|
Computing in the RAIN: A
Reliable Array of Independent Nodes
Vasken
Bohossian, Charles C. Fan,
Paul S. LeMahieu,
Marc D. Riedel,
Lihao Xu and
Jehoshua Bruck
In IEEE
Transactions on Parallel and Distributed Computing, Feb. 2001
Abstract
|
The RAIN project is a research collaboration between Caltech and NASA-JPL
on distributed computing and data storage systems for future spaceborne
missions. The goal of the project is to identify and develop key building
blocks for reliable distributed systems built with inexpensive
off-the-shelf components. The RAIN platform consists of a heterogeneous
cluster of computing and/or storage nodes connected via multiple
interfaces to networks configured in fault-tolerant topologies. The RAIN
software components run in conjunction with operating system services and
standard network protocols. Through software-implemented fault tolerance,
the system tolerates multiple node, link, and switch failures, with no
single point of failure. The RAIN technology has been transferred to
Rainfinity, a start-up company focusing on creating clustered solutions
for improving the performance and availability of Internet data centers.
In this paper we describe the following contributions: 1) fault-tolerant
interconnect topologies and communication protocols providing consistent
error reporting of link failures; 2) fault management techniques based on
group membership; and 3) data storage schemes based on computationally
efficient error-control codes. We present several proof-of-concept
applications: a highly-available video server, a highly-available Web
server and a distributed checkpointing system. Also we describe a
commercial product, Rainwall, built with the RAIN technology. |
|

View
Paper (PDF) |

View
Presentation (PowerPoint) |
|