1. What is MPICH-Madeleine?
MPICH-Madeleine is a free MPICH-based implementation of the MPI standard. If you are not familiar with MPI, you should first take a look at http://www-unix.mcs.anl.gov/mpi/. MPI is a high-level communication interface designed to provide high-performance communications on various network architectures including supercomputers and clusters of workstations (usually off-the-shelf PC's interconnected by high-speed links). Nowadays, clusters of workstations become increasingly popular thanks to the availability of many high-speed connection technologies (Gigabit-Ethernet, Myrinet, GigaNet, SCI). Furthermore, interconnecting such COW's to build heterogeneous clusters of clusters is now a hot issue. Unfortunately, no current MPI implementation supports this kind of architectures efficiently. Indeed, the only way to handle network heterogeneity is to use interoperable implementations of MPI: several MPI implementations (one per cluster) communicate with each other using an inter-MPI glue.
Our alternative proposal is to provide a true multi-protocol implementation of MPI on top of a generic and multi-protocol communication layer called Madeleine (version 3). Madeleine III is the communication sub-system of the Parallel Multithreaded Machine (http://runtime.futurs.inria.fr/pm2) runtime environment. It is especially targeted towards:
- Single clusters with several interconnection networks;
- Clusters of clusters (possibly with several interconnection networks).
1.1 Madeleine Basics
Let's explain the key concepts of Madeleine which are necessary to understand before using our implementation of MPI. Let's suppose that two clusters are available:
Figure 1.1. Example of interconnected clusters
- the first cluster named foo, is composed of 3 nodes, named foo0, foo1 and foo2, linked by a Myrinet network;
- the second cluster named goo, is composed of 4 nodes, named goo0, goo1, goo2, goo3, linked by a SCI network.
Both clusters also feature an Ethernet network (TCP) which links together all the machines of each cluster. The clusters can be seen on Figure 1.1.
Madeleine uses objects called channels in order to virtualize the available networks in a given configuration. There are basically two types of channels:
- physical channels which are simple abstractions of real existing networks and;
- virtual channels which are build above physical channels and can be used to create heterogeneous networks.
With our simple example, we can build three physical channels:
- a channel build above the Myrinet network. This channel encompasses the nodes {foo0, foo1, foo2};
- a channel build above the SCI network. This channel encompasses the nodes {goo0, goo1, goo2, goo3};
- a channel build above the TCP network. This channel encompasses all the nodes of both clusters.
On top of these three different physical channels, we can build a virtual channel which encompasses all the nodes of the configuration. One may think that there is no difference with the TCP physical channel, but in fact the behavior of a program using the virtual channel will be totally different as Madeleine will automatically select the best available network to communicate between two nodes of this virtual channel.
Indeed, all communications occurring within the foo cluster will use the Myrinet network, all communications occurring within the goo cluster will use the SCI network. And if two nodes belonging to different clusters want to exchange messages, the TCP network will be used.
More complicate configurations can be expressed: if we suppose now that the node goo3 features a SCI NIC, we can build a virtual channel over the physical channels corresponding to the Myrinet and SCI networks. In that case, we do not need (and most important: use) the TCP network. In fact, with that new configuration, from the application's point of view, all the nodes can communicate with each other. Indeed even if a node A is not physically connected to a node B, it can in any case send messages to it. Internally, the node goo3, which features both Myrinet and SCI NICs, will forward the Madeleine message from A to B.
Figure 1.2. Example of configuration
How is this information given to Madeleine? The library uses configuration files (the following example files describe the configuration shown in Figure 1.2 with the node goo3 featuring a SCI NIC):
-
the first file, the network configuration file, named localnet.cfg, describes the different available networks.
networks : ({ name : tcp; hosts : ( foo0, foo1, foo2, goo0, goo1, goo2, goo3 ); dev : tcp; },{ name : myrinet; hosts : ( foo0, foo1, foo2, goo3 ); dev : mx; },{ name : sci; hosts : ( goo0, goo1, goo2, goo3 ); dev : sisci; });A network is defined with its name (tag name), a list of machines it includes (tag hosts), and the identifier of the device connecting these machines (tag dev, should be one of the predefined identifiers recognized by Madeleine).
-
the second file, the channel configuration file, named config, describes the channel mapping over the different nodes.
channels : ({ name : tcp_channel; net : tcp; hosts : ( foo0, foo1, foo2, goo0, goo1, goo2, goo3 ); },{ name : sci_channel; net : sci; hosts : ( goo0, goo1, goo2, goo3 ); },{ name : myri_channel; net : myrinet; hosts : ( foo0, foo1, foo2, goo3 ); }); vchannels : { name : default; channels : ( myri_channel, sci_channel ); };A physical channel is defined with its name (tag name), the identifier of the network is based on (tag net which has to be defined in the network configuration file), and a list of machines it encompasses (tag hosts).
A virtual channel is defined with its name (tag name) and the list of physical channels it is build upon (tag channels).
Once a virtual channel is build, the physical channels below it are no longer visible by the application. However, it is possible to create several different Madeleine physical channels over the same physical network. Hence a physical channel can truly be seen as a logical network or a network abstraction.
1.2 How is MPICH-Madeleine Implemented?
As its name indicates, our work is based on the MPICH (http://www-unix.mcs.anl.gov/mpi/mpich/) implementation of MPI. Basically, the core of MPICH-Madeleine is based on a specific MPICH device, called ch_mad, which handles several Madeleine channels in parallel and thus allows the use of several networks at the same time within the same application. In order to do implement such a device, several threads are used: the MPI application code is executed by a particular thread and each channel is assigned its specific thread which will process all the incoming communications.
Figure 1.3. Architecture of MPICH-Madeleine
We also developed another device for handling intra-node SHMEM communications. This device uses the same concepts of ch_mad, that is, a thread is responsible for executing the polling of incoming communications. This device (which is included in the same directory as ch_mad) benefits from the advanced polling mechanisms available within the Marcel library. Figure 1.3 shows the architecture of MPICH-Madeleine.
The internal structure of MPICH should allow to design and plug a new device into the implementation without needing to modify the upper layers (mainly the Abstract Device Interface, the ADI). Unfortunately, it turned out that the development of a multi-thread MPICH device requires some modifications of the ADI and even of some of the higher level features of MPICH. This is why our Madeleine device cannot be downloaded on its own, but is part of a full, customized MPICH version.
RETURN HOME | NEXT: Installing MPICH-Madeleine
Copyright © July 2008 Team Runtime