I am working in Raymond Namyst's Runtime Inria Team-Project. My primary research interests are:
The long-awaited convergence between specialized HPC networks (Infiniband, Myrinet, ...) and traditional networks (Ethernet) is finally occuring (for instance Myricom achieved 2 microseconds MPI latency over Ethernet in 2005). It raises the question of which hardware and software technology will be used in the future converged networks. While complex features such as RDMA or TOE have been proposed, modern NICs offer simple stateless offload features such as TSO or multiqueue to improve performance in a cost-effective manner.
We developed Open-MX to show that it is possible to achieve high performance communication by designing a MPI stack for such generic Ethernet hardware. We also propose some innovative ideas to improve performance by adding little support in the hardware.
Open-MX was designed within a collaboration between Inria and Myricom.
We now extend this work as part of the CCI project (Common Communication Interface) which aims at offering high performance communication for HPC and data centers.
Selected Publications (All publications about this topic)The widespread use of multicore processors in high-performance computing makes intra-node parallelization as important as inter-node communication. While hybrid models such as MPI+OpenMP are being worked on, many applications still rely on MPI both intra-node and inter-node communication.
We showed that most existing MPI implementations offer limited throughput for large message intra-node data transfer and may be easily improved with custom implementations such as Open-MX specialized intra-node strategies. We now develop the generic kernel module KNEM to provide MPI implementors with an optimized data transfer model that reduces CPU consumption, cache pollution and memory usage while improving large message throughput.
This work is carried out within Stéphanie Moreaud's PhD, and collaborations between Inria and the MPICH2 team at Argonne National Lab and the Open MPI project.
Selected Publications (All publications about this topic)The increasing complexity in modern machines, with multiple processors, shared caches, cores, hardware threads, NUMA memory nodes, and I/O devices causes performance to dramatically depend on where task and data buffer are placed. While manual understanding of the hardware architecture may be feasible, achieving performance portability requires automatic discovery of the hardware topology and constraints. Then, using such knowledge enable topology-aware placement of tasks according to their behavior and needs.
We exhibited the impact of task and data placement with regards to high-speed network performance and then development the corresponding automatic placement strategies. We also implemented the hwloc software which gathers deep knowledge about the hardware and exposes it to application in an abstracted and portable manner, letting MPI and OpenMP runtime systems place tasks in a clever way depending on their affinities.
This work is carried out within Stéphanie Moreaud's PhD, and collaborations between Inria and the Open MPI project.
More details about Hardware Locality (hwloc)
Selected Publications (All publications about this topic)The emergence of multicore processors with multiple levels of shared caches and distributed NUMA memory architectures leads to a HPC world where machines are not flat but hierarchical. Running tasks on such machines requires to carefully place them and their data buffers according to their affinities so as to maximize locality. While OpenMP is an interesting way to parallelize codes using threads, most implementations fail to efficiently manage nested and irregular parallelism: they either spread the workload across all cores without maintaining affinities, or maintain affinity only at startup.
We develop ForestGOMP to tackle this problem. The parallel structure of the application (OpenMP parallel sections) and their associated data buffers are taken into account by the scheduler during the whole execution time. The whole workload is properly distributed on all cores and memory nodes while maintaining affinities, and it is redistributed properly whenever imbalance appears.
To achieve this goal, we had to develop advanced memory management abilities so as to let ForestGOMP distribute/migrate data buffers near their accessing tasks at runtime manually (with optimized memory migration primitives) or automagically (with convenient next-touch migration strategies).
This work was carried out within François Broquedis' PhD.
Selected Publications (All publications about this topic)In the past, I worked on the interaction between high-speed networks and distributed storage, as detailed in my Ph.D dissertation, and the associated outdated research activities.