hwloc

Portable Hardware Locality


News

Software releases are announced and available on the Open MPI subproject page here.

Introduction

This page is the original research-side of the Hardware Locality (hwloc) software development. Several other research teams are now also working on hwloc-related topics that may not be listed below. The hwloc software and documentation is available for download as an Open MPI subproject here.

Context

The democratization of multicore processors and NUMA architectures (AMD HyperTransport, Intel QPI, ...) leads to the spreading of complex hardware topologies into the whole server world. While large shared-memory machines were formerly very rare, nodaways every single cluster node may contain 12 cores, hierarchical caches, or multiple threads per core, making its topology far from flat.

Such complex and hierarchical topologies have strong impact of the application performance. The developer must take hardware affinities into account when trying to exploit the actual hardware performance. For instance, two tasks that tightly cooperate should probably rather be placed onto cores sharing a cache. However, two independent memory-intensive tasks should better be spread out onto different sockets so as to maximize their memory throughput. For instance, MPI processes and OpenMP threads have to be placed according to their affinities and to the hardware characteristics.

hwloc provides a portable abstraction (across OS, versions, architectures, ...) of the hierarchical topology of modern architectures, including NUMA memory nodes, sockets, shared caches, cores and simultaneous multithreading. It also gathers various attributes such as cache and memory information. It builds a hierarchical tree that the application may walk to retrieve information about the hardware or to bind tasks properly.

hwloc also offers affinity information about I/O devices such as network interfaces, InfiniBand HCAs or GPUs. It enables better I/O data transfer thanks to processes and data being properly placed on the host part that is closer to devices.

Credits

hwloc is the evolution and merger of the INRIA libtopology project and Open MPI's Portable Linux Processor Affinity (PLPA) project. libtopology was developed by the INRIA Runtime Team-Project (headed by Raymond Namyst). hwloc is now developed in collaboration with the Open MPI community, and more.

libtopology was initially implemented inside the Marcel threading library as a way to inform the BubbleSched frame-work of hardware affinities. With the advent of multicore machines, this work became interesting for much more than multithreading. So libtopology was extracted from Marcel and became an independent library offering a portable abstraction of hierarchical architectures for high-performance computing.

Papers

  1. François Broquedis, Jérôme Clet-Ortega, Stéphanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications. In Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP2010), Pisa, Italia, February 2010. IEEE Computer Society Press. Available here.
    If you are looking for general-purpose hwloc citations, please use this one. The paper introduces hwloc, its goals and its implementation. It then shows how hwloc may be used by MPI implementations and OpenMP runtime systems as a way to carefully place processes and adapt communication strategies to the underlying hardware.
  2. Brice Goglin. Managing the Topology of Heterogeneous Cluster Nodes with Hardware Locality (hwloc). In Proceedings of 2014 International Conference on High Performance Computing & Simulation (HPCS 2014), Bologna, Italy, July 2014. IEEE Computer Society Press. Available here.
    If you are looking for a citation about I/O device locality and cluster/multi-node support, please use this one. This paper (available here) explains how I/O locality is managed in hwloc, how device details are represented, how hwloc interacts with other libraries, and how multiple nodes such as a cluster can be efficiently managed.
  3. François Broquedis, Nathalie Furmento, Brice Goglin, Pierre-André Wacrenier, and Raymond Namyst. ForestGOMP: an efficient OpenMP environment for NUMA architectures. In International Journal on Parallel Programming, Special Issue, 2010. Springer. Available here.
    The article describes the ForestGOMP runtime system which uses information about OpenMP thread affinities and hardware locality so as to dynamically schedule threads in a cache and memory efficient manner.
  4. Stéphanie Moreaud, Brice Goglin, and Raymond Namyst. Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access. In Proceedings of the 17th European MPI Users Group Conference, Lecture Notes in Computer Science, Stuttgart, Germany, September 2010. Springer. Outstanding paper award. Available here.
    The paper describes how hwloc lets Open MPI gather information about network card locality. Combining it with process locality enables adaptive distribution of MPI messages across multiple network rails.
  5. Brice Goglin and Stéphanie Moreaud. Dodging Non-Uniform I/O Access in Hierarchical Collective Operations for Multicore Clusters. In CASS 2011: The 1st Workshop on Communication Architecture for Scalable Systems, held in conjunction with IPDPS 2011, Anchorage, AK, May 2011. IEEE Computer Society Press. Available here.
    This paper explains how to use hwloc to elect leader processes according to NIC localities so as to better cope with non-uniform input/output access in hierarchical collective operations.
  6. Brice Goglin, Jeff Squyres, and Samuel Thibault. Hardware Locality: Peering under the hood of your server. In Linux Pro Magazine, 128:28-33, July 2011.
    This journal article presents hwloc goals through many concrete examples.

All INRIA Runtime hwloc papers are also listed here with the corresponding Bibtex entries.


Last updated on 2014/07/30.