StarPU is a task programming library for hybrid architectures
Rather than handling low-level issues, programmers can concentrate on algorithmic concerns!
The StarPU documentation is available in PDF and in HTML. Please note that these documents are up-to-date with the latest release of StarPU.
May 2013 » Engineer position open at Inria (Team Runtime + Team Moais): More details here.
April 2013 » StarPU is now featured in the Tools and Libraries section of AMD's heterogeneous computing application showcase.
April 2013 » Read the new report C Language Extensions for Hybrid CPU/GPU Programming with StarPU to know everything you always wanted to know about the C Languages Extensions for StarPU.
February 2013 » The v1.0.5 release of StarPU is now available!. This release mainly brings bug fixes.
January 2013 » A tutorial on StarPU was given at the ComPAS conference.
November 2012 » StarPU at SuperComputing'12: A StarPU poster is on display on the Inria booth, Feel free to come & have a chat, at booth #1209!
October 2012 » The v1.0.4 release of StarPU is now available!. This release mainly brings bug fixes.
For any questions regarding StarPU, please contact the StarPU developers mailing list.
starpu-devel@lists.gforge.inria.fr
Portability is obtained by the means of a unified abstraction of the machine. StarPU offers a unified offloadable task abstraction named codelet. Rather than rewriting the entire code, programmers can encapsulate existing functions within codelets. In case a codelet can run on heterogeneous architectures, it is possible to specify one function for each architectures (e.g. one function for CUDA and one function for CPUs). StarPU takes care of scheduling and executing those codelets as efficiently as possible over the entire machine, include multiple GPUs. One can even specify several functions for each architecture, and StarPU will automatically determine which version is best for each input size.
To relieve programmers from the burden of explicit data transfers, a high-level data management library enforces memory coherency over the machine: before a codelet starts (e.g. on an accelerator), all its data are automatically made available on the compute resource. Data are also kept on e.g. GPUs as long as they are needed for further tasks. When a device runs out of memory, StarPU uses an LRU strategy to evict unused data. StarPU also takes care of automatically prefetching data, which thus permits to overlap data transfers with computations (including GPU-GPU direct transfers) to achieve the most of the architecture.
Dependencies between tasks can be given several ways, to provide the programmer with best flexibility:
StarPU also supports an OpenMP-like reduction access mode.
StarPU obtains portable performances by efficiently (and easily) using all computing resources at the same time. StarPU also takes advantage of the heterogeneous nature of a machine, for instance by using scheduling strategies based on auto-tuned performance models. These determine the relative performance achieved by the different processing units for the various kinds of task, and thus permits to automatically let processing units execute the tasks they are the best for.
To deal with clusters, StarPU can nicely integrate with MPI through explicit network communications, which will then be automatically combined and overlapped with the intra-node data transfers and computation. The application can also just provide the whole task graph, a data distribution over MPI nodes, and StarPU will automatically determine which MPI node should execute which task, and generate all required MPI communications accordingly.
StarPU comes with a GCC plug-in that extends the C programming language with pragmas and attributes that make it easy to annotate a sequential C program to turn it into a parallel StarPU program.
All that means that, with the help of StarPU's extensions to the C language, the following sequential source code of a tiled version of the classical Cholesky factorization algorithm using BLAS is also valid StarPU code, possibly running on all the CPUs and GPUs, and given a data distribution over MPI nodes, it is even a distributed version!
for (k = 0; k < tiles; k++) {
potrf(A[k,k])
for (m = k+1; m < tiles; m++)
trsm(A[k,k], A[m,k])
for (m = k+1; m < tiles; m++)
syrk(A[m,k], A[m, m])
for (m = k+1, m < tiles; m++)
for (n = k+1, n < m; n++)
gemm(A[m,k], A[n,k], A[m,n])
}
In order to understand the performance obtained by StarPU, it is helpful to visualize the actual behaviour of the applications running on complex heterogeneous multicore architectures. StarPU therefore makes it possible to generate Pajé traces that can be visualized thanks to the ViTE (Visual Trace Explorer) open source tool.
Example: LU decomposition on 3 CPU cores and a GPU using a very simple greedy scheduling strategy. The green (resp. red) sections indicate when the corresponding processing unit is busy (resp. idle). The number of ready tasks is displayed in the curve on top: it appears that with this scheduling policy, the algorithm suffers a certain lack of parallelism. Measured speed: 175.32 GFlop/s

This second trace depicts the behaviour of the same application using a scheduling strategy trying to minimize load imbalance thanks to auto-tuned performance models and to keep data locality as high as possible. In this example, the Pajé trace clearly shows that this scheduling strategy outperforms the previous one in terms of processor usage. Measured speed: 239.60 GFlop/s

Some software is known for being able to use StarPU to tackle heterogeneous architectures, here is a non-exhaustive list:
You can find below the list of publications related to applications using StarPU.
All StarPU related publications are also listed here with the corresponding Bibtex entries.
A good overview is available in the following Research Report.
Last updated on 2012/10/03.