Parallel Computing at CSC
Modern supercomputers consist of tightly connected PCs (computing nodes), as a rough simplification. In order to benefit from a supercomputer and to surpass the performance of a regular PC, the programs need to utilize the resources (CPU, memory, cache memory, and I/O) of multiple nodes in parallel. In fact, even to get the most out of a single modern multi-core processor, a single program needs to execute as multiple processes or threads in parallel.
CSC's experts help the users in choosing the best parallelization method. Please contact CSC's Service Desk, firstname.lastname@example.org, if you have any questions.
MPI and OpenMP
One categorization of parallel programing paradigms is to divide them into distributed memory approaches and shared memory approaches. The most widely used implementations of distributed and shared memory paradigms are Message Passing Interface (MPI) libraries and OpenMP compiler directives, respectively.
MPI is the most used communication library in massively parallel supercomputers, utilizing multiple computing nodes. In addition to the communication subroutine library itself, the implementations include the necessary system tools to compile and execute MPI programs.
In MPI programming, the tasks communicate by explicitly exchanging messages. This requires subroutine calls. In principle, there is no difference between tasks running within a single node or tasks running in different nodes.
MPI is standardized and thus portable. MPI programming requires some effort due to explicit communication model and relatively complex subroutine call syntax, but a well written MPI code typically performs well on most architectures.
OpenMP is most used to utilize the multiple cores of a single processor or multiple processors within a single computing node (or PC, laptop, etc.). OpenMP is implemented as compiler directives (or pragmas) and threads, and as such, after compiling, does not require any additional tools in the system.
In OpenMP programming, it is assumed that the individual threads can all "see" the same memory areas, and separate communication using messages between the threads is not needed.
Also OpenMP is standardized and portable. Adding directives or pragmas to a serial code is easy and the code can be parallelized step by step. However, when multiple threads access the same shared memory location, synchronization must be handled explicitly to avoid race conditions. Performance is limited by the number of available threads (typically threads within one node) and the possible serial sections in the code.
Hybrid MPI/OpenMP model is also possible, in which a single MPI task contains multiple threads, for example. A typical example is running MPI between nodes and OpenMP within nodes. In some cases this improves performance by reducing the congestion in communication resources.
In addition to well established MPI and OpenMP programming, CSC supports emerging technologies such as CUDA (for Nvidia GPGPUs), Xeon Phi coprocessor programming, and Co-Array Fortran (CAF).
CSC's parallel computing environments and further information
CSC's parallel computing environments are described in detail in the CSC computing environment users' guide and in the users' guides of the computing servers. As a rough generalization, sisu.csc.fi is profiled for massively parallel MPI and MPI/OpenMP programs, and taito.csc.fi for sequential, OpenMP and MPI programs. The programs are executed through SLURM batch queue system in Sisu and Taito. The Taito-shell service is intended for interactive programs.
- Rinnakkaisohjelmointi MPI:llä (PDF, Parallel Programming Using MPI, only in Finnish)
- CSC computing environment users guide (there is also PDF Version available)
- Sisu User's Guide
- Taito User's Guide
- Taito-shell User's Guide
Additionally, CSC has a comprehensive training curriculum for HPC programming, see courses.
Usage profiles of CSC's (parallel and serial) computers
All CSC's computing servers support parallel computing but have different performance characteristics. Also the programming environments vary. Thus the best platform match for a computational job depends on its characteristics. See below for further information and recommendations.
Parallel computers for massively parallel programs:
- Sisu (sisu.csc.fi)
- Taito (taito.csc.fi)
For interactive use:
- Taito-shell (taito-shell.csc.fi)
Sisu (Cray XC30) is the new supercomputer and profiled for large parallel jobs. Sisu was taken into use in March 2013.
Taito is a supercluster for serial, small parallel jobs and jobs that require a lot of memory. Memory and CPUs can be reserved for exclusive use.
Taito-shell is intended for interactive jobs. Memory and CPUs are shared between all logged in users.
|Sisu||sisu.csc.fi||supercomputer||parallel||Reservation of resources|
|Taito||taito.csc.fi||supercluster||parallel (256) and serial||Reservation of resources|
|Taito-shell||taito-shell.csc.fi||Interactive shell on Taito||serial and applications||Shared resources|
How the computation quota is billed?
Running parallel programs on CSC's computers costs billing units, i.e. the computation quota admitted by CSC to each project, of the following amount for a single CPU hour:
- on Sisu 2.0 billing units
- on Taito 2.0 billing units
On Sisu and Taito the amount of used billing units is calculated by multiplying the number of processors, the execution time in hours and the cost of a single CPU hour in billing units. Taito-shell the amount of used billing units is calculated by multiplying actual CPU usage time with the cost of a single CPU hour in billing units.
In short on Sisu and Taito you consume billing units based on reservation, regardless of usage. On Taito-shell actuall usage is billed.
CSC admits for all new projects a small amount of computation quota for measuring the performance of the parallel program. Quota for production runs applied by the project will be admitted when the performance of the program meets the requirements.
When the computation quota is almost used off, the project needs to apply more quota. That is applied for the programs given in the application form for each project. New programs taken in use in the project need to be given in the application form and the performance tests need to be done.
It is possible to control the quota used by separate programs by applying a separate project number for each program.