Code Optimization

CSC's application specialist can help to optimize the performance of an application, whether it is a simple Matlab script or a complex HPC software package. Small optimization request are handled as regular user support. Larger optimization tasks are handled through CSC's Code optimization service process. Support can be requested and the application specialists contacted through CSC Service Desk.

Profiling

Optimization process begins with application profiling. Profiling is used to find which parts of the program consume most CPU time (or memory, I/O, etc.), and where the possible performance bottlenecks are. Software packages often include internal timing routines that can be used, or one can use external profilers provided by e.g. processor vendors.

Subroutine libraries

It is very, very hard to beat thoroughly optimized mathematical subroutine libraries with self-written code (to the extend that in many cases smart compilers will silently replace matrix-matrix multiplication for-loops by library calls). In general, it is advisable to offload as much of the large matrix algebra, FFT-routines, etc. to the mathematical subroutine libraries such as Intel MKL or similar.

Vectorization

Part of the high performance of modern CPUs comes from their ability to apply the same arithmetic operation simultaneously to two, four, or more pieces of data. Modern Intel and AMD processors have special vector instructions for achieving this (e.g. AVX2 and AVX512 instructions). Even though compilers can often automatically use these vector instructions, programmer may need to give hints to the compiler or rearrange the program logic to fully utilize vectorization.

Memory access

Nowadays, the performance bottleneck in numerical algorithms is in most cases not in the CPU's capability to perform arithmetic operations, but in the memory's capability to supply the data for the CPU. The problem is partly solved with the use of a small and fast cache memory "buffer" between the CPU and the main memory. In general, if the memory is accessed in large enough contiguous data segments, the cache memory is well utilized. Also, the more arithmetic operations that can be performed to the data already in the cache memory before accessing the main memory again, the better.

Optimizing memory access performance problems often involves rearranging the program execution order or data structures, or by moving to a lower level programming language that allows better control of memory access.

Parallel scalability

Using multiple tasks or threads to solve a single task causes unavoidable overhead. It is likely, that when doubling the number of processors to solve a task, the time to solution is not halved. In the worst case, the time to solution may be the same or even increase. Therefore, it is imperative to know the execution time as a function of the number of the tasks and threads used, that is, the parallel scalability of the application.

The scalability depends on the problem size. If the total size of the system is kept fixed as the number of tasks used is varied, we talk about strong scalability. If the size of the system is fixed per process, we talk about weak scalability.

There are problems that are inherently easier to parallelize and others that are harder. In general, easier problems are the ones in which different parts of the data can be processed independently. For example, many image processing algorithms fall in this category. Harder problems are usually those that involve long-range physical interactions, electromagnetics, fluid dynamics, etc.

Scalability bottlenecks are often caused by synchronization, memory bus congestion, message passing overheads, I/O, or load imbalance. As the number of parallel tasks is increased, different parts of the program execution may become bottlenecks.

I/O access

Even with a massive parallel file systems, the I/O is still relatively slow. In general, the bandwidth of the parallel filesystems is adequate for most applications when reading and writing large blocks of data, but operations involving metadata, such as file opening and closing, may be even slower than on a local workstation hard drive.

Mahti and Puhti have their own separate parallel file systems. However, if a single application stresses the file system and starts to show slowing due to file I/O, for example due to frequent access to metadata, it likely means that all applications using the same filesystem will suffer from reduced I/O performance and a large number of users are affected. Therefore, all I/O problems must be considered to be of high priority.

GPU programming

GPUs are powerful hardware components that can perform parallel processing with large blocks of data to deliver enormous computational capability. Due to their efficiency and cost effectiveness GPUs are a very good choice for computational intensive tasks and are gaining more and more foothold in modern supercomputers, including both Puhti and Mahti.

Unlike CPUs that are built for flexibility and speed, GPUs are built for massively parallel execution of single operations on multiple data and as such excel at high-throughput computing tasks. To leverage GPUs, one needs to use a different programming model, but also often one needs to re-think some of the algorithms used in the code e.g. to avoid unnecessary memory copies between CPUs and GPUs.

There are basically three ways to take advantage of GPUs: directive-based programming, native GPU programming, and high-level frameworks. In the directive-based approach, an already existing serial code can be parallelized by adding small code snippets that looks like comments and are in essence guiding the compiler on how to automatically generate GPU code. In the native GPU programming approach, the whole program or some kernels are written directly in a GPU programming language such as CUDA or HIP. As an alternative, one can also use an external framework, such as Kokkos or AMReX, to automate the parallelization to GPUs.

It may be relatively easy and quick to achieve a nice speed-up for serial programs, but to use GPUs at full capacity, or to use more than one GPU in parallel, requires typically a significant programming effort.