Code Optimization - Services for Research
Our application specialists can help you optimize the performance of an application, whether it is a simple Matlab script or a complex high performance computing (HPC) software package. Small optimization requests are handled as regular user support. Larger optimization tasks are handled through our code optimization service process. As a first step for any size of a task, contact CSC's application specialists through our service desk.
Optimization process begins with application profiling. Profiling is used to find which parts of the program consume most CPU time (or memory, I/O, and so on), and where the possible performance bottlenecks are. Software packages often include internal timing routines that can be used, or one can use external profilers provided by, for example, processor vendors.
It is extremely hard to beat thoroughly optimized mathematical subroutine libraries with self-written code (to the extend that in many cases smart compilers will silently replace matrix-matrix multiplication for-loops by library calls). In general, it is advisable to offload as much of the large matrix algebra, FFT-routines, and so on to the mathematical subroutine libraries such as Intel MKL or similar.
Part of the high performance of modern CPUs comes from their ability to apply the same arithmetic operation simultaneously to two, four, or more pieces of data. Modern Intel and AMD processors have special vector instructions for achieving this (e.g. AVX2 and AVX512 instructions). Even though compilers can often automatically use these vector instructions, a programmer may need to give hints to the compiler or rearrange the program logic to fully utilize vectorization.
Nowadays, the performance bottleneck in numerical algorithms is in most cases not in the CPU's capability to perform arithmetic operations, but in the memory's capability to supply the data to the CPU. The problem is partly solved with the use of a small and fast cache memory "buffer" between the CPU and the main memory. In general, if the memory is accessed in large enough contiguous data segments, the cache memory is well-utilized. Also, the more arithmetic operations that can be performed to the data already in the cache memory before accessing the main memory again, the better.
Optimizing memory access performance problems often involves rearranging the program execution order or data structures, or by moving to a lower level programming language that allows better control of memory access.
Using multiple tasks or threads to solve a single task causes unavoidable overhead. It is likely that doubling the amount of processors will not halve the time it takes to solve a task. In the worst case, the time to solution may be the same or even increase. Therefore, it is imperative to know the execution time as a function of the number of the tasks and threads used, that is, the parallel scalability of the application.
The scalability depends on the size of the problem. If the total size of the system is kept fixed as the number of tasks used is varied, it has strong scalability. If the size of the system is fixed per process, it shows weak scalability.
There are problems that are inherently easier to parallelize and others that are harder. In general, easier problems are the ones in which different parts of the data can be processed independently. For example, many image processing algorithms fall in this category. Harder problems are usually those that involve long-range physical interactions, electromagnetics, fluid dynamics, etc.
Scalability bottlenecks are often caused by synchronization, memory bus congestion, message passing overheads, I/O, or load imbalance. As the number of parallel tasks is increased, different parts of the program execution may become bottlenecks.
Even with a massive parallel file systems, the I/O is still relatively slow. In general, the bandwidth of the parallel filesystems is adequate for most applications when reading and writing large blocks of data. However, operations involving metadata, such as file opening and closing, may be even slower than on a local workstation hard drive.
Mahti and Puhti have their own separate parallel file systems. Yet, if a single application stresses the file system and starts to show slowing due to file I/O, for example, due to frequent access to metadata, it likely means that all applications using the same filesystem will suffer from reduced I/O performance and a large number of users will be affected. Therefore, all I/O problems must be considered to be of high priority.
GPUs are powerful hardware components that can perform parallel processing with large blocks of data to deliver enormous computational capability. Due to their efficiency and cost effectiveness GPUs are a very good choice for computation-intensive tasks and are gaining more and more foothold in modern supercomputers, including both Puhti and Mahti.
Unlike CPUs that are built for flexibility and speed, GPUs are built for massively parallel execution of single operations on multiple data. They excel at high-throughput computing tasks. To leverage GPUs, one needs to use a different programming model. Also, often one needs to re-think some of the algorithms used in the code, for instance, to avoid unnecessary memory copies between CPUs and GPUs.
There are basically three ways to take advantage of GPUs: directive-based programming, native GPU programming, and high-level frameworks. In the directive-based approach, an already existing serial code can be parallelized by adding small code snippets that look like comments and are in essence guiding the compiler on how to automatically generate GPU code. In the native GPU programming approach, the whole program or some kernels are written directly in a GPU programming language, such as CUDA or HIP. As an alternative, one can also use an external framework (such as Kokkos or AMReX) to automate the parallelization to GPUs.
It may be relatively easy and quick to achieve a nice speed-up for serial programs. Yet, to use GPUs at full capacity, or to use more than one GPU in parallel typically requires significant programming effort.