Code Optimization

CSC's application specialist can help optimizing the performance of an application, whether it is a simple Matlab script or a complex HPC software package. Small optimization request are handled as regular user support. Larger optimization tasks are handled through CSC's Code optimization service process. Support can be requested and the application specialists contacted through CSC Service Desk.

Profiling

Optimization process begins with application profiling. Profiling is used to find which parts of the program consume most CPU time (or memory, I/O, etc.), and where the possible performance bottlenecks are. Software packages often include internal timing routines that can be used, or one can use external profilers, such as CrayPat to build the profile.

Subroutine libraries

It is very, very hard to beat thoroughly optimized mathematical subroutine libraries with self-written code (to the extend that in many cases smart compilers will silently replace matrix-matrix multiplication for-loops by library calls). In general, it is advisable to offload as much of the large matrix algebra, FFT-routines, etc., to the mathematical subroutine libraries such as Intel MKL or Cray LibSci.

Memory access

Nowadays, the performance bottleneck in numerical algorithms is not in the CPU's capability to perform arithmetic operations, but in the memory's capability to supply the data for the CPU. The problem is partly solved with the use of the small and fast cache memory "buffer" between the CPU and the main memory. In general, if the memory is accessed in large enough contiguous data segments, the cache memory is well utilized. Also, the more arithmetic operations that can be performed to the data already in the cache memory before accessing the main memory again, the better.

Optimizing memory access performance problems often involves rearranging the program execution order or data structures, or by moving to a lower level programming language that allows better control of memory access.

Parallel scalability

Using multiple tasks or threads to solve a single task causes overheads, necessarily. It is likely, that when doubling the number of processors to solve a task, the time to solution is not halved. In the worst case, the time to solution may be the same or even increase. Therefore, it is imperative to know the execution time as a function of the number of the used tasks and threads, that is, the parallel scalability of the application.

The scalability is dependent of the problems size. If the total size of the system is kept fixed as the number of used tasks is varied, we talk about strong scalability. If the size of the system is fixed per process, we talk about weak scalability.

There are problems that are inherently easier to parallelize and others that are harder. In general, the easier problems are the ones in which different parts of the data can be processed independently. Many image processing algoritms fall to this category, for example. The harder problems are usually those that involve long-range physical interactions, electromagnetics, fluid dynamics, etc.

Scalability bottlenecks are often caused by synchronization, memory bus congestion, message passing overheads, I/O, or load imbalance. As the number of parallel tasks is increased, different parts of the program execution may become bottlenecks.

I/O access

Even with the massive parallel file systems, the I/O is still relative slow. In general, the bandwidth of the parallel filesystems is adequate for most applications when reading and writing large blocks of data, but the operations involving metadata, such as the file opening and closing, may be even slower than on a local workstation hard drive.

CSC's parallel file system is a shared resource betweeen all users and computers in Kajaani datacenter. If a single application stresses the file system and starts to show slowing due to file I/O, by frequent access to metadata for example, it likely means that all applications using the same filesystem (the whole Kajaani datacenter, in practice) are suffering from reduced I/O performance. Therefore, all I/O problems must be considered to be of the high priority.