4.6 Profiling Parallel Applications with CrayPAT
You can identify and analyse performance bottlenecks of your application with the Cray Performance Analysis Toolkit. You will need a representative test input, that is large enough (to represent the problem in the production runs) but still runable in a couple of minutes. For instance, with a simulation code, take a full-sized problem but run only a few time steps. You should also have an understanding on the parallel scalability of the code (with the given problem). Carry out these steps on a core count with which the code still scales.
Start by loading the CrayPAT modules
module load perftools-base module load perftools
Rebuild your application, e.g. make clean && make.
Instrument the application as
pat_build a.out(here a.out is the name of your binary). You should get a new binary with "+pat" addition.
Run the obtained +pat-binary with the selected core count. You should obtain a file with a ".xf" suffix, in addition to output files.
Generate a performance report based on sampling as
pat_report a.out+pat+....xf > samp.profhere, replace a.out+pat+....xf with the proper filename. This should produce the sampling report (file samp.prof), a file with .apa suffix and a file with .ap2 suffix.
Read through the sampling profile file for the profile, e.g. in which routines time is being spent. Let us now select the most important user and library routines for more detailed analysis. See the .apa file: this file controls the tracing experiment. You can include more library groups (MPI by default, see "man pat_build" for all possible options) and user functions for the tracing experiment through that file.
Build a new instrumented binary with
pat_build -O (...).apa
This should give yet another binary with "+apa" addition.
Run the new +apa binary. This should produce yet another .xf file.
Apply pat_report again:
pat_report (...).xf > tracing.profwhere (...).xf is the name of the most recent .xf file. Read through the file tracing.prof. The profile guides you, which user functions should be assessed for optimization (and do not optimize functions not consuming a significant amount of time), and the reported hardware counters give indications about the possible performance issues in single-core execution (cache misses etc). Have a look also at the CrayPAT GUI:
app2 (name of the most recent .ap2 file)
After studying these (note that you can control the pat_report output, see "pat_report -O help"), repeat the steps above using a core count where the code does not scale anymore. Compare the profiles to establish understanding on the scalability bottlenecks.
|Previous chapter||One level up||Next chapter|