Geocomputing - Services for Research
Intro to the topic, webinar from 23.9.2020
CSC offers a wide range of very high level computing services for any field of research. Moving to CSC computers might make sense in these cases:
- Computing something takes more than 2-4 hours
- Need more memory
- Working with very big datasets
- Need to work with GIS data or software already available in Puhti
- Keep your desktop computer for normal usage, do computation elsewhere
- Need for a server computer (cPouta)
- Need for a lot of computers with the same set-up for courses (CSC Notebooks)
- GPU or MPI programs
Usage of CSC's computing environments are mostly free of charge for users from Finnish universities and state research institutes.
CSC's supercomputers have fast data I/O and a lot more memory than normal desktop computers. In general the computing speed of one CPU is not much better than that of normal desktop computer, but there are thousands of CPUs compared to a few in desktop computers. Using of CSC's computers could significantly reduce computing time if, the analysis can run in parallel on several CPUs or on GPU.
For GIS users, especially Puhti and cPouta should be valuable computing environments.
Puhti is CSC's supercomputer and could be the first option to consider for GIS-users. Puhti has several GIS software packages installed and also includes some bigger Finnish datasets. It is a ready environment, you just need to log in and start working! But it is mostly a black terminal Linux system, not a fancy desktop- or web-application. The main reasons why Puhti might not be suitable for some analysis is software incompatibility and user's too limited (Linux) skills.
It is possible to install to Puhti most of the software available for Linux. Many common GIS software are already installed to Puhti by CSC. Puhti's GIS software includes at the moment: CloudCompare, FORCE, GDAL, GRASS, LasTools, OpenDroneMap, Orfeo ToolBox, PCL, PDAL, QGIS, SagaGIS, sen2cor, SNAP, SPLITS, WhiteboxTools and Zonation. Additionally R and Python are available with pre-installed spatial packages.
When using some of the installed software, always a related module must be loaded first, please see the linked pages for details about the specific software. It is also possible to install yourself software to CSC environment for personal use. Puhti also has GPU partitions, which are mostly used for deep learning.
Puhti's operating system is Linux, so software available only for Windows can not be installed there, for example ArcGIS and Erdas. Also server kind of software is not suitable for Puhti, including PostGIS or GeoServer. For these cPouta can be used, please see next chapter.
Puhti has a shared data folder for spatial data, which is available for all users and includes the most important open GIS datasets of Finland, inc NLS DEM, lidar data and topographic database, LUKE VMI, all SYKE open data and many more.
You can also move your own data to Puhti. For different purposes there are available different directories. In scratch directory everybody has by default 1 Tb, which can be extended by request. Scratch is cleaned periodically, so keep a copy of your important files also in Allas object storage. GDAL and all other software based on it support very well also reading data directly from Allas. GDAL does not support direct writing to Allas, so normally you have to write your output-files first to scratch and then move them to Allas. With Python and R scripts it is possible to write directly to Allas.
Working in Puhti
Normally work in Puhti is done using scripts. Most commonly used scripting languages for GIS are R, Python and bash scripts, additionally also MATLAB and Julia are available. If moving your existing R or Python scripts to Puhti from a Windows environment, you would usually only need to modify the files' paths. You also have to confirm the availability of used packages, but you can install your own packages for your own use.
The scripts are run as jobs in Puhti. Jobs enable to organize and balance the use of computing resources between different users. A job is started by a batch job file. In principle there are three kinds of batch jobs:
- Single core serial jobs with "normal" GIS-software. You run your code as it is, just in Puhti. This will not be much faster than using desktop, but for long computations just freeing up your desktop might be useful.
- Array jobs with several cores, with "normal" GIS-software. The idea of an array job is to run the same script several times simultaneously. But these jobs are unaware of each other, and the user has no control over the execution order of these jobs. In GIS context array jobs are useful for example if you are doing the same analysis for different map sheets, or different scenarios, or different time periods.
- Parallel jobs with several cores.
- Many scientific software packages support this option, so this is the most common usage type in general in Puhti. Recently also some GIS software packages support parallel computing out-of-the box.
- Many programming languages, inc R and Python, support parallel computing. Using these features, it is possible to write yourself scripts that run in parallel.
CSC provides some example scripts for spatial data analysis in Puhti. Examples include also batch job scripts. Some of the examples include similar solutions for serial, array and parallel jobs. Examples are for Allas, Python, R, FORCE, GDAL, GRASS, PDAL, SNAP and machine learning. Also GeoPortti Github includes some longer examples.
Puhti has interactive partition for using tools in "normal" way. It is meant for smaller interactive analysis tasks and usage of software with graphical user interface (GUI). In this way you can use for example CloudCompare, QGIS, SNAP, GRASS GIS, SagaGIS, RStudio or Spyder for Python. Puhti web interface is the best option for using software with GUI.
cPouta is a Infrastrucutre-as-a-service service offered by CSC. It offers different hardware setups where the user has to install everything needed from scratch (OS, software, network configuration etc). It is not suitable for smaller trivial computing needs. On the other side this gives the user a freedom to install custom computing environments. cPouta is ideal for running server kind of software, for example PostGIS and GeoServer. Expert users can also set up also their own computing clusters. cPouta requires server administration, software installation and Linux skills.
cPouta practically supports only different Linux versions, so setting up ArcGIS Pro or desktop there is not easily possible. The easiest way to use some ArcGIS functionality is to install ArcGIS Server for Linux to cPouta and run ArcPy scripts.
Performance hints for geocomputing
- Use profilig tools to see which parts of your script are the slowest. Look for possibilities to make the slowest parts faster. All programming languages have their own profiling tools, for example:
- Different algorithms and different functions from different packages may use quite different amount of time for the same computation task.
- Watch out for 'for loops' and try to find alternative ways.
- Make the script run in parallel.
- When working with big raster datasets using virtual rasters might be very helpful.
- When working with big vector data sets using a database could be appropriate.
- Remove unnecessary data (clip, select, generalize)
- Index vector data if your software can use it.
Next steps for starting with Puhti
Important Puhti documentation pages:
- Get started with CSC computing services (administrative side)
- Connecting to Puhti
- Moving data to CSC and back:
- Linux tutorial
- Puhti software pages
- GIS courses materials, inc Geocomputing using CSC resources, R and Python GIS courses, machine learning with spatial data etc.
- Geocomputing seminar materials, inc point cloud and EO workshops and several use case presentations.
The geocomputing related news is sent to gis-hpc mailing list, you are welcome to join!
If you have any questions or comments, or would need some other software/data to Puhti, please contact CSC Servicedesk