Geocomputing - Services for Research
Intro to the topic, webinar from 10.4.2018
CSC offers a wide range of very high level computing services for any field of research. A typical GIS-user uses some desktop software for daily work. Moving to CSC computers might make sense in these cases:
- Computing something takes more than 2-4 hours
- Need more memory
- Working with very big datasets
- Need to work with GIS data or software already available in Puhti
- Keep your desktop computer for normal usage, do computation elsewhere
- Need for a server computer (cPouta)
- Need for a lot of computers with the same set-up (courses)
- GPU or MPI programs
Usage of CSC's computing environments are normallly free of charge for users from Finnish universities and from state research institutes, other users are also welcome, for them the price list is available here.
CSC's supercomputers have fast data I/O and a lot more memory than normal desktop computers. In general the computing speed of one CPU is not much better than that of normal desktop computer, but there are thousands of CPUs compared to a few in desktop computers. Using of CSC's computers could significantly reduce computing time if, the analysis can run in parallel on several CPUs.
For GIS users, especially Puhti and cPouta should be valuable computing environments.
Puhti is a CSC's supercomputer and could be the first option to consider for GIS-users. Puhti has several GIS software packages installed and also includes some bigger Finnish datasets. It is a ready environment, you just need to log in and start working! But it is mostly a black terminal Linux system, not a fancy desktop- or web-application. The main reasons why Puhti might not be suitable for some analysis is software incompatibility and user's too limited (Linux) skills.
It is possible to install to Puhti most of the software available for Linux. Many common GIS software are already installed to Puhti by CSC. Puhti's GIS software includes at the moment: GDAL, FORCE, LasTools, Mapnik, OpenDroneMap, Orfeo ToolBox, PDAL, QGIS, SagaGIS, sen2cor, SNAP, WhiteboxTools and Zonation. Additionally R and Python are available with pre-installed spatial packages. MATLAB is installed, but users should have their own licenses.
When using some of the installed software, always a related module must be loaded first, please see the linked pages for details about the specific software. It is also possible to install yourself software to CSC environment for personal use. Puhti also has GPU partitions, which are mostly used for deep learning.
Puhti's operating system is Linux, so software available only for Windows can not be installed there, for example ArcGIS and Erdas. Also server kind of software is not suitable for Puhti, including PostGIS or GeoServer. For these cPouta can be used, please see next chapter.
Puhti has a shared data folder for spatial data, which is available for all users and includes the most important open GIS datasets of Finland, inc NLS DEM, lidar data and topographic database, LUKE VMI, all SYKE open data and many more.
You can also move your own data to Puhti. For different purposes there are available different directories. In scratch directory everybody has by default 1 Tb, which can be extended by request. Scratch is cleaned periodically, so keep a copy of your important files also in Allas object storage. GDAL and all other software based on it support very well also reading data directly from Allas. GDAL does not support direct writing to Allas, so normally you have to write your output-files first to scratch and then move them to Allas.
With big raster files divided to a lot of mapsheets virtual rasters may be very helpful.
Working in Puhti
Normally work in Puhti is done using scripts. Most commonly used scripting languages for GIS are R, Python and bash scripts. If moving your existing R or Python scripts to Puhti from a Windows environment, you would usually only need to modify the files' paths. You also have to confirm the availability of used packages but you can install your own packages for your own use.
The scripts are run as jobs in Puhti. Jobs enable to organize and balance the use of computing resources between different users. A job is started by a batch job file. In principle there are three kinds of batch jobs:
- Single core serial jobs with "normal" GIS-software. You run your code as it is, just in Puhti. This will not be much faster than using desktop, but for long computations just freeing up your desktop might be useful.
- Array jobs with several cores, with "normal" GIS-software. The idea of array job is to run the same script several times simultaneously. But these jobs are unaware of each other, and the user has no control over the execution order of these jobs. In GIS context array jobs are useful for example if you are doing the same analysis for different map sheets, or different scenarios, or different time periods.
- Parallel jobs with several cores.
- Many scientific software packages support this option, so this is the most common usage type in genereal in Puhti. But only very few GIS software packages support parallel computing out-of-the box, for more info see here.
- Many programming languages, inc R and Python, support parallel computing. Using these features it is possible to write scipts that run in parallel.
We provide some R and Python examples for Puhti in Github. Examples include also batch job scripts. Some of the examples include similar solutions for serial, array and parallel jobs.
Puhti has interactive partition for interactive work. It is meant for smaller interactive analysis tasks and usage of software with graphial user interface (GUI). In this way you can use for example QGIS, RStudio or Spyder IDE for Python. For using software with GUI the connection to Puhti has to be made via X shell or NoMachine desktop.
cPouta is a Infrastrucutre-as-a-service service offered by CSC. It offers different hardware setups where the user can (has to) install any computing environment needed (software installation, network configuration etc). It is not suitable for smaller trivial computing needs. On the other side this gives the user a lot more freedom to install kind of custom computing environment. cPouta is ideal for running server kind of software, for example PostGIS and GeoServer. Expert users can also set up also their own computing clusters. cPouta requires server administration, software installation and Linux skills.
In cPouta, Windows installations are possible in principle, but Windows licenses need to be arranged by the user himself. The most common case for GIS Windows software is ArcGIS, the easiest way to use some ArcGIS functionality is to install ArcGIS Server for Linux in a cPouta environment and run ArcPy scripts, see instructions for that.
Other advance uses include the installation of Hadoop/Spark environments in cPouta.
In cPouta a wide range of virtual machine flavours ("hardware" configurations) is available, some of these are speciallly designed for HPC-computing or fast IO.
Performance hints for geocomputing
- Use profilig tools to see which parts of your script are the slowest. Look for possibilities to make the slowest parts faster. All programming languages have their own profiling tools, for example:
- Different algorithms and different functions from different packages may use quite different amount of time for the same computation task.
- Watch out for for loops and try to find alternative ways.
- Make the script run in parallel.
- When working with big raster datasets using virtual rasters might be very helpful.
- When working with big vector data sets using a database could be appropriate.
- Remove unnecessary data (clip, select, generalize)
- Index vector data if your software can use it.
Practical info getting account and basic use
Projects and Accounts
- To start using CSC computing resources you need user account and a project.
- Your jobs consume billing units according to these rules.
- If you need help with estimating your job resource needs, use seff command or see the webinar about estimating needed memory.
- Connecting, use ssh (linux, MacOSX), PuTTY (Windows) or NoMachine for GUI.
- Moving data to CSC and back:
- For Linux/Mac users: scp, rsync.
- For Windows users: WinSCP, FileZilla or similar.
- There is also a very comprehensive webinar about this topic (including tips for moving very big datasets).
- See our 2018 course about Geocomputing using CSC resources, the materials are still available and include tutorials both for Taito and cPouta.
- Code examples for Taito/Puhti: R, Python, GRASS GIS, SagaGIS.
- Installation guidelines for cPouta: ArcGIS Server, PostGIS, GeoServer.
- Our user guides include a long section about Linux.
- For software related materials please see each spefic software's page.
- The geocomputing related news is sent to gis-hpc mailing list, you are welcom to join! Also the arcive is open.
If you have any questions or comments, or would need some other software/data to Taito, please contact CSC servicedesk at email@example.com