Intro to the topic, webinar from 10.4.2018
CSC offers a wide range of very high level computing services for any field of research. A typical GIS-user uses some desktop software for daily work. Moving to CSC computers might make sense in these cases:
- Computing something takes more than 2-4 hours
- Need more memory
- Working with very big datasets
- Keep your desktop computer for normal usage, do computation elsewhere
- Need for a server computer (cPouta)
- Need for a lot of computers with the same set-up (courses)
- GPU or MPI programs
Usage of CSC's computing environments are normallly free of charge for users from Finnish universities and from state research institutes, other users are also welcome, for them the price list is available here.
CSC's supercomputers have fast data I/O and a lot more memory than normal desktop computers. In general the computing speed of one CPU is not much better than that of normal desktop computer, but there are thousands of CPUs compared to a few in desktop computers. Using of CSC's computers could significantly reduce computing time if, the analysis can run in parallel on several CPUs.
For GIS users, especially Taito and cPouta should be valuable computing environments.
Taito is a CSC's supercluser and could be the first option to consider for GIS-users. Taito has several GIS software packages installed and also includes some bigger Finnish datasets. It is a ready environment, you just need to log in and start working! But it is mostly a black terminal Linux system, not a fancy desktop- or web-application. The main reasons why Taito might not be suitable for some analysis is software incompatibility and user's too limited (Linux) skills.
It is possible to install to Taito most of the software available for Linux. Many common GIS software are already installed to Taito by CSC. Taito's GIS software includes at the moment: GDAL/OGR, GRASS GIS, LasTools, PDAL, Proj.4, QGIS, SagaGIS, TauDEM and Zonation. Additionally R and Python are available with pre-installed spatial packages. MATLAB is installed, but users should have their own licenses.
When using some of the installed software, always a related module must be loaded first, please see the linked pages for details about the specific software. It is also possible to install yourself software to CSC environment for personal use. Taito also has GPU accelerators.
Taito's operating system is Linux, so software available only for Windows can not be installed there, for example ArcGIS and Erdas. Also server kind of software is not suitable for Taito, including PostGIS or GeoServer. For these cPouta can be used, please see next chapter.
Taito has a shared data folder for spatial data, which is available for all users. It includes currently data from NLS, FMI, LUKE and SYKE.
You can also move your own data to Taito. For different purposes there are available different directories. In work directory everybody has by default 5 Tb, which can be extended if given good reasons.
Working in Taito
Normally work in Taito is done using scripts. Most commonly used scripting languages for GIS are R, Python and bash scripts. If moving your existing R or Python scripts to Taito from a Windows environment, you would usually only need to modify the files' paths. You also have to confirm the availability of used packages but you can install your own packages and or libraries as you would normally do in your own computer.
The scripts are run as jobs in Taito, to organize the order scripts from different users are run in the computer cluster. A job is started by a batch job file. In principle there are three kinds of batch jobs:
- Single core serial jobs with "normal" GIS-software. You run your code as it is, just in Taito. This will not be much faster than using desktop, but for long computations just freeing up your desktop might be useful. And you can use the extra memory and faster input-output properties of Taito.
- Array jobs with several cores, with "normal" GIS-software. The idea of array job is to run the same script several times simultaneously. But these jobs are unaware of each other, and the user has no control over the execution order of these jobs. In GIS context array jobs are useful for example if you are doing the same analysis for different map sheets, or different scenarios, or different time periods.
- Parallel jobs with several cores.
- Many scientific software packages support this option, so this is the most common usage type in genereal in Taito. But only very few GIS software packages support parallel computing out-of-the box, for more info see here.
- Many programming languages support parallel computing (for example snow or foreach in R, or multiprocessing or parallel in Python). Using these features it is possible to write scipts that run in parallel.
We provide some R and Python examples for Taito in Github. Examples include also batch job scripts. Some of the examples include similar solutions for serial, array and parallel jobs.
Taito-shell is the little sister of Taito. It is meant for serial jobs and and usage of software with graphial user interface (GUI). Taito-shell uses the same the disk environment, software stack and module system as the normal Taito cluster. The difference to the Taito is, that Taito-shell has no time limit for jobs, an interactive job started in Taito-shell can run as long as the Taito-shell session remains open. In Taito-shell you can use for example QGIS, GRASS GIS, RStudio or Spyder IDE for Python. For using software with GUI the connection to Taito-shell has to be made via X shell or NoMachine desktop. You can also submit batch jobs to Taito from Taito-shell.
cPouta is a Infrastrucutre-as-a-service service offered by CSC. It offers different hardware setups where the user can (has to) install any computing environment needed (software installation, network configuration etc). It is not suitable for smaller trivial computing needs. On the other side this gives the user a lot more freedom to install kind of custom computing environment. cPouta is ideal for running server kind of software, for example PostGIS and GeoServer. Expert users can also set up also their own computing clusters. cPouta requires server administration, software installation and Linux skills.
In cPouta, Windows installations are possible in principle, but Windows licenses need to be arranged by the user himself. The most common case for GIS Windows software is ArcGIS, the easiest way to use some ArcGIS functionality is to install ArcGIS Server for Linux in a cPouta environment and run ArcPy scripts, see instructions for that.
Other advance uses include the installation of Hadoop/Spark environments in cPouta.
In cPouta a wide range of virtual machine flavours (hardware configurations) is available, some of these are speciallly designed for HPC-computing or fast IO.
Performance hints for geocomputingScripts:
- Use profilig tools to see which parts of your script are the slowest. Look for possibilities to make the slowest parts faster. All programming languages have their own profiling tools, for example:
- Different algorithms and different functions from different packages may use quite different amount of time for the same computation task.
- Watch out for for loops and try to find alternative ways.
- Make the script run in parallel.
- When working with big raster datasets using virtual rasters might be very helpful.
- When working with big vector data sets using a database could be appropriate.
- Remove unnecessary data (clip, select, generalize)
- Index vector data if your software can use it.
Practical info getting account and basic use
Projects and Accounts
- To start using CSC computing resources you need user account.
- For bigger Taito cases and all cPouta cases create a project or join an existing one. New projects are given a default 10 000 billing units (BU), but you can request more whenever needed.
- Your jobs consume billing units according to these rules.
- You can change your billing project and see how many BUs you have used.
- If you need help with estimating your job resource needs, see the seff command from the end of this page or see the webinar about estimating needed memory.
- Connecting, use ssh (linux, MacOSX) or PuTTY (Windows).
- Moving data to CSC and back:
- For Linux users: scp, rsync.
- For Windows users with less experience with Linux commands, see the FileZilla section. If you happen to have WinSCP or any other similar tool, it should work as well.
- There is also a very comprehensive webinar about this topic (including tips for moving very big datasets).
- See our 2017 course about Geocomputing using CSC resources, the materials are still available and include tutorials both for Taito and cPouta.
- Code examples for Taito: R, Python, GRASS GIS, SagaGIS.
- Installation guidelines for cPouta: ArcGIS Server, PostGIS, GeoServer.
- Our user guides include a long section about Linux.
- For software related materials please see each spefic software's page.
- The geocomputing related news is sent to gis-hpc mailing list, you are welcom to join! Also the arcive is open.
If you have any questions or comments, or would need some other software/data to Taito, please contact CSC servicedesk at firstname.lastname@example.org