CSC offers a wide range of very high level computing services for any field of research. A typical GIS-user uses some desktop software for daily work. Moving to CSC computers might make sense in these cases:
- Computing something takes more than 2-4 hours
- Need more memory
- Working with very big datasets
- Keep your desktop computer for normal usage, do computation elsewhere
- Need for a server computer (cPouta)
- Need for a lot of computers with the same set-up (courses)
- GPU or MPI programs
CSC's supercomputers have fast data I/O and a lot more memory then normal desktop computers. In general the computing speed of one CPU is not much better than of normal desktop computer, but there are thousands of CPUs compared to few of the desktop computers. Taito also has GPU accelerators. Use of CSC's computers could give significant results if, the analysis can run in parallel on several CPUs.
Practical solutions in CSC's environment
The use of these services for GIS has been so far rather limited. One big problem for GIS users has been, that the widely used ArcGIS software are available only for Windows operating system, but CSC's supercomputers are running on Linux. Also normally GIS software is not designed for running in parallel or using other supercomputing concepts.
Taito is CSC's supercluser and could be the first option to consider for GIS-users. To Taito it is possible to install most of the software available for Linux. A lot of software is installed to Taito by CSC. Taito's GIS software includes at the moment: QGIS, GDAL/OGR, Proj.4, SagaGIS and R, including several spatial packages. Taito is a Linux machine, so software available only for Windows can not be installed there or server kind of software, which means that ArcGIS, Erdas, PostGIS or GeoServer can not be added there. It is also possible to install yourself software to CSC environment for personal use. If you need a software that could be useful also for others, please send an e-mail to CSC servicedesk asking installation of that software.
Taito has a shared data folder for spatial data, which is available for all users.
Alternatives for using Taito:
- Using single core serial jobs with "normal" GIS-software. You run your code as it is, just in Taito. This will not be much faster than using desktop, but for long computations just freeing up your desktop might be useful. And you can use the extra memory and faster input-output properties of Taito.
- Using several cores with array jobs., with "normal" GIS-software. The idea of array job is to start several jobs at the same time, but these jobs are unaware of each other, and the user has no control over the execution order of these jobs. In GIS context array jobs are useful for example if you are doing same analysis for different map sheets, or different scenarios, or different time periods.
- Using several cores with parallel jobs.
- Many scientific software packages support this option, so this is the most common usage type in Taito. But only very few GIS software packages support parallel computing out-of-the box, see the list below.
- Many programming languages support parallel computing (for example snow or foreach in R, or multiprocessing or parallel in Python). Using these features the user has control over the workflow and which parts of the code are run in parallel.
cPouta is a Infrastrucutre-as-a-service kind of service, so there the user has to do all setup work (software installation, network configuration etc), so for smaller works it is not suitable. On the other side this gives the user a lot more freedem. In cPouta also Windows installations are possible in printciple. To cPouta any GIS software can be installed, most attracting it is for software not suitable for Taito, for example ArcGIS, PostGIS, GeoServer. In cPouta a wide range of virtual machine flavours is available, some of these are speciallly designed for HPC-computing or fast IO.
The easiest way for utilizing ArcGIS functionality is to install ArcGIS Server for Linux and then to run ArcPy scripts, see instructions for that.
For running open source software the easiest way might be installing OSGeoLive.
Some more hints for faster geocomputing
It is always recommended to have critical look also on the used code and your data. Small changes in these might have a big impact on computing times.Software:
- Use profilig tools to see which parts of your workflow are the slowest.
- cprofiling in Python
- Look for possibilities to make the slowest parts faster. Different algorithms and different software products may use quite different amount of time for same computation.
- When working with big vector data sets using a database could be appropriate.
- Vector data has indeces
- appropriate level of detail (generalize if needed)
- only needed area (clip if needed)
- only relevant data as attributes (delete some of the attributes if needed)
- no any empty unused space in .dbf files of Shape files
- for some analysis it might be better to divide your data into parts, for example in ArcGIS Tabulate Area was ca 100 times faster when using 1000 input files with one polygon instead of one file 1000 polygons and each polygon was calculated separately.
Software suitable for supercomputers
Some international projects have developed GIS-software for use with supercomputers. In these cases the software can make use of the special characters of supercomputers, running in parallel or using GPU for processing.
GRASS has limited support for Parallel GRASS jobs. Also some script examples of running several GRASS jobs in parallel are available. There has been also one attempt to use GPU. Some related articles:
- Iteration and supercomputing with GRASS GIS.
- Tuning Principal Component Analysis for GRASS GIS on Multi-core and GPU Architectures.
- GRASS gis on high performance computing with MPI, OpenMP and Ninf-G programming framework.
- Implementation of the r.cuda.los module in the open source GRASS GIS by using parallel computation on the NVIDIA CUDA graphic cards.
In CyberGIS project some open source GIS software packages with support for parallel runs were developed:
- Parallel PySAL: natural breaks classification, weights calculation
- TauDEM: terrain and hydrological analysis
- pRasterBlaster, based on GDAL: map reprojection
- PGAP, Generalized Assignment Problem
- Parallel Agent-Based Modeling
- Parallel Kernel Density Estimation
- Parallel Map Algebra
- Simple Parallel Tiff Writer
- lasTools for lidar data
- A few spatial functions in R support parallel computing
- Geotrellis: raster manipulation in Scala language. Spark.
- GIS tools for Hadoop.
If you have any questions or comments, or any interest in using CSC's supercomputers contact CSC servicedesk.