Virtual datasets

Virtual datasets are useful for managing large datasets that are split into multiple files. For example 10m and 2m DEMs available in Taito are often convenient to use through virtual rasters.

Virtual rasters are just xml files that tell GDAL where actual data can be found but from user's point of view virtual rasters can be treated much like any other raster format. Virtual raster's are useful because they allow handling of large datasets as if they were a single file eliminating need for locating correct files for each part of your script.

For example the 2m DEM is available in Taito at /proj/ogiir-csc/mml/dem2m. It is however split into a number of tif files (map sheets) and if we wanted for example to calculate zonal statistics for some areas scattered around whole Finland we would have to somehow find out which elevation model covers which area and compute statistics from correct file. Further complications would arise if an area we want to calculate statistics for happens to lie at a border between two or more map sheets. Similar issues with edge effects would arise for example when using focal functions where information from surrounding files is also needed. These issues can be easily avoided by creating a virtual raster for the whole study area and above mentioned problems will be automatically taken care of by GDAL.

 

Creating virtual rasters with GDAL
As virtual rasters are just xml we could write it even by hand using text editor. This is of course impractical for any large number of files. GDAL has a very nice tool gdalbuildvrt which will create the virtual raster for us. To use GDAL in Taito we must first load the module with:

module load geo-env

Gdalbuildvrt is very simple to use. It takes a list of files and name of output virtual raster as parameters like so:

gdalbuildvrt -input_file_list file_list.txt virtual_raster.vrt

Note that the tool has some other options available but for this example only most basic functionality is required.

In our 2m dem example a list of files (including paths) can be generated using find:

find /proj/ogiir-csc/mml/dem2m/ -name "*.tif" > file_list.txt

Above command looks recursively for all files with .tif ending from dem2m folder and prints them to file_list.txt file which can be then supplied to gdalbuildvrt as argument.

Once virtual raster has been created it can be used and visualized like any other raster file using software that utilizes GDAL, including many python gis modules, qgis, grass and saga. It is worth noting that while running some analysis on a 2m dem covering whole Finland is entirely feasible in Taito, viewing the data with for example QGIS is not practical for such a large dataset without further optimization.

Working with large virtual rasters visually

If you wanted to easily view the aforementioned whole Finland 2m dem, you have to do a few things:

  • Create overviews for your virtual raster using gdaladdo command. You should take care to not create overviews that are so large that the overviews become a huge file themselves.
  • If your virtual raster is really big it makes sense to create a hierarchial structure of virtual rasters where topmost virtual raster points to smaller virtual rasters which point to smaller virtual rasters and so on until you have the last virtual raster pointing to actual files. The reason for using this approach is that if you don't do this also the overviews used get really big. Note that using this kind of hierachial structure may produce some artifacts when running analysis on the data so it should be reserved for viewing purposes.
  • Pre calculate statistics for your virtual rasters and source files. This is to make opening files faster in for example QGIS. QGIS needs to sample for min and max value in the data to be able to set the coloscale right and this takes time with large virtual rasters. To avoid having to do this you can precompute statistics to separate XML file with gdalinfo --stats command.
  • A good trick in QGIS when working with large rasters is to enable raster toolbar (View->Toolbars->Raster Toolbar) This allows you to easily adjust colorscale to area shown in screen which lets you have good contrast regardless of zoom level.
  • QGIS seems to be pretty good at handling large datasets when above mentioned steps have been taken. Even with 2m dem from whole finland zooming and moving the map is quite smooth.

Running analysis that only use part of virtual raster

It's possible to work with very large virtual rasters when the analysis doesn't actually need to output a raster of similar size. A good example would be calculating zonal statistics for polygons spread out accross large area (see csc training github for example). It's worth noting however that even if all of the source files won't need to be accessed some time is still wasted on finding out which source files contain data needed. As virtual rasters don't need to form continious surfaces it might be better idea to create a virtual raster only covering your study area (see https://research.csc.fi/es/gis_data_in_taito).

Working with virtual rasters in different GIS-software

If your virtual raster is of a size that could be handled for example by a normal single tif file then most GIS software should be able to use it without problems. For working with larger datasets:

  • QGIS, can be used to view large virtual rasters (with hierarchical structure and overviews) smoothly.
  • Python, packages such as rasterio and rasterstats can use large virtual rasters relatively efficiently (see training github).
  • R, reading and querying virtual rasters with raster package works fine.
  • GDAL  translate tool allows you to specify an operating area so you can work pretty efficiently extract information from a large virtual raster.
  • GRASS can link to external vrt files with r.external tool and also allows setting an computational region in a similar fashion to GDAL to process only a small part of a vrt file or process vrt file in parallel tiles (see training github). However r.external seems to take long time for really large virtual rasters (whole 2m dem of finland in Taito for example). Viewing of large vrt files is not as smooth as in QGIS.
  • Taudem reads vrt files, but as output files are rasters covering the same area as the input vrt there isn't that much point to using large vrt files with Taudem.
  • SagaGIS can import vrt files but this will simply result in one large saga grid file so again not much advantage in using a large vrt to begin with.