Virtual rasters - Services for Research
Virtual rasters is useful GDAL concept for managing large raster datasets that are split into not overlapping map sheets. Virtual rasters are not useful for managing time-series or overlapping rasters, for example remote sensing tiles. Virtual rasters are supported by any GDAL based tool, including Python and R spatial packages, ArcGIS, FME, GrassGIS, MapInfo, QGIS, and SagaGIS. Additionally ArcGIS and MapInfo have respectively also their own virtual raster format similar to GDAL virtual raster.
Technically a virtual raster is just a small xml file that tells GDAL where the actual data files are, but from user's point of view virtual rasters can be treated much like any other raster format. Virtual rasters can include raster data in any file format GDAL supports. Virtual rasters are useful because they allow handling of large datasets as if they were a single file eliminating need for locating correct files.
For example the 2m and 10m DEM are available in Puhti. These datasets are split into a number of tif files (map sheets) and if we wanted for example to calculate zonal statistics for some areas scattered around whole Finland we would have to somehow find out which elevation model covers which area and compute statistics from correct file. Further complications would arise if an area we want to calculate statistics for happens to lie at a border between two or more map sheets. Similar issues with edge effects would arise for example when using focal functions where information from surrounding files is also needed. These issues can be easily avoided by creating a virtual raster for the whole study area and above mentioned problems will be automatically taken care of by GDAL.
Virtual rasters with Allas
It is possible to use virtual rasters so, that the big raster files are in Allas and only the small .xml-file is in Puhti or a local PC. The data is moved to Puhti or local PC only when the virtual raster is opened. The best performing format to save your raster data in Allas is often Cloud optimized TIFF, but other formats are also possible.
Creating virtual rasters
Following tools support creating virtual rasters:
- GDAL gdalbuildvrt commandline tool
- Python and R have wrappers for GDAL gdalbuildvrt.
- QGIS, GrassGIS and SagaGIS provide graphical interface for gdalbuildvrt
- lidR supports writing lidar data analysis results directly as virtual raster
Creating virtual raster with GDAL gdalbuildvrt
If your data is divided to subfolders the easiest option is to create the virtual raster with a help of input file list:
gdalbuildvrt -input_file_list file_list.txt virtual_raster.vrt
Check gdalbuildvrt documentation for additional options, for example for setting no-data value.
Creating file list for gdalbuildvrt
File list should include preferably full paths, but for local files also relative paths may also be used.
find /appl/data/geo/mml/dem2m/ -name "*.tif" > file_list.txt
dir \data\dem2m\*.tif /S /B > file_list.txt
Load allas and geoconda modules and set up your credentials, see Puhti GDAL page.
Lists the file names as they are in the bucket with rclone or some other tool:
rclone lsf --include '*.tif' allas:<your_bucket_name > file_list.txt
Next add to file list the full paths as they are required by GDAL, using vsicurl, vsis3 or vsiswift drivers. See longer explanations of GDAL drivers and Allas from Puhti GDAL page.
sed -i -e 's-^-/vsicurl/https://a3s.fi/<your_bucket_name>/-' file_list.txt
sed -i -e 's-^-/vsis3/<your_bucket_name>/-' file_list.txt
sed -i -e 's-^-/vsiswift/<your_bucket_name>/-' file_list.txt
Using virtual rasters
Reading only a part of a big virtual raster with bbox
The key for efficient usage of virtual rasters is often the support for reading only a part of the big virtual raster. This can be done both with R and Python, for many other tools the crop has to be done as extra step with saving the cropped data to a file.
With GDAL it is easy to crop a small part out of the big virtual raster:
gdal_translate -projwin 614500 6668500 644500 6640500 test.vrt test_clip.tif
In R the read extent can be given to the raster read()-function with xmn, xmx, ymn, ymx or ext options.
In Python rasterio there does not seem to be as straight forward option for reading a specific extent, but a work-around could be using merge()-function which supports bounds option:
from rasterio.merge import merge
data_dir = "/projappl/<your_project>/"
fp = os.path.join(data_dir, "test.vrt")
bbox = (614500, 6640500, 644500, 6668500)
with rasterio.open(fp) as src:
data, out_transform = merge([src], bounds=bbox)
Running analysis that only some parts of the virtual raster
It's possible to work with very large virtual rasters when the analysis doesn't actually need to output a raster of similar size. A good example would be calculating zonal statistics for polygons spread out across large area (see csc training github for example). In this case raster data is read only from areas overlapping with the polygons.
Working with large virtual rasters visually
It is worth noting that while running some analysis on a 2m DEM covering whole Finland is entirely feasible in Puhti, viewing the data with for example QGIS is not practical for such a large dataset without further optimization. If you wanted to easily view a big virtual raster , you have to do a few things:
- Create overviews for your virtual raster using gdaladdo command. You should take care to not create overviews that are so large that the overviews become a huge file themselves.
- If your virtual raster is really big it makes sense to create a hierarchial structure of virtual rasters where topmost virtual raster points to smaller virtual rasters which point to smaller virtual rasters and so on until you have the last virtual raster pointing to actual files. The reason for using this approach is that if you don't do this also the overviews used get really big. Note that using this kind of hierachial structure may produce some artifacts when running analysis on the data so it should be reserved for viewing purposes.
- Pre calculate statistics for your virtual rasters and source files. This is to make opening files faster in for example QGIS. QGIS needs to sample for min and max value in the data to be able to set the coloscale right and this takes time with large virtual rasters. To avoid having to do this you can precompute statistics to separate XML file with gdalinfo --stats command.
- A good trick in QGIS when working with large rasters is to enable raster toolbar (View->Toolbars->Raster Toolbar) This allows you to easily adjust colorscale to area shown in screen which lets you have good contrast regardless of zoom level.
- QGIS seems to be pretty good at handling large datasets when above mentioned steps have been taken. Even with 2m dem from whole finland zooming and moving the map is quite smooth.