Back

How should I backup my data in Sisu and Taito?

The file system used in Sisu and Taito is designed for analysing data, not for storing data. Of the personal disk areas only the home directory ( including the $USERAPPL directory) is backuped regularly. No automatic back-ups are made for work ($WRKDIR) and project directories!

Human errors and technical problems may cause data losses and because of that users should pay attention to keep the most critical data stored securely.

Small files can be backuped in home directory

The home directories of Sisu and Taito are back-upped daily, but as the size of home directory is only 50 GB it can't be used for backuping large data sets.

However, for smaller data like command scripts, parameter files, instructions, etc. home directory is the easiest way to do backuping. For the same reason it is good practice to have your own software installations and scripts in your $USERAPPL where they are automatically back-upped by CSC ($USERAPPL instructions).

For larger data sets HPC archive can be used to make a back up copy

In $WRKDIR you can create, manage and analyse up to 5 TB data sets. However $WRKDIR is not backuped and thus it should not be used as the only storage site for important data. The iRODS based HPC archive service can be used as a site to store back up copies of the most critical data sets like the original input data or final results of a simulation or analysis process.

As the storage capacity of HPC archive (2 TB/user) as well data transfer rate between CSC servers and HPC archive are limited, it is not recommend that you back up all the files to HPC archive or use automatic scripts that copy data to HPC archive. Further, you should avoid running more than a few HPC archive data transfer processes simultanously.

HPC archive (as well as IDA and other iRODS based storage systems) are not designed to store massive amounts of individual files. Instead of directly copying numerous small files to HPC-archive you should collect them first into a single compressed archive file, and copy that to HPC archive.

In example below, files with names starting with data and ending with .inp (eg. data_set1.inp or dataA0003.inp) are first collected as one compressed archive file using command tar after which the compressed file is uploaded to HPC archive service with command iput.:

ls data*.inp
tar zcvf input_files_for_project1.tgz data*.inp
ls -lh input_files_for_project1.tgz
module load irods #(Run this command only in Sisu, NOT in Taito)
iput input_files_for_project1.tgz
ils

You can retrieve the data, stored above, from HPC archive with command iget and then uncompress it:

iget input_files_for_project1.tgz
tar zxvf input_files_for_project1.tgz

If the size of the file you are planning to import to HPC archive is larger than 100 GB it may be advisable to split it into several pieces as exporting large files to HPC archive is often problematic. In Sisu and Taito you can do the splitting of the file with command: split.

For example, file big_data.tgz can be split into chunks of 100 GB with command:

split -b 100G big_data.tgz big_data.tgz_part_

The command above creates a set of files that contain the data of big_data.tgz. The original data is preserved intact. Each data chunk is named by adding an alphabetic suffix to the base name given as the last argument of the command above. For example, if the size of the input file big_data.tgz is 290 GB, then following three files will be created: big_data.tgz_part_aa, big_data.tgz_part_ab, big_data.tgz_part_ac etc.

These chunks can now be moved to HPC achive with a set if iput commands:

iput big_data.tgz_part_aa
iput big_data.tgz_part_ab
iput big_data.tgz_part_ac

Later on, you can download the split data set with commands:

iget big_data.tgz_part_aa
iget big_data.tgz_part_ab
iget big_data.tgz_part_ac

And merge them back together with cat command

cat big_data.tgz_part_* > big_data.tgz

Compressing, splitting and transporting terabyte level data sets will take time. Thus in the case of large data sets you should use taito-shell.csc.fi or batch jobs for executing these tasks.