3.2. Storing data in the CSC computing environment
CSC supports several storage systems for longer term data storage. These storage servers are intended for making backups and master copies of the datasets that are not actively modified during a research project. These services are however not intended for long term (tens of years) data preservation.
HPC archive is intended to storing datasets that are utilized in CSC computing environment.
IDA storage service is a general storage service for scientific data. It is part of the Fairdata.fi research data management environment.
Pouta Object Storage is a general purpose storage platform that can be accessed from any platform: CSC servers, virtual machines in cPouta cloud environment as well as users local computing environment.
The main difference between between these services is in their user policy and in the accessibility. The HPC archive is directly bound to the CSC user accounts: All the customers of the CSC computing environment will automatically have an account with 2 TB quota in the HPC archive.
The IDA service is not directly linked to the CSC computing environment. Even though CSC produces and hosts the IDA service and users need to register to CSC, the storage space is applied from the home organization of the user (Finnish higher education institute, state reserach institute) or from the Academy of Finland based on a funding decision. IDA users can use the storage space from both their own computers and from the servers of CSC. IDA can also be used to publish or share datasets. More information about applying storage space from IDA can be found from through IDA's website:
Pouta Object storage is a cloud storage service, where you can store and retrieve data over HTTPS. This is not tied to any individual virtual machines. The data can be made accessible from anywhere. For those familiar with commercial cloud services, the Amazon S3 is an example of an object storage service.
The storage quotas are granted and used based on CSC customer projects. The default quota is 1 TB but an extesion can be requested. Pouta Object Storage is described more in detail in the next chapter:
The usage of HPC archive and IDA is based on client programs. IDA uses in-house developed www interface and client tool (ida), while HPC-archive service uses iRODS (Integrated Rule-Oriented Data System) client. cPouta can be accessed through s3 or swift compatible interfaces (including the cPouta www interface).
The files that are in these storage systems can be managed through client interfaces but the content of the archived files can't be studied or modified. In stead, the stored file must be first retrieved back to the CSC servers or to some other computer in order to analyse or modify the dataset.
3.2.1 Using HPC archive
HPC archive service is based on iRODS technology. In Taito cluster the iRODS commands are automatically in use. In Sisu you need to run following set-up command in order to be able to execute iRODS commands:
module load irods
(Note that if you are using HPC arcive in Taito, you should not run the module load irods commad as it loads an iRODS version that is not compatilble with HPC archive). The two basic iRODS commands are:
In addition to that, there are several other iRODS commands that can be used to manage the data at the archive server. Many of these i-commands, listed in table 3.1 are analogous to the corresponding linux commands. E.g. Command irm removes a file from the iRODS server and imkdir creates a new directory to the iRODS server.
We recommend that you don't store all the data to the main folder of the server, but instead you should create a hierarchical directory structure that helps you to locate your files later on. Further, if possible, the files should merged into larger compressed archiving units with programs like tar or zip before moving data to the HPC archive or IDA.
Table 3.1 Most commonly used iRODS commands.
|icd||Change the current working directory (collection).|
|ichksum||Calculate checksum for one or more data-object or collections.|
|ichmod||Change access permissions to collections or data-objects|
|icp||Copy a data-object (file) or collection (directory) to another.|
|iexit||Exit an irods session (un-iinit).|
|iget||Get a file from iRODS.|
|igetwild.sh||Get one or more files from iRODS using wildcard characters.|
|ihelp||Display a synopsis list of the i-commands|
|iinit||Initialize a session, so you don't need to retype your password.|
|ilocate||Search for data-object(s) OR collections (via a script).|
|ils||List collections (directories) and data-objects (files).|
|imkdir||Make an irods directory (collection).|
|imv||Move/rename an irods data-object (file) or collection (directory).|
|ipasswd||Change your irods password.|
|iput||Put (store) a file into iRODS.|
|ipwd||Print the current working directory (collection) name.|
|iquota||Show information on iRODS quotas (if any).|
|irm||Remove one or more data-objects or collections.|
Synchronize collections between a local/irods or irods/irods (at the moment this command is not working properly. We recomment not to use it)
In this example, user kkayttaj copies a set of files from his $WRKDIR directory in Sisu, to his HPC Archive directory.
After logging into Sisu the user sets up the iRODS commands and moves to the work directory of Sisu:
[kkayttaj@c305 ~]$ module load irods [kkayttaj@c305 ~]$ cd $WRKDIR
In the case of operating formwithin Taito, skip the module-command (see above).
Then the user checks the content of the directory with command ls and creates a new directory called: proj27_data_1.
[kkayttaj@c305 kkayttaj]$ ls images27_a.jpg images27_b.jpg images27_c.jpg input27.dat result27_a.out result27_b.out result27_c.out [kkayttaj@c305 kkayttaj]$ mkdir proj27_data_1Then the user copies the files he wants to preserve to the new directory:
[kkayttaj@c305 kkayttaj]$ cp input27.dat proj27_data_1 [kkayttaj@c305 kkayttaj]$ cp result27*.out proj27_data_1 [kkayttaj@c305 kkayttaj]$ cp images27*.jpg proj27_data_1After that the user checks that the new directory contains all the files that you wish to store to archive.
[kkayttaj@c305 kkayttaj]$ ls proj27_data_1 images27_a.jpg images27_b.jpg images27_c.jpg input27.dat result27_a.out result27_b.out result27_c.outNext, the data to be stored is collected to a compressed tar archive file called proj27_data_1.tgz.
[kkayttaj@c305 kkayttaj]$ tar zcvf proj27_data_1.tgz proj27_data_1The resulting compressed file proj27_data_1.tgz can now be copied to the HPC archive. Before copying the data,the user first creates a new sub-folder called proj27 to the IDA server.
[kkayttaj@c305 kkayttaj]$ imkdir proj27Next the user checks that the directory was created to the HPC archive server and changes the current HPC archive server directory as the new proj27 directory:
[kkayttaj@c305 kkayttaj]$ ils /hpc_archive/home/kkayttaj: C- /hpc_archive/home/kkayttajl/proj27
[kkayttaj@c305 kkayttaj]$ icd proj27After this the user is ready to execute iput command that copies the file to to the new directory in the HPC archive server.
[kkayttaj@c305 kkayttaj]$ iput proj27_data_1.tgzOnce the data copying process is finished, the user checks that the file has been successfully copied to the archive:
[kkayttaj@c305 kkayttaj]$ ils -l /hpc_archive/home/kkayttaj/proj27 kkayttaj 0 disk-1.4 1344214352 2013-03-25.13:15 & proj27_data_1.tgz
If you want to be certain, that the transfer has been completely successful, you can run the checksum-commands for both local copy (md5sum) and the irods-copy (ichksum) and verify that the checksums match:
[kkayttaj@c305 kkayttaj]$ ichksum proj27_data_1.tgz proj27_data_1.tgz 24eeb2845cbfda238b78fa165c21607d Total checksum performed = 1, Failed checksum = 0 [kkayttaj@c305 kkayttaj]$ md5sum proj27_data_1.tgz 24eeb2845cbfda238b78fa165c21607d proj27_data_1.tgzOnce the flles are succesfully archived, files in directory proj27_data_1 and file proj27_data_1.tgz can be removed from the local $WRKDIR
[kkayttaj@c305 kkayttaj]$ rm proj27_data_1.tgz [kkayttaj@c305 kkayttaj]$ rm -r proj27_data_1 [kkayttaj@c305 kkayttaj]$ rm input27.dat [kkayttaj@c305 kkayttaj]$ rm result27*.out [kkayttaj@c305 kkayttaj]$ rm images27*.jpg
kkayttaj@sisu-login5:/wrk/kkayttaj>module load irods kkayttaj@sisu-login5:>cd $WRKDIR kkayttaj@sisu-login5:/wrk/kkaytaj>ils /hpc_archive/home/kkaytaj: C- /hpc_archive/home/kkayttaj/proj27 kkayttaj@sisu-login5:/wrk/kkaytaj>icd proj27 kkayttaj@sisu-login5:/wrk/kkaytaj>ils /hpc_archive/home/kkayttaj/proj27: proj27_data_1.tgz kkayttaj@sisu-login5:/wrk/kkaytaj> iget proj27_data_1.tgzThen decompress and unpack the data
kkayttaj@sisu-login5:/wrk/kkaytaj> tar zxvf proj27_data_1.tgz
After these commands the $WRKDIR directory will include directory proj27_data_1 that contains the files stored to the HPC-Archive service.
Example 3. Retrieving data from the old archive directory
The HPC archive system was taken in use in 2013. Users that used the archive service of CSC before the current HPC archive system can access the data sored to the older system by HPC archive path:
For example, user kkauttaj could retrieve files, stored to the old systrem, with commands:
[kkayttaj@taito-login3 ~]$ ils /hpc_archive/old_archive/kkayttaj /hpc_archive/old_archive/kkayttaj: [kkayttaj@taito-login3 ~]$ iget /hpc_archive/old_archive/kkayttaj/old_project.tgz
Each IDA project has two storage areas: staging area and frozen area. The staging area is intended for collecting and managing data. A mature data set, that will not change anymore, can be moved to frozen area to be preserved and further linked to other Fairdata services.
The command line client of IDA, ida, enables data transport between Taito and IDA. Data can uploaded and downloaded from the IDA staging area. In the case of frozen area, only download is possible. Note that some key features of IDA, like moving data to the frozen area of publishing data is possible only through the IDA WWW interface.
Before you can start using IDA client in Taito you must set up your IDA connection by running command.
The configuration process asks for your CSC project number and application password. This information can be obtained from the personal information page of the IDA WWW-interface. The configuration is stored to your home directory so you need to do it only once.
Once you have configured the connection, you can start operating with IDA . The basic syntax of ida commands is:
ida task -options target_in_ida target_in_taito
To check the content of you staging area in IDA, give command:
ida info /
Adding option -f to the ida command makes the command to use the frozen area instead of staging area. For example the following command would give you information about file test2, locating in the frozen area:
[kkayttaj@c305 kkayttaj] ida info -f /test2 project: 2000136 pathname: /test2 area: frozen type: file pid: 5bc456a74ba89743214993f23695474 size: 113926178937 encoding: application/octet-stream modified: 2018-10-15T08:17:53Z frozen: 2018-10-15T08:58:15Z
Uploading and downloading files and directories between Taito and IDA is done with commands:
ida upload target_in_ida local_file ida download target_in_ida local_file
For example command:
ida upload test123/data1 test_data
will upload file: test_data from Taito to the IDA staging area and store the data there to directory test123 with name data1. The directory test123 will be automatically created to the staging area, if it does not already exist.
If your download a directory, the downloaded files are stored to Taito as a zip archive. Thus you should define the local target file to have name extension .zip. For example:
ida download project1 project1_data.zip
The command above would download all the data from the IDA staging area directory project1 and store it to a zip archive file project1_data.zip in your current directory in Taito.
More information about using and configuring IDA client can be found from https://github.com/CSCfi/ida2-command-line-tools
CSC's archiving system, HPC archive, can be accessed via the Scientist's User Interface web portal by using portal's My Files tool (https://sui.csc.fi/group/sui/my-files). Intructions for data management with My Files, please see chapter 5.1.
|Previous chapter||One level up||Next chapter|