3.2. Archiving data to the HPC archive and IDA storage services

CSC supports two parallel archiving systems for longer term data storage:

  1. HPC archive is intended to storing datasets that are utilized in CSC computing environment.

  2. IDA storage service  is a general storage service for scientific data. It is part of the Fairdata.fi reserach data management environment.

The main difference between between these two services is in their user policy and in the accessibility. The HPC archive is directly bound to the CSC user accounts: All the customers of the CSC computing environment will automatically have an account with 2 TB quota in the HPC archive.

The IDA service is not directly linked to the CSC computing environment. Even though CSC hosts the IDA service and users need to register to CSC, the storage space is applied from the universities or from the Academy of Finland. IDA users can use the storage space from both their own computers and from the servers of CSC. Thus IDA can be used for transporting data between CSC and local environment. IDA can also be used to publish or share datasets. More information about applying storage space from IDA can be found from the home page of IDA:

The usage of HPC archive and  IDA is based on client programs. IDA uses in-house developed client tool (ida), while HPC-archive service uses iRODS (Integrated Rule-Oriented Data System) client.

The files that are in these storage systems can be managed through client interfaces but the content of the archived files can't be studied or modified. In stead, the stored file must be first retrieved back to the CSC servers or to some other computer in order to analyse or modify the dataset.

3.2.1 Using HPC archive

HPC archive service is based on iRODS technology. In Taito cluster the iRODS commands are automatically in use. In Sisu you need to run following set-up command in order to be able to execute iRODS commands:

module load irods

(Note that if you are using HPC arcive in Taito, you should not run the module load irods commad as it loads an iRODS version that is not compatilble with HPC archive).  The two basic iRODS commands are:

  • iput that copies a file to the iRODS server
  • iget that retrieves a file from the iRODS server

In addition to that, there are several other iRODS commands that can be used to manage the data at the archive server. Many of these i-commands, listed in table 3.1 are analogous to the corresponding linux commands. E.g. Command irm removes a file from the iRODS server and imkdir creates a new directory to the iRODS server.

We recommend that you don't store all the data to the main folder of the server, but instead you should create a hierarchical directory structure that helps you to locate your files later on. Further, if possible, the files should merged into larger compressed archiving units with programs like tar or zip before moving data to the HPC archive or IDA.

Table 3.1 Most commonly used iRODS commands.

Command Function
icd Change the current working directory (collection).
ichksum Calculate checksum for one or more data-object or collections.
ichmod Change access permissions to collections or data-objects
icp Copy a data-object (file) or collection (directory) to another.
iexit Exit an irods session (un-iinit).
iget Get a file from iRODS.
igetwild.sh Get one or more files from iRODS using wildcard characters.
ihelp Display a synopsis list of the i-commands
iinit Initialize a session, so you don't need to retype your password.
ilocate Search for data-object(s) OR collections (via a script).
ils List collections (directories) and data-objects (files).
imkdir Make an irods directory (collection).
imv Move/rename an irods data-object (file) or collection (directory).
ipasswd Change your irods password.
iput Put (store) a file into iRODS.
ipwd Print the current working directory (collection) name.
iquota Show information on iRODS quotas (if any).
irm Remove one or more data-objects or collections.
irsync

Synchronize collections between a local/irods or irods/irods (at the moment this command is not working properly. We recomment not to use it)

 

Example 1. Storing data from within Sisu to the HPC archive server

In this example, user kkayttaj copies a set of files from his $WRKDIR directory in Sisu, to his HPC Archive directory.
After logging into Sisu the user sets up the iRODS commands and moves to the work directory of Sisu:

[kkayttaj@c305 ~]$ module load irods
[kkayttaj@c305 ~]$ cd $WRKDIR

In the case of operating formwithin Taito, skip the module-command (see above).

Then the user checks the content of the directory with command ls and creates a new directory called: proj27_data_1.

[kkayttaj@c305 kkayttaj]$ ls
images27_a.jpg images27_b.jpg  images27_c.jpg  input27.dat  result27_a.out
result27_b.out  result27_c.out
[kkayttaj@c305 kkayttaj]$  mkdir proj27_data_1
Then the user copies the files he wants to preserve to the new directory:
[kkayttaj@c305 kkayttaj]$  cp input27.dat proj27_data_1
[kkayttaj@c305 kkayttaj]$  cp result27*.out proj27_data_1
[kkayttaj@c305 kkayttaj]$  cp images27*.jpg  proj27_data_1
After that the user checks that the new directory contains all the files that you wish to store to archive.
[kkayttaj@c305 kkayttaj]$ ls proj27_data_1
images27_a.jpg images27_b.jpg  images27_c.jpg  input27.dat  result27_a.out
result27_b.out  result27_c.out
Next, the data to be stored is collected to a compressed tar archive file called proj27_data_1.tgz.
[kkayttaj@c305 kkayttaj]$ tar zcvf proj27_data_1.tgz proj27_data_1
The resulting compressed file proj27_data_1.tgz can now be copied to the HPC archive. Before copying the data,the user first creates a new sub-folder called proj27 to the IDA server.
[kkayttaj@c305 kkayttaj]$ imkdir proj27
Next the user checks that the directory was created to the HPC archive server and changes the current HPC archive server directory as the new proj27 directory:
[kkayttaj@c305 kkayttaj]$ ils
/hpc_archive/home/kkayttaj:   C- /hpc_archive/home/kkayttajl/proj27
[kkayttaj@c305 kkayttaj]$ icd proj27
After this the user is ready to execute iput command that copies the file to to the new directory in the HPC archive server.
[kkayttaj@c305 kkayttaj]$ iput proj27_data_1.tgz
Once the data copying process is finished, the user checks that the file has been successfully copied to the archive:
[kkayttaj@c305 kkayttaj]$ ils -l
/hpc_archive/home/kkayttaj/proj27
  kkayttaj            0 disk-1.4               1344214352 2013-03-25.13:15 & proj27_data_1.tgz

If you want to be certain, that the transfer has been completely successful, you can run the checksum-commands for both local copy (md5sum) and the irods-copy (ichksum) and verify that the checksums match:

[kkayttaj@c305 kkayttaj]$ ichksum proj27_data_1.tgz
    proj27_data_1.tgz                     24eeb2845cbfda238b78fa165c21607d
Total checksum performed = 1, Failed checksum = 0

[kkayttaj@c305 kkayttaj]$ md5sum proj27_data_1.tgz
24eeb2845cbfda238b78fa165c21607d proj27_data_1.tgz
Once the flles are succesfully archived, files in directory proj27_data_1 and file proj27_data_1.tgz can be removed from the local $WRKDIR
[kkayttaj@c305 kkayttaj]$ rm proj27_data_1.tgz
[kkayttaj@c305 kkayttaj]$ rm -r proj27_data_1
[kkayttaj@c305 kkayttaj]$ rm input27.dat
[kkayttaj@c305 kkayttaj]$ rm result27*.out
[kkayttaj@c305 kkayttaj]$ rm images27*.jpg

 

Example 2. Retrieving data from the archive server on Sisu

To retrieve the data, stored to HPC Archive in the previous example, the user kkayttaj should do following steps. First the compressed file is copied from the HPC Archive to the $WRKDIR directory.
kkayttaj@sisu-login5:/wrk/kkayttaj>module load irods
kkayttaj@sisu-login5:>cd $WRKDIR
kkayttaj@sisu-login5:/wrk/kkaytaj>ils
/hpc_archive/home/kkaytaj: C- /hpc_archive/home/kkayttaj/proj27
kkayttaj@sisu-login5:/wrk/kkaytaj>icd proj27
kkayttaj@sisu-login5:/wrk/kkaytaj>ils
/hpc_archive/home/kkayttaj/proj27: proj27_data_1.tgz
kkayttaj@sisu-login5:/wrk/kkaytaj> iget proj27_data_1.tgz
Then decompress and unpack the data
kkayttaj@sisu-login5:/wrk/kkaytaj> tar zxvf proj27_data_1.tgz

After these commands the $WRKDIR directory will include directory proj27_data_1 that contains the files stored to the HPC-Archive service.


Example 3. Retrieving data from the old archive directory

The HPC archive system was taken in use in 2013. Users that used the archive service of CSC before the current HPC archive system can access the data sored to the older system by HPC archive path:

/hpc_archive/old_archive/useraccount

For example, user kkauttaj could retrieve files, stored to the old systrem, with commands:

[kkayttaj@taito-login3 ~]$ ils /hpc_archive/old_archive/kkayttaj
/hpc_archive/old_archive/kkayttaj:
 [kkayttaj@taito-login3 ~]$ iget /hpc_archive/old_archive/kkayttaj/old_project.tgz

 

3.2.2 Configuring and using IDA in command line

Each IDA project has two storage areas: staging area and frozen area. The staging area is intended for collecting and managing data. A mature data set, that will not change anymore, can be moved to frozen area to be preserved and further linked to other Fairdata services.

The command line client of IDA, ida, enables data transport between Taito and IDA. Data can uploaded and downloaded from the IDA staging area. In the case of frozen area, only download is possible. Note that some key features of IDA, like moving data to the frozen area of publishing data is possible only through the IDA WWW interface.

Before you can start using IDA client in Taito you must set up your IDA connection by running command.

ida_configure

The configuration process asks for your CSC project number and application password. This information can be obtained from the personal information page of the IDA WWW-interface. The configuration is stored to your home directory so you need to do it only once.

Once you have configured the connection, you can start operating with IDA . The basic syntax of ida commands is:

ida task -options target_in_ida target_in_taito

To check the content of you staging area in IDA, give command:

ida info /

Adding option -f to the ida command makes the command to use the frozen area instead of staging area. For example the following command would give you information about file test2, locating in the frozen area:

[kkayttaj@c305 kkayttaj] ida info -f /test2
project:    2000136
pathname:   /test2
area:       frozen
type:       file
pid:        5bc456a74ba89743214993f23695474
size:       113926178937
encoding:   application/octet-stream
modified:   2018-10-15T08:17:53Z
frozen:     2018-10-15T08:58:15Z

 

Uploading and downloading files and directories between Taito and IDA is done with commands:

ida upload target_in_ida local_file
ida download target_in_ida local_file 

For example command:

ida upload test123/data1 test_data

will upload file: test_data from Taito to the IDA staging area and store the data there to directory test123 with name data1. The directory test123 will be automatically created to the staging area, if it does not already exist.

If your download a directory, the downloaded files are stored to Taito as a zip archive. Thus you should define the local target file to have name extension .zip. For example:

ida download project1 project1_data.zip

The command above would download all the data from the IDA staging area directory project1 and store it to a zip archive file project1_data.zip in your current directory in Taito.

More information about using and configuring IDA client can be found from https://github.com/CSCfi/ida2-command-line-tools

 

 

3.2.3 Using HPC Archive with Scientist's User Interface

CSC's archiving system, HPC archive, can  be accessed via the Scientist's User Interface web portal by using portal's My Files tool (https://sui.csc.fi/group/sui/my-files). Intructions for data management with My Files, please see chapter 5.1.

 

  Previous chapter     One level up     Next chapter