3.2. Archiving data to the HPC archive and IDA storage services

CSC supports two parallel archiving systems for long term data storage:

  1. HPC archive is intended to storing datasets that are utilized in CSC computing environment.

  2. IDA storage service  is a general storage service for scientific data.

The main difference between between these two services is in their user policy and in the accessibility. The HPC archive is directly bound to the CSC user accounts: All the customers of the CSC computing environment will automatically have an account with 2 TB quota in the HPC archive.

The IDA service is not directly linked to the CSC computing environment. Even though CSC hosts the IDA service and users need to register to CSC, the storage space is applied from the universities or from the Academy of Finland. IDA users can use the storage space from both their own computers and from the servers of CSC. Thus IDA can be used for transporting data between CSC and local environment. IDA can also be used to publish or share datasets. More information about applying storage space from IDA can be found from the home page of IDA:

The usage of both HPC archive and IDA is based on iRODS (Integrated Rule-Oriented Data System) technology and thus these storage sites are not directly mounted to the CSC computing environment. The dataset to be stored, is copied to the archive over the internet. The files that are in these storage systems can be managed through iRODS interfaces but the content of the archived files can't be studied or modified. In stead, the stored file must be first retrieved back to the CSC servers or to some other computer in order to analyse or modify the dataset.

3.2.1 Using HPC archive

In Taito cluster the iRODS commands are automatically in use. In Sisu you need to run following set-up command in order to be able to execute iRODS commands:

module load irods

(Note that if you are using HPC arcive in Taito, you should not run the module load irods commad as it loads an iRODS version that is not compatilble with HPC archive).  The two basic iRODS commands are:

  • iput that copies a file to the iRODS server
  • iget that retrieves a file from the iRODS server

In addition to that, there are several other iRODS commands that can be used to manage the data at the archive server. Many of these i-commands, listed in table 3.1 are analogous to the corresponding linux commands. E.g. Command irm removes a file from the iRODS server and imkdir creates a new directory to the iRODS server.

We recommend that you don't store all the data to the main folder of the server, but instead you should create a hierarchical directory structure that helps you to locate your files later on. Further, if possible, the files should merged into larger compressed archiving units with programs like tar or zip before moving data to the HPC archive or IDA.

Table 3.1 Most commonly used iRODS commands.

Command Function
icd Change the current working directory (collection).
ichksum Calculate checksum for one or more data-object or collections.
ichmod Change access permissions to collections or data-objects
icp Copy a data-object (file) or collection (directory) to another.
iexit Exit an irods session (un-iinit).
iget Get a file from iRODS.
igetwild.sh Get one or more files from iRODS using wildcard characters.
ihelp Display a synopsis list of the i-commands
iinit Initialize a session, so you don't need to retype your password.
ilocate Search for data-object(s) OR collections (via a script).
ils List collections (directories) and data-objects (files).
imkdir Make an irods directory (collection).
imv Move/rename an irods data-object (file) or collection (directory).
ipasswd Change your irods password.
iput Put (store) a file into iRODS.
ipwd Print the current working directory (collection) name.
iquota Show information on iRODS quotas (if any).
irm Remove one or more data-objects or collections.
irsync

Synchronize collections between a local/irods or irods/irods (at the moment this command is not working properly. We recomment not to use it)

 

Example 1. Storing data from within Sisu to the HPC archive server

In this example, user kkayttaj copies a set of files from his $WRKDIR directory in Sisu, to his HPC Archive directory.
After logging into Sisu the user sets up the iRODS commands and moves to the work directory of Sisu:

[kkayttaj@c305 ~]$ module load irods
[kkayttaj@c305 ~]$ cd $WRKDIR

In the case of operating formwithin Taito, skip the module-command (see above).

Then the user checks the content of the directory with command ls and creates a new directory called: proj27_data_1.

[kkayttaj@c305 kkayttaj]$ ls
images27_a.jpg images27_b.jpg  images27_c.jpg  input27.dat  result27_a.out
result27_b.out  result27_c.out
[kkayttaj@c305 kkayttaj]$  mkdir proj27_data_1
Then the user copies the files he wants to preserve to the new directory:
[kkayttaj@c305 kkayttaj]$  cp input27.dat proj27_data_1
[kkayttaj@c305 kkayttaj]$  cp result27*.out proj27_data_1
[kkayttaj@c305 kkayttaj]$  cp images27*.jpg  proj27_data_1
After that the user checks that the new directory contains all the files that you wish to store to archive.
[kkayttaj@c305 kkayttaj]$ ls proj27_data_1
images27_a.jpg images27_b.jpg  images27_c.jpg  input27.dat  result27_a.out
result27_b.out  result27_c.out
Next, the data to be stored is collected to a compressed tar archive file called proj27_data_1.tgz.
[kkayttaj@c305 kkayttaj]$ tar zcvf proj27_data_1.tgz proj27_data_1
The resulting compressed file proj27_data_1.tgz can now be copied to the HPC archive. Before copying the data,the user first creates a new sub-folder called proj27 to the IDA server.
[kkayttaj@c305 kkayttaj]$ imkdir proj27
Next the user checks that the directory was created to the HPC archive server and changes the current HPC archive server directory as the new proj27 directory:
[kkayttaj@c305 kkayttaj]$ ils
/hpc_archive/home/kkayttaj:   C- /hpc_archive/home/kkayttajl/proj27
[kkayttaj@c305 kkayttaj]$ icd proj27
After this the user is ready to execute iput command that copies the file to to the new directory in the HPC archive server.
[kkayttaj@c305 kkayttaj]$ iput proj27_data_1.tgz
Once the data copying process is finished, the user checks that the file has been successfully copied to the archive:
[kkayttaj@c305 kkayttaj]$ ils -l
/hpc_archive/home/kkayttaj/proj27
  kkayttaj            0 disk-1.4               1344214352 2013-03-25.13:15 & proj27_data_1.tgz

If you want to be certain, that the transfer has been completely successful, you can run the checksum-commands for both local copy (md5sum) and the irods-copy (ichksum) and verify that the checksums match:

[kkayttaj@c305 kkayttaj]$ ichksum proj27_data_1.tgz
    proj27_data_1.tgz                     24eeb2845cbfda238b78fa165c21607d
Total checksum performed = 1, Failed checksum = 0

[kkayttaj@c305 kkayttaj]$ md5sum proj27_data_1.tgz
24eeb2845cbfda238b78fa165c21607d proj27_data_1.tgz
Once the flles are succesfully archived, files in directory proj27_data_1 and file proj27_data_1.tgz can be removed from the local $WRKDIR
[kkayttaj@c305 kkayttaj]$ rm proj27_data_1.tgz
[kkayttaj@c305 kkayttaj]$ rm -r proj27_data_1
[kkayttaj@c305 kkayttaj]$ rm input27.dat
[kkayttaj@c305 kkayttaj]$ rm result27*.out
[kkayttaj@c305 kkayttaj]$ rm images27*.jpg

 

Example 2. Retrieving data from the archive server on Sisu

To retrieve the data, stored to HPC Archive in the previous example, the user kkayttaj should do following steps. First the compressed file is copied from the HPC Archive to the $WRKDIR directory.
kkayttaj@sisu-login5:/wrk/kkayttaj>module load irods
kkayttaj@sisu-login5:>cd $WRKDIR
kkayttaj@sisu-login5:/wrk/kkaytaj>ils
/hpc_archive/home/kkaytaj: C- /hpc_archive/home/kkayttaj/proj27
kkayttaj@sisu-login5:/wrk/kkaytaj>icd proj27
kkayttaj@sisu-login5:/wrk/kkaytaj>ils
/hpc_archive/home/kkayttaj/proj27: proj27_data_1.tgz
kkayttaj@sisu-login5:/wrk/kkaytaj> iget proj27_data_1.tgz
Then decompress and unpack the data
kkayttaj@sisu-login5:/wrk/kkaytaj> tar zxvf proj27_data_1.tgz

After these commands the $WRKDIR directory will include directory proj27_data_1 that contains the files stored to the HPC-Archive service.


Example 3. Retrieving data from the old archive directory

The HPC archive system was taken in use in 2013. Users that used the archive service of CSC before the current HPC archive system can access the data sored to the older system by HPC archive path:

/hpc_archive/old_archive/useraccount

For example, user kkauttaj could retrieve files, stored to the old systrem, with commands:

[kkayttaj@taito-login3 ~]$ ils /hpc_archive/old_archive/kkayttaj
/hpc_archive/old_archive/kkayttaj:
 [kkayttaj@taito-login3 ~]$ iget /hpc_archive/old_archive/kkayttaj/old_project.tgz

 

3.2.2 Configuring the connection to IDA

You can use the IDA storage service directly from the severs of CSC using the same iRODS commands that are used for HPC archive. However, as IDA is not the default iRODS server in the CSC computing environment, you must modify your iRODS settings before connecting to IDA. Before changing the iRODS setup, close the iRODS connection to the HPC-archive server with command

iexit full

Then switch to iRODS version 4.0.3 (the default iRODS version of Taito is not compatible with IDA)

module load irods/4.0.3

 

After that you will need to create a connection configuration file for IDA.  This configurationfile  should be named as irods_environment.json and it shoulb be located in .irods subdiretory in your home directory.

To start editing this IDA confioguration file with for example nano editor, run command:

 nano $HOME/.irods/irods_environment.json

The file shoud contain following information:

{

    "irods_home": "/ida/organization/project",
    "irods_user_name": "username",
    "irods_host": "ida.csc.fi",
    "irods_zone": "ida",
    "irods_port": 1247,
    "irods_authentication_scheme": "native"

}

 

In the example above you shoud changer the organization, project and username so that they match your IDA project. For example for user kkayttaj that belongs into IDA project called bigproj in universty myuni, the final version of irods_environmnet.json should look like (note the commas):

{
    "irods_home": "/ida/myuni/bigproj",
    "irods_user_name": "kkayttaj",
    "irods_port": 1247,
    "irods_host": "ida.csc.fi",
    "irods_zone": "ida",
    "irods_authentication_scheme": "native"
}

Note that the IDA home directory is the same for all the members of your IDA project.

Then open a new iRODS connection to IDA with command:

iinit

To switch back to use HPC archive, give command:

module unload irods/4.0.3

 

Example 3. Storing data to the IDA server

In this example, user kkayttaj copies a set of files from his $WRKDIR directory in Taito, to his IDA directory. In the example we assume that the user belongs to organization: jy (University of Jyväskylä) and IDA project: jy1234. The IDA connection has been defined as described in chapter 3.2.2. After logging in to CSC the user sets up the iRODS commands and moves to the work directory :

kkayttaj@taito-login3:~> module load irods/4.0.3
kkayttaj@taito-login3:~> cd $WRKDIR
Then the user checks the content of the directory with command ls and creates a new directory called: proj27_data_1.
kkayttaj@taito-login3:/wrk/kkaytaj> ls
images27_a.jpg images27_b.jpg images27_c.jpg input27.dat  result27_a.out
result27_b.out result27_c.out
kkayttaj@taito-login3:/wrk/kkaytaj> mkdir proj27_data_1
Then the user copies the files he wants to preserve to the new directory:
kkayttaj@taito-login3:/wrk/kkaytaj> cp input27.dat proj27_data_1
kkayttaj@taito-login3:/wrk/kkaytaj> cp result27*.out proj27_data_1
kkayttaj@taito-login3:/wrk/kkaytaj> cp images27*.jpg  proj27_data_1
After that  the user checks that the new directory contains all the files that you wish to store to archive.
kkayttaj@taito-login3:/wrk/kkaytaj> ls proj27_data_1
images27_a.jpg images27_b.jpg  images27_c.jpg  input27.dat  result27_a.out
result27_b.out  result27_c.out
Next, the data to be stored is collected  to a copressed tar archive file called proj27_data_1.tgz.
kkayttaj@taito-login3:/wrk/kkaytaj> tar zcvf proj27_data_1.tgz proj27_data_1


The resulting compressed file proj27_data_1.tgz can now be copied to the IDA service. Before  copying the data,the user first creates a new sub-folder called proj27  to the IDA server.

kkayttaj@taito-login3:/wrk/kkaytaj> imkdir proj27

Next the user checks that the directory was created to the IDA server and changes the current IDA server directory as the new proj27 directory:

kkayttaj@taito-login3:/wrk/kkaytaj> ils
/ida/jy/jy1234:
  .apdisk
  ._.apdisk
  data_file_1.tgz
  dataset_123.zip
  ._.TemporaryItems
  C- /ida/jy/jy1234/proj27
  C- /ida/jy/jy1234/published
  C- /ida/jy/jy1234/.TemporaryItems

kkayttaj@taito-login3:/wrk/kkaytaj> icd proj27

After this the user is ready to execute iput command that copies the file to to the new directory in the IDA server.

kkayttaj@taito-login3:/wrk/kkaytaj> iput proj27_data_1.tgz
Once the data copying process is finished, the user checks that the file has been successfully copied to the archive:
kkayttaj@taito-login3:/wrk/kkaytaj> ils -l
/ida/jy/jy1234/proj27:
  kkayttaj            0 disk-1.4               1344214352 2013-03-25.13:15 & proj27_data_1.tgz

Once the data has bee successfully copied to IDA the local copes of the files can be removed.

kkayttaj@taito-login3:/wrk/kkaytaj> rm proj27_data_1.tgz
kkayttaj@taito-login3:/wrk/kkaytaj> rm -r proj27_data_1
kkayttaj@taito-login3:/wrk/kkaytaj> rm input27.dat
kkayttaj@taito-login3:/wrk/kkaytaj> rm result27*.out
kkayttaj@taito-login3:/wrk/kkaytaj> rm images27*.jpg

 

3.2.3 Using HPC Archive and Ida with Scientist's User Interface

CSC's parallel archiving systems HPC Archive and Ida can both be accessed via the Scientist's User Interface web portal by using portal's My Files tool (https://sui.csc.fi/group/sui/my-files). Intructions for data management with My Files, please see chapter 5.1.

 

  Previous chapter     One level up     Next chapter