1.5 Disk environment

The CSC supercomputing environment allows researchers to analyse and manage large datasets. Supercluster taito.csc.fi and supercomputer sisu.csc.fi have a common disk environment and directory structure where on CSC you can work with datasets that contain several terabytes of data. In Taito (and Sisu) you can store data in several personal disk areas. The disk areas available in Taito are listed in Table 1.2 and Figure 1.2 below. Knowing the basic features of different disk areas is essential if you wish to use the CSC computing and storage services effectively. Note that in Taito all directories use the same Lustre-based file server (except $TMPDIR which is local to each node). Thus all directories are visible to both the front-end nodes and the computing nodes of Taito.

In addition to the local directories in Taito, users have access to the CSC archive server, HPC archive, which is intended for long term data storage. HPC archive server is used through the iRODS software. ( See CSC Computing environment user's guide, Chapter 3.2)


Table 1.2 Standard user directories at CSC.

Directory or storage area  Intended use  Default  quota/user  Storage time  Backup 
$HOME Initialization scripts, source codes, small data files.
Not for running programs or research data.
50 GB Permanent Yes
$USERAPPL Users' own application software. 50 GB Permanent Yes
$WRKDIR Temporary data storage. 5 TB 90 days No
$TMPDIR Temporary users' files, scratch, compiling.   2 days** No
project Common storage for project members. A project can consist of one or more user accounts. On request. Permanent No
HPC archive* Long term storage. 2 TB Permanent Yes


*The HPC-archive server is used through iRODS commands, and it is not mounted to Taito as a directory.
** This applies to the files on the login node $TMPDIR. The files in compute node $TMPDIR are kept for the duration of the batch job and deleted immediately after it.

The directories listed in the table above can be accessed by normal linux commands, excluding the archive server, which is used through the iRODS software. The $HOME and $WRKDIR directories as well as the HPC archive service can also be accessed through the MyFiles tool of the Scientist's User Interface WWW service. The $USERAPPL is a subdirectory of $HOME.

When you are working on command line, you can utilize automatically defined environment variables that contain the directory paths to different disk areas (excluding project disk for which there is no environment variable). So, if you would like to move to your work directory you could do that by writing:

cd $WRKDIR
Similarly, copying a file data.txt to your work directory could be done with command:
 cp data.txt $WRKDIR/
In the following chapters you can find more detailed introductions to the usage and features of different user specific disk areas.

 


1.5.1 Home directory

When you log in to CSC, your current directory will first be your home directory. Home directory should be used for initialization and configuration files and frequently accessed small programs and files. The size of the home directory is rather limited, by default it is only 50 GB, since this directory is not intended for large datasets.

The files stored in the home directory will be preserved as long as the corresponding user account is valid. This directory is also backed up regularly so that the data can be recovered in the case of disk failures. Taito and Sisu servers share the same home directory. Thus if you modify settings files like .bashrc, the modifications will affect both servers.

Inside linux commands, the home directory can be indicated by the tilde character (~) or by using the environment variable, $HOME. Also the command cd without any argument will return the user to his/her home directory.
 

1.5.2 Work directory

The work directory is a place where you can temporarily store large datasets that are actively used. By default, you can have up to 5 terabytes of data in it. This user-specific directory is indicated by the environment variable, $WRKDIR. The Taito and Sisu servers share the same $WRKDIR directory.

The $WRKDIR is NOT intended for long term data storage. Files that have not been used for 90 days will be automatically removed. If you want to keep some data in $WRKDIR for longer time periods you can copy it to directory $WRKDIR/DONOTREMOVE. The files under this sub directory will not be removed by the automatic cleaning process. Please note that the DONOTREMOVE directory is not intended for storing data but to keep available ONLY such important data that is frequently needed. Backup copies are not taken of the contents of the work directory (including DONOTREMOVE directory). Thus, if some files are accidentally removed by the user or lost due to physical breaking of the disk, the data is irreversibly lost.

Please do not use touch command particularly if you have lot of files because it is metadata heavy operation and will impact $WRKDIR performance for all users.

$WRKDIR F.A.Q.

  • Q: Can I check what files the cleaning process is about to remove from my $WRKDIR directory?
    A: You can use command show_old_wrkdir_files to check the files that are in danger to be removed. For example the commands below lists the files that are are older than 83 days and thus will be removed after the next seven days.
    show_old_wrkdir_files 83 > files_to_be_removed
    less files_to_be_removed
    The first command produces the list and writes it into a file. Please bear in mind that producing the list is a heavy operation so do it only when needed and refer to the file instead.
     
  • Q: I've a zip/tar which I've extracted and file dates are old, are those files removed immediately?
    A: No, extracted files will have 90 days grace time.
     
  • Q: I've old reference data which I need for verification often. Are those removed?
    A: All files which have been accessed within 90 days are safe (read, open, write, append, etc.). Command: stat filename will show timestamps.
     
  • Q: How do I preserve an important dataset I have in $WRKIR?
    A: Make a compressed tar file of your data and copy it to HPC archive (see chapter 3.2  of the CSC  Computing environmnet user guide).

 

 

1.5.3 Software installation directory

Users of CSC servers are free to install their own application software on CSC's computing servers. The software may be developed locally or downloaded from the internet. The main limitation for the software installation is that  user must be able to do the installation without using the root user account. Further, the software must be installed on user's own private disk areas instead of the common application directories like /usr/bin.

The user application directory $USERAPPL is a directory intended for installing user's own software. This directory is visible also to the computing nodes of the server, so software installed there can be used in batch jobs. Unlike the work directory, $USERAPPL is regularly backed up.

Sisu and Taito servers have separate $USERAPPL directories. This is reasonable: if you wish to use the same software in both machines you normally you need to compile separate versions of the software in each machine. The $USERAPPL directories reside in your home directory and are called: appl_sisu and appl_taito. These directories are actually visible to both Sisu and Taito servers. However, in Taito the $USERAPPL variable points to $HOME/appl_taito, and in Sisu to  $HOME/appl_sisu.


Figure 1.2 Storage environment in Sisu and Taito computers.

1.5.4 Monitoring disk usage


The amount of data that can be stored to different disk areas is limited either by user specific quotas or by the amount of available free disk space. You can check your disk usage and quotas with the command:

quota

The quota command shows also your disk quotas on different areas. If the disk quota is exceeded, you cannot add more data to the directory. In some directories, the quota can be slightly exceeded temporarily, but after a so-called grace period, the disk usage must be returned to the accepted level.

When a disk area fills up, you should remove unnecessary files, compress existing files and/or move them to the archive server. If you have well-justified reasons to use more disk space than what your quotas allow, you should send a request to the CSC resource manager (resource_at_csc.fi).

When one of your directories is approaching the quota limit, it is reasonable to check which files or folders take up most space. To list the files in your current directory ordered by size, give command:

ls -lSrh

Note however, that this command does not tell how much disk space the files in the subdirectories use. Thus it is often more useful to use the command du (disk usage) instead. You can, for example, try command:

du -sh ./*
This command returns the size of each file or the total disk usage of each subdirectory in your current directory. You can also combine du with sort to see what file or directory is the largest item in your current directory:
du -s ./* | sort -n
Note that as the du command checks all the files in your current directory and running the command may in some cases take several minutes.

 

  Previous chapter     One level up     Next chapter