1.5 Disk environment

Supercomputer sisu.csc.fi and supercluster taito.csc.fi have a common disk environment and directory structure that allows researchers to analyze and manage large datasets. A default CSC user account allows working with datasets that contain up to five terabytes of data. In Sisu you can store data to the personal disk areas listed in table 1.2 and figure 1.2. Knowing the basic features of different disk areas is essential if you wish to use the CSC computing and storage services effectively.

In Sisu, all directories use the same Lustre-based file server. Thus all directories are visible to both the front-end nodes and the computing nodes of Sisu. In addition to the local directories in Sisu, users have access to the HPC archive server, which is intended for long term data storage  (see CSC Computing environment user's guide, Chapter 3.2).

Table 1.2 Standard user directories at CSC.

Directory or storage area Intended use Default quota/user Storage time Backup
$HOME Initialization scripts, source codes, small data files.
Not for running programs or research data.
50 GB Permanent Yes
$USERAPPL Users' own application software. 50 GB Permanent Yes
$WRKDIR Temporary data storage. 5 TB 90 days No
$TMPDIR Temporary users' files.   2 days No
project Common storage for project members. A project can consist of one or more user accounts. On request. Permanent No
HPC Archive* Long term storage. 2 TB Permanent Yes
*The Archive server is used through iRODS commands, and it is not mounted to Sisu as a directory.

 

The directories listed in the table above can be accessed by normal linux commands, excluding the HPC archive server, which is used through the iRODS software. The $HOME and $WRKDIR directories can also be accessed through the MyFiles tool of the Scientist's User Interface WWW service.

When you are working on command line, you can utilize automatically defined environment variables that contain the directory paths to different disk areas (excluding project disk for which there is no environment variable). So, if you would like to move to your work directory you could do that by writing:

cd $WRKDIR
Similarly, copying a file data.txt to your work directory could be done with command:
cp data.txt $WRKDIR/
In the following chapters you can find more detailed introductions to the usage and features of different user specific disk areas.

 

1.5.1 Home directory

When you log in to CSC your current directory will first be your home directory. Home directory should be used for initialization and configuration files and other frequently accessed small programs and files. The size of the home directory is rather limited, by default it is only 50 GB, since this directory is not intended for large datasets.

The files stored in the home directory will be preserved as long as the corresponding user account is valid. Home directory is also backed up regularly so that the data can be recovered in the case of disk failures. Sisu and Taito servers share the same home directory. Thus if you modify settings files like .bashrc, the modifications will affect in both servers.

Inside linux commands, the home directory can be indicated by the tilde character (~) or by using the environment variable, $HOME. Also the command cd without any argument will return the user to his/her home directory.
 

1.5.2 Work directory

The work directory is a place where you can temporarily store large datasets that are actively used. By default, you can have up to 5 terabytes of data in it. This user-specific directory is indicated by the environment variable, $WRKDIR. The Taito and Sisu servers share the same $WRKDIR directory.

The $WRKDIR is NOT intended for long term data storage. Files that have not been used for 90 days will be automatically removed. If you want to keep some data in $WRKDIR for longer time periods you can copy it to directory $WRKDIR/DONOTREMOVE. The files under this sub directory will not be removed by the automatic cleaning process. Please note that the DONOTREMOVE directory is not intended for storing data but to keep available ONLY such important data that is frequently needed. Backup copies are not taken of the contents of the work directory (including DONOTREMOVE directory). Thus, if some files are accidentally removed by the user or lost due to physical breaking of the disk, the data is irreversibly lost.

Please do not use touch command particularly if you have lot of files because it is metadata heavy operation and will impact $WRKDIR performance for all users.

$WRKDIR F.A.Q.

  • Q: Can I check what files the cleaning process is about to remove from my $WRKDIR directory?
    A: You can use command show_old_wrkdir_files to check the files that are in danger to be removed. For example the commands below lists the files that are are older than 83 days and thus will be removed after the next seven days.
    show_old_wrkdir_files 83 > files_to_be_removed
    less files_to_be_removed
    The first command produces the list and writes it into a file. Please bear in mind that producing the list is a heavy operation so do it only when needed and refer to the file instead.
     
  • Q: I've a zip/tar which I've extracted and file dates are old, are those files removed immediately?
    A: No, extracted files will have 90 days grace time.
     
  • Q: I've old reference data which I need for verification often. Are those removed?
    A: All files which have been accessed within 90 days are safe (read, open, write, append, etc.). Command: stat filename will show timestamps.
     
  • Q: How do I preserve an important dataset I have in $WRKIR?
    A: Make a compressed tar file of your data and copy it to HPC archive (see chapter 3.2  of the CSC  Computing environmnet user guide).

1.5.3 Software installation directory

Users of CSC servers are free to install their own application software on CSC's computing servers. The software may be developed locally or downloaded from the internet. The main limitation for the software installation is that the user must be able to do the installation without using the root user account. Further, the software must be installed on user's own private disk areas instead of the common application directories like /usr/bin.

The user application directory $USERAPPL is intended for installing user's own software. This directory is visible also to the computing nodes of the server, so software installed there can be used in batch jobs.

Sisu and Taito servers have separate $USERAPPL directories. This is reasonable: if you wish to use the same software in both machines  normally you need to compile separate versions of the software in each machine. The $USERAPPL directories reside in your home directory and are called: appl_sisu and appl_taito. These directories are actually visible to both Sisu and Taito servers. However, in Sisu the $USERAPPL variable points to $HOME/appl_sisu, and in Taito to $HOME/appl_taito.


Figure 1.2 Storage environment in Sisu and Taito computers.
 

More information regarding user application directory $USERAPPL can be found in the CSC Computing environment user guide, especially in chapter 3.1.5, which includes practical examples of installing own software in $USERAPPL.

CSC can help you with your own installation of the software/tool, please don't hesitate to contact the Service Desk.

 

1.5.4 Monitoring disk usage


The amount of data that can be stored to different disk areas is limited either by user specific quotas or by the amount of available free disk space. You can check your disk usage and quotas with the command:

quota

The quota command shows also your disk quotas on different areas. If the disk quota is exceeded, you cannot add more data to the directory. In some directories, the quota can be slightly exceeded temporarily, but after a so-called grace period, the disk usage must be returned to the accepted level.

When a disk area fills up, you should remove unnecessary files, compress existing files and/or move them to the HPC archive server. If you have well-justified reasons to use more disk space than what your quotas allow, you should send a request to the CSC resource manager (resource_at_csc.fi).

When some of your directories is approaching the quota limits, it is reasonable to check which files of folders require most space. To list the files in your current directory ordered by size, give command:

ls -lSrh
Note however, that this command does not tell how much disk space the files in the subdirectories use. Thus it is often more useful to use the command du (disk usage) instead. You can, for example, try the command:
du -sh ./*
This command returns the size of each file or the total disk usage of each subdirectory in your current directory. You can also combine du with sort to see what file or directory is the largest item in your current directory:
du -s ./* | sort -n
Note that as the du command checks all the files in your current directory and running the command may in some cases take several minutes.

 

    Previous chapter     One level up     Next chapter