Example case 2: Data lifecycle at CSC – from collection to preservation

This research project collects data from sensors to a data store. They will collect about 100 TiBs of data and the data will be processed in batches that create a summary dataset that is to be published on the web as dynamic data. Research group needs to collect sensory data and process, open and preserve it. There's a large amount of data.

This can be done by using several CSC services together:

  • Allas Object Storage via cPouta: Collecting sensory data 
  • Data analysis in CSC computing environment
  • cPouta: Opening and sharing data in cloud (dynamic data) by installing for example a web service. 
  • Fairdata Services : Data preservation (static files after the project is finished). First step is to bring data into IDA and describe it with Qvain Light, which allows to get identifier (DOI) for the dataset. Digital Preservation Service for Research Data requires a contract.

Policies require publishing the data according to the FAIR principles. The published datasets 10 TiB will be proposed for preservation.

There is no personal information in this dataset.

There is no sensitive data in this dataset.

The project will be using CSC computing environments.

Service Getting access Benefits for this case Requirements for the user

 

Allas Object Storage

Collecting sensory data for temporary storage

Create a CSC user account, CSC project, apply for access right to Allas and apply for extra storage and billing units

Adhere to General Terms of Use for CSC's Services for Research

Project lifetime data storage. 

Good accessibility and capacity. Integrity control.

The data can have different levels of access control and can be accessed on the CSC servers as well as from anywhere on the internet.

 

 

Responsibility for managing the data and taking backups.

cPouta

Opening and sharing data in cloud

Create a CSC user account and CSC project, apply for access right to cPouta and apply for extra storage and billing units

Adhere to General Terms of Use for CSC's Services for Research and to Pouta terms of use

A Virtual Machine (VM) running in cPouta can be used as the platform into which the sensors send their data. VM processes the data and stores the data to Allas.

Another VM can be set up to allow external users to access the collected data and some light-weight analysis tools.

cPouta allows users to launch and manage their own Linux based VM. This allows users to:

  1. Use root account to install software and to freely define the settings of the server
  2. Run servers (eg. WWW servers or databases that can be opened)
  3. Ensure that certain (limited) computing capacity is available whenever needed.

For data storage user can add Volume to a VM. VM root disk is not suitable for data storage.

The users must be able to independently build and maintain the VM and the servers that run on the VM.

User must also set up connections to the other services that the VM uses.

The server requires continuous maintenance and monitoring by the user.

CSC computing environment

Data analysis in CSC environment

Create a CSC user account, CSC project and apply for access right to CSC computing environment and apply for extra storage and billing units

Adhere to General Terms of Use for CSC's Services for Research 

 

Data can be copied from Allas for the analysis to the CSC supercomputing environment. Mahti & Puhti can be used for tasks that require exceptional resources like heavy parallel computing and high memory.

Powerful large and medium scale simulations, for example of nuclear fusion, material science, fluid and molecular dynamics

Effective data intensive computing e.g. in bioinformatics, digital humanities and geosciences.

GPU partition for handling demanding artificial intelligence tasks with fast and easy access to large data sets.

User needs to be able to operate in Linux command line to import and export data and to be able to submit the actual computing tasks through the batch job system.

As supercomputers are used through batch job systems they and not suitable for tasks that require instant access to the computing resources.

After the computation, data has to be copied to Allas. 

 

Fairdata Services

Data publication and preservation

  1. Create a CSC user account and CSC project
  2. Apply for access right to IDA

Adhere to General Terms of Use for CSC's Services for Research and to Fairdata IDA terms of use

 

Easy to use web browser interface for data upload, download etc.

In Puhti and Mahti user can transfer data to IDA with IDA's command line tools.

Enables saving, organizing and sharing data within the project group and storing the data in an immutable state. Data can be constructed as a dataset via additional services.

Helps you to transform research data documentation into an interoperable, machine-readable format.

Provides download links for data directly from IDA (temporary sharing) or via  Etsin research data finder. Published dataset gets a DOI and a landing page in Etsin.

One way to transfer data into Digital Preservation service for Research Data

The actual data (files) in the published dataset can be set either as openly downloadable by others or to publish only the documentation (metadata) of the dataset.

User needs a web browser and a internet connection. For large amount of data, user may need to learn how to use IDA's command line tools for data transfer.