Example case 2: Data lifecycle at CSC – from collection to preservation - Services for Research
Example case 2: Data lifecycle at CSC – from collection to preservation
This research project collects data from sensors to a data store. They will collect about 100 TiBs of data and the data will be processed in batches that create a summary dataset that is to be published on the web as dynamic data. Research group needs to collect sensory data and process, open and preserve it. There's a large amount of data.
This can be done by using several CSC services together:
- Allas Object Storage via cPouta: Collecting sensory data
- Data analysis in CSC computing environment
- cPouta: Opening and sharing data in cloud (dynamic data) by installing for example a web service.
- Fairdata Services : Data preservation (static files after the project is finished). First step is to bring data into IDA and describe it with Qvain Light, which allows to get identifier (DOI) for the dataset. Digital Preservation Service for Research Data requires a contract.
Policies require publishing the data according to the FAIR principles. The published datasets 10 TiB will be proposed for preservation.
There is no personal information in this dataset.
There is no sensitive data in this dataset.
The project will be using CSC computing environments.
|Service||Getting access||Benefits for this case||Requirements for the user|
Collecting sensory data for temporary storage
Project lifetime data storage.
Good accessibility and capacity. Integrity control.
The data can have different levels of access control and can be accessed on the CSC servers as well as from anywhere on the internet.
Responsibility for managing the data and taking backups.
Opening and sharing data in cloud
A Virtual Machine (VM) running in cPouta can be used as the platform into which the sensors send their data. VM processes the data and stores the data to Allas.
Another VM can be set up to allow external users to access the collected data and some light-weight analysis tools.
cPouta allows users to launch and manage their own Linux based VM. This allows users to:
For data storage user can add Volume to a VM. VM root disk is not suitable for data storage.
The users must be able to independently build and maintain the VM and the servers that run on the VM.
User must also set up connections to the other services that the VM uses.
The server requires continuous maintenance and monitoring by the user.
Data analysis in CSC environment
Data can be copied from Allas for the analysis to the CSC supercomputing environment. Mahti & Puhti can be used for tasks that require exceptional resources like heavy parallel computing and high memory.
Powerful large and medium scale simulations, for example of nuclear fusion, material science, fluid and molecular dynamics
Effective data intensive computing e.g. in bioinformatics, digital humanities and geosciences.
GPU partition for handling demanding artificial intelligence tasks with fast and easy access to large data sets.
User needs to be able to operate in Linux command line to import and export data and to be able to submit the actual computing tasks through the batch job system.
As supercomputers are used through batch job systems they and not suitable for tasks that require instant access to the computing resources.
After the computation, data has to be copied to Allas.
Data publication and preservation
Easy to use web browser interface for data upload, download etc.
In Puhti and Mahti user can transfer data to IDA with IDA's command line tools.
Enables saving, organizing and sharing data within the project group and storing the data in an immutable state. Data can be constructed as a dataset via additional services.
Helps you to transform research data documentation into an interoperable, machine-readable format.
Provides download links for data directly from IDA (temporary sharing) or via Etsin research data finder. Published dataset gets a DOI and a landing page in Etsin.
One way to transfer data into Digital Preservation service for Research Data
The actual data (files) in the published dataset can be set either as openly downloadable by others or to publish only the documentation (metadata) of the dataset.
User needs a web browser and a internet connection. For large amount of data, user may need to learn how to use IDA's command line tools for data transfer.