Artificial Intelligence enriches the digitised archived materials

The main objective of the High-Performance Digitisation Initiative by CSC, National Library and National Archive is to create an intelligent annotation pipeline for processing and enriching archived material, such as scanned newspapers, books, images and official documents. Harnessing the new technologies for artificial intelligence and machine learning offers us an opportunity to substantially enhance data quality and thereby boost the value of data.

Efficient use of the growing resources of digital data is today seriously hampered by insufficient search functions and findability due to deficits in data quality and lacking metadata. Metadata has traditionally been added manually and errors or ambiguities in the data, like those very frequently occurring in digitisation processes, have made full text search and automatic annotation difficult.

The new annotation pipeline runs in CSC's supercomputing environment and uses high-performance GPU accelerated machine learning methods for computer vision and artificial intelligence based annotation. The HPD Initiative collects required data with collaborating archive data sources, performs training of advanced machine learning models, implements a production-quality software pipeline and service, and finally provides the integration back to data sources.

The pipeline will be developed into a service that will be offered to memory organisations such as libraries and archives. The delivered pipeline will be available as open source software and the pipeline will be taken into sustainable production use in CSC's cloud computing platform.

The end users of the solution are citizens, researchers, businesses and public administration via the national portals of National Library and National Archives in Finland, and available for harvesting by the European Data Portal. The added value to the end users are the annotated documents and especially images: The vast masses of scanned archive images can be considered to be unaccessible at their current state as the required human labour to organise and discover content is very high. The High-Performance Digitisation Initiative provides a computational solution to the problem and opens up unique datasets for public use and refining.

High-Performance Digitalisation Initiative is funded by the Innovation and Networks Executive Agency under the European Comission.

More information

High-Performance Digitisation Initiative

Aleksi Kallio, CSC, aleksi.kallio@csc.fi, tel. 050 3845158

Vili Haukkovaara, the National Archive of Finland, vili.haukkovaara@arkisto.fi, tel. 0295337019

Heli Kautonen, the National Library of Finland, heli.kautonen@helsinki.fi, tel. 050 3102654