JRA3 - Tools for Scientific Data Services
Storing large amounts of data, spanning large number of files, is a common task for many scientific applications. Moreover users need to find their way in the stored data, i.e. there is need for extra information describing the data, so called metadata. Based on the metadata the users should be able to construct search criteria in order to retrieve pieces of the stored data for later processing. Finally, the storage infrastructure should allow for execution of post-processing or evaluation tasks. This constitutes the basic motivation for the work performed within the JRA3 activity. The main objectives of the project are:
• develop tools for distributed storage and transparent access of highly complex data sets
• develop innovative solutions for scientific data management, search and filtering in order to improve its scientific use
• develop tools for high-level data-analysis (feature extraction, statistics, time-series) to investigate and describe complex data-relationships
• develop portal solutions for convenient access and graphical interaction
Data Storage Infrastructure established based on iRODS (1), in the current setup we have five separate zones: CINECA (Bologna), UEDIN-EPCC (Edinburgh), Parallab (Bergen), PSNC (Poznan) and TCD (Dublin). UEDIN-EPCC, Parallab and TCD are federated.
Using the JAVA API for iRODS called Jargon, a standalone command line client and a web client (based on GridSphere (2) framework) were developed with the following basic functionality:
• users can insert data (and corresponding metadata) into the infrastructure
• user can list the contents of a data resource (get a list of the stored data objects)
• users can download (fetch) a data object from the infrastructure
• users can search for data objects having a particular set of metadata values, or do advanced search by combining various metadata attribute conditions
• users can download metadata only, and can modify the metadata information stored in the system
• users can execute selected post-processing and analysis tasks using the rule/micro-service mechanism of iRODS
• visualization of the data (separate service implemented using PRODS, PHP interface to iRODS)
All the server side operations are implemented using so called micro-services (C functions installed and compiled into the server to perform a particular task). Micro-services are called from the client side by means of rules. Micro-services can be chained together to create complex task workflows. More information here (3).
The following applications have been identified:
Geophysics - CO2-sequestration:modeling of an unstable miscible displacement porous media flow
Turbulence simulations by Federico Toschi (TUE Eindhoven). Approx 20 TB of data written in raw binary files.
Cosmological large scale structure simulations, cosmology group in Trieste using Gadget 2/3 code.
Galaxy clusters simulation, astrophysics groups of IRA-Bologna and CINECA. The adopted code is Enzo. The current total data size is about 10 TB. All the raw data files are HDF5
Turbulence simulations by Federico Toschi (TUE Eindhoven). Approx 20 TB of data written in raw binary files
The metadata describing the data is represented by an XML document which complies with an XSD schema defined for the applications producing the data. A unique schema characterizing the content of the metadata information for the above applications has been created.
Initial CO2 data (set of 12.000 time-sequences amounting to a volume of 2.2 TB) created and ingested into the data infrastructure using the command line client.