DataStaMP – Data Storage and Management Project
One of the key hallmarks of the new generation of synchrotrons better performance, leading to higher resolution datasets. Users at MAX IV are now able to see their samples in unprecedented detail in both length and time. This puts enormous strain on data storage infrastructure which must adapt to ever increasing quantities of data that have to be transferred and accessed faster than ever before. If data management is not equally cutting edge, it risks choking the productivity of users, and blunting the state-of-the-art scientific instruments. MAX IV has been funded to bring cutting edge data storage and management to the synchrotron users in DataSTaMP (Data Storage and Management Project). The project will ensure data generated at MAX IV is not randomly distributed on scientist’s hard drives and stored in their home lab, but made accessible in the spirit of Open Science movement to the research community in the European Open Science Cloud (EOSC).
Scope of project
DataSTaMP addresses all stages of the research data lifecycle, from proposal, through data collection and analysis, publication, to archival and reuse. The establishment of resources for long-term data storage will maintain all of the scientific data generated at MAX IV, both now and in the future. Storing the data in a standard format in a single secure location will create conditions for offering value adding services to the MAX IV user community, such as collaborative work platforms, data portals for searching and sharing, and data analysis and processing services. Collecting of important metadata will make the data more meaningful and valuable, especially when combined in a big dataset. And assigning harmonised metadata tags at the point of collection, will make sure the data generated at MAX IV will be findable, accessible, interoperable and reusable (FAIR) for decades to come.
Organisation of work into work packages
The scope of DataSTaMP is split into four main work packages, each embracing a specific subject. Data storage concerns all infrastructure resources for storage, compute and network. Experimental data and metadata involves harvesting and recording of datasets and the correlated metadata, and basic data quality assessment of data taken during or when preparing for an experiment. The domain of data management of experiments relates to the governance of datasets stored in a variety of data catalogues associated to a performed experiment. Finally, data evaluation concerns the discipline of processing datasets, the development of scientific computing tools, and services related to collaboration between scientists on the researched data.
Data management of experiments
The EU’s policy of Open science will make scientific knowledge and research output more easily accessible. Open science will offer the opportunity to access and reuse all publicly funded research data in Europe across scientific disciplines and facilities. Supporting the intent of Open science, entails the need for tools and resources for organizing and managing datasets at MAX IV to be improved and extended with new features. New data catalogues are necessary to retain and consolidate sample, instrument and scientific metadata related to an experiment. Access restrictions must be applied allowing access to the taken data only to the responsible principal investigator and scientific collaborators identified by the PI. The request for a cross-facility data catalogue service with all research made accessible, requires a unique and persistent identification and authentication of users across scientific disciplines and facilities. The data management solution must not only keep track of the experimental data, but how, when and where it was collected, and under which license the research will be shared, so that the research can be made available.
Deliverables to be provided within the scope of the Data management of experiments
The data management of experiments work package focus at creating seamless access to, and easy management of, datasets from a variety of data catalogues linked to a user experiment. In order to this, two new data catalogues will be developed (a sample database and a data portal) and the Digital User Office (DUO) solution will be enhanced for the purpose if enabling the dissemination of information on users and proposals associated to the taken data at the beamline.
- Deliverable 1.1 – Digital user office system extended with advanced functionality required to enable data storage and management.
- Deliverable 1.2 – Catalogue to store and track samples, extended with sample shipment capabilities for remote and mail-in experiments.
- Deliverable 1.3 – Web portal enabled data catalogue to govern scientific data and publish as open data. (EOSC compatible).
Experimental data & metadata
In recent years synchrotron sources, beamline instrumentation and detector have been dramatically improved. At present, detectors are capable of producing thousands of frames per second. This ever-evolving development in detector technology, with increasing data rates and an expanding portfolio of detectors to manage, is causing a reconsideration of how experimental data is harvested at MAX IV. The intention is to unify the data gaining from the experiments and to create a flexible and configurable framework that match the requirements for a wide range of detectors. The framework will be supplemented by features to quickly determine the quality of the taken raw data (fast data), so there is (almost) instant feedback whether to stop or continue an experiment with the calibrations made. Furthermore, the framework will be compatible with a unified way for beamlines to harvest standardized metadata during the execution of an experiment. This slow stream of metadata, which will automatically retrieve and record information on calibration and significant measurements from the beamline setup relevant to the experiment and to the scientific user community, is very likely to be physically and logically separated from the fast raw detector data.
Deliverables to be provided within the scope of Experimental data & metadata
The experimental data & metadata work package targets at supporting a unified way of harvesting metadata and experimental data from a variety of detectors and instruments, and building the capability to handle data volumes and high-speed data rates of the new generation of detectors.
- Deliverable 2.1 – Data acquisition framework to harvest and record streaming data from an increased standard portfolio of detectors.
- Deliverable 2.2 – Metadata acquisition framework to automatically record metadata on a scientific experiment.
- Deliverable 2.3 – Livestream-viewer with basic instant data quality assessment capabilities.
It’s increasingly difficult to take data away from where it’s produced, due its size. The datasets are simply too big to transfer, or to store on laptops and portable memory devices. There is a growing need of processing and analysing data where it’s taken and will be preserved. This has been recognized by DataSTaMP, but also by the European Union and scientific user community, which has formed a collaboration for national photon and neutron research infrastructures to create a shared and harmonised framework for the management of scientific data. The ExPaNDS project (Extended Photon and Neutron Data Services) in which MAX IV is a member, runs concurrently to DataSTaMP and strives to reach an agreement on a federated data analysis-as-service solution common to all European photon and neutron research. DataSTaMP will closely follow the roadmap of ExPaNDS and comply with guiding principles for establishing a collaboration platform with analytics capabilities, that allow scientists to perform work remotely and to quickly disseminate their scientific findings widely.
Deliverables to be provided within the scope of the Data evaluation
The work package of Data evaluation will make data collected at MAX IV more useful and reusable, by providing tools and platforms for the scientific user community to share, process and analyse the experimental data post-visit.
- Deliverable 3.1 – Scientific computing tool to visualize data in HDF5-format via web and on mobile devices.
- Deliverable 3.2 – Data analysis as-a-service to interactively process data and visualize results. (EOSC compatible).
- Deliverable 3.3 – Set of protocols and tools allowing users to configure group access to scientific data and provide analysis workflows.
- Deliverable 3.4 – Relevant data compression methods and integrated tools (will be developed if necessary).
The scientific data infrastructure at MAX IV is designed to handle both computation and storage needs. Data produced at beamlines is centrally managed by a tiered storage system connected to a cluster of servers which provide high performance computing for online data analysis. The infrastructure resources is strategically and physically separated into data acquisition (DAQ), data storage, and data compute. The separation is made in order to draw away offline analysis from storing and data recording for which high availability is needed to serve ongoing experiments. Nevertheless, the Open science movement’s demand for cross-facility services and the MAX IV quest of increasing the value of research data long-term, drives a demand to upscale the capacity in the scientific storage infrastructure and to create sustainable storage capabilities. The DataSTaMP project set out at designing a home for scientific data generated at MAX IV, where the data can be hosted and maintained long-term.
Deliverables to be provided within the scope of Data storage
The deliverables within the Data storage work package will establish a state-of-the-art storage infrastructure where scientific data can be securely preserved for 10+ years.
- Deliverable 4.1 – Tape storage solution with an initial storage capacity of 15PB
- Deliverable 4.2 – Tape storage capacity build-out (capacity TBD)
- Deliverable 4.3 – Offline storage capacity expansion of 2PB (in 2019)
- Deliverable 4.4 – Additional offline storage capacity expansion of 2PB (in 2022)
- Deliverable 4.5 – Online storage capacity expansion of 400TB
- Deliverable 4.6 – Cloud storage service capability on a science storage provider (driven by demand)
Collaborations and co-operations with EC funded projects
DataSTaMP will have an impact on the access to, and sharing of, research data beyond the MAX IV user community. The project aims to support the ambitions of the European Open Science Cloud and conform to the principles of FAIR research data, i.e. enable for both humans and technology to find, access, interoperate and re-use the scientific data produced at MAX IV. The data storage and management project is underpinning the EC funded project ExPaNDS in which MAX IV participate and contributes. Work packages on Data evaluation and Data management of experiments, and in particularly work package 3.2, Interactive analysis, and work package 1.3, Data portal, will strongly support the realization of ExPaNDS objective to deliver harmonised, interoperable, and integrated data sources and data analysis services for photon and neutron facilities. The success of such ambitions is undoubtedly a major challenge, as it requires fostering standardization and interoperability of data and metadata infrastructures. However, DataSTaMP is a fundamental enabler in applying such cross-facility service at MAX IV, as the project will provide the essential structure for the research infrastructure specific implementation of ExPaNDS.
 Open Science, https://ec.europa.eu/research/openscience
 EOSC Portal, https://eosc-portal.eu/
 The FAIR Guiding Principles for scientific data management and stewardship, https://dx.doi.org/10.1038/Fsdata.2016.18
 EXPaNDS, https://expands.eu/