Managing distributed research data in the engineering sciences with a Data Mesh approach
Since Research Data Management in the engineering sciences is decentralised regarding organisations as well as technically w.r.t. tools and data sources, a Data Mesh approach is applied, for which first experiences are presented.
To make scientific data-driven research transparent and comprehensive, Research data management (RDM) covers tools and methods to make it FAIR (findable, accessible, interoperable, reusable) and to maintain research data as a valuable resource. Moreover, a cultural change started to build analyses based on existing data instead of necessarily collect new data at the beginning of each experiment.
RDM, esp. in the engineering sciences, is characterised by heterogeneity and decentralisation: Thousands of research institutions in Germany work independently from each other, unless collaborating within projects. Different domains form the engineering sciences, ranging from mechanical over electrical to civil engineering. Engineering is highly interdisciplinary, so that data might be reused for a purpose unknown at data generation. Data formats and structures are heterogeneous, covering relational sensor data and material models as well as images and audios. Data is provided in several repositories, either generic, institutional, or specialised.
Overall, this initial situation makes it hard to discover existing data, leverage this data content-wise as well as technically, and assess data quality of a reused dataset.
The Data Mesh approach from industrial data management appears appropriate for this setting: Instead of centralised IT teams, domains and their domain owner manage their datasets, being able to answer specific questions about the data and ensuring its quality. Data is provided in the form of data products, ensuring that relevant elements like metadata for context, code for processing, a handle for identification, provenance as history, and a license from a legal perspective etc. are provided. Data remains in their original source, leveraging existing and potentially more specialised infrastructure. Compared to a monolithic ‘one-fits-all’ solution, this is less complex to maintain and can more easily adapt to future requirements. No complex ETL pipeline are required for data integration, although it requires data in their sources to be accessible, e.g. via an API. Based on metadata, the decentralised data in its sources is registered in a central platform, e.g. in a data catalogue or graph, for increased findability. On such a self-serve platform, owner can onboard their data and user can find and access it. To achieve interoperability within a Data Mesh, a federated governance is applied, consisting of global and local elements: Designed depending on the respective requirements, global governance ensures standardisation between data within the whole mesh, while local rules leave room for domain-specific individual design decisions.
In this talk, the characteristics of RDM in the engineering sciences and Data Mesh will be presented and mapped against each other. First results will be presented about the suitability and fields of adaption. Although not exactly 1:1 applicable to RDM, the Data Mesh approach addresses main challenges identified before. Especially the domain-oriented approach and the federated governance go beyond purely technical or centralised solution approaches.