BIDS-flux
Continuous federated multi-site MRI dataset creation from research instruments to open-publication with a Datalad backbone.
In an era of data-hungry algorithms, collections of large high-quality research data are increasingly important to answer complex scientific hypotheses. Acquiring, curating, preparing, and sharing datasets at scale remain challenging tasks that require extensive field-specific knowledge, thus relying upon groups of researchers with varying expertise. As such, these steps are often conducted in non-integrated and sequential phases rather than continuously, which in turn limits the observability (e.g., progress tracking, quality assessment, error detection) of the collection process and consequently, the potential for data fixes or operational improvements. Furthermore, while funding agencies and local institutions both strongly encourage researchers to implement FAIR principles (findability, accessibility, interoperability, and reusability)(Wilkinson et al., 2016) in their data management practices, ethics committees - notably those dealing with human research - logically impose important privacy, cybersecurity, sovereignty and governance constraints to data-sharing. Altogether, the burden to adequately conduct scientific data collection, preparation and sharing is far too large for individual labs to bear, even more so when novice students or trainees are tasked with implementing these complex steps. These responsibilities further increase when scaling projects to multiple labs or sites with distinct institutional governance or digital infrastructures, inherently slowing collaboration. All these challenges of large-scale, high-quality, collaborative data collection initiatives call for open-science community efforts and institutional infrastructure to support efficient scientific operations and as-open-as-possible data sharing. From our experience in collecting and managing the data of the dense fMRI data CNeuroMod project, we aim to transfer, genericize, scale and automate these workflows in the context of our institution-wide open-science initiative, as well as a multisite Canadian Pediatric Imaging Platform (CPIP) by designing and implementing the BIDS-Flux infrastructure.
The neuroimaging community has already conducted extensive efforts in data management by specifying the robust and evolutive file-based scientific data standard BIDS(Gorgolewski et al., 2016) that automated-tool, BIDS-apps(Gorgolewski et al., 2017), can robustly input and output. This foundation allowed us to design and start the implementation of a scalable FAIR data management platform. Modularly organized datasets are managed as Datalad (Halchenko et al., 2021) repos to version large files with git-annex while exposing data structure and metadata with git that also tracks provenance between datasets with submodules. Code-hosting platforms (e.g., Github, Gitlab) have enhanced git version control with agile practices such as development operations (DevOps) through continuous integrations and deployment (CI/CD) mechanisms, which can be translated to data operations (DataOps) on Datalad repos. As such, we chose to host and organize datasets on self-deployed Gitlab instances that orchestrate workflows, while we store data in a modern scalable object-stores (Minio). Provided some data collection conventions (e.g., ReproNim), it continuously ingests raw data from instruments (e.g., MRI, REDCap, Biopac), and carries-out data standardization (e.g., heudiconv, spec2nii, phys2bids), anonymization (defacing), testing (e.g., BIDS-validator, protocol compliance) and quality control (eg. MRIQC). After review and merge of the raw standard data from new sessions, a set of containerized BIDSApps configured for the study-specific collection protocol are triggered using Datalad reproducibility features to generate derivatives that also undergo review through reports and merging into their own datasets. All resulting datasets are continuously and recurringly covered by data standard compliance, deployment, privacy and security audits tests. While focussed on neuroimaging and peripherally collected data covered by the BIDS extensions, this infrastructure is modular enough to be adapted to other types of file-based scientific data structure, provided field-specific data standards and reproducible tools. Finally, while file-based datasets fits the earlier stage of data preparation, extracted data can be also forwarded to tabular or vector optimized storage better suited for downstream usage such as statistical modelling or machine-learning operations (MLOps).
Building upon Datalad inheritance of git decentralization and git-annex versatility, we further designed our platform to enable distributed dataset creation. Multiple distant sites can independently add new data to local forks of the datasets, while federation mechanisms propagate these changes and optionally transfer released non-sensitive data to community accessible academic cloud storage and long-term archives (eg. DataVerse) with Datalad flexibly tracking the data presence in these redundant storages. Apart from benefits in collaboratively scaling scientific endeavours, a distributed, replicated and federated management of scientific operations can accommodate the complex requirements of resilient data sovereignty and open-science.
Finally, while a large portion of the scientific data workflow can be automated, the latest stages of analysis and publication require, for now, more human intervention. Even if FAIR data management calls for very technical solutions as described above, the access and usage of the data has to be simple and user-friendly to empower researchers with FAIR practices. We thus plan to provide easily launchable distributed cloud containerized computing environments (eg. BinderHub) with tracking of the used datasets’ versions. By powering interactive exploration of rich datasets as much as full-fledged analyses in modern executable reproducible preprints(Karakuzu et al., 2022), such computational environments are the final steps to an end-to-end FAIR scientific workflow, tracking provenance from collection instruments to publications.
Gorgolewski, K.J. et al. (2016) ‘The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments’, Scientific data, 3, p. 160044. Gorgolewski, K.J. et al. (2017) ‘BIDS apps: Improving ease of use, accessibility, and reproducibility of neuroimaging data analysis methods’, PLoS computational biology, 13(3), p. e1005209. Halchenko, Y.O. et al. (2021) ‘DataLad: distributed system for joint management of code, data, and their relationship’, Journal of open source software, 6(63). Available at: https://doi.org/10.21105/joss.03262. Karakuzu, A. et al. (2022) ‘NeuroLibre : A preprint server for full-fledged reproducible neuroscience’. Available at: https://doi.org/10.31219/osf.io/h89js. Wilkinson, M.D. et al. (2016) ‘The FAIR Guiding Principles for scientific data management and stewardship’, Scientific data, 3, p. 160018.