Schedule
You can also access the schedule information as Pentabarf XML. For viewing the schedule on mobile, we recommend the Giggity app (F-Droid) – use the link to the XML file above. Unfortunately, we are not currently aware of a free app compatible with iOS devices.
All times listed are in Central European Time (CET).
2025-10-23
Arrival and Registration
08:30 (00:30) ‧ Foyer ‧ General
Abstract
Find the venue, pick up name badges, get comfortable
Welcome and Overview
09:00 (00:20) ‧ Event Hall ‧ General
Abstract
Welcome from the organizers
Pragmatic YODA: overview of YODA principles and their wild life encounters
09:20 (00:20) ‧ Event Hall ‧ Practical guidelines for data and software
Yaroslav O. Halchenko
YODA principle a day, keeps gray hair rate at bay
Abstract
YODA Principles were formulated in 2018 as a poster for a human neuroimaging conference, and later became covered in the DataLad Handbook. Despite that exposure, YODA principles remain largely unknown in the community although might naturally be thought to be ’ecological’ and followed in the scientific practice by many researchers. Under the tagline ‘A YODA principle a day keeps gray hair rate at bay,’ the presentation will cover essential YODA concepts and demonstrate their real-world application using tools like DataLad, and reviewing some prominent examples such as OpenNeuroDerivatives presented last distribits. We will further showcase supporting tools like con/duct while providing practical tips for sticking to YODA principles in day-to-day scientific workflows. By relating and potentially contrasting these principles to other hierarchical organization standards (e.g. FHS and XDG Base Directory) we hope to provide insights into creating more maintainable and reproducible research data resources.
Research Software Education and Documentation: The overlooked pillar of software sustainability and high quality science
09:40 (00:20) ‧ Event Hall ‧ Practical guidelines for data and software
Julia-Katharina Pfarr
How good can a research software be without proper documentation and training?
Abstract
The crucial role of open and collaborative research software development in advancing high-quality research is increasingly recognized and valued within the scientific community and by funding sources. While attention often focuses on the software engineering aspects of research tools, the development of appropriate training materials remains frequently overlooked, with limited resources available to guide developers in creating effective tutorials and limited recognition of the work by academic evaluation. Though good examples of comprehensive documentation and training materials exist (e.g., DataLad, BIDS, various Python packages), the knowledge of how these materials were created - including the underlying thought processes and decisions - typically remains with their creators. Train-the-trainer programs such as those from ReproNim or Software Carpentry offer invaluable contributions in preparing peers for research software education; however, these programs have limited spots, and their resources for education of trainers are not enough known. Therefore, in my talk, I will emphasize the importance of creating educational and training materials for open-source research software and provide practical guidelines for developers.
The first part of my presentation will highlight the value of research software education materials. Acquiring proficiency in software not only increases scientific efficiency - researchers spend less time troubleshooting and more time conducting science with fewer errors - but well-designed tutorials also lower entry barriers to sophisticated analyses: without proper training materials, certain methods remain accessible only to those with advanced computational backgrounds. Well constructed tutorials break sophisticated techniques into manageable concepts with clear implementation steps. This democratizes science by enabling researchers from smaller institutions and under-resourced settings to implement cutting-edge techniques without extensive local expertise. Furthermore, a community of well-trained researchers contributes to software sustainability, beginning with user-driven documentation improvements and potentially advancing to code contributions. User feedback during training reveals necessary design improvements, creating tools that increasingly align with researchers’ actual workflows.
Key considerations in my talk will address existing training modalities (progressive complexity versus task-based approaches), how to determine which are most appropriate and effective for specific research software, the time investment required, and the sustainability challenges that arise. Through these discussions, I aim to emphasize that investing in high-quality training resources is central to research software development - particularly in complex domains like neuroimaging - and essential for maximizing the scientific impact of our tools. Actions to increase the value of software training should include recognition by the science community, e.g., with specific awards or quantification of the gains for software quality with good documentation. Finally, I will address the critical questions: who should be responsible for undertaking this task? And how do we make it sustainable?
Q&A
10:00 (00:20) ‧ Event Hall ‧ Practical guidelines for data and software
Abstract
Questions and panel discussion
Coffee
10:20 (00:30) ‧ Foyer ‧ Break
Abstract
Coffee and snack overflow
git-annex for computer scientists
10:50 (00:20) ‧ Event Hall ‧ git-annex and special remotes
Joey Hess
quick introduction to git-annex internals for people who are not scared of words like “distributed”
Abstract
git-annex has hundreds of subcommands, a myriad of features and use cases, and comprises a hundred thousand lines of code. But there is a simple core that, if you understand it, makes it easy to grasp everything else.
After learning that, we’ll check our understanding by applying it to a few features that have been recently added to git-annex, like the compute special remote.
Managing Tape Archives with git-annex: A Special Remote for Sequential Media
11:10 (00:20) ‧ Event Hall ‧ git-annex and special remotes
Steffen Vogel
Use git-annex to manage your tape archive index.
Abstract
Magnetic tape remains a highly reliable and cost-effective medium for long-term data storage. However, its sequential nature and limited integration with modern software workflows often hinder adoption outside traditional enterprise environments. This talk introduces a new approach that integrates tape storage with git-annex through a purpose-built special remote called git-annex-remote-tape, enabling, versioned data management on offline, high-capacity media.
We begin with a brief technical overview of the LTO ecosystem, focusing on tape operation via SCSI command sets and connectivity through SAS or Fibre Channel interfaces. These drives are well-supported in Linux via device nodes (e.g., /dev/st0
) and managed with tools such as mt
, tar
, and sg3_utils
. Despite robust hardware capabilities — such as hardware encryption and streaming compression — integrating tape into file-level workflows remains challenging, especially for applications requiring metadata versioning and content addressing.
To address this gap, we introduce git-annex-remote-tape, a special remote that treats tape as a managed backend for git-annex. This remote enables users to archive annexed content to tape, track its location, and restore it reliably using the git-annex interface. It leverages a simple on-tape data format optimized for sequential media, featuring:
- Basic Metadata for each object linked to the git-annex key.
- Streaming-friendly layout supporting large block sizes and seekable objects.
The remote manages the full data lifecycle: writing files to tape, updating metadata, and retrieving content via SCSI-based positioning. Tape media is treated as append-only and immutable, ensuring data consistency and simplifying recovery. All operations are coordinated through the standard git-annex special remote protocol, with additional logic for tape-specific behaviors.
A live demonstration will showcase real-world usage: initializing a tape remote, archiving data, removing local content, and restoring files from tape. We illustrate how tape-backed storage can be used transparently within annex repositories and how the system handles multiple volumes and offline content. The demo uses standard Linux tools and highlights how the solution integrates into existing workflows without requiring proprietary software.
We conclude with an overview of current limitations—such as lack of random access—and the development roadmap, including support for autoloaders, replication strategies, and enhanced diagnostics. By combining the durability of tape with the flexibility of content-addressable storage, git-annex-remote-tape offers a powerful tool for researchers, archivists, and infrastructure engineers seeking sustainable and auditable archival solutions.
Forgejo-aneksajo: a git-annex/DataLad forge
11:30 (00:20) ‧ Event Hall ‧ git-annex and special remotes
Matthias Riße
Apply established software development practices to your (meta-)data projects.
Abstract
Software development and data curation or analysis share many of their issues: keeping track of the evolution of files – ideally with information on how, why, when, and by whom –, organizing collaboration with multiple people, keeping track of known issues and other TODOs, discussing changes, making versions available to others, automating tasks, and more. Often you will even have to write code as part of a data project, blurring the line between the two even more. In the free and open-source software development world these issues already have well established solutions: a version control system keeps track of your projects history and ongoing development, a forge can serve as a collaboration hub, CI/CD services provide flexible automation. So, why not apply them to our data management needs? Forgejo-aneksajo does just that and extends Forgejo with git-annex support, making it a versatile self-hostable (meta-)data collaboration platform that neatly fits into a data management ecosystem around git, git-annex and DataLad.
Your Life in Git
11:50 (00:20) ‧ Event Hall ‧ git-annex and special remotes
Yann Büchau
How I track pretty much all my digital assets with git (annex)
Abstract
Accidental deletions, program crashes, disk failures, etc. have led me to put as many relevant files as I can under version control with git. With this, one can go back in history, restore older versions and see changes over time. That’s great for code, small files and documents, but throwing a large media collection at git itself won’t result in a good time. Git annex however makes tracking and syncing large collections of arbitrarily-sized files possible. With Forgejo-Aneksajo (a Forgejo fork with git annex support), there is now a workable web interface for easy access and collaboration on git annex repositories. In this talk I will explain how I “put my life in git” and show tools I use to achieve it.
Q&A
12:10 (00:30) ‧ Event Hall ‧ git-annex and special remotes
Abstract
Questions and panel discussion
Lunch (Self-Organized)
12:40 (01:30) ‧ Foyer ‧ Break
Abstract
Lunch (self-organized, outside venue)
Extending the Brain Imaging Data Structure specification to provenance metadata
14:10 (00:05) ‧ Event Hall ‧ Lightning talks
Boris Clénet
Provenance allows BIDS dataset to hold information on how they were generated.
Abstract
Interpreting and comparing scientific results as well as enabling reusable data and analysis output require understanding provenance, i.e. how data were generated and processed. To be useful, the provenance must be comprehensive, understandable, easily communicated, and captured in a machine accessible form. We present a recent extension of the Brain Imaging Data Structure (BIDS) that aims at describing the provenance of a dataset. This extension was designed as a combination of ergonomics and computability in order to help neuroscientists parsing human readable yet computer generated provenance records.
Maintaining large datasets at scale
14:15 (00:05) ‧ Event Hall ‧ Lightning talks
Christopher Markiewicz
Finding and fixing defects that arise in automatically managed git-annex repositories.
Abstract
OpenNeuro is a neuroimaging data hosting service that allows researchers to upload datasets, which are converted into git-annex repositories and exposed to for download. Users have created over 7000 datasets, in total containing more than 140TB of data.
Users largely interact with datasets through the web interface and a command-line uploader/downloader, and rarely use git, git-annex or DataLad directly. As a result, software bugs can introduce defects into a repository without detection until a user attempts to access the data, which can be many months and several dataset versions later.
This talk will discuss some defects that we have found, and the strategies we’ve used to resolve these. These include:
- Unannex/re-annex commit pairs poisoning the git history. Finding, fixing and rewriting tags.
- Recovering from large files somehow getting into your git-annex branch.
- Copy-on-write git backups so you can rewrite history without fear.
Using Git-annex to enhance the MediaWiki file repository system
14:20 (00:05) ‧ Event Hall ‧ Lightning talks
Timothy Sanders
Git-annex is a great companion to MediaWiki’s built in file repository functionality. An annex repo is useful for file location tracking and integrity, as well as offering a bridge to external hosting services through special remotes and integrating both platforms through bidirectional metadata exposure.
Abstract
Managing binary file assets can be a pain-point for mediawiki admins, with binary files often eclipsing the wiki text content. Mediawiki offers an API for file repository backends and extensions for hosting binaries in services like s3 buckets, as well as hooks for additional file operations.
A git-annex mediawiki backend can implicitly provide these features and more by acting as a bridge to special remotes, and provide a more useful form of file versioning and archiving.
I will share my experiments with integrating git-annex with mediawiki in several ways. Starting with just making a git-annex compatible uploads folder that uses a matching key format to the mediawiki native sha1 hashdir format, then integrating it with mediawiki’s hook interface. I also experimented with a git-annex special remote for the mediawiki API. Finally, I am working on using git-annex as a mediawiki file repository backend based off of the official specification.
I have also found that there is an overlap between Datalad users and Mediawiki users, so this might be relevant.
If there’s time, I can provide some background and case studies. Note that this isn’t related to the git-remote-mediawiki project, although I’m a fan of that as well. I think git-annex makes more sense for this application.
Unconference
14:25 (00:50) ‧ Event Hall ‧ Unconference
Coffee
15:15 (00:30) ‧ Foyer ‧ Break
Abstract
Coffee and snack overflow
Managing distributed research data in the engineering sciences with a Data Mesh approach
15:45 (00:20) ‧ Event Hall ‧ Federated data management across domains
Mario Moser
Since Research Data Management in the engineering sciences is decentralised regarding organisations as well as technically w.r.t. tools and data sources, a Data Mesh approach is applied, for which first experiences are presented.
Abstract
To make scientific data-driven research transparent and comprehensive, Research data management (RDM) covers tools and methods to make it FAIR (findable, accessible, interoperable, reusable) and to maintain research data as a valuable resource. Moreover, a cultural change started to build analyses based on existing data instead of necessarily collect new data at the beginning of each experiment. RDM, esp. in the engineering sciences, is characterised by heterogeneity and decentralisation: Thousands of research institutions in Germany work independently from each other, unless collaborating within projects. Different domains form the engineering sciences, ranging from mechanical over electrical to civil engineering. Engineering is highly interdisciplinary, so that data might be reused for a purpose unknown at data generation. Data formats and structures are heterogeneous, covering relational sensor data and material models as well as images and audios. Data is provided in several repositories, either generic, institutional, or specialised. Overall, this initial situation makes it hard to discover existing data, leverage this data content-wise as well as technically, and assess data quality of a reused dataset. The Data Mesh approach from industrial data management appears appropriate for this setting: Instead of centralised IT teams, domains and their domain owner manage their datasets, being able to answer specific questions about the data and ensuring its quality. Data is provided in the form of data products, ensuring that relevant elements like metadata for context, code for processing, a handle for identification, provenance as history, and a license from a legal perspective etc. are provided. Data remains in their original source, leveraging existing and potentially more specialised infrastructure. Compared to a monolithic ‘one-fits-all’ solution, this is less complex to maintain and can more easily adapt to future requirements. No complex ETL pipeline are required for data integration, although it requires data in their sources to be accessible, e.g. via an API. Based on metadata, the decentralised data in its sources is registered in a central platform, e.g. in a data catalogue or graph, for increased findability. On such a self-serve platform, owner can onboard their data and user can find and access it. To achieve interoperability within a Data Mesh, a federated governance is applied, consisting of global and local elements: Designed depending on the respective requirements, global governance ensures standardisation between data within the whole mesh, while local rules leave room for domain-specific individual design decisions. In this talk, the characteristics of RDM in the engineering sciences and Data Mesh will be presented and mapped against each other. First results will be presented about the suitability and fields of adaption. Although not exactly 1:1 applicable to RDM, the Data Mesh approach addresses main challenges identified before. Especially the domain-oriented approach and the federated governance go beyond purely technical or centralised solution approaches.
The Helmholtz Earth and Environment DataHub - Highly Distributed Data That Thrives on Metadata
16:05 (00:20) ‧ Event Hall ‧ Federated data management across domains
Ulrich Loup
Germany-wide, highly distributed observational data of the earth, the atmosphere, and the oceans from decades are now being made available by a Helmholtz initiative through one common combined data/metadata interface and standard, which poses a number of challenges to the established scientific and technical workflows.
Abstract
In Environmental Sciences, Time-series data is key to, for example, monitoring environmental processes, validating earth system models and remote sensing products, training of data driven methods and better understanding of climate processes. A major issue is the lack of a consistent data availability standard aligned with the FAIR (findable accessible interoperable reusable) principles.
The DataHub initiative, which is part of the Helmholtz Research Field Earth and Environment, addresses these shortcomings by establishing a large-scale infrastructure around common data standards and interfaces, for example, the Open Geospatial Consortium’s SensorThings API (STA). Closely related to the DataHub is the STAMPLATE project, whose challenging task was to harmonize the extremely heterogeneous metadata formats stemming from the different observation domains such as the earth, atmosphere and ocean. Moreover, within the domains different metadata formats developed historically due to diverging system architectures and missing guidelines.
In DataHub, the research data, whether it is collected by measurement devices or acquired through manual processes, is distributed among the seven participating research centers. Each of these centers is responsible for operating its own time series management system, which ingests the observational data. In addition to these data ingest systems, sensor and device management systems provide easy-to-use self-services for entering metadata, such as the Helmholtz Sensor Management System (https://helmholtz.software/software/sensor-management-system) or the O2A Registry (https://registry.o2a-data.de/). Each center operates a data/metadata synchronization service that ultimately makes the data available through STA, which integrates both data and metadata. Quality checking tools such as SaQC (https://helmholtz.software/software/saqc) facilitate data quality control. The powerful and modern Earth Data Portal (www.earth-data.de) with highly customizable thematic viewers is the central portal for data exploration. In order to ensure that metadata entered in any user self-service is also displayed in the Earth Data Portal along with the ingested data, custom, semantic metadata profiles developed in STAMPLATE augment STA’s core data model with domain-specific information.
In summary, the data that is accessible on the Earth Data Portal and available from the STA endpoints is distributed in two distinct categories. Firstly, observation data and its metadata are acquired by separate systems. And secondly, each center operates its own data and metadata infrastructure, with all centers ultimately connecting to STA endpoints.
The operationalization of the framework and its subsequent integration into research data workflows is imminent, presenting us with a number of challenges as our research data management processes undergo a transformative shift from manual, human-based workflows to self-organized, digitally-enabled workflows. For example, new ways of downloading data need to be found that meet the needs of researchers, while addressing issues such as copyright and avoiding infrastructure overload.
This talk addresses the fundamental elements of our initiative and the associated challenges.
Location-Transparent Distributed Datacube Processing
16:25 (00:20) ‧ Event Hall ‧ Federated data management across domains
Peter Baumann
Semantic-rich query languages enable automatic query splitting and asymmetric load balancing
Abstract
It is a common first-semester Computer Science insight that general-purpose (“Turing-complete”) languages cannot be understood by another program, which is necessary, among others, to generate true distributed execution plans. However, restricted models indeed are amenable to such analysis - in particular, the SQL database language which (i) has a focused data structure, tables, and (ii) in its core does not have iteration.
We present a similar high-level query language based on a multi-dimensional array model, also known as “datacubes”, and its implementation in the rasdaman Array DBMS, a database system centered around arrays (rather than tables). The query optimizer can generate both local and distributed plans and find efficient work splittings. The net effect for the user is that disparate, independent rasdaman deployments can be federated in a location-transparent manner: users do not need to know where the data sit that get accessed in a query. In the extreme case, cloud-based data centers can get federated with small edge devices like Raspberry Pi, and the optimizer can generate plans which take into account the asymmetric capabilities of the nodes involved.
This language has been adopted as part 15 of the ISO SQL standard. A slightly modified version named Web Coverage Processing Service (WCPS), which additionally incorporates space-time semantics, has been standardized by the Open Geospatial Consortium and ISO. In rasdaman, WCPS queries internally get mapped to SQL MDA queries which suhbsequently get evaluated, including distributed processing.
We present concepts, architecture, and live demos on geo datacubes representing weather forecasts and satellite image timeseries.
Q&A
16:45 (00:30) ‧ Event Hall ‧ Federated data management across domains
Abstract
Questions and panel discussion
End of Day
17:15 (00:15) ‧ Foyer ‧ General
Abstract
Head out to a self-organized dinner and social time. Venue closing at 17:30.
2025-10-24
Arrival and Registration
08:30 (00:30) ‧ Foyer ‧ General
Abstract
Talk to (new) friends, get comfortable for the 2nd day
Compute on demand
09:00 (00:20) ‧ Event Hall ‧ Computational workflows and network infrastructure
Michał Szczepanik
Compute-on-demand with a git-annex special remote: an fMRIPrep use case example
Abstract
Git-annex special remotes can be thought of as more of a protocol then a place. A special remote can perform arbitrary operations to get a piece of content - so why not have it compute instead of download? Recently, two independent implementations have been published. This talk will discuss them both.
The datalad-remake special remote implementation relies on compute instructions stored in a git-tracked file, and uses URL-like git-annex keys to record availability. On get, a temporary worktree is provisioned and the compute instructions are executed. Signing of git commits is used to prevent unauthorised execution: a trust list needs to be declared.
The compute special remote is now built into git-annex. It is based on a compute special remote interface, and a protocol for communicating with external programs (or executable scripts).
As for non-trivial use cases, the ability to compute on demand addresses an important issue faced by neuroscientists. Preprocessing fMRI data includes morphing 4-dimensional (space + time) brain images to match a standard template. Usually, preprocessed images are normalized to more than one template, multiplying the storage requirements. Computing the transformations is time-consuming, but the transformation matrices are small and applying them to produce normalized images is quick. Recent releases of fMRIPrep, a popular software for fMRI data preprocessing, save all files required to apply these transformations, and provide workflows which can apply them reproducibly in a “shortcut” manner. In the talk, I will demonstrate how datalad-remake can be used with fMRIPrep to save hundreds of megabytes per study participant.
Continuous Benchmarking of HPC Simulations
09:20 (00:20) ‧ Event Hall ‧ Computational workflows and network infrastructure
Jayesh Badwaik
Opportunities and Challenges in Implementing Continuous Benchmarking for HPC Simulations
Abstract
Ensuring performance consistency and early regression detection is critical in high-performance computing (HPC) operations. Traditional benchmarking methods rely on manual execution, leading to inconsistencies and delayed issue detection. During the JUREAP program, we integrated Continuous Benchmarking (CB) using exacb to standardize performance evaluation across 50 applications. This automation improved reproducibility, streamlined reporting, and enabled early detection of system anomalies, such as faulty Slurm updates and workflow execution issues on the JEDI machine. Even without a fully operational exascale supercomputer, exacb facilitated systematic performance comparisons, providing valuable insights into application scalability.
Beyond JUREAP, Cx enhances research software development and HPC system management. Our framework simplifies benchmarking, ensuring efficient performance tracking and optimization at scale—key for the upcoming JUPITER exascale supercomputer. Automating benchmarking reduces manual overhead, improves system stability, and aids in troubleshooting by providing structured performance insights. In this talk, we share our experience implementing CB in JUREAP, key findings from benchmarking 50 applications, and the broader impact of CI/CD/CB on research software, system administration, and future exascale computing.
Datalad Reproducibility for High Performance Computing: The Datalad Slurm Extension
09:40 (00:20) ‧ Event Hall ‧ Computational workflows and network infrastructure
Andreas Knüpfer
Datalad run and rerun are incompatible with HPC. The talk presents how we fixed it.
Abstract
Datalad run and rerun for machine-actionable reproducibility is not working in High Performance Computing (HPC) environments where compute jobs are managed through a batch scheduling system. Using datalad run outside of Slurm for job submission is pointless. And using it inside a batch job causes problems as well as very inefficient behavior.
We present the Datalad Slurm extension to solve this for the Slurm batch scheduling system. It introduces the new sub-commands “slurm-schedule”, “slurm-finish”, and “slurm-reschedule”. They allow to schedule many jobs at the same time from the same clone of a repository and generate a reproducibility record in the git log for successful jobs.
The talk introduces the solution from the user perspective, explains the design idea and some implementation details, presents some performance evaluation, and touches a few extra features.
Iroh p2p QUIC transport and resumable verified transfers
10:00 (00:20) ‧ Event Hall ‧ Computational workflows and network infrastructure
Floris Bruynooghe
An overview of iroh p2p connections and a blobs protocol based on blake3 hashes running on top
Abstract
With iroh our aim is to provide reliable direct connections where the first byte can flow without delay. Iroh provides a QUIC connection to users, that is holepunched when possible. Users are free to run any protocols on top of the QUIC connection. One useful protocol is the blobs protocol, which uses the BLAKE3 internal Merkel tree to enable verified streaming of data.
This talk aims to provide an overiview of how iroh’s architecture and connections work. Followed by a brief look at how the verified streaming is designed.
Q&A
10:20 (00:30) ‧ Event Hall ‧ Computational workflows and network infrastructure
Abstract
Questions and panel discussion
Coffee
10:50 (00:30) ‧ Foyer ‧ Break
Abstract
Coffee and snack overflow
dtool and dserver: A flexible ecosystem for findable data
11:20 (00:20) ‧ Event Hall ‧ Metadata-based tools and data discoverability
Johannes Laurin Hörmann
dtool is a lightweight data management tool that packages metadata with immutable data to promote accessibility, interoperability, and reproducibility; dserver makes dtool datasets findable
Abstract
Making data FAIR - findable, accessible, interoperable, reproducible - has become the recurring theme behind many research data management efforts. dtool is a lightweight data management tool that packages metadata with immutable data to promote accessibility, interoperability, and reproducibility. Each dataset is self-contained and does not require metadata to be stored in a centralised system. dserver, as defined by a REST API, makes dtool datasets findable, hence rendering the dtool ecosystem fit for a FAIR data management world. Its simplicity, modularity, accessibility and standardisation via API distinguish dtool and dserver from other solutions and enable it to serve as a common denominator for cross-disciplinary research data management. The dtool ecosystem bridges the gap between standardisation-free data management by individuals and FAIR platform solutions with rigid metadata requirements. We show how dtool and dserver have been used productively to enable research in solid mechanics [1], multiscale simulations [2], and molecular dynamics simulations [3].
[1] A. Sanner and L. Pastewka, Crack-Front Model for Adhesion of Soft Elastic Spheres with Chemical Heterogeneity, J. Mech. Phys. Solids 160, 104781 (2022).
[2] H. Holey, A. Codrignani, P. Gumbsch, and L. Pastewka, Height-Averaged Navier–Stokes Solver for Hydrodynamic Lubrication, Tribol. Lett. 70, 36 (2022).
[3] J. L. Hörmann, C. Liu, Y. Meng, and L. Pastewka, Molecular Simulations of Sliding on SDS Surfactant Films, J. Chem. Phys. 158, (2023).
DataChain: Query and Version Your Cloud Storage Without Moving a File
11:40 (00:20) ‧ Event Hall ‧ Metadata-based tools and data discoverability
Dmitry Petrov
Open-source DataChain tool adds querying, versioning, and metadata to raw files—right where they are.
Abstract
Cloud object storages like S3, GCS and Azure are the backbone of modern AI workflows - but they weren’t built for understanding what’s inside of files which is critical for ML and AI teams. DataChain is an open-source tool that helps building a layer of semantic on top of your storage using ML models and LLM. It turns raw files into structured, versioned datasets - without moving, modifying, or duplicating anything.
Built by the team behind DVC, the industry-standard for ML data versioning. In this talk, you’ll see how it helps teams manage unstructured data at scale and enables a new class of AI-native tools - from agentic pipelines to semantic search.
Information Management at the INM-7
12:00 (00:20) ‧ Event Hall ‧ Metadata-based tools and data discoverability
Stephan Heunis
A linked metadata-based toolset for modelling, annotation, discoverability, and protection of information
Abstract
More or less a year and a half ago, around the time of the first distribits event, we started exploring a new direction for information management at the INM-7, Research Center Jülich. The goal was to extend beyond the constraints imposed by a DataLad, git, and git-annex-based data management approach (i.e. the need for a git repository to “live somewhere”) to something that connects managed data to the wider world of data infrastructures. Semantic metadata and RDF soon became the necessary ingredients for describing our data such that it becomes interoperable and machine-actionable by design, opening up possibilities to translate and transform infrastructure-dependent datasets into portable yet comprehensive data descriptors. In this talk, we’ll show the set of interoperable metadata-based tools that we have developed to model, annotate, and protect data, and we’ll demonstrate existing deployments in varying use cases, from the management of institute-internal personnel records to the curation of data samples from a research study.
Q&A
12:20 (00:30) ‧ Event Hall ‧ Metadata-based tools and data discoverability
Abstract
Questions and panel discussion
Lunch (Self-Organized)
12:50 (01:30) ‧ Foyer ‧ Break
Abstract
Lunch (self-organized, outside venue)
Unconference
14:20 (00:25) ‧ Event Hall ‧ Unconference
Coffee
14:45 (00:30) ‧ Foyer ‧ Break
Abstract
Coffee and snack overflow
Distributed data : standards, governance and analyses in neurosciences
15:15 (00:20) ‧ Event Hall ‧ Start-to-end FAIR analyses in the neurosciences
Jean-Baptiste Poline
This talk will review some sociological and technical aspects associated with the standards, governance structure and data analyses of distributed data.
Abstract
Distributed data in life science has become a necessity given the sharing constraints imposed by ethics and legal and political frameworks, which can be found at the laboratory, institution, country or continent levels. While distributing data is necessary, this comes with the huge challenge of making local data interoperable.
Ten years after the Wilkinson et al FAIR guideline proposal, finding a standard to represent the particular set of data still looks a little like a busy christmas market where one has to choose the type of decoration to put on the data. The reasons why it is so hard to find a common language are multiple and intertwined with the research sociology, i.e. how laboratories or institutions get funding, how they compete or collaborate. For those who have been working on developing data standards for many years, a feeling that things are never going to be resolved might dominate. In addition, sustaining the development of standards as well as their maintenance within a reasonable governance structure is rarely at the top of funding agencies that tend to reward ground breaking research and emphasize innovations.
In this presentation, I will first lay out key sociological and technical factors preventing distributed data reuse, some of these such as the classical research incentives are well known, while others are less well understood. Secondly, and mainly, I would like to introduce some hope that the process can be indeed successful and propose a number of guidelines, based on the collective experience of the development of standards and tools. In particular, I will consider federated analysis as a key motivation for standardized data and present some first experiments using a combination of tools (FedBiomed, Nipoppy / BIDS, Neurobagel) and standards (meta data, computing, etc) enabling a community to extract information from distributed datasets. I will give some examples of federated analysis of Parkinson clinical and neuroimaging data. I will review where LLMs will play a key role in facilitating this process. It is likely that the capacity to get results that cannot be achieved differently will drive the adoption and the sustainability of distributed data and analyses architectures, and only if the process is concurrently building communities of practice.
Bridging the Gap Between Storage and Applications: A Modular Concept for Large Image Data Access
15:35 (00:20) ‧ Event Hall ‧ Start-to-end FAIR analyses in the neurosciences
Julia Thönnißen
Concept for a modular, cloud-native image delivery service enabling access and transformation of large image datasets—bridging storage and applications without data duplication.
Abstract
Recent advances in imaging technologies—particularly high-throughput methods—have led to an unprecedented growth of image datasets, reaching Terabytes to Petabytes in scale. While such massive datasets offer transformative potential for scientific discovery, they also introduce significant challenges for visualization and analysis due to the sheer size of the data and its continuous growth. Visualizing, annotating, and analyzing large-scale image datasets raises a fundamental dilemma of balancing computational efficiency and memory requirements. Many existing tools fail to manage large datasets effectively due to memory constraints, often forcing lossy methods like downsampling. Conversely, solutions optimized for large data volumes frequently depend on specialized or proprietary formats, reducing interoperability with other ecosystems. This highlights diverging requirements: storage systems favour compression for compactness, analysis tools require fast data access, and visualization tools benefit from tiled, multi-resolution formats. Without a unified strategy, institutions often resort to inefficient workflows involving repeated format conversions and costly data duplication to support diverse applications. Ongoing standardization efforts within the bioimaging community [1-4] represent important developments towards more efficient and standardized use of bioimaging data. However, the conversion of data into a single (and yet evolving) standard is not feasible for rapidly growing large-scale datasets, especially given very diverging needs for parallel processing on HPC systems. To address these issues, we present a concept for a modular cloud-native image delivery service designed to act as a flexible middleware layer between large-scale image repositories and consuming applications. The system supports heterogeneous input formats and delivers transformed data views on demand. It performs real-time operations such as coordinate transformations, filtering, and multi-resolution tiling, eliminating the need for pre-processing or intermediate storage. The service offers an extensible set of access points: RESTful APIs for web-based visualization (e.g., Neuroglancer, OpenSeadragon), virtual file system mounts for file-oriented tools (e.g., OMERO, ImageJ), and programmatic interfaces compatible with customizable environments (e.g., Napari, datalad). Additionally, it can dynamically present standard-conformant data views—such as those aligned with the Brain Imaging Data Structure (BIDS) [4]—from arbitrarily organized datasets. By decoupling data access from physical storage layout, the service facilitates scalable, multi-tool interoperability in distributed environments without data duplication. In summary, we propose a flexible and extensible approach to image data access that supports dynamic transformations, minimizes redundancy, and bridges the gap between diverse storage backends and modern, distributed applications. It aligns with the FAIR data principles and builds upon community standards while enabling efficient workflows for managing and exploiting large-scale image datasets.
[1] S. Besson et al., “Bringing Open Data to Whole Slide Imaging”, Digital Pathology ECDP 2019, Lecture Notes in Computer Science, vol. 11435, pp. 3–10, Jul. 2019, DOI: 10.1007/978-3-030-23937-4_1 [2] J. Moore et al., “OME-NGFF: A next-generation file format for expanding bioimaging data-access strategies”, Nature Methods, vol. 18, no. 12, pp. 1496–1498, Dec. 2021. DOI: 10.1038/s41592-021-01326-w. [3] C. Allan et al., “OMERO: Flexible, model-driven data management for experimental biology”, Nature Methods, vol. 9, no. 3, pp. 245–253, Mar. 2012. DOI: 10.1038/nmeth.1896. [4] K. J. Gorgolewski et al., “The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments”, Scientific Data, vol. 3, no. 1, p. 160 044, Jun. 2016. DOI: 10.1038/sdata.2016.44.
BIDS-flux
15:55 (00:20) ‧ Event Hall ‧ Start-to-end FAIR analyses in the neurosciences
Basile Pinsard
Continuous federated multi-site MRI dataset creation from research instruments to open-publication with a Datalad backbone.
Abstract
In an era of data-hungry algorithms, collections of large high-quality research data are increasingly important to answer complex scientific hypotheses. Acquiring, curating, preparing, and sharing datasets at scale remain challenging tasks that require extensive field-specific knowledge, thus relying upon groups of researchers with varying expertise. As such, these steps are often conducted in non-integrated and sequential phases rather than continuously, which in turn limits the observability (e.g., progress tracking, quality assessment, error detection) of the collection process and consequently, the potential for data fixes or operational improvements. Furthermore, while funding agencies and local institutions both strongly encourage researchers to implement FAIR principles (findability, accessibility, interoperability, and reusability)(Wilkinson et al., 2016) in their data management practices, ethics committees - notably those dealing with human research - logically impose important privacy, cybersecurity, sovereignty and governance constraints to data-sharing. Altogether, the burden to adequately conduct scientific data collection, preparation and sharing is far too large for individual labs to bear, even more so when novice students or trainees are tasked with implementing these complex steps. These responsibilities further increase when scaling projects to multiple labs or sites with distinct institutional governance or digital infrastructures, inherently slowing collaboration. All these challenges of large-scale, high-quality, collaborative data collection initiatives call for open-science community efforts and institutional infrastructure to support efficient scientific operations and as-open-as-possible data sharing. From our experience in collecting and managing the data of the dense fMRI data CNeuroMod project, we aim to transfer, genericize, scale and automate these workflows in the context of our institution-wide open-science initiative, as well as a multisite Canadian Pediatric Imaging Platform (CPIP) by designing and implementing the BIDS-Flux infrastructure. The neuroimaging community has already conducted extensive efforts in data management by specifying the robust and evolutive file-based scientific data standard BIDS(Gorgolewski et al., 2016) that automated-tool, BIDS-apps(Gorgolewski et al., 2017), can robustly input and output. This foundation allowed us to design and start the implementation of a scalable FAIR data management platform. Modularly organized datasets are managed as Datalad (Halchenko et al., 2021) repos to version large files with git-annex while exposing data structure and metadata with git that also tracks provenance between datasets with submodules. Code-hosting platforms (e.g., Github, Gitlab) have enhanced git version control with agile practices such as development operations (DevOps) through continuous integrations and deployment (CI/CD) mechanisms, which can be translated to data operations (DataOps) on Datalad repos. As such, we chose to host and organize datasets on self-deployed Gitlab instances that orchestrate workflows, while we store data in a modern scalable object-stores (Minio). Provided some data collection conventions (e.g., ReproNim), it continuously ingests raw data from instruments (e.g., MRI, REDCap, Biopac), and carries-out data standardization (e.g., heudiconv, spec2nii, phys2bids), anonymization (defacing), testing (e.g., BIDS-validator, protocol compliance) and quality control (eg. MRIQC). After review and merge of the raw standard data from new sessions, a set of containerized BIDSApps configured for the study-specific collection protocol are triggered using Datalad reproducibility features to generate derivatives that also undergo review through reports and merging into their own datasets. All resulting datasets are continuously and recurringly covered by data standard compliance, deployment, privacy and security audits tests. While focussed on neuroimaging and peripherally collected data covered by the BIDS extensions, this infrastructure is modular enough to be adapted to other types of file-based scientific data structure, provided field-specific data standards and reproducible tools. Finally, while file-based datasets fits the earlier stage of data preparation, extracted data can be also forwarded to tabular or vector optimized storage better suited for downstream usage such as statistical modelling or machine-learning operations (MLOps). Building upon Datalad inheritance of git decentralization and git-annex versatility, we further designed our platform to enable distributed dataset creation. Multiple distant sites can independently add new data to local forks of the datasets, while federation mechanisms propagate these changes and optionally transfer released non-sensitive data to community accessible academic cloud storage and long-term archives (eg. DataVerse) with Datalad flexibly tracking the data presence in these redundant storages. Apart from benefits in collaboratively scaling scientific endeavours, a distributed, replicated and federated management of scientific operations can accommodate the complex requirements of resilient data sovereignty and open-science. Finally, while a large portion of the scientific data workflow can be automated, the latest stages of analysis and publication require, for now, more human intervention. Even if FAIR data management calls for very technical solutions as described above, the access and usage of the data has to be simple and user-friendly to empower researchers with FAIR practices. We thus plan to provide easily launchable distributed cloud containerized computing environments (eg. BinderHub) with tracking of the used datasets’ versions. By powering interactive exploration of rich datasets as much as full-fledged analyses in modern executable reproducible preprints(Karakuzu et al., 2022), such computational environments are the final steps to an end-to-end FAIR scientific workflow, tracking provenance from collection instruments to publications.
Gorgolewski, K.J. et al. (2016) ‘The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments’, Scientific data, 3, p. 160044. Gorgolewski, K.J. et al. (2017) ‘BIDS apps: Improving ease of use, accessibility, and reproducibility of neuroimaging data analysis methods’, PLoS computational biology, 13(3), p. e1005209. Halchenko, Y.O. et al. (2021) ‘DataLad: distributed system for joint management of code, data, and their relationship’, Journal of open source software, 6(63). Available at: https://doi.org/10.21105/joss.03262. Karakuzu, A. et al. (2022) ‘NeuroLibre : A preprint server for full-fledged reproducible neuroscience’. Available at: https://doi.org/10.31219/osf.io/h89js. Wilkinson, M.D. et al. (2016) ‘The FAIR Guiding Principles for scientific data management and stewardship’, Scientific data, 3, p. 160018.
Q&A
16:15 (00:30) ‧ Event Hall ‧ Start-to-end FAIR analyses in the neurosciences
Abstract
Questions and panel discussion
Conference Wrap-Up
16:45 (00:15) ‧ Event Hall ‧ General
End of Day
17:00 (00:30) ‧ Foyer ‧ General
Abstract
Head out to a self-organized dinner and social time, or head home. Venue closing at 17:30.
2025-10-25
Coffee
08:30 (00:30) ‧ Seminar room, 3rd floor ‧ Break
Abstract
Coffee and snack overflow
Kick-Off/Pitches
09:00 (00:30) ‧ Seminar room, 3rd floor ‧ Hackathon
Hacking
09:30 (02:30) ‧ Seminar room, 3rd floor ‧ Hackathon
Lunch (Self-Organized)
12:00 (01:30) ‧ Seminar room, 3rd floor ‧ Break
Abstract
Lunch (self-organized, outside venue)
Hacking
13:30 (01:00) ‧ Seminar room, 3rd floor ‧ Hackathon
Coffee
14:30 (00:15) ‧ Seminar room, 3rd floor ‧ Break
Abstract
Coffee and snack overflow
Hacking
14:45 (01:45) ‧ Seminar room, 3rd floor ‧ Hackathon
Wrap-Up
16:30 (00:30) ‧ Seminar room, 3rd floor ‧ Hackathon
End of Day
17:00 (00:30) ‧ Seminar room, 3rd floor ‧ General
Abstract
Head out to a self-organized dinner and social time, or head home. Venue closing at 17:30.