Skip to main content
  1. Events/
  2. Distribits 2025/

Schedule

You can also access the schedule information as Pentabarf XML. For viewing the schedule on mobile, we recommend the Giggity app (F-Droid) – use the link to the XML file above. Unfortunately, we are not currently aware of a free app compatible with iOS devices.

All times listed are in Central European Time (CET).

2025-10-23

Arrival and Registration

08:30 (00:30) ‧ Foyer ‧ General

Abstract

Find the venue, pick up name badges, get comfortable

Welcome and Overview

09:00 (00:20) ‧ Event Hall ‧ General

Abstract

Welcome from the organizers

Pragmatic YODA: overview of YODA principles and their wild life encounters

09:20 (00:20) ‧ Event Hall ‧ Practical guidelines for data and software

Yaroslav O. Halchenko

YODA principle a day, keeps gray hair rate at bay

Abstract

YODA Principles were formulated in 2018 as a poster for a human neuroimaging conference, and later became covered in the DataLad Handbook. Despite that exposure, YODA principles remain largely unknown in the community although might naturally be thought to be ’ecological’ and followed in the scientific practice by many researchers. Under the tagline ‘A YODA principle a day keeps gray hair rate at bay,’ the presentation will cover essential YODA concepts and demonstrate their real-world application using tools like DataLad, and reviewing some prominent examples such as OpenNeuroDerivatives presented last distribits. We will further showcase supporting tools like con/duct while providing practical tips for sticking to YODA principles in day-to-day scientific workflows. By relating and potentially contrasting these principles to other hierarchical organization standards (e.g. FHS and XDG Base Directory) we hope to provide insights into creating more maintainable and reproducible research data resources.

Research Software Education and Documentation: The overlooked pillar of software sustainability and high quality science

09:40 (00:20) ‧ Event Hall ‧ Practical guidelines for data and software

Julia-Katharina Pfarr

How good can a research software be without proper documentation and training?

Abstract

The crucial role of open and collaborative research software development in advancing high-quality research is increasingly recognized and valued within the scientific community and by funding sources. While attention often focuses on the software engineering aspects of research tools, the development of appropriate training materials remains frequently overlooked, with limited resources available to guide developers in creating effective tutorials and limited recognition of the work by academic evaluation. Though good examples of comprehensive documentation and training materials exist (e.g., DataLad, BIDS, various Python packages), the knowledge of how these materials were created - including the underlying thought processes and decisions - typically remains with their creators. Train-the-trainer programs such as those from ReproNim or Software Carpentry offer invaluable contributions in preparing peers for research software education; however, these programs have limited spots, and their resources for education of trainers are not enough known. Therefore, in my talk, I will emphasize the importance of creating educational and training materials for open-source research software and provide practical guidelines for developers.

The first part of my presentation will highlight the value of research software education materials. Acquiring proficiency in software not only increases scientific efficiency - researchers spend less time troubleshooting and more time conducting science with fewer errors - but well-designed tutorials also lower entry barriers to sophisticated analyses: without proper training materials, certain methods remain accessible only to those with advanced computational backgrounds. Well constructed tutorials break sophisticated techniques into manageable concepts with clear implementation steps. This democratizes science by enabling researchers from smaller institutions and under-resourced settings to implement cutting-edge techniques without extensive local expertise. Furthermore, a community of well-trained researchers contributes to software sustainability, beginning with user-driven documentation improvements and potentially advancing to code contributions. User feedback during training reveals necessary design improvements, creating tools that increasingly align with researchers’ actual workflows.

Key considerations in my talk will address existing training modalities (progressive complexity versus task-based approaches), how to determine which are most appropriate and effective for specific research software, the time investment required, and the sustainability challenges that arise. Through these discussions, I aim to emphasize that investing in high-quality training resources is central to research software development - particularly in complex domains like neuroimaging - and essential for maximizing the scientific impact of our tools. Actions to increase the value of software training should include recognition by the science community, e.g., with specific awards or quantification of the gains for software quality with good documentation. Finally, I will address the critical questions: who should be responsible for undertaking this task? And how do we make it sustainable?

Q&A

10:00 (00:20) ‧ Event Hall ‧ Practical guidelines for data and software

Abstract

Questions and panel discussion

Coffee

10:20 (00:30) ‧ Foyer ‧ Break

Abstract

Coffee and snack overflow

git-annex for computer scientists

10:50 (00:20) ‧ Event Hall ‧ git-annex and special remotes

Joey Hess

quick introduction to git-annex internals for people who are not scared of words like “distributed”

Abstract

git-annex has hundreds of subcommands, a myriad of features and use cases, and comprises a hundred thousand lines of code. But there is a simple core that, if you understand it, makes it easy to grasp everything else.

After learning that, we’ll check our understanding by applying it to a few features that have been recently added to git-annex, like the compute special remote.

Managing Tape Archives with git-annex: A Special Remote for Sequential Media

11:10 (00:20) ‧ Event Hall ‧ git-annex and special remotes

Steffen Vogel

Use git-annex to manage your tape archive index.

Abstract

Magnetic tape remains a highly reliable and cost-effective medium for long-term data storage. However, its sequential nature and limited integration with modern software workflows often hinder adoption outside traditional enterprise environments. This talk introduces a new approach that integrates tape storage with git-annex through a purpose-built special remote called git-annex-remote-tape, enabling, versioned data management on offline, high-capacity media.

We begin with a brief technical overview of the LTO ecosystem, focusing on tape operation via SCSI command sets and connectivity through SAS or Fibre Channel interfaces. These drives are well-supported in Linux via device nodes (e.g., /dev/st0) and managed with tools such as mt, tar, and sg3_utils. Despite robust hardware capabilities — such as hardware encryption and streaming compression — integrating tape into file-level workflows remains challenging, especially for applications requiring metadata versioning and content addressing.

To address this gap, we introduce git-annex-remote-tape, a special remote that treats tape as a managed backend for git-annex. This remote enables users to archive annexed content to tape, track its location, and restore it reliably using the git-annex interface. It leverages a simple on-tape data format optimized for sequential media, featuring:

  • Basic Metadata for each object linked to the git-annex key.
  • Streaming-friendly layout supporting large block sizes and seekable objects.

The remote manages the full data lifecycle: writing files to tape, updating metadata, and retrieving content via SCSI-based positioning. Tape media is treated as append-only and immutable, ensuring data consistency and simplifying recovery. All operations are coordinated through the standard git-annex special remote protocol, with additional logic for tape-specific behaviors.

A live demonstration will showcase real-world usage: initializing a tape remote, archiving data, removing local content, and restoring files from tape. We illustrate how tape-backed storage can be used transparently within annex repositories and how the system handles multiple volumes and offline content. The demo uses standard Linux tools and highlights how the solution integrates into existing workflows without requiring proprietary software.

We conclude with an overview of current limitations—such as lack of random access—and the development roadmap, including support for autoloaders, replication strategies, and enhanced diagnostics. By combining the durability of tape with the flexibility of content-addressable storage, git-annex-remote-tape offers a powerful tool for researchers, archivists, and infrastructure engineers seeking sustainable and auditable archival solutions.

Forgejo-aneksajo: a git-annex/DataLad forge

11:30 (00:20) ‧ Event Hall ‧ git-annex and special remotes

Matthias Riße

Apply established software development practices to your (meta-)data projects.

Abstract

Software development and data curation or analysis share many of their issues: keeping track of the evolution of files – ideally with information on how, why, when, and by whom –, organizing collaboration with multiple people, keeping track of known issues and other TODOs, discussing changes, making versions available to others, automating tasks, and more. Often you will even have to write code as part of a data project, blurring the line between the two even more. In the free and open-source software development world these issues already have well established solutions: a version control system keeps track of your projects history and ongoing development, a forge can serve as a collaboration hub, CI/CD services provide flexible automation. So, why not apply them to our data management needs? Forgejo-aneksajo does just that and extends Forgejo with git-annex support, making it a versatile self-hostable (meta-)data collaboration platform that neatly fits into a data management ecosystem around git, git-annex and DataLad.

Your Life in Git

11:50 (00:20) ‧ Event Hall ‧ git-annex and special remotes

Yann Büchau

How I track pretty much all my digital assets with git (annex)

Abstract

Accidental deletions, program crashes, disk failures, etc. have led me to put as many relevant files as I can under version control with git. With this, one can go back in history, restore older versions and see changes over time. That’s great for code, small files and documents, but throwing a large media collection at git itself won’t result in a good time. Git annex however makes tracking and syncing large collections of arbitrarily-sized files possible. With Forgejo-Aneksajo (a Forgejo fork with git annex support), there is now a workable web interface for easy access and collaboration on git annex repositories. In this talk I will explain how I “put my life in git” and show tools I use to achieve it.

Q&A

12:10 (00:30) ‧ Event Hall ‧ git-annex and special remotes

Abstract

Questions and panel discussion

Lunch (Self-Organized)

12:40 (01:30) ‧ Foyer ‧ Break

Abstract

Lunch (self-organized, outside venue)

Extending the Brain Imaging Data Structure specification to provenance metadata

14:10 (00:05) ‧ Event Hall ‧ Lightning talks

Boris Clénet

Provenance allows BIDS dataset to hold information on how they were generated.

Abstract

Interpreting and comparing scientific results as well as enabling reusable data and analysis output require understanding provenance, i.e. how data were generated and processed. To be useful, the provenance must be comprehensive, understandable, easily communicated, and captured in a machine accessible form. We present a recent extension of the Brain Imaging Data Structure (BIDS) that aims at describing the provenance of a dataset. This extension was designed as a combination of ergonomics and computability in order to help neuroscientists parsing human readable yet computer generated provenance records.

Maintaining large datasets at scale

14:15 (00:05) ‧ Event Hall ‧ Lightning talks

Christopher Markiewicz

Finding and fixing defects that arise in automatically managed git-annex repositories.

Abstract

OpenNeuro is a neuroimaging data hosting service that allows researchers to upload datasets, which are converted into git-annex repositories and exposed to for download. Users have created over 7000 datasets, in total containing more than 140TB of data.

Users largely interact with datasets through the web interface and a command-line uploader/downloader, and rarely use git, git-annex or DataLad directly. As a result, software bugs can introduce defects into a repository without detection until a user attempts to access the data, which can be many months and several dataset versions later.

This talk will discuss some defects that we have found, and the strategies we’ve used to resolve these. These include:

  • Unannex/re-annex commit pairs poisoning the git history. Finding, fixing and rewriting tags.
  • Recovering from large files somehow getting into your git-annex branch.
  • Copy-on-write git backups so you can rewrite history without fear.

Using Git-annex to enhance the MediaWiki file repository system

14:20 (00:05) ‧ Event Hall ‧ Lightning talks

Timothy Sanders

Git-annex is a great companion to MediaWiki’s built in file repository functionality. An annex repo is useful for file location tracking and integrity, as well as offering a bridge to external hosting services through special remotes and integrating both platforms through bidirectional metadata exposure.

Abstract

Managing binary file assets can be a pain-point for mediawiki admins, with binary files often eclipsing the wiki text content. Mediawiki offers an API for file repository backends and extensions for hosting binaries in services like s3 buckets, as well as hooks for additional file operations.

A git-annex mediawiki backend can implicitly provide these features and more by acting as a bridge to special remotes, and provide a more useful form of file versioning and archiving.

I will share my experiments with integrating git-annex with mediawiki in several ways. Starting with just making a git-annex compatible uploads folder that uses a matching key format to the mediawiki native sha1 hashdir format, then integrating it with mediawiki’s hook interface. I also experimented with a git-annex special remote for the mediawiki API. Finally, I am working on using git-annex as a mediawiki file repository backend based off of the official specification.

I have also found that there is an overlap between Datalad users and Mediawiki users, so this might be relevant.

If there’s time, I can provide some background and case studies. Note that this isn’t related to the git-remote-mediawiki project, although I’m a fan of that as well. I think git-annex makes more sense for this application.

Unconference

14:25 (00:50) ‧ Event Hall ‧ Unconference

Coffee

15:15 (00:30) ‧ Foyer ‧ Break

Abstract

Coffee and snack overflow

Managing distributed research data in the engineering sciences with a Data Mesh approach

15:45 (00:20) ‧ Event Hall ‧ Federated data management across domains

Mario Moser

Since Research Data Management in the engineering sciences is decentralised regarding organisations as well as technically w.r.t. tools and data sources, a Data Mesh approach is applied, for which first experiences are presented.

Abstract

To make scientific data-driven research transparent and comprehensive, Research data management (RDM) covers tools and methods to make it FAIR (findable, accessible, interoperable, reusable) and to maintain research data as a valuable resource. Moreover, a cultural change started to build analyses based on existing data instead of necessarily collect new data at the beginning of each experiment. RDM, esp. in the engineering sciences, is characterised by heterogeneity and decentralisation: Thousands of research institutions in Germany work independently from each other, unless collaborating within projects. Different domains form the engineering sciences, ranging from mechanical over electrical to civil engineering. Engineering is highly interdisciplinary, so that data might be reused for a purpose unknown at data generation. Data formats and structures are heterogeneous, covering relational sensor data and material models as well as images and audios. Data is provided in several repositories, either generic, institutional, or specialised. Overall, this initial situation makes it hard to discover existing data, leverage this data content-wise as well as technically, and assess data quality of a reused dataset. The Data Mesh approach from industrial data management appears appropriate for this setting: Instead of centralised IT teams, domains and their domain owner manage their datasets, being able to answer specific questions about the data and ensuring its quality. Data is provided in the form of data products, ensuring that relevant elements like metadata for context, code for processing, a handle for identification, provenance as history, and a license from a legal perspective etc. are provided. Data remains in their original source, leveraging existing and potentially more specialised infrastructure. Compared to a monolithic ‘one-fits-all’ solution, this is less complex to maintain and can more easily adapt to future requirements. No complex ETL pipeline are required for data integration, although it requires data in their sources to be accessible, e.g. via an API. Based on metadata, the decentralised data in its sources is registered in a central platform, e.g. in a data catalogue or graph, for increased findability. On such a self-serve platform, owner can onboard their data and user can find and access it. To achieve interoperability within a Data Mesh, a federated governance is applied, consisting of global and local elements: Designed depending on the respective requirements, global governance ensures standardisation between data within the whole mesh, while local rules leave room for domain-specific individual design decisions. In this talk, the characteristics of RDM in the engineering sciences and Data Mesh will be presented and mapped against each other. First results will be presented about the suitability and fields of adaption. Although not exactly 1:1 applicable to RDM, the Data Mesh approach addresses main challenges identified before. Especially the domain-oriented approach and the federated governance go beyond purely technical or centralised solution approaches.

The Helmholtz Earth and Environment DataHub - Highly Distributed Data That Thrives on Metadata

16:05 (00:20) ‧ Event Hall ‧ Federated data management across domains

Ulrich Loup

Germany-wide, highly distributed observational data of the earth, the atmosphere, and the oceans from decades are now being made available by a Helmholtz initiative through one common combined data/metadata interface and standard, which poses a number of challenges to the established scientific and technical workflows.

Abstract

In Environmental Sciences, Time-series data is key to, for example, monitoring environmental processes, validating earth system models and remote sensing products, training of data driven methods and better understanding of climate processes. A major issue is the lack of a consistent data availability standard aligned with the FAIR (findable accessible interoperable reusable) principles.

The DataHub initiative, which is part of the Helmholtz Research Field Earth and Environment, addresses these shortcomings by establishing a large-scale infrastructure around common data standards and interfaces, for example, the Open Geospatial Consortium’s SensorThings API (STA). Closely related to the DataHub is the STAMPLATE project, whose challenging task was to harmonize the extremely heterogeneous metadata formats stemming from the different observation domains such as the earth, atmosphere and ocean. Moreover, within the domains different metadata formats developed historically due to diverging system architectures and missing guidelines.

In DataHub, the research data, whether it is collected by measurement devices or acquired through manual processes, is distributed among the seven participating research centers. Each of these centers is responsible for operating its own time series management system, which ingests the observational data. In addition to these data ingest systems, sensor and device management systems provide easy-to-use self-services for entering metadata, such as the Helmholtz Sensor Management System (https://helmholtz.software/software/sensor-management-system) or the O2A Registry (https://registry.o2a-data.de/). Each center operates a data/metadata synchronization service that ultimately makes the data available through STA, which integrates both data and metadata. Quality checking tools such as SaQC (https://helmholtz.software/software/saqc) facilitate data quality control. The powerful and modern Earth Data Portal (www.earth-data.de) with highly customizable thematic viewers is the central portal for data exploration. In order to ensure that metadata entered in any user self-service is also displayed in the Earth Data Portal along with the ingested data, custom, semantic metadata profiles developed in STAMPLATE augment STA’s core data model with domain-specific information.

In summary, the data that is accessible on the Earth Data Portal and available from the STA endpoints is distributed in two distinct categories. Firstly, observation data and its metadata are acquired by separate systems. And secondly, each center operates its own data and metadata infrastructure, with all centers ultimately connecting to STA endpoints.

The operationalization of the framework and its subsequent integration into research data workflows is imminent, presenting us with a number of challenges as our research data management processes undergo a transformative shift from manual, human-based workflows to self-organized, digitally-enabled workflows. For example, new ways of downloading data need to be found that meet the needs of researchers, while addressing issues such as copyright and avoiding infrastructure overload.

This talk addresses the fundamental elements of our initiative and the associated challenges.

Location-Transparent Distributed Datacube Processing

16:25 (00:20) ‧ Event Hall ‧ Federated data management across domains

Peter Baumann

Semantic-rich query languages enable automatic query splitting and asymmetric load balancing

Abstract

It is a common first-semester Computer Science insight that general-purpose (“Turing-complete”) languages cannot be understood by another program, which is necessary, among others, to generate true distributed execution plans. However, restricted models indeed are amenable to such analysis - in particular, the SQL database language which (i) has a focused data structure, tables, and (ii) in its core does not have iteration.

We present a similar high-level query language based on a multi-dimensional array model, also known as “datacubes”, and its implementation in the rasdaman Array DBMS, a database system centered around arrays (rather than tables). The query optimizer can generate both local and distributed plans and find efficient work splittings. The net effect for the user is that disparate, independent rasdaman deployments can be federated in a location-transparent manner: users do not need to know where the data sit that get accessed in a query. In the extreme case, cloud-based data centers can get federated with small edge devices like Raspberry Pi, and the optimizer can generate plans which take into account the asymmetric capabilities of the nodes involved.

This language has been adopted as part 15 of the ISO SQL standard. A slightly modified version named Web Coverage Processing Service (WCPS), which additionally incorporates space-time semantics, has been standardized by the Open Geospatial Consortium and ISO. In rasdaman, WCPS queries internally get mapped to SQL MDA queries which suhbsequently get evaluated, including distributed processing.

We present concepts, architecture, and live demos on geo datacubes representing weather forecasts and satellite image timeseries.

Q&A

16:45 (00:30) ‧ Event Hall ‧ Federated data management across domains

Abstract

Questions and panel discussion

End of Day

17:15 (00:15) ‧ Foyer ‧ General

Abstract

Head out to a self-organized dinner and social time. Venue closing at 17:30.


2025-10-24

Arrival and Registration

08:30 (00:30) ‧ Foyer ‧ General

Abstract

Talk to (new) friends, get comfortable for the 2nd day

Compute on demand

09:00 (00:20) ‧ Event Hall ‧ Computational workflows and network infrastructure

Michał Szczepanik

Compute-on-demand with a git-annex special remote: an fMRIPrep use case example

Abstract

Git-annex special remotes can be thought of as more of a protocol then a place. A special remote can perform arbitrary operations to get a piece of content - so why not have it compute instead of download? Recently, two independent implementations have been published. This talk will discuss them both.

The datalad-remake special remote implementation relies on compute instructions stored in a git-tracked file, and uses URL-like git-annex keys to record availability. On get, a temporary worktree is provisioned and the compute instructions are executed. Signing of git commits is used to prevent unauthorised execution: a trust list needs to be declared.

The compute special remote is now built into git-annex. It is based on a compute special remote interface, and a protocol for communicating with external programs (or executable scripts).

As for non-trivial use cases, the ability to compute on demand addresses an important issue faced by neuroscientists. Preprocessing fMRI data includes morphing 4-dimensional (space + time) brain images to match a standard template. Usually, preprocessed images are normalized to more than one template, multiplying the storage requirements. Computing the transformations is time-consuming, but the transformation matrices are small and applying them to produce normalized images is quick. Recent releases of fMRIPrep, a popular software for fMRI data preprocessing, save all files required to apply these transformations, and provide workflows which can apply them reproducibly in a “shortcut” manner. In the talk, I will demonstrate how datalad-remake can be used with fMRIPrep to save hundreds of megabytes per study participant.

Continuous Benchmarking of HPC Simulations

09:20 (00:20) ‧ Event Hall ‧ Computational workflows and network infrastructure

Jayesh Badwaik

Opportunities and Challenges in Implementing Continuous Benchmarking for HPC Simulations

Abstract

Ensuring performance consistency and early regression detection is critical in high-performance computing (HPC) operations. Traditional benchmarking methods rely on manual execution, leading to inconsistencies and delayed issue detection. During the JUREAP program, we integrated Continuous Benchmarking (CB) using exacb to standardize performance evaluation across 50 applications. This automation improved reproducibility, streamlined reporting, and enabled early detection of system anomalies, such as faulty Slurm updates and workflow execution issues on the JEDI machine. Even without a fully operational exascale supercomputer, exacb facilitated systematic performance comparisons, providing valuable insights into application scalability.

Beyond JUREAP, Cx enhances research software development and HPC system management. Our framework simplifies benchmarking, ensuring efficient performance tracking and optimization at scale—key for the upcoming JUPITER exascale supercomputer. Automating benchmarking reduces manual overhead, improves system stability, and aids in troubleshooting by providing structured performance insights. In this talk, we share our experience implementing CB in JUREAP, key findings from benchmarking 50 applications, and the broader impact of CI/CD/CB on research software, system administration, and future exascale computing.

Datalad Reproducibility for High Performance Computing: The Datalad Slurm Extension

09:40 (00:20) ‧ Event Hall ‧ Computational workflows and network infrastructure

Andreas Knüpfer

Datalad run and rerun are incompatible with HPC. The talk presents how we fixed it.

Abstract

Datalad run and rerun for machine-actionable reproducibility is not working in High Performance Computing (HPC) environments where compute jobs are managed through a batch scheduling system. Using datalad run outside of Slurm for job submission is pointless. And using it inside a batch job causes problems as well as very inefficient behavior.

We present the Datalad Slurm extension to solve this for the Slurm batch scheduling system. It introduces the new sub-commands “slurm-schedule”, “slurm-finish”, and “slurm-reschedule”. They allow to schedule many jobs at the same time from the same clone of a repository and generate a reproducibility record in the git log for successful jobs.

The talk introduces the solution from the user perspective, explains the design idea and some implementation details, presents some performance evaluation, and touches a few extra features.

Iroh p2p QUIC transport and resumable verified transfers

10:00 (00:20) ‧ Event Hall ‧ Computational workflows and network infrastructure

Floris Bruynooghe

An overview of iroh p2p connections and a blobs protocol based on blake3 hashes running on top

Abstract

With iroh our aim is to provide reliable direct connections where the first byte can flow without delay. Iroh provides a QUIC connection to users, that is holepunched when possible. Users are free to run any protocols on top of the QUIC connection. One useful protocol is the blobs protocol, which uses the BLAKE3 internal Merkel tree to enable verified streaming of data.

This talk aims to provide an overiview of how iroh’s architecture and connections work. Followed by a brief look at how the verified streaming is designed.

https://www.iroh.computer/

Q&A

10:20 (00:30) ‧ Event Hall ‧ Computational workflows and network infrastructure

Abstract

Questions and panel discussion

Coffee

10:50 (00:30) ‧ Foyer ‧ Break

Abstract

Coffee and snack overflow

dtool and dserver: A flexible ecosystem for findable data

11:20 (00:20) ‧ Event Hall ‧ Metadata-based tools and data discoverability

Johannes Laurin Hörmann

dtool is a lightweight data management tool that packages metadata with immutable data to promote accessibility, interoperability, and reproducibility; dserver makes dtool datasets findable

Abstract

Making data FAIR - findable, accessible, interoperable, reproducible - has become the recurring theme behind many research data management efforts. dtool is a lightweight data management tool that packages metadata with immutable data to promote accessibility, interoperability, and reproducibility. Each dataset is self-contained and does not require metadata to be stored in a centralised system. dserver, as defined by a REST API, makes dtool datasets findable, hence rendering the dtool ecosystem fit for a FAIR data management world. Its simplicity, modularity, accessibility and standardisation via API distinguish dtool and dserver from other solutions and enable it to serve as a common denominator for cross-disciplinary research data management. The dtool ecosystem bridges the gap between standardisation-free data management by individuals and FAIR platform solutions with rigid metadata requirements. We show how dtool and dserver have been used productively to enable research in solid mechanics [1], multiscale simulations [2], and molecular dynamics simulations [3].

[1] A. Sanner and L. Pastewka, Crack-Front Model for Adhesion of Soft Elastic Spheres with Chemical Heterogeneity, J. Mech. Phys. Solids 160, 104781 (2022).

[2] H. Holey, A. Codrignani, P. Gumbsch, and L. Pastewka, Height-Averaged Navier–Stokes Solver for Hydrodynamic Lubrication, Tribol. Lett. 70, 36 (2022).

[3] J. L. Hörmann, C. Liu, Y. Meng, and L. Pastewka, Molecular Simulations of Sliding on SDS Surfactant Films, J. Chem. Phys. 158, (2023).

DataChain: Query and Version Your Cloud Storage Without Moving a File

11:40 (00:20) ‧ Event Hall ‧ Metadata-based tools and data discoverability

Dmitry Petrov

Open-source DataChain tool adds querying, versioning, and metadata to raw files—right where they are.

Abstract

Cloud object storages like S3, GCS and Azure are the backbone of modern AI workflows - but they weren’t built for understanding what’s inside of files which is critical for ML and AI teams. DataChain is an open-source tool that helps building a layer of semantic on top of your storage using ML models and LLM. It turns raw files into structured, versioned datasets - without moving, modifying, or duplicating anything.

Built by the team behind DVC, the industry-standard for ML data versioning. In this talk, you’ll see how it helps teams manage unstructured data at scale and enables a new class of AI-native tools - from agentic pipelines to semantic search.

Information Management at the INM-7

12:00 (00:20) ‧ Event Hall ‧ Metadata-based tools and data discoverability

Stephan Heunis

A linked metadata-based toolset for modelling, annotation, discoverability, and protection of information

Abstract

More or less a year and a half ago, around the time of the first distribits event, we started exploring a new direction for information management at the INM-7, Research Center Jülich. The goal was to extend beyond the constraints imposed by a DataLad, git, and git-annex-based data management approach (i.e. the need for a git repository to “live somewhere”) to something that connects managed data to the wider world of data infrastructures. Semantic metadata and RDF soon became the necessary ingredients for describing our data such that it becomes interoperable and machine-actionable by design, opening up possibilities to translate and transform infrastructure-dependent datasets into portable yet comprehensive data descriptors. In this talk, we’ll show the set of interoperable metadata-based tools that we have developed to model, annotate, and protect data, and we’ll demonstrate existing deployments in varying use cases, from the management of institute-internal personnel records to the curation of data samples from a research study.

Q&A

12:20 (00:30) ‧ Event Hall ‧ Metadata-based tools and data discoverability

Abstract

Questions and panel discussion

Lunch (Self-Organized)

12:50 (01:30) ‧ Foyer ‧ Break

Abstract

Lunch (self-organized, outside venue)

Unconference

14:20 (00:25) ‧ Event Hall ‧ Unconference

Coffee

14:45 (00:30) ‧ Foyer ‧ Break

Abstract

Coffee and snack overflow

Distributed data : standards, governance and analyses in neurosciences

15:15 (00:20) ‧ Event Hall ‧ Start-to-end FAIR analyses in the neurosciences

Jean-Baptiste Poline

This talk will review some sociological and technical aspects associated with the standards, governance structure and data analyses of distributed data.

Abstract

Distributed data in life science has become a necessity given the sharing constraints imposed by ethics and legal and political frameworks, which can be found at the laboratory, institution, country or continent levels. While distributing data is necessary, this comes with the huge challenge of making local data interoperable.

Ten years after the Wilkinson et al FAIR guideline proposal, finding a standard to represent the particular set of data still looks a little like a busy christmas market where one has to choose the type of decoration to put on the data. The reasons why it is so hard to find a common language are multiple and intertwined with the research sociology, i.e. how laboratories or institutions get funding, how they compete or collaborate. For those who have been working on developing data standards for many years, a feeling that things are never going to be resolved might dominate. In addition, sustaining the development of standards as well as their maintenance within a reasonable governance structure is rarely at the top of funding agencies that tend to reward ground breaking research and emphasize innovations.

In this presentation, I will first lay out key sociological and technical factors preventing distributed data reuse, some of these such as the classical research incentives are well known, while others are less well understood. Secondly, and mainly, I would like to introduce some hope that the process can be indeed successful and propose a number of guidelines, based on the collective experience of the development of standards and tools. In particular, I will consider federated analysis as a key motivation for standardized data and present some first experiments using a combination of tools (FedBiomed, Nipoppy / BIDS, Neurobagel) and standards (meta data, computing, etc) enabling a community to extract information from distributed datasets. I will give some examples of federated analysis of Parkinson clinical and neuroimaging data. I will review where LLMs will play a key role in facilitating this process. It is likely that the capacity to get results that cannot be achieved differently will drive the adoption and the sustainability of distributed data and analyses architectures, and only if the process is concurrently building communities of practice.

Bridging the Gap Between Storage and Applications: A Modular Concept for Large Image Data Access

15:35 (00:20) ‧ Event Hall ‧ Start-to-end FAIR analyses in the neurosciences

Julia Thönnißen

Concept for a modular, cloud-native image delivery service enabling access and transformation of large image datasets—bridging storage and applications without data duplication.

Abstract

Recent advances in imaging technologies—particularly high-throughput methods—have led to an unprecedented growth of image datasets, reaching Terabytes to Petabytes in scale. While such massive datasets offer transformative potential for scientific discovery, they also introduce significant challenges for visualization and analysis due to the sheer size of the data and its continuous growth. Visualizing, annotating, and analyzing large-scale image datasets raises a fundamental dilemma of balancing computational efficiency and memory requirements. Many existing tools fail to manage large datasets effectively due to memory constraints, often forcing lossy methods like downsampling. Conversely, solutions optimized for large data volumes frequently depend on specialized or proprietary formats, reducing interoperability with other ecosystems. This highlights diverging requirements: storage systems favour compression for compactness, analysis tools require fast data access, and visualization tools benefit from tiled, multi-resolution formats. Without a unified strategy, institutions often resort to inefficient workflows involving repeated format conversions and costly data duplication to support diverse applications. Ongoing standardization efforts within the bioimaging community [1-4] represent important developments towards more efficient and standardized use of bioimaging data. However, the conversion of data into a single (and yet evolving) standard is not feasible for rapidly growing large-scale datasets, especially given very diverging needs for parallel processing on HPC systems. To address these issues, we present a concept for a modular cloud-native image delivery service designed to act as a flexible middleware layer between large-scale image repositories and consuming applications. The system supports heterogeneous input formats and delivers transformed data views on demand. It performs real-time operations such as coordinate transformations, filtering, and multi-resolution tiling, eliminating the need for pre-processing or intermediate storage. The service offers an extensible set of access points: RESTful APIs for web-based visualization (e.g., Neuroglancer, OpenSeadragon), virtual file system mounts for file-oriented tools (e.g., OMERO, ImageJ), and programmatic interfaces compatible with customizable environments (e.g., Napari, datalad). Additionally, it can dynamically present standard-conformant data views—such as those aligned with the Brain Imaging Data Structure (BIDS) [4]—from arbitrarily organized datasets. By decoupling data access from physical storage layout, the service facilitates scalable, multi-tool interoperability in distributed environments without data duplication. In summary, we propose a flexible and extensible approach to image data access that supports dynamic transformations, minimizes redundancy, and bridges the gap between diverse storage backends and modern, distributed applications. It aligns with the FAIR data principles and builds upon community standards while enabling efficient workflows for managing and exploiting large-scale image datasets.

[1] S. Besson et al., “Bringing Open Data to Whole Slide Imaging”, Digital Pathology ECDP 2019, Lecture Notes in Computer Science, vol. 11435, pp. 3–10, Jul. 2019, DOI: 10.1007/978-3-030-23937-4_1 [2] J. Moore et al., “OME-NGFF: A next-generation file format for expanding bioimaging data-access strategies”, Nature Methods, vol. 18, no. 12, pp. 1496–1498, Dec. 2021. DOI: 10.1038/s41592-021-01326-w. [3] C. Allan et al., “OMERO: Flexible, model-driven data management for experimental biology”, Nature Methods, vol. 9, no. 3, pp. 245–253, Mar. 2012. DOI: 10.1038/nmeth.1896. [4] K. J. Gorgolewski et al., “The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments”, Scientific Data, vol. 3, no. 1, p. 160 044, Jun. 2016. DOI: 10.1038/sdata.2016.44.

BIDS-flux

15:55 (00:20) ‧ Event Hall ‧ Start-to-end FAIR analyses in the neurosciences

Basile Pinsard

Continuous federated multi-site MRI dataset creation from research instruments to open-publication with a Datalad backbone.

Abstract

In an era of data-hungry algorithms, collections of large high-quality research data are increasingly important to answer complex scientific hypotheses. Acquiring, curating, preparing, and sharing datasets at scale remain challenging tasks that require extensive field-specific knowledge, thus relying upon groups of researchers with varying expertise. As such, these steps are often conducted in non-integrated and sequential phases rather than continuously, which in turn limits the observability (e.g., progress tracking, quality assessment, error detection) of the collection process and consequently, the potential for data fixes or operational improvements. Furthermore, while funding agencies and local institutions both strongly encourage researchers to implement FAIR principles (findability, accessibility, interoperability, and reusability)(Wilkinson et al., 2016) in their data management practices, ethics committees - notably those dealing with human research - logically impose important privacy, cybersecurity, sovereignty and governance constraints to data-sharing. Altogether, the burden to adequately conduct scientific data collection, preparation and sharing is far too large for individual labs to bear, even more so when novice students or trainees are tasked with implementing these complex steps. These responsibilities further increase when scaling projects to multiple labs or sites with distinct institutional governance or digital infrastructures, inherently slowing collaboration. All these challenges of large-scale, high-quality, collaborative data collection initiatives call for open-science community efforts and institutional infrastructure to support efficient scientific operations and as-open-as-possible data sharing. From our experience in collecting and managing the data of the dense fMRI data CNeuroMod project, we aim to transfer, genericize, scale and automate these workflows in the context of our institution-wide open-science initiative, as well as a multisite Canadian Pediatric Imaging Platform (CPIP) by designing and implementing the BIDS-Flux infrastructure. The neuroimaging community has already conducted extensive efforts in data management by specifying the robust and evolutive file-based scientific data standard BIDS(Gorgolewski et al., 2016) that automated-tool, BIDS-apps(Gorgolewski et al., 2017), can robustly input and output. This foundation allowed us to design and start the implementation of a scalable FAIR data management platform. Modularly organized datasets are managed as Datalad (Halchenko et al., 2021) repos to version large files with git-annex while exposing data structure and metadata with git that also tracks provenance between datasets with submodules. Code-hosting platforms (e.g., Github, Gitlab) have enhanced git version control with agile practices such as development operations (DevOps) through continuous integrations and deployment (CI/CD) mechanisms, which can be translated to data operations (DataOps) on Datalad repos. As such, we chose to host and organize datasets on self-deployed Gitlab instances that orchestrate workflows, while we store data in a modern scalable object-stores (Minio). Provided some data collection conventions (e.g., ReproNim), it continuously ingests raw data from instruments (e.g., MRI, REDCap, Biopac), and carries-out data standardization (e.g., heudiconv, spec2nii, phys2bids), anonymization (defacing), testing (e.g., BIDS-validator, protocol compliance) and quality control (eg. MRIQC). After review and merge of the raw standard data from new sessions, a set of containerized BIDSApps configured for the study-specific collection protocol are triggered using Datalad reproducibility features to generate derivatives that also undergo review through reports and merging into their own datasets. All resulting datasets are continuously and recurringly covered by data standard compliance, deployment, privacy and security audits tests. While focussed on neuroimaging and peripherally collected data covered by the BIDS extensions, this infrastructure is modular enough to be adapted to other types of file-based scientific data structure, provided field-specific data standards and reproducible tools. Finally, while file-based datasets fits the earlier stage of data preparation, extracted data can be also forwarded to tabular or vector optimized storage better suited for downstream usage such as statistical modelling or machine-learning operations (MLOps). Building upon Datalad inheritance of git decentralization and git-annex versatility, we further designed our platform to enable distributed dataset creation. Multiple distant sites can independently add new data to local forks of the datasets, while federation mechanisms propagate these changes and optionally transfer released non-sensitive data to community accessible academic cloud storage and long-term archives (eg. DataVerse) with Datalad flexibly tracking the data presence in these redundant storages. Apart from benefits in collaboratively scaling scientific endeavours, a distributed, replicated and federated management of scientific operations can accommodate the complex requirements of resilient data sovereignty and open-science. Finally, while a large portion of the scientific data workflow can be automated, the latest stages of analysis and publication require, for now, more human intervention. Even if FAIR data management calls for very technical solutions as described above, the access and usage of the data has to be simple and user-friendly to empower researchers with FAIR practices. We thus plan to provide easily launchable distributed cloud containerized computing environments (eg. BinderHub) with tracking of the used datasets’ versions. By powering interactive exploration of rich datasets as much as full-fledged analyses in modern executable reproducible preprints(Karakuzu et al., 2022), such computational environments are the final steps to an end-to-end FAIR scientific workflow, tracking provenance from collection instruments to publications.

Gorgolewski, K.J. et al. (2016) ‘The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments’, Scientific data, 3, p. 160044. Gorgolewski, K.J. et al. (2017) ‘BIDS apps: Improving ease of use, accessibility, and reproducibility of neuroimaging data analysis methods’, PLoS computational biology, 13(3), p. e1005209. Halchenko, Y.O. et al. (2021) ‘DataLad: distributed system for joint management of code, data, and their relationship’, Journal of open source software, 6(63). Available at: https://doi.org/10.21105/joss.03262. Karakuzu, A. et al. (2022) ‘NeuroLibre : A preprint server for full-fledged reproducible neuroscience’. Available at: https://doi.org/10.31219/osf.io/h89js. Wilkinson, M.D. et al. (2016) ‘The FAIR Guiding Principles for scientific data management and stewardship’, Scientific data, 3, p. 160018.

Q&A

16:15 (00:30) ‧ Event Hall ‧ Start-to-end FAIR analyses in the neurosciences

Abstract

Questions and panel discussion

Conference Wrap-Up

16:45 (00:15) ‧ Event Hall ‧ General

End of Day

17:00 (00:30) ‧ Foyer ‧ General

Abstract

Head out to a self-organized dinner and social time, or head home. Venue closing at 17:30.


2025-10-25

Coffee

08:30 (00:30) ‧ Seminar room, 3rd floor ‧ Break

Abstract

Coffee and snack overflow

Kick-Off/Pitches

09:00 (00:30) ‧ Seminar room, 3rd floor ‧ Hackathon

Hacking

09:30 (02:30) ‧ Seminar room, 3rd floor ‧ Hackathon

Lunch (Self-Organized)

12:00 (01:30) ‧ Seminar room, 3rd floor ‧ Break

Abstract

Lunch (self-organized, outside venue)

Hacking

13:30 (01:00) ‧ Seminar room, 3rd floor ‧ Hackathon

Coffee

14:30 (00:15) ‧ Seminar room, 3rd floor ‧ Break

Abstract

Coffee and snack overflow

Hacking

14:45 (01:45) ‧ Seminar room, 3rd floor ‧ Hackathon

Wrap-Up

16:30 (00:30) ‧ Seminar room, 3rd floor ‧ Hackathon

End of Day

17:00 (00:30) ‧ Seminar room, 3rd floor ‧ General

Abstract

Head out to a self-organized dinner and social time, or head home. Venue closing at 17:30.