Datalad Reproducibility for High Performance Computing: The Datalad Slurm Extension

Datalad run and rerun are incompatible with HPC. The talk presents how we fixed it.

Datalad run and rerun for machine-actionable reproducibility is not working in High Performance Computing (HPC) environments where compute jobs are managed through a batch scheduling system. Using datalad run outside of Slurm for job submission is pointless. And using it inside a batch job causes problems as well as very inefficient behavior.

We present the Datalad Slurm extension to solve this for the Slurm batch scheduling system. It introduces the new sub-commands “slurm-schedule”, “slurm-finish”, and “slurm-reschedule”. They allow to schedule many jobs at the same time from the same clone of a repository and generate a reproducibility record in the git log for successful jobs.

The talk introduces the solution from the user perspective, explains the design idea and some implementation details, presents some performance evaluation, and touches a few extra features.