Skip to main content
  1. Talks/

Maintaining large datasets at scale

Finding and fixing defects that arise in automatically managed git-annex repositories.

OpenNeuro is a neuroimaging data hosting service that allows researchers to upload datasets, which are converted into git-annex repositories and exposed to for download. Users have created over 7000 datasets, in total containing more than 140TB of data.

Users largely interact with datasets through the web interface and a command-line uploader/downloader, and rarely use git, git-annex or DataLad directly. As a result, software bugs can introduce defects into a repository without detection until a user attempts to access the data, which can be many months and several dataset versions later.

This talk will discuss some defects that we have found, and the strategies we’ve used to resolve these. These include:

  • Unannex/re-annex commit pairs poisoning the git history. Finding, fixing and rewriting tags.
  • Recovering from large files somehow getting into your git-annex branch.
  • Copy-on-write git backups so you can rewrite history without fear.