Compute on demand

Compute-on-demand with a git-annex special remote: an fMRIPrep use case example

Git-annex special remotes can be thought of as more of a protocol then a place. A special remote can perform arbitrary operations to get a piece of content - so why not have it compute instead of download? Recently, two independent implementations have been published. This talk will discuss them both.

The datalad-remake special remote implementation relies on compute instructions stored in a git-tracked file, and uses URL-like git-annex keys to record availability. On get, a temporary worktree is provisioned and the compute instructions are executed. Signing of git commits is used to prevent unauthorised execution: a trust list needs to be declared.

The compute special remote is now built into git-annex. It is based on a compute special remote interface, and a protocol for communicating with external programs (or executable scripts).

As for non-trivial use cases, the ability to compute on demand addresses an important issue faced by neuroscientists. Preprocessing fMRI data includes morphing 4-dimensional (space + time) brain images to match a standard template. Usually, preprocessed images are normalized to more than one template, multiplying the storage requirements. Computing the transformations is time-consuming, but the transformation matrices are small and applying them to produce normalized images is quick. Recent releases of fMRIPrep, a popular software for fMRI data preprocessing, save all files required to apply these transformations, and provide workflows which can apply them reproducibly in a “shortcut” manner. In the talk, I will demonstrate how datalad-remake can be used with fMRIPrep to save hundreds of megabytes per study participant.