Distributed data : standards, governance and analyses in neurosciences

This talk will review some sociological and technical aspects associated with the standards, governance structure and data analyses of distributed data.

Distributed data in life science has become a necessity given the sharing constraints imposed by ethics and legal and political frameworks, which can be found at the laboratory, institution, country or continent levels. While distributing data is necessary, this comes with the huge challenge of making local data interoperable.

Ten years after the Wilkinson et al FAIR guideline proposal, finding a standard to represent the particular set of data still looks a little like a busy christmas market where one has to choose the type of decoration to put on the data. The reasons why it is so hard to find a common language are multiple and intertwined with the research sociology, i.e. how laboratories or institutions get funding, how they compete or collaborate. For those who have been working on developing data standards for many years, a feeling that things are never going to be resolved might dominate. In addition, sustaining the development of standards as well as their maintenance within a reasonable governance structure is rarely at the top of funding agencies that tend to reward ground breaking research and emphasize innovations.

In this presentation, I will first lay out key sociological and technical factors preventing distributed data reuse, some of these such as the classical research incentives are well known, while others are less well understood. Secondly, and mainly, I would like to introduce some hope that the process can be indeed successful and propose a number of guidelines, based on the collective experience of the development of standards and tools. In particular, I will consider federated analysis as a key motivation for standardized data and present some first experiments using a combination of tools (FedBiomed, Nipoppy / BIDS, Neurobagel) and standards (meta data, computing, etc) enabling a community to extract information from distributed datasets. I will give some examples of federated analysis of Parkinson clinical and neuroimaging data. I will review where LLMs will play a key role in facilitating this process. It is likely that the capacity to get results that cannot be achieved differently will drive the adoption and the sustainability of distributed data and analyses architectures, and only if the process is concurrently building communities of practice.