4.7 Article

FAIRly big: A framework for computationally reproducible processing of large-scale data

Journal

SCIENTIFIC DATA
Volume 9, Issue 1, Pages -

Publisher

NATURE PORTFOLIO
DOI: 10.1038/s41597-022-01163-2

Keywords

-

Funding

  1. European Union's Horizon 2020 research and innovation programme under grant agreement Human Brain Project (SGA3, H2020-EU.3.1.5.3) [945539]
  2. European Union's Horizon 2020 research and innovation programme under grant agreement VirtualBrainCloud (H2020-EU.3.1.5.3) [826421]
  3. US National Science Foundation [NSF 1912266, 1429999]
  4. German Federal Ministry of Education and Research [BMBF 01GQ1905, 01GQ1411]
  5. ETIUDA grant from the National Science Centre, Poland [2018/28/T/HS6/00507]
  6. Div Of Information & Intelligent Systems
  7. Direct For Computer & Info Scie & Enginr [1429999] Funding Source: National Science Foundation

Ask authors/readers for more resources

This paper introduces a DataLad-based framework for reproducible data processing in compliance with open science mandates. The framework allows capturing machine-actionable computational provenance records to trace and verify research outcomes, as well as re-executing them on different computing infrastructures.
Large-scale datasets present unique opportunities to perform scientific investigations with unprecedented breadth. However, they also pose considerable challenges for the findability, accessibility, interoperability, and reusability (FAIR) of research outcomes due to infrastructure limitations, data usage constraints, or software license restrictions. Here we introduce a DataLad-based, domain-agnostic framework suitable for reproducible data processing in compliance with open science mandates. The framework attempts to minimize platform idiosyncrasies and performance-related complexities. It affords the capture of machine-actionable computational provenance records that can be used to retrace and verify the origins of research outcomes, as well as be re-executed independent of the original computing infrastructure. We demonstrate the framework's performance using two showcases: one highlighting data sharing and transparency (using the studyforrest.org dataset) and another highlighting scalability (using the largest public brain imaging dataset available: the UK Biobank dataset).

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available