4.6 Article

Towards self-describing and FAIR bulk formats for biomedical data

Journal

PLOS COMPUTATIONAL BIOLOGY
Volume 19, Issue 3, Pages -

Publisher

PUBLIC LIBRARY SCIENCE
DOI: 10.1371/journal.pcbi.1010944

Keywords

-

Ask authors/readers for more resources

We introduce a self-describing serialized format called Portable Format for Biomedical (PFB) data for bulk biomedical data. PFB is based on Avro and includes a data model, a data dictionary, the data itself, and pointers to third party controlled vocabularies. Each data element in the data dictionary is associated with a third party controlled vocabulary for easier harmonization of PFB files. We also introduce an open source SDK called PyPFB for creating, exploring and modifying PFB files. Experimental studies show improved performance of importing and exporting bulk biomedical data in the PFB format compared to JSON and SQL formats.
We introduce a self-describing serialized format for bulk biomedical data called the Portable Format for Biomedical (PFB) data. The Portable Format for Biomedical data is based upon Avro and encapsulates a data model, a data dictionary, the data itself, and pointers to third party controlled vocabularies. In general, each data element in the data dictionary is associated with a third party controlled vocabulary to make it easier for applications to harmonize two or more PFB files. We also introduce an open source software development kit (SDK) called PyPFB for creating, exploring and modifying PFB files. We describe experimental studies showing the performance improvements when importing and exporting bulk biomedical data in the PFB format versus using JSON and SQL formats. Author summaryMany Biomedical data sets have a unique structure that encapsulates and describes the data. When working with these datasets it can be difficult to keep track of the overall structure and ontologies that define the individual properties. PFB was developed so that working with this type of data is made simpler. This allows anyone interacting with the data to bring this fully self-describing dataset in one file anywhere and do analysis over the phenotypic and biological data contained within it. PFB was devleoped over the Avro serialized data format which helps researchers and commons operators to make data schema updates as well as change references to external ontolgies. In this work we show the advantages to using PFB as a bioinformatic tool and how it is used to enable fast sharing of large biomedical research data sets. The results also show that we PFB is bringing significant speedups for storing and sharing structured biomedical datasets.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available