4.7 Article

mzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements

Journal

JOURNAL OF PROTEOME RESEARCH
Volume 20, Issue 1, Pages 172-183

Publisher

AMER CHEMICAL SOC
DOI: 10.1021/acs.jproteome.0c00192

Keywords

Proteomics Standards Initiative; mzML; mass spectrometry; proteomics; metabolomics; data compression; HDF5

Funding

  1. BBSRC [BB/K01997X/1, BB/R02216X/1, BB/M024954, BB/R021430]
  2. MRC [MR/N028457]
  3. National Institutes of Health [R01GM087221, R24GM127667, U19AG023122]
  4. National Science Foundation [DBI-1933311, IOS-1922871]
  5. BBSRC [BB/R02216X/1, BB/M024954/2, BB/M024954/1, BB/K01997X/1] Funding Source: UKRI
  6. MRC [MR/N028457/1] Funding Source: UKRI

Ask authors/readers for more resources

The optimized HDF5 file format mzMLb enhances read/write speed and storage efficiency of mass spectrometry data, supporting future versions of mzML and providing easy implementation.
With ever-increasing amounts of data produced by mass spectrometry (MS) proteomics and metabolomics, and the sheer volume of samples now analyzed, the need for a common open format possessing both file size efficiency and faster read/write speeds has become paramount to drive the next generation of data analysis pipelines. The Proteomics Standards Initiative (PSI) has established a clear and precise extensible markup language (XML) representation for data interchange, mzML, receiving substantial uptake; nevertheless, storage and file access efficiency has not been the main focus. We propose an HDF5 file format mzMLb that is optimized for both read/write speed and storage of the raw mass spectrometry data. We provide an extensive validation of the write speed, random read speed, and storage size, demonstrating a flexible format that with or without compression is faster than all existing approaches in virtually all cases, while with compression is comparable in size to proprietary vendor file formats. Since our approach uniquely preserves the XML encoding of the metadata, the format implicitly supports future versions of mzML and is straightforward to implement: mzMLb's design adheres to both HDF5 and NetCDF4 standard implementations, which allows it to be easily utilized by third parties due to their widespread programming language support. A reference implementation within the established ProteoWizard toolkit is provided.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available