4.7 Article

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Journal

BIOINFORMATICS
Volume 30, Issue 1, Pages 119-120

Publisher

OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/btt601

Keywords

-

Funding

  1. Finnish Strategic Centre for Science, Technology and Innovation DIGILE
  2. Academy of Finland [139402]
  3. Sardinian (Italy) [L7-2010/COBIK]
  4. COST Action [BM1006]

Ask authors/readers for more resources

Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig's scalability over many computing nodes and illustrate its use with example scripts.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available