4.6 Article

Detecting and Removing Sample Contamination in Phylogenomic Data: An Example and its Implications for Cicadidae Phylogeny (Insecta: Hemiptera)

Journal

SYSTEMATIC BIOLOGY
Volume 71, Issue 6, Pages 1504-1523

Publisher

OXFORD UNIV PRESS
DOI: 10.1093/sysbio/syac043

Keywords

-

Funding

  1. NSF [DEB1655891, DEB0955849, DEB0720664, DEB0529679, DEB0089946]
  2. USAID PEER Science Program [AID-OAA-A-11-00012]
  3. National Centre for Biological Sciences (NCBS)
  4. Ramanujan Fellowship (Department of Science and Technology, Govt. of India)
  5. Vietnam Academy of Science and Technology (VAST) [NCXS02.04/22-23]

Ask authors/readers for more resources

Contamination of genetic samples in molecular phylogenetic studies is a continuing concern. This study developed an automated pipeline to identify and exclude contaminated loci, without relying on prior knowledge of taxon relatedness. The experiment showed that contaminated loci can affect topological patterns and branch supports, and contamination is correlated with long tip branches and certain statistical values. The cleaned data set supported some subfamilies in the Cicadidae family.
Contamination of a genetic sample with DNA from one or more nontarget species is a continuing concern of molecular phylogenetic studies, both Sanger sequencing studies and next-generation sequencing studies. We developed an automated pipeline for identifying and excluding likely cross-contaminated loci based on the detection of bimodal distributions of patristic distances across gene trees. When contamination occurs between samples within a data set, a comparison between a contaminated sample and its contaminant taxon will yield bimodal distributions with one peak close to zero patristic distance. This new method does not rely on a priori knowledge of taxon relatedness nor does it determine the causes(s) of the contamination. Exclusion of putatively contaminated loci from a data set generated for the insect family Cicadidae showed that these sequences were affecting some topological patterns and branch supports, although the effects were sometimes subtle, with some contamination-influenced relationships exhibiting strong bootstrap support. Long tip branches and outlier values for one anchored phylogenomic pipeline statistic (AvgNHomologs) were correlated with the presence of contamination. While the anchored hybrid enrichment markers used here, which target hemipteroid taxa, proved effective in resolving deep and shallow level Cicadidae relationships in aggregate, individual markers contained inadequate phylogenetic signal, in part probably due to short length. The cleaned data set, consisting of 429 loci, from 90 genera representing 44 of 56 current Cicadidae tribes, supported three of the four sampled Cicadidae subfamilies in concatenated-matrix maximum likelihood (ML) and multispecies coalescent-based species tree analyses, with the fourth subfamily weakly supported in the ML trees. No well-supported patterns from previous family-level Sanger sequencing studies of Cicadidae phylogeny were contradicted. One taxon (Aragualna plenalinea) did not fall with its current subfamily in the genetic tree, and this genus and its tribe Aragualnini is reclassified to Tibicininae following morphological re-examination. Only subtle differences were observed in trees after the removal of loci for which divergent base frequencies were detected. Greater success may be achieved by increased taxon sampling and developing a probe set targeting a more recent common ancestor and longer loci. Searches for contamination are an essential step in phylogenomic analyses of all kinds and our pipeline is an effective solution. [Auchenorrhyncha; base-composition bias; Cicadidae; Cicadoidea; Hemiptera; phylogenetic conflict.]

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available