4.6 Article

CTD: An information-theoretic algorithm to interpret sets of metabolomic and transcriptomic perturbations in the context of graphical models

Journal

PLOS COMPUTATIONAL BIOLOGY
Volume 17, Issue 1, Pages -

Publisher

PUBLIC LIBRARY SCIENCE
DOI: 10.1371/journal.pcbi.1008550

Keywords

-

Funding

  1. Gulf Coast Consortia, on the NLM Biomedical Informatics Training Program [T15 LM007093]
  2. Henry and Emma Meyer Professorship in Molecular Genetics [U54-DA036134, U54DA049098, U41-HG009649]

Ask authors/readers for more resources

The paper introduces a novel information-theoretic formulation of identifying highly connected subsets within a weighted graph without the need for computationally costly permutation testing. The proposed algorithm, CTD Connect the Dots, uses data compression to detect highly connected subsets within a given set of nodes.
We consider the following general family of algorithmic problems that arises in transcriptomics, metabolomics and other fields: given a weighted graph G and a subset of its nodes S, find subsets of S that show significant connectedness within G. A specific solution to this problem may be defined by devising a scoring function, the Maximum Clique problem being a classic example, where S includes all nodes in G and where the score is defined by the size of the largest subset of S fully connected within G. Major practical obstacles for the plethora of algorithms addressing this type of problem include computational efficiency and, particularly for more complex scores which take edge weights into account, the computational cost of permutation testing, a statistical procedure required to obtain a bound on the p-value for a connectedness score. To address these problems, we developed CTD Connect the Dots, a fast algorithm based on data compression that detects highly connected subsets within S. CTD provides information-theoretic upper bounds on p-values when S contains a small fraction of nodes in G without requiring computationally costly permutation testing. We apply the CTD algorithm to interpret multi-metabolite perturbations due to inborn errors of metabolism and multi-transcript perturbations associated with breast cancer in the context of disease-specific Gaussian Markov Random Field networks learned directly from respective molecular profiling data. Author summary A frequently encountered omic analysis problem is to identify a subset of nodes within a weighted graph G that is both highly connected in G and belongs to S, a subset of nodes in G. For example, G may represent a biological pathway, kinetic network model, biological interaction network, or a network learned directly from data, where edges represent co-variation relationships between abundances of molecular variables. S may be the set of molecular variables that are perturbed in an individual case or in a set of disease cases relative to controls. In this work, we develop a novel information-theoretic formulation of this problem and a local search algorithm that obviate the need for computationally costly permutation testing, a statistical procedure which is typically required to establish rigorous p-value bounds for other scoring-based methods.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available