Statistics & Probability

Article Automation & Control Systems

A tutorial on automatic hyperparameter tuning of deep spectral modelling for regression and classification tasks

Dario Passos, Puneet Mishra

Summary: This tutorial aims to provide practical tools and methods for non-expert users in the chemometrics field to learn and implement deep spectral modelling and DL optimization. It showcases two practical examples on implementing and optimizing DL models for spectral regression and classification tasks.

CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS (2022)

Article Biochemical Research Methods

ImmuCellAI-mouse: a tool for comprehensive prediction of mouse immune cell abundance and immune microenvironment depiction

Ya-Ru Miao, Mengxuan Xia, Mei Luo, Tao Luo, Mei Yang, An-Yuan Guo

Summary: ImmuCellAI-mouse is an accurate tool for predicting mouse immune cell abundance, utilizing gene expression data and a hierarchical strategy to accurately predict most immune cell types, revealing dynamic changes in immune cell infiltration in a mouse tumor dataset.

BIOINFORMATICS (2022)

Article Biochemical Research Methods

plotsr: visualizing structural similarities and rearrangements between multiple genomes

Manish Goel, Korbinian Schneeberger

Summary: Third-generation genome sequencing technologies have significantly increased the number of high-quality genome assemblies. Plotsr is an efficient tool for visualizing structural similarities and rearrangements between genomes, capable of comparing on chromosome level or selected regions, and enhancing visualization with regional identifiers or histogram tracks.

BIOINFORMATICS (2022)

Article Economics

Misallocation, Selection, and Productivity: A Quantitative Analysis With Panel Data From China

Tasso Adamopoulos, Loren Brandt, Jessica Leight, Diego Restuccia

Summary: Based on household-level panel data from China and a quantitative framework, this study examines the extent and consequences of factor misallocation in agriculture. It finds that within-village frictions in land and capital markets, linked to rural land institutions, disproportionately constrain more productive farmers. These frictions reduce aggregate agricultural productivity by affecting resource allocation across farmers and worker allocation across sectors, particularly in terms of the type of farmers operating in agriculture.

ECONOMETRICA (2022)

Article Economics

In Search of a Job: Forecasting Employment Growth Using Google Trends

Daniel Borup, Erik Christian Montes Schutte

Summary: This study demonstrates a strong correlation between Google search activity and employment growth in the United States, showing that Google search results are an effective predictor for future employment growth. By constructing a large panel of 172 variables and utilizing Google's own algorithms to find semantically related search queries, the Google Trends model outperforms other predictors in its ability to forecast employment growth.

JOURNAL OF BUSINESS & ECONOMIC STATISTICS (2022)

Article Biochemical Research Methods

Graph Convolutional Autoencoder and Generative Adversarial Network-Based Method for Predicting Drug-Target Interactions

Chang Sun, Ping Xuan, Tiangang Zhang, Yilin Ye

Summary: This study proposes a graph convolutional autoencoder and generative adversarial network (GAN) based method (GANDTI) for predicting novel drug-target interactions (DTIs). By constructing a drug-target heterogeneous network and regularizing the feature vectors of nodes into a Gaussian distribution, GANDTI outperforms other methods for drug repositioning.

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (2022)

Article Biochemical Research Methods

RCSB Protein Data Bank: improved annotation, search and visualization of membrane protein structures archived in the PDB

Sebastian Bittrich, Yana Rose, Joan Segura, Robert Lowe, John D. Westbrook, Jose M. Duarte, Stephen K. Burley

Summary: Membrane proteins, encoded by a significant portion of human genes, account for a majority of FDA-approved drug targets. The RCSB Protein Data Bank web portal has made recent improvements by integrating a wealth of new membrane protein annotations from external resources. These enhancements greatly enhance the presentation of membrane protein data and provide users with tools for searching and visualizing these proteins.

BIOINFORMATICS (2022)

Article Statistics & Probability

SURPRISES IN HIGH-DIMENSIONAL RIDGELESS LEAST SQUARES INTERPOLATION

Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani

Summary: This paper studies minimum l(2) norm interpolation least squares regression in the high-dimensional regime, focusing on linear and nonlinear models. The study discovers the phenomena of double descent behavior in prediction risk and potential benefits of overparametrization.

ANNALS OF STATISTICS (2022)

Article Biochemical Research Methods

Finding lncRNA-Protein Interactions Based on Deep Learning With Dual-Net Neural Architecture

Lihong Peng, Chang Wang, Xiongfei Tian, Liqian Zhou, Keqin Li

Summary: In this study, a deep learning framework with dual-net neural architecture (LPI-DLDN) is developed to identify potential interactions between long non-coding RNAs (lncRNAs) and proteins. Compared with other state-of-the-art prediction methods, LPI-DLDN demonstrates powerful classification performance and reveals new interaction relationships.

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (2022)

Article Biochemical Research Methods

An Integrative Framework of Heterogeneous Genomic Data for Cancer Dynamic Modules Based on Matrix Decomposition

Xiaoke Ma, Penggang Sun, Maoguo Gong

Summary: This study proposes an integrative framework for dynamic module detection based on regularized nonnegative matrix factorization method, which can be applied to cancer diagnosis and therapy. Experimental results demonstrate that the framework outperforms other methods in terms of accuracy. For breast cancer data, the obtained dynamic modules are more enriched and can be used to predict cancer stages and patients' survival time.

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (2022)

Article Biochemical Research Methods

Manifold alignment for heterogeneous single-cell multi-omics data integration using Pamona

Kai Cao, Yiguang Hong, Lin Wan

Summary: Motivated by the need for effective approaches to integrate single-cell multi-omics data, this study presents Pamona, a partial Gromov-Wasserstein distance-based manifold alignment framework. It aims to delineate and represent the shared and dataset-specific cellular structures across modalities. Pamona demonstrates superior performance in accurately identifying shared and dataset-specific cells, recovering and aligning cellular structures, outperforming existing methods. The framework also allows for the incorporation of prior information to enhance alignment quality.

BIOINFORMATICS (2022)

Article Biochemical Research Methods

ToxIBTL: prediction of peptide toxicity based on information bottleneck and transfer learning

Lesong Wei, Xiucai Ye, Tetsuya Sakurai, Zengchao Mu, Leyi Wei

Summary: In this study, a novel deep learning framework called ToxIBTL is proposed for predicting the toxicity of peptides and proteins. By utilizing the information bottleneck principle and transfer learning, ToxIBTL effectively retains relevant information and minimizes redundant information in the features, resulting in improved prediction performance.

BIOINFORMATICS (2022)

Article Biochemical Research Methods

Predicting Drug-Drug Interactions Based on Integrated Similarity and Semi-Supervised Learning

Cheng Yan, Guihua Duan, Yayan Zhang, Fang-Xiang Wu, Yi Pan, Jianxin Wang

Summary: A drug-drug interaction (DDI) refers to the association between drugs where one drug's pharmacological effects are influenced by another drug. This study proposes a novel method, called DDI-IS-SL, to predict DDIs using integrated similarity and semi-supervised learning. DDI-IS-SL combines drug chemical, biological, and phenotype data to calculate the feature similarity of drugs. It also uses a semi-supervised learning method to calculate the interaction possibility scores of drug-drug pairs. DDI-IS-SL demonstrates better prediction performance and shorter computation time compared to other methods, and its performance is further supported by case studies.

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (2022)

Article Biochemical Research Methods

Dynamic Module Detection in Temporal Attributed Networks of Cancers

Dongyuan Li, Shuyao Zhang, Xiaoke Ma

Summary: Tracking dynamic modules during cancer progression is crucial for studying cancer pathogenesis, diagnosis, and therapy. Current algorithms focusing on detecting dynamic modules without integrating heterogeneous genomic data have limitations, which this study addresses by proposing a novel algorithm (TANMF) that integrates temporal networks and gene attributes. Experimental results demonstrate the superiority of TANMF in accuracy and its ability to identify dynamic modules associated with patients' survival time in breast cancer data.

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (2022)

Review Engineering, Environmental

Review of landslide susceptibility assessment based on knowledge mapping

Chen Yong, Jinlong Dong, Guo Fei, Tong Bin, Zhou Tao, Fang Hao, Wang Li, Zhan Qinghua

Summary: This study comprehensively examined the research status of landslide susceptibility through visual analysis and summarization. The findings identified the shortcomings in various aspects and proposed future research directions. The results provide valuable references and guidance for understanding the current landslide susceptibility and future studies.

STOCHASTIC ENVIRONMENTAL RESEARCH AND RISK ASSESSMENT (2022)

Article Statistics & Probability

Nonparametric Causal Effects Based on Longitudinal Modified Treatment Policies

Ivan Diaz, Nicholas Williams, Katherine L. Hoffman, Edward J. Schenck

Summary: Most causal inference methods focus on fixed values of exposure under interventions, which may not be practical for continuous or multi-valued treatments. Longitudinal modified treatment policies (LMTPs) provide an alternative that yields immediate and practically relevant effects, with an interpretation in terms of meaningful interventions such as changing the exposure by a specific amount. LMTPs also have the advantage of satisfying the positivity assumption required for causal inference.

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION (2023)

Article Statistics & Probability

Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data

Jun Yu, HaiYing Wang, Mingyao Ai, Huiming Zhang

Summary: This article proposes a nonuniform subsampling method based on Poisson subsampling to address the computational burden issue for massive data. The optimal subsampling probabilities are derived under the quasi-likelihood estimation and their consistency and asymptotic normality are established. A distributed subsampling framework is also developed for handling data stored in different locations. The effectiveness of the proposed methods is demonstrated through numerical experiments on simulated and real datasets.

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION (2022)

Article Biochemical Research Methods

DeepBarcoding: Deep Learning for Species Classification Using DNA Barcoding

Cheng-Hong Yang, Kuo-Chuan Wu, Li-Yeh Chuang, Hsueh-Wei Chang

Summary: DNA barcodes are short sequence fragments used for species identification. This study proposes a deep learning framework, called deep barcoding, for species classification using DNA barcodes. By utilizing raw sequence data and deep convolutional neural networks, the deep barcoding model achieves high accuracy in species identification. Although there are challenges, the deep barcoding model has the potential to be an effective tool for species classification.

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (2022)

Article Engineering, Environmental

Stream water quality prediction using boosted regression tree and random forest models

Ali O. Alnahit, Ashok K. Mishra, Abdul A. Khan

Summary: This study compares two machine learning methods for predicting water quality parameters in 97 watersheds in the Southeast Atlantic region of the USA. The results show that one algorithm is easier to train and more robust. Furthermore, partial dependence plots highlight the complex and non-linear relationships between predictors and water quality indicators.

STOCHASTIC ENVIRONMENTAL RESEARCH AND RISK ASSESSMENT (2022)

Review Statistics & Probability

Advances in statistical modeling of spatial extremes

Raphael Huser, Jennifer L. Wadsworth

Summary: The classical modeling of spatial extremes is often too rigid and fails to capture the localized nature of severe events. It is also computationally challenging or less efficient to fit classical spatial extremes models in high dimensions. Recent progress includes new models with flexible tail structures and likelihood-based inference methods for large datasets.

WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS (2022)