Statistics & Probability

Article Biochemical Research Methods

Analysing high-throughput sequencing data in Python with HTSeq 2.0

Givanna H. Putri, Simon Anders, Paul Theodor Pyl, John E. Pimanda, Fabio Zanini

Summary: HTSeq 2.0 provides an expanded application programming interface, including a new representation for sparse genomic data, enhancements for htseq-count to accommodate single-cell omics, a new script for data using cell and molecular barcodes, improved documentation, testing and deployment, bug fixes, and Python 3 support.

BIOINFORMATICS (2022)

Article Statistics & Probability

A Minimax Optimal Ridge-Type Set Test for Global Hypothesis With Applications in Whole Genome Sequencing Association Studies

Yaowu Liu, Zilin Li, Xihong Lin

Summary: In this article, a minimax optimal ridge-type set test (MORST) is proposed for testing a global hypothesis. MORST has a higher power compared to classical tests when the signals are weak or moderate, with only a slight increase in computation. Extensive simulations demonstrate the robustness of MORST, and it performs well in analyzing real data.

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION (2022)

Article Statistics & Probability

Wasserstein Regression

Yaqing Chen, Zhenhua Lin, Hans-Georg Muller

Summary: This paper proposes a distribution-to-distribution regression model based on the Wasserstein metric for analyzing random object data that do not belong to vector spaces. By utilizing the geometric properties of the tangent bundles of the space of random measures, the distributions are mapped to tangent spaces, enabling regression modeling for distribution data. Through simulations and asymptotic convergence rate derivation, the performance of the model in predicting distributions and estimating regression operators is verified.

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION (2023)

Article Statistics & Probability

Interpretable machine learning: Fundamental principles and 10 grand challenges

Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, Chudi Zhong

Summary: This work highlights the fundamental principles of interpretable machine learning and identifies 10 technical challenge areas in this field, including optimizing sparse models, scoring systems, and adding constraints for better interpretability. It serves as a useful starting point for statisticians and computer scientists interested in interpretable machine learning.

STATISTICS SURVEYS (2022)

Article Biochemical Research Methods

CoV-Spectrum: analysis of globally shared SARS-CoV-2 data to identify and characterize new variants

Chaoran Chen, Sarah Nadeau, Michael Yared, Philippe Voinov, Ning Xie, Cornelius Roemer, Tanja Stadler

Summary: The CoV-Spectrum website provides support for identifying and tracking new SARS-CoV-2 variants, with flexible mutation search capabilities and analysis on various data sources to understand characteristics and transmission of different variants.

BIOINFORMATICS (2022)

Article Computer Science, Artificial Intelligence

Optimal ratio for data splitting

V. Roshan Joseph

Summary: When splitting data for training and testing, the optimal ratio should be root p : 1, where p represents the number of parameters in a linear regression model.

STATISTICAL ANALYSIS AND DATA MINING (2022)

Article Biochemical Research Methods

ProteinBERT: a universal deep-learning model of protein sequence and function

Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, Michal Linial

Summary: Self-supervised deep language modeling has achieved unprecedented success with natural language tasks, and the authors introduce a new deep language model called ProteinBERT specifically designed for proteins, which efficiently handles long sequences and achieves near or even better performance than other methods, providing an effective framework for rapid training of protein predictors.

BIOINFORMATICS (2022)

Article Economics

RCTs to Scale: Comprehensive Evidence From Two Nudge Units

Stefano DellaVigna, Elizabeth Linos

Summary: Nudge interventions have been widely implemented in both academic studies and government units. However, there are significant differences in the impact of nudges between these two settings. This study compares data from 126 randomized controlled trials (RCTs) conducted by Nudge Units and academic journals, and identifies three factors contributing to the differences: statistical power, characteristics of the interventions, and selective publication. The findings suggest that selective publication and low statistical power are the major contributors to the disparities, while variation in nudge characteristics explains the remaining differences.

ECONOMETRICA (2022)

Article Economics

Two-way fixed effects and differences-in-differences with heterogeneous treatment effects: a survey

Clement de Chaisemartin, Xavier D'Haultfoeuille

Summary: Linear regressions with period and group fixed effects are commonly used to estimate the effects of policies. However, recent research has shown that these regressions may produce misleading estimates if the effects of policies vary between different groups or over time. Therefore, alternative estimators robust to heterogeneous effects have been proposed in a growing literature. This survey uses these alternative estimators to reexamine a previous study by Wolfers ().

ECONOMETRICS JOURNAL (2023)

Article Biochemical Research Methods

GTDB-Tk v2: memory friendly classification with the genome taxonomy database

Pierre-Alain Chaumeil, Aaron J. Mussig, Philip Hugenholtz, Donovan H. Parks

Summary: This study presents an updated version of GTDB-Tk that uses a divide-and-conquer approach to reduce memory requirements while minimizing classification impact.

BIOINFORMATICS (2022)

Article Biochemical Research Methods

Plant Disease Detection Using Generated Leaves Based on DoubleGAN

Yafeng Zhao, Zhen Chen, Xuan Gao, Wenlong Song, Qiang Xiong, Junfeng Hu, Zhichao Zhang

Summary: The study explores the use of DoubleGAN to generate images of unhealthy plant leaves in order to balance unbalanced datasets. The WGAN is used to generate a pretrained model and the SRGAN is used to generate high-resolution images. Compared to DCGAN, the images generated by DoubleGAN are clearer and achieve higher recognition accuracy.

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (2022)

Article Biochemical Research Methods

Graph Convolutional Networks for Drug Response Prediction

Tuan Nguyen, Giang T T Nguyen, Thin Nguyen, Duc-Hau Le

Summary: This study proposes a novel method called GraphDRP based on graph convolutional networks for drug response prediction and finds that graph representation can improve prediction performance.

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (2022)

Article Mathematical & Computational Biology

Estimating diversity in networked ecological communities

Amy D. Willis, Bryan D. Martin

Summary: Comparing ecological communities across environmental gradients is challenging, especially when there are many different taxonomic groups. Traditional diversity estimation methods, such as maximum likelihood estimates of the parameters of a multinomial model, have strict assumptions and do not account for ecological networks. In this article, the authors leverage models from the compositional data literature to estimate diversity indices, such as Shannon, Simpson, Bray-Curtis, and Euclidean. They find that their method performs best in strongly networked communities with many taxa, as shown in a case study on the microbiome of seafloor basalts.

BIOSTATISTICS (2022)

Article Computer Science, Theory & Methods

Fuzzy-based dynamic event triggering formation control for nonstrict-feedback nonlinear MASs

Liang Cao, Deyin Yao, Hongyi Li, Wei Meng, Renquan Lu

Summary: This paper investigates the formation control issue of nonlinear multiagent systems with asymmetric input saturation and unmeasured states. A high-gain fuzzy observer is constructed to estimate the unavailable states, and a leader-follower formation control strategy is proposed. Two new dynamic event triggering mechanisms and dynamic rules of threshold parameters are established to reduce the communication between controller and actuator. Furthermore, a modified auxiliary system is developed to counteract the adverse effect of asymmetric input saturation.

FUZZY SETS AND SYSTEMS (2023)

Article Statistics & Probability

Construction of the average variance extracted index for construct validation in structural equation models with adaptive regressions

Patricia Mendes dos Santos, Marcelo Angelo Cirillo

Summary: This study improves a conventional construct validation indicator by using adaptive regressions and finds that the adaptive linear regression method is efficient for correctly specified models in formative structural models.

COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION (2023)

Article Economics

General Equilibrium Effects of Cash Transfers: Experimental Evidence From Kenya

Dennis Egger, Johannes Haushofer, Edward Miguel, Paul Niehaus, Michael Walker

Summary: This study examines the effects of large economic stimuli on individuals and the overall economy, and provides meaningful insights through an experiment conducted in rural Kenya. The results show that cash transfers have significant impacts on consumption and assets for recipients, and also generate positive spillover effects on non-recipient households and firms, with minimal inflation.

ECONOMETRICA (2022)

Article Statistics & Probability

Robust Post-Matching Inference

Alberto Abadie, Jann Spiess

Summary: Nearest-neighbor matching is a useful tool for creating balance between treatment and control groups in observational studies, reducing the dependence on parametric modeling assumptions. Ignoring the matching step can lead to invalid standard errors, especially if matching is conducted with replacement or if the regression model is misspecified.

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION (2022)

Article Biochemical Research Methods

ABlooper: fast accurate antibody CDR loop structure prediction with accuracy estimation

Brennan Abanades, Guy Georges, Alexander Bujotzek, Charlotte M. Deane

Summary: In this study, the researchers developed a deep learning-based tool called ABlooper for predicting the structure of CDR loops in antibodies. ABlooper accurately predicts the structure of CDR-H3 loops, which are known for their sequence and structural variability. The tool provides high accuracy predictions and confidence estimates for each prediction.

BIOINFORMATICS (2022)

Article Biochemical Research Methods

UCSCXenaShiny: an R/CRAN package for interactive analysis of UCSC Xena data

Shixiang Wang, Yi Xiong, Longfei Zhao, Kai Gu, Yin Li, Fei Zhao, Jianfeng Li, Mingjie Wang, Haitao Wang, Ziyu Tao, Tao Wu, Yichao Zheng, Xuejun Li, Xue-Song Liu

Summary: UCSC Xena platform offers processed cancer omics data, while UCSCXenaShiny is an R Shiny package that allows users to quickly search, download, and explore the data. This tool provides important research opportunities for cancer researchers and clinicians with limited programming experience.

BIOINFORMATICS (2022)

Article Statistics & Probability

Separable Effects for Causal Inference in the Presence of Competing Events

Mats J. Stensrud, Jessica G. Young, Vanessa Didelez, James M. Robins, Miguel A. Hernan

Summary: The presence of competing events complicates the definition of causal effects in time-to-event settings. This study proposes separable effects to examine the causal effect of a treatment on an event of interest. The separable direct effect is the treatment effect on the event of interest not mediated by its effect on the competing event. The separable indirect effect is the treatment effect on the event of interest only through its effect on the competing event. The assumption that the treatment can be decomposed into two distinct components is necessary for identifying the separable effects.

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION (2022)