4.7 Article

Assessing putative bias in prediction of anti-microbial resistance from real-world genotyping data under explicit causal assumptions

Journal

ARTIFICIAL INTELLIGENCE IN MEDICINE
Volume 130, Issue -, Pages -

Publisher

ELSEVIER
DOI: 10.1016/j.artmed.2022.102326

Keywords

Antimicrobial resistance; Biomedical informatics; Causal methods; Directed acyclic graph; Epidemiology; Explainability; Interpretability; Propensity score

Funding

  1. NIH NIAID [R01AI145552, R01AI141810]
  2. NSF [SCH 2013998]

Ask authors/readers for more resources

Whole genome sequencing is becoming the standard method for identifying antimicrobial resistance due to its ability to provide detailed genetic information. However, the development of prediction tools for resistance is challenging due to biased sampling. This study evaluates the effectiveness of bias-handling methods on antibiotic resistance prediction using genetic data. The results show that bias-handling methods can improve the accuracy of prediction.
Whole genome sequencing (WGS) is quickly becoming the customary means for identification of antimicrobial resistance (AMR) due to its ability to obtain high resolution information about the genes and mechanisms that are causing resistance and driving pathogen mobility. By contrast, traditional phenotypic (antibiogram) testing cannot easily elucidate such information. Yet development of AMR prediction tools from genotype-phenotype data can be biased, since sampling is non-randomized. Sample provenience, period of collection, and species representation can confound the association of genetic traits with AMR. Thus, prediction models can perform poorly on new data with sampling distribution shifts. In this work -under an explicit set of causal assumptions- we evaluate the effectiveness of propensity-based rebalancing and confounding adjustment on antibiotic resistance prediction using genotype-phenotype AMR data from the Pathosystems Resource Integration Center (PATRIC). We select bacterial genotypes (encoded as k-mer signatures, i.e., DNA fragments of length k), country, year, species, and AMR phenotypes for the tetracycline drug class, preparing test data with recent genomes coming from a single country. We test boosted logistic regression (BLR) and random forests (RF) with/without bias-handling. On 10,936 instances, we find evidence of species, location and year imbalance with respect to the AMR phenotype. The crude versus bias-adjusted change in effect of genetic signatures on AMR varies but only moderately (selecting the top 20,000 out of 40+ million k-mers). The area under the receiver operating characteristic (AUROC) of the RF (0.95) is comparable to that of BLR (0.94) on both out-of-bag samples from bootstrap and the external test (n = 1085), where AUROCs do not decrease. We observe a 1 %-5 % gain in AUROC with bias-handling compared to the sole use of genetic signatures. In conclusion, we recommend using causally-informed prediction methods for modeling real-world AMR data; however, traditional adjustment or propensity-based methods may not provide advantage in all use cases and further methodological development should be sought.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available