4.7 Article

Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models

Journal

JOURNAL OF CHEMICAL INFORMATION AND MODELING
Volume 62, Issue 5, Pages 1199-1206

Publisher

AMER CHEMICAL SOC
DOI: 10.1021/acs.jcim.2c00079

Keywords

-

Funding

  1. Swiss National Science Foundation [205321_182176]
  2. RETHINK initiative at ETH Zurich
  3. Swiss National Science Foundation (SNF) [205321_182176] Funding Source: Swiss National Science Foundation (SNF)

Ask authors/readers for more resources

Chemical language models (CLMs) are useful for designing molecules with desired properties. This study introduces the perplexity metric to evaluate the generated molecules' similarity to the design objectives, ranking the promising designs. The perplexity scoring also helps identify and remove undesired biases in the model training process.
Chemical language models (CLMs) can be employed to design molecules with desired properties. CLMs generate new chemical structures in the form of textual representations, such as the simplified molecular input line entry system (SMILES) strings. However, the quality of these de novo generated molecules is difficult to assess a priori. In this study, we apply the perplexity metric to determine the degree to which the molecules generated by a CLM match the desired design objectives. This model-intrinsic score allows identifying and ranking the most promising molecular designs based on the probabilities learned by the CLM. Using perplexity to compare greedy (beam search) with explorative (multinomial sampling) methods for SMILES generation, certain advantages of multinomial sampling become apparent. Additionally, perplexity scoring is performed to identify undesired model biases introduced during model training and allows the development of a new ranking system to remove those undesired biases.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available