Overview



ChemMORT (Chemical Molecular Optimization, Representation and Translation) is a platform which provides chemical space navigation with possibilities for de novo molecular optimization. Based on the combination of Neural Machine Translation model and Particle Swarm Optimization method, ChemMORT is able to accomplish multi-parameter optimization tasks effectively. Three modules are provided: SMILES Encoder, Descriptor Decoder and Molecular Optimizer.

Figure 1. Diagram of workflow

Figure 1. Diagram of workflow

Introduction



SMILES Encoder

The SMILES Encoder allows the user to transform the SMILES string to a 512-dimensional vector through the application of well-trained encoder network from Neural Machine Translation model. Owing to its consecutive, reversible and informative characters, such representation is recommended for the use of a sort of molecular descriptors or GPS for the chemical space of the molecules. It provides four types of inputs to start encoding: single/batch SMILES string(s) input, file upload (*.sdf/*.csv/*.txt)and molecule drawing from editor.


Descriptor Decoder

The Embedding Decoder is able to back-engineer the 512-dimensional vector to the corresponding uniform canonical SMILES string through the application of well-trained decoder network from Neural Machine Translation model. Such function makes it possible to be a steered solution for molecular de novo optimization, as the possibility output of the decoder can be sampled. User can upload file (*.csv) for decoding, which provide information about 512-dimensional vector characterization.


Molecular Optimizer

The Molecular Optimizer, combining the Neural Machine Translation model and Particle Swarm Optimization method, is designed to optimize molecular features through the application of credible ADMET prediction models. As molecular optimization requires the balance of several criteria, Molecular Optimizer allows the multi-parameter optimization within customized constraints. 11 high-quality ADMET prediction models based on the calculated 512-dimensional vectors are provided in the module (details see Property of Optimizer), which enables the accurate generation of the optimized molecules.


Property of Optimizer

Endpoint

Description

Performance

Type

Method

logD7.4

Log of the octanol/water distribution coefficient at pH7.4.
* Optimal: 1~3

Test Set
RMSE: 0.555±0.010
MAE: 0.426±0.007
R2: 0.840±0.004
5-Fold CV
RMSE: 0.562±0.009
MAE: 0.428±0.13
R2: 0.834±0.005

Basic property

XGBoost

AMES

The probability to be positive in Ames test.
* The smaller AMES score, the less likely to be AMES positive.

Test Set
ACC: 0.813±0.007
SEN: 0.835±0.013
SPE: 0.787±0.013
AUC: 0.888±0.004
5-Fold CV
ACC: 0.810±0.016
SEN: 0.838±0.014
SPE: 0.777±0.031
AUC: 0.889±0.013

Toxicity

XGBoost

Caco-2

Papp (Caco-2 Permeability)
* Optimal: higher than -5.15 Log unit or -4.70 or -4.80

Test Set
RMSE: 0.332±0.007
MAE: 0.244±0.004
R2: 0.718±0.019
5-Fold CV
RMSE: 0.328±0.004
MAE: 0.245±0.005
R2: 0.728±0.011

Absorption

XGBoost & Data Augment

MDCK

Papp (MDCK Permeability)

Test Set
RMSE: 0.323±0.022
MAE: 0.232±0.011
R2: 0.650±0.041
5-Fold CV
RMSE: 0.322±0.034
MAE: 0.235±0.021
R2: 0.644±0.057

Absorption

XGBoost & Data Augment

PPB

Plasma Protein Binding
* Significant with drugs that are highly protein-bound and have a low therapeutic index.

Test Set
RMSE: 0.152±0.003
MAE: 0.104±0.002
R2: 0.691±0.016
5-Fold CV
RMSE: 0.154±0.010
MAE: 0.106±0.007
R2: 0.691±0.025

Distribution

XGBoost

QED

Quantitative estimate of drug-likeness

n/a

Drug-likeness score

Molecular Function

SlogP

Log of the octanol/water partition coefficient, based on an atomic contribution model [Crippen 1999].
* Optimal: 0< LogP <3
* logP <0: poor lipid bilayer permeability.
* logP >3: poor aqueous solubility.

Fitted on an extensive training set of 9920 molecules, with R2 = 0.918 and σ = 0.677

Basic property

Molecular Function

logS

Log of Solubility
* Optimal: higher than -4 log mol/L
* <10 μg/mL: Low solubility.
* 10–60 μg/mL: Moderate solubility.
* >60 μg/mL: High solubility

Test Set
RMSE: 0.823±0.026
MAE: 0.572±0.009
R2: 0.862±0.011
5-Fold CV
RMSE: 0.842±0.084
MAE: 0.592±0.056
R2: 0.839±0.029

Basic property

XGBoost

hERG

The probability to be hERG Blocker
* The higher hERG score, the more likely to be hERG Blocker.

Test Set
ACC: 0.814±0.026
SEN: 0.841±0.042
SPE: 0.760±0.065
AUC: 0.854±0.032
5-Fold CV
ACC: 0.800±0.036
SEN: 0.820±0.068
SPE: 0.754±0.147
AUC: 0.857±0.053

Toxicity

XGBoost

Hepatoxicity

The probability of owning liver toxicity
* The smaller hepatoxicity score, the less likely to be liver toxic.

Test Set
ACC: 0.729±0.016
SEN: 0.732±0.019
SPE: 0.724±0.044
AUC: 0.794±0.015
5-Fold CV
ACC: 0.700±0.026
SEN: 0.701±0.030
SPE: 0.691±0.075
AUC: 0.764±0.030

Toxicity

XGBoost

LD50

LD50 of acute toxicity
* High-toxicity: 1~50 mg/kg.
* Toxicity: 51~500 mg/kg.
* low-toxicity: 501~5000 mg/kg.

Test Set
ACC: 0.765±0.007
SEN: 0.764±0.015
SPE: 0.765±0.014
AUC: 0.848±0.007
5-Fold CV
ACC: 0.741±0.045
SEN: 0.742±0.128
SPE: 0.740±0.111
AUC: 0.833±0.033

Toxicity

XGBoost

Development Environment


Third party library

Version

TensorFlow

1.14.0

Scikit-learn

0.23.2

RDKit

2019.03.1

Django

2.2

XGBoost

1.2.0

Celery

4.4.7

RabbitMQ

3.6.10

SMILES Encoder


Three input types are provided for SMILES encoding:

  • 1. By inputting SMILES strings
  • 2. By uploading file (*.sdf/*.csv/*.txt)
  • 3. By drawing molecule from editor below

SMILES String(s)

Step1: Access the Services page via Services->SMILES Encoder in the navigation bar.

Step2: Select the first entry for “Input SMILES string(s)” and enter the SMILES string(s) in the input box.

Click the EXAMPLE button to quickly fill in the example SMILES.

Step3: Submit and get results.


After submission, the input SMILES strings will be transformed to the corresponding 512-dimensional vector representations through the application of well-trained encoder network. In this page, the Summary and Result block will present the overview of the results and the detailed information about the SMILES strings, structure graphs and final status, respectively.

1. In the Summary block:

  • 1) Molecules indicates the total number of input SMILES strings;
  • 2) Invalid molecules indicates the number of unidentified SMILES string(s).

2. In the Results block:

  • 1) index indicates the index of the input SMILES;
  • 2) structure is the molecular image;
  • 3) SMILES is the input SMILES string;
  • 4) Feature "Copy" button allows the copy of the encoded features.
One type of output file is available for downloading, which provides information including the original SMILES and the encoded successful 512-dimensional vectors.

Uploading file

Step1: Access the Services page via Services->SMILES Encoder in the navigation bar.

Step2: Select the second entry for “Upload file (*.sdf/*.csv/*.txt)” and upload related file in the input box.

There are three types of files available for upload here, all of which are in a similar format (*.TXT, *.CSV, *.TSV).

Step3: Submit and get results.


After submission, the input SMILES strings will be transformed to the corresponding 512-dimensional vector representations through the application of well-trained encoder network. In this page, the Summary and Result block will present the overview of the results and the detailed information about the SMILES strings, structure graphs and final status, respectively.

1. In the Summary block:

  • 1) Molecules indicates the total number of input SMILES strings;
  • 2) Invalid molecules indicates the number of unidentified SMILES string(s).

2. In the Results block:

  • 1) index indicates the index of the input SMILES;
  • 2) structure is the molecular image;
  • 3) SMILES is the input SMILES string;
  • 4) Feature "Copy" button allows the copy of the encoded features.
One type of output file is available for downloading, which provides information including the original SMILES and the encoded successful 512-dimensional vectors.

Molecular Editor

Step1: Access the Services page via Services->SMILES Encoder in the navigation bar.

Step2: Select the third entry for “Draw molecule from editor below” and draw aim molecular structure in the editor.

Step3: Submit and get results.

After submission, the input SMILES strings will be transformed to the corresponding 512-dimensional vector representations through the application of well-trained encoder network. In this page, the Summary and Result block will present the overview of the results and the detailed information about the SMILES strings, structure graphs and final status, respectively.

1. In the Summary block:

  • 1) Molecules indicates the total number of input SMILES strings;
  • 2) Invalid molecules indicates the number of unidentified SMILES string(s).

2. In the Results block:

  • 1) index indicates the index of the input SMILES;
  • 2) structure is the molecular image;
  • 3) SMILES is the input SMILES string;
  • 4) Feature "Copy" button allows the copy of the encoded features.
One type of output file is available for downloading, which provides information including the original SMILES and the encoded successful 512-dimensional vectors.

Descriptor Decoder


The Embedding Decoder is able to back-engineer the 512-dimensional vector to the corresponding uniform canonical SMILES string through the application of well-trained decoder network.

Step1: Access the Services page via Services->Descriptor Decoder in the navigation bar.

Step2: Upload file with specified format: one 512-dimensional vector presents per row, where each dimensional value is separated by space. Here's an example:

example.csv

Step3: Submit and get results.


After submission, the input SMILES strings will be transformed to the corresponding 512-dimensional vector representations through the application of well-trained encoder network. In this page, the Summary and Result block will present the overview of the results and the detailed information about the SMILES strings, structure graphs and final status, respectively.

1. In the Summary block:

  • 1) Molecules indicates the total number of input SMILES strings;
  • 2) Invalid molecules indicates the number of unidentified SMILES string(s).

2. In the Results block:

  • 1) index indicates the index of the input SMILES;
  • 2) structure is the molecular image;
  • 3) SMILES is the input SMILES string;
  • 4) Feature "Copy" button allows the copy of the encoded features.
One type of output file is available for downloading, which provides information including the original SMILES, status and the encoded successful 512-dimensional vectors.

Molecular Optimizer


The Molecular Optimizer is designed to optimize molecular features through the application of credible ADMET prediction models. As molecular optimization requires the balance of several criteria, Molecular Optimizer allows the multi-parameter optimization within customized constraints.

Step1: Input the information about job name and email address for the receiving of optimized result;

Step2: Input the SMILES string of the target molecule for following optimization;

Click the EXAMPLE button to quickly fill in the example SMILES.

Step3: Select the aim property for optimization with the ideal value or range;

Five feature categories, including basic property, Absorption, Distribution, Drug-likeness score and toxicity, are provided for optimization. To accomplish multi-parameter optimization, several parameters are provided following:

` \text {Score}=\frac{\sum_{i=1}^{\j}(\text {Scaled Score}_i \cdot Weight_i)}{\sum_{\i=1}^{\j}Weight_i} `

Where j is the number of aim properties, Scaled Scorei represents the desirability of the aim property i of the optimized molecule, and Weighti represents the contribution of the aim property i of the optimized molecule.

Step4: Set distance constraint:

Though the application of credible optimization objectives can assist the molecular optimization in this endeavor, without any constraints they may solely focus on the very objective thus resulting unpleasant structures. Owing to the nature of structure–activity relationship, Distance Constraint enables the set of distance limitation between generated molecule and the reference molecules based on the application of ECFP4 fingerprint and Tanimoto similarity algorithm. It is also allowed to navigate generated molecule to possess totally different structural features of the initial molecule. The weight set of the distance constraint is similar to the namesake of properties.

Step5: Set Substructure Constraint:

The constraint about the appearance of important substructure (such as active motif or undesirable substructure) is allowed by inputting corresponding SMARTS in Substructure Constraint. The weight set of the distance constraint is similar to the namesake of properties.

Step6: Set the optimization steps and related circles:

Considering the different needs of calculated time and the number optimal molecules, users can set STEP and TRACK hyperparameters to control the optimization process, where STEPS indicates the number of iterations and TRACK indicates the number of the optimal molecules for displaying.

Step7: Submit and get result.

After submission, the task will run automatically in the background. Once it completed, users can view the results via the link provided in the email.



The first row is the information of the initial molecule, and the rest is the information of the optimized molecule.