MolE-antimicrobial / README.md
pavm595's picture
Update README.md
5d7583c verified
---
tags:
- pytorch
- pyg
- graph-neural-networks
- machine-learning
- barlow-twins
- graph-isomorphism-network
- molecular-biology
- computational-biology
- antibiotics
- antimicrobial-discovery
- high-throughput-screening
- virtual-drug-screening
- haicu
library_name: pytorch
language:
- en
---
# MolE - Antimicrobial Prediction
This model uses MolE's pre-trained representation to train XGBoost models to predict the antimicrobial activity of compounds based on their molecular structure. The model was developed by Roberto Olayo Alarcon et al. and more information can be found in the [GitHub repository](https://github.com/rolayoalarcon/MolE) and the [accompanying paper](https://www.nature.com/articles/s41467-025-58804-4).
## Files:
- `model.pth` - the pre-trained representation model's weights
- `config.yaml` - model configuration
- `MolE-XGBoost-08.03.2024_14.20.pkl` - pretrained XGBoost model
## Usage
### Inference Example
Below is a minimal example showing how to load and run inference with **MolE** directly from the Hugging Face Hub.
```python
import torch, yaml, pickle, pandas as pd
from huggingface_hub import hf_hub_download
import mole_representation, mole_antimicrobial_prediction
class MolE:
def __init__(self, device='auto'):
repo = "pavm595/MolE-antimicrobial"
self.device = "cuda:0" if device == "auto" and torch.cuda.is_available() else "cpu"
# Download + load
cfg = yaml.safe_load(open(hf_hub_download(repo, "config.yaml")))
self.model = mole_representation.GINet(**cfg["model"]).to(self.device)
self.model.load_state_dict(torch.load(hf_hub_download(repo, "model.pth"), map_location=self.device))
self.xgb = pickle.load(open(hf_hub_download(repo, "MolE-XGBoost-08.03.2024_14.20.pkl"), "rb"))
def predict_from_smiles(self, smiles_tsv):
smiles_df = mole_representation.read_smiles(smiles_tsv, "smiles", "chem_name")
emb = mole_representation.batch_representation(smiles_df, self.model, "smiles", "chem_name", device=self.device)
X_input = mole_antimicrobial_prediction.add_strains(
emb, "data/01.prepare_training_data/maier_screening_results.tsv.gz"
)
probs = self.xgb.predict_proba(X_input)[:, 1]
return pd.DataFrame(
{"antimicrobial_predictive_probability": probs},
index=X_input.index
)
```
### Run inference:
```python
mole = MolE()
pred = mole.predict_from_smiles("examples/input/examples_molecules.tsv")
print(pred)
```
## Metadata
### Input
The input is a TSV file with two columns: `chem_name` and `smiles`. The column 'chem_name' contains the name of the molecule from PubChem, e.g. Halicin, and the column 'smiles' contains the chemical formula in SMILES format, e.g. `C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`. An example input is the file `examples\input\example_molecules.tsv`.
### Output
The output is a TSV file with two columns: `pred_id` and `antimicrobial_predictive_probability`. The column `pred_id` contains a given molecule and a bacteria, e.g. Halicin:Akkermansia muciniphila (NT5021), and the column `antimicrobial_predictive_probability` contains antimicrobial potential (AP) scores for
molecule prioritization, reflecting the chance of the given molecule having growth inhibition effect on the corresponding bacteria, e.g. 0.021192694. An example output is `examples/output/example_molecules_prediction.tsv`.
## Copyright
Code derived from https://github.com/rolayoalarcon/MolE is licensed under the MIT license, Copyright (c) 2024 Roberto Olayo Alarcon. The [model weights](https://doi.org/10.5281/zenodo.10803099) are licensed under [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/legalcode), Copyright (c) 2024 Roberto Olayo Alarcon. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov.