Application in Real-World Scientific Workflows
The core value of ChemInformant lies in its role as a starting point for data science workflows, seamlessly injecting chemical data into Python’s powerful scientific computing ecosystem. This page will demonstrate through three cases that more closely resemble real-world research scenarios how ChemInformant can be combined with advanced libraries like RDKit, Scikit-learn, and NetworkX to accomplish diverse tasks ranging from data preprocessing and multi-class classification to community detection.
Note
All examples use ChemInformant’s standardized snake_case property names for consistent data handling across workflows.
Note
The examples on this page depend on additional specialized libraries.
pip install rdkit-pypi scikit-learn networkx
Example 1: Batch Preprocessing and Analysis with RDKit
In chemical analysis, it is often necessary to first standardize raw molecules obtained from a database, for example, by “desalting”. In this scenario, we use ChemInformant to obtain the SMILES for a set of non-steroidal anti-inflammatory drugs (NSAIDs), then hand them over to RDKit for desalting, and further analyze whether they contain a benzene ring, a common chemical feature.
import ChemInformant as ci
from rdkit import Chem
from rdkit.Chem import SaltRemover
import pandas as pd
# 1. Use ci to get SMILES for a set of NSAIDs
identifiers = ['aspirin', 'ibuprofen', 'naproxen', 'diclofenac',
'ketoprofen', 'celecoxib', 'indomethacin']
df = ci.get_properties(identifiers, ['isomeric_smiles', 'input_identifier'])
df_clean = df[df['status'] == 'OK'].copy()
# 2. Use RDKit's SaltRemover to preprocess the data
remover = SaltRemover.SaltRemover()
df_clean['clean_smiles'] = df_clean['isomeric_smiles'].apply(
lambda s: Chem.MolToSmiles(remover.StripMol(Chem.MolFromSmiles(s)))
)
# 3. Perform substructure analysis based on the preprocessed data
pattern = Chem.MolFromSmarts('c1ccccc1')
df_clean['has_benzene'] = df_clean['clean_smiles'].apply(
lambda s: Chem.MolFromSmiles(s).HasSubstructMatch(pattern)
)
print(">>> RDKit Substructure Analysis: Do NSAIDs contain a benzene ring?")
print(df_clean[['input_identifier', 'has_benzene']])
Output:
>>> RDKit Substructure Analysis: Do NSAIDs contain a benzene ring?
input_identifier has_benzene
0 aspirin True
1 ibuprofen True
2 naproxen True
3 diclofenac True
4 ketoprofen True
5 celecoxib True
6 indomethacin True
Example 2: Multi-Class Classification with Scikit-learn
We can use the data obtained from ChemInformant as features to train a machine learning model to distinguish between different classes of drugs. This example will differentiate between three classes of drugs: statins, NSAIDs, and antibiotics.
For workflow demonstration only
The core purpose of this example is to demonstrate how to smoothly pass data from ChemInformant into Scikit-learn for cross-validation.
import ChemInformant as ci
import pandas as pd
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from collections import Counter
# 1. Define three classes of drugs
classes = {
'Statin': ['simvastatin', 'atorvastatin', 'pravastatin', 'rosuvastatin'],
'NSAID': ['aspirin', 'ibuprofen', 'naproxen', 'diclofenac'],
'Antibiotic': ['amoxicillin', 'ciprofloxacin', 'azithromycin', 'doxycycline']
}
labels, ids = [], []
for cls, drugs in classes.items():
ids.extend(drugs)
labels.extend([cls] * len(drugs))
# 2. Use ci to get comprehensive feature data efficiently
# NEW: Using all_properties for comprehensive dataset
df_feat = ci.get_properties(ids, all_properties=True)
df_feat_clean = df_feat[df_feat['status'] == 'OK'].copy()
# Extract key features already available from ChemInformant
features = ['molecular_weight', 'xlogp', 'tpsa', 'h_bond_donor_count',
'h_bond_acceptor_count', 'rotatable_bond_count']
# 3. Prepare training data and perform cross-validation
features = ['molecular_weight', 'xlogp', 'tpsa']
X = df_feat_clean[features].values
y = pd.Categorical(pd.Series(labels).loc[df_feat_clean.index]).codes
counts = Counter(y)
min_class_count = min(counts.values()) if counts else 1
n_splits = min(5, min_class_count)
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
acc = cross_val_score(clf, X, y, cv=cv, scoring='accuracy')
print(f">>> Multi-class accuracy {n_splits}-fold CV: {acc.mean():.2%} ± {acc.std():.2%}")
Output:
>>> Multi-class accuracy 4-fold CV: 91.67% ± 14.43%
Example 3: Similarity Networking and Community Detection with NetworkX
This is a more advanced application that translates chemical similarity into a network relationship. We use ChemInformant to retrieve molecular information, use RDKit to calculate fingerprint similarity, and then use NetworkX to build a network graph and perform community detection (i.e., find subgroups of the most structurally similar compounds in the network).
import ChemInformant as ci
from rdkit import Chem
from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator
from rdkit.DataStructs import TanimotoSimilarity
import networkx as nx
from networkx.algorithms import community
# 1. Use ci to get SMILES for NSAIDs to generate fingerprints
ids_net = ['aspirin', 'ibuprofen', 'naproxen', 'diclofenac']
df_net = ci.get_properties(ids_net, ['isomeric_smiles', 'input_identifier'])
df_net_clean = df_net[df_net['status'] == 'OK'].copy()
# 2. Generate fingerprints using RDKit
fpgen = GetMorganGenerator(radius=2, fpSize=1024)
fps = [fpgen.GetFingerprint(Chem.MolFromSmiles(s)) for s in df_net_clean['isomeric_smiles']]
# 3. Build a graph with NetworkX and add edges based on similarity
G = nx.Graph()
for name in df_net_clean['input_identifier']:
G.add_node(name)
# Use .iloc to ensure index alignment
for i in range(len(df_net_clean)):
for j in range(i + 1, len(df_net_clean)):
sim = TanimotoSimilarity(fps[i], fps[j])
if sim >= 0.2:
G.add_edge(df_net_clean.iloc[i]['input_identifier'],
df_net_clean.iloc[j]['input_identifier'],
weight=sim)
# 4. Perform community detection
communities = community.greedy_modularity_communities(G, weight='weight')
print("\n>>> NSAIDs Similarity Network Community Grouping:")
for idx, comm in enumerate(communities, 1):
print(f" Community {idx}: {sorted(comm)}")
Output:
>>> NSAIDs Similarity Network Community Grouping:
Community 1: ['ibuprofen', 'naproxen']
Community 2: ['aspirin', 'diclofenac']