=======================================================
Application in Real-World Scientific Workflows
=======================================================

The core value of ChemInformant lies in its role as a starting point for data science workflows, seamlessly injecting chemical data into Python's powerful scientific computing ecosystem. This page will demonstrate through three cases that more closely resemble real-world research scenarios how ChemInformant can be combined with advanced libraries like **RDKit**, **Scikit-learn**, and **NetworkX** to accomplish diverse tasks ranging from data preprocessing and multi-class classification to community detection.

.. note::
   All examples use ChemInformant's standardized **snake_case** property names for consistent data handling across workflows.

.. note::
   The examples on this page depend on additional specialized libraries.

   .. code-block:: bash

      pip install rdkit-pypi scikit-learn networkx


.. _rdkit_integration:

Example 1: Batch Preprocessing and Analysis with RDKit
-------------------------------------------------------

In chemical analysis, it is often necessary to first standardize raw molecules obtained from a database, for example, by "desalting". In this scenario, we **use ChemInformant to obtain the SMILES for a set of non-steroidal anti-inflammatory drugs (NSAIDs)**, then hand them over to RDKit for desalting, and further analyze whether they contain a benzene ring, a common chemical feature.

.. code-block:: python
   :emphasize-lines: 1, 9, 10, 14

   import ChemInformant as ci
   from rdkit import Chem
   from rdkit.Chem import SaltRemover
   import pandas as pd

   # 1. Use ci to get SMILES for a set of NSAIDs
   identifiers = ['aspirin', 'ibuprofen', 'naproxen', 'diclofenac',
                  'ketoprofen', 'celecoxib', 'indomethacin']
   df = ci.get_properties(identifiers, ['isomeric_smiles', 'input_identifier'])
   df_clean = df[df['status'] == 'OK'].copy()

   # 2. Use RDKit's SaltRemover to preprocess the data
   remover = SaltRemover.SaltRemover()
   df_clean['clean_smiles'] = df_clean['isomeric_smiles'].apply(
       lambda s: Chem.MolToSmiles(remover.StripMol(Chem.MolFromSmiles(s)))
   )
   
   # 3. Perform substructure analysis based on the preprocessed data
   pattern = Chem.MolFromSmarts('c1ccccc1')
   df_clean['has_benzene'] = df_clean['clean_smiles'].apply(
       lambda s: Chem.MolFromSmiles(s).HasSubstructMatch(pattern)
   )
   
   print(">>> RDKit Substructure Analysis: Do NSAIDs contain a benzene ring?")
   print(df_clean[['input_identifier', 'has_benzene']])


Output:

.. code-block:: text

   >>> RDKit Substructure Analysis: Do NSAIDs contain a benzene ring?
     input_identifier  has_benzene
   0          aspirin         True
   1        ibuprofen         True
   2         naproxen         True
   3       diclofenac         True
   4       ketoprofen         True
   5        celecoxib         True
   6     indomethacin         True


.. _sklearn_integration:

Example 2: Multi-Class Classification with Scikit-learn
---------------------------------------------------------

We can use the **data obtained from ChemInformant** as features to train a machine learning model to distinguish between different classes of drugs. This example will differentiate between three classes of drugs: statins, NSAIDs, and antibiotics.

.. admonition:: For workflow demonstration only
   :class: caution

   The core purpose of this example is to demonstrate how to smoothly pass data from **ChemInformant** into Scikit-learn for cross-validation.

.. code-block:: python
   :emphasize-lines: 1, 21, 22, 23

   import ChemInformant as ci
   import pandas as pd
   from rdkit import Chem
   from rdkit.Chem import rdMolDescriptors
   from sklearn.ensemble import RandomForestClassifier
   from sklearn.model_selection import StratifiedKFold, cross_val_score
   from collections import Counter

   # 1. Define three classes of drugs
   classes = {
       'Statin': ['simvastatin', 'atorvastatin', 'pravastatin', 'rosuvastatin'],
       'NSAID': ['aspirin', 'ibuprofen', 'naproxen', 'diclofenac'],
       'Antibiotic': ['amoxicillin', 'ciprofloxacin', 'azithromycin', 'doxycycline']
   }
   labels, ids = [], []
   for cls, drugs in classes.items():
       ids.extend(drugs)
       labels.extend([cls] * len(drugs))

   # 2. Use ci to get comprehensive feature data efficiently
   # NEW: Using all_properties for comprehensive dataset
   df_feat = ci.get_properties(ids, all_properties=True)
   df_feat_clean = df_feat[df_feat['status'] == 'OK'].copy()
   
   # Extract key features already available from ChemInformant
   features = ['molecular_weight', 'xlogp', 'tpsa', 'h_bond_donor_count', 
              'h_bond_acceptor_count', 'rotatable_bond_count']

   # 3. Prepare training data and perform cross-validation
   features = ['molecular_weight', 'xlogp', 'tpsa']
   X = df_feat_clean[features].values
   y = pd.Categorical(pd.Series(labels).loc[df_feat_clean.index]).codes
   
   counts = Counter(y)
   min_class_count = min(counts.values()) if counts else 1
   n_splits = min(5, min_class_count)
   
   cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
   clf = RandomForestClassifier(n_estimators=100, random_state=42)
   acc = cross_val_score(clf, X, y, cv=cv, scoring='accuracy')

   print(f">>> Multi-class accuracy {n_splits}-fold CV: {acc.mean():.2%} ± {acc.std():.2%}")

Output:

.. code-block:: text

   >>> Multi-class accuracy 4-fold CV: 91.67% ± 14.43%


.. _networkx_integration:

Example 3: Similarity Networking and Community Detection with NetworkX
--------------------------------------------------------------------------

This is a more advanced application that translates chemical similarity into a network relationship. We **use ChemInformant to retrieve molecular information**, use RDKit to calculate fingerprint similarity, and then use NetworkX to build a network graph and perform community detection (i.e., find subgroups of the most structurally similar compounds in the network).

.. code-block:: python
   :emphasize-lines: 1, 10, 11

   import ChemInformant as ci
   from rdkit import Chem
   from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator
   from rdkit.DataStructs import TanimotoSimilarity
   import networkx as nx
   from networkx.algorithms import community

   # 1. Use ci to get SMILES for NSAIDs to generate fingerprints
   ids_net = ['aspirin', 'ibuprofen', 'naproxen', 'diclofenac']
   df_net = ci.get_properties(ids_net, ['isomeric_smiles', 'input_identifier'])
   df_net_clean = df_net[df_net['status'] == 'OK'].copy()

   # 2. Generate fingerprints using RDKit
   fpgen = GetMorganGenerator(radius=2, fpSize=1024)
   fps = [fpgen.GetFingerprint(Chem.MolFromSmiles(s)) for s in df_net_clean['isomeric_smiles']]

   # 3. Build a graph with NetworkX and add edges based on similarity
   G = nx.Graph()
   for name in df_net_clean['input_identifier']:
       G.add_node(name)

   # Use .iloc to ensure index alignment
   for i in range(len(df_net_clean)):
       for j in range(i + 1, len(df_net_clean)):
           sim = TanimotoSimilarity(fps[i], fps[j])
           if sim >= 0.2:
               G.add_edge(df_net_clean.iloc[i]['input_identifier'], 
                          df_net_clean.iloc[j]['input_identifier'], 
                          weight=sim)
   
   # 4. Perform community detection
   communities = community.greedy_modularity_communities(G, weight='weight')

   print("\n>>> NSAIDs Similarity Network Community Grouping:")
   for idx, comm in enumerate(communities, 1):
       print(f"  Community {idx}: {sorted(comm)}")

Output:

.. code-block:: text

   >>> NSAIDs Similarity Network Community Grouping:
     Community 1: ['ibuprofen', 'naproxen']
     Community 2: ['aspirin', 'diclofenac']