ChemInformant: A Robust Python Client for PubChem Data Access

Abstract

ChemInformant is a Python client for high-throughput, programmatic access to PubChem that streamlines automated data retrieval by converting large, mixed-type lists of chemical identifiers directly into analysis-ready Pandas DataFrames. To ensure resilience, the package integrates persistent HTTP caching, automatic rate-limiting with exponential backoff retries, and runtime data validation using Pydantic. By addressing critical limitations in existing tools, such as network instability and inefficient batch processing, ChemInformant offers significant performance improvements in batch retrieval operations, providing a more reliable and efficient component for the modern Python cheminformatics ecosystem. The software is released under the MIT license and has been published in the Journal of Open Source Software.

1. Introduction

The PubChem database, maintained by NCBI, contains information on over 100 million chemical compounds and serves as a critical resource for chemical and biological research. However, programmatic access to this vast repository presents several challenges for researchers developing automated workflows.

Existing Python clients often suffer from network reliability issues, inefficient batch processing, and lack of robust error handling. Many existing clients lack built-in request throttling, retries, or persistent caching, forcing users to implement boilerplate code to handle network errors and redundant requests. The maintenance status of some libraries is also concerning; some popular clients have not had formal releases for several years.

Batch processing is often inefficient in existing tools. Workflows with mixed-type identifiers (e.g., names and CIDs) require manual pre-processing. Furthermore, a single invalid identifier in a large batch can cause an entire query to fail without clear error reporting, hindering data acquisition pipelines.

ChemInformant addresses these limitations by providing a robust, workflow-centric Python client that abstracts the complexity of PubChem API interactions while ensuring data integrity and optimal performance. The library transforms multi-step data acquisition tasks into single, elegant function calls, enabling researchers to focus on analysis rather than data retrieval mechanics.

2. Key Features

Analysis-Ready Pandas/SQL Output

The core API returns either a clean Pandas DataFrame or direct SQL output, eliminating data wrangling boilerplate and enabling immediate integration with both the Python data science ecosystem and modern database workflows.

Automated Network Reliability

Ensures workflows run flawlessly with built-in persistent caching, smart rate-limiting, and automatic retries. Transparently handles API pagination for large-scale queries, delivering complete result sets without manual intervention.

Flexible & Fault-Tolerant Input

Natively accepts mixed lists of identifiers (names, CIDs, SMILES) and intelligently handles invalid inputs by flagging them with clear status in the output, ensuring a single bad entry never fails an entire batch operation.

Guaranteed Data Integrity

Employs Pydantic v2 models for rigorous, runtime data validation when using the object-based API, preventing malformed or unexpected data from corrupting analysis pipelines.

Terminal-Ready CLI Tools

Includes chemfetch and chemdraw for rapid data retrieval and 2D structure visualization directly from the terminal, perfect for quick lookups without writing scripts.

2.1 Comparison with Existing Tools

ChemInformant addresses critical gaps in the current landscape of chemical information clients. The following comparison highlights key advantages over widely-used alternatives:

Key Feature	ChemInformant	PubChemPy	PubChemR	webchem	ChemSpiPy
Platform	Python	Python	R	R	Python
Primary Database	PubChem	PubChem	PubChem	Multi-DB	ChemSpider
Persistent Caching	Yes	No	No	No	No
Rate-Limiting & Retries	Yes	No	No	Partial	No
Batch Retrieval	Yes	Partial	Partial	Partial	Partial
Mixed Identifier Support	Yes	No	No	No	No
Fault Tolerance	Yes	No	No	No	No
Runtime Type Safety	Yes	No	Partial	No	No
Project Activity	Active	Inactive	Active	Active	Inactive

Notes: Persistent Caching stores results locally to accelerate repeated queries. Rate-Limiting & Retries manages API request limits and server errors for robust automation. Fault Tolerance reports status per-item in batch queries, avoiding complete failure on single errors.

3. Quick Start

3.1 Basic Usage

import ChemInformant as ci # Define identifiers - mixed types supported identifiers = ["aspirin", "caffeine", 1983] # 1983 is paracetamol's CID # Specify properties to retrieve properties = ["molecular_weight", "xlogp", "cas"] # Get data as Pandas DataFrame df = ci.get_properties(identifiers, properties) # Save to SQL database ci.df_to_sql(df, "sqlite:///chem_data.db", "results", if_exists="replace") print(df)

3.2 Expected Output

3.3 Command Line Usage

4. Installation

5. Performance

The benchmark script (benchmark.py) compares ChemInformant with existing Python clients using 285 unique drug names across 6 properties, measuring retrieval speed in compounds per second for different caching scenarios (from JOSS publication).

5.1 Download Statistics

6. API Reference

6.1 API Structure Overview

ChemInformant provides two main API approaches: Convenience Functions for single compound queries and the get_properties API for batch/complex queries.

graph LR A["ChemInformant API
Entry Point"] --> B["Query Strategy
Selection"] B -->|"Single Compound
Quick Access"| C["Convenience Functions
22 Specialized Methods"] B -->|"Batch Processing
Complex Queries"| D["get_properties API
Four Operational Modes"] C --> C1["Basic Properties
molecular_weight, formula
canonical_smiles, cas"] C --> C2["Descriptors
exact_mass, tpsa
complexity, charge"] C --> C3["Molecular Counts
h_bond_counts, rotatable_bonds
heavy_atoms"] C --> C4["Identifiers
inchi, inchi_key
stereo_counts"] D --> D1["Mode 1: CORE
get_properties(identifiers)
22 essential properties"] D --> D2["Mode 2: Custom
get_properties(ids, properties)
user-specified selection"] D --> D3["Mode 3: Enhanced
get_properties(ids, include_3d=True)
36 properties + 3D data"] D --> D4["Mode 4: Complete
get_properties(ids, all_properties=True)
comprehensive dataset"] classDef entryNode fill:#ffffff,stroke:#2c3e50,stroke-width:2px,color:#2c3e50 classDef decisionNode fill:#f8f9fa,stroke:#34495e,stroke-width:2px,color:#2c3e50 classDef categoryNode fill:#ecf0f1,stroke:#7f8c8d,stroke-width:1.5px,color:#2c3e50 classDef methodNode fill:#fdfdfd,stroke:#95a5a6,stroke-width:1px,color:#2c3e50 classDef modeNode fill:#f4f6f7,stroke:#bdc3c7,stroke-width:1px,color:#2c3e50 class A entryNode class B decisionNode class C,D categoryNode class C1,C2,C3,C4 methodNode class D1,D2,D3,D4 modeNode

6.3 get_properties API (Four Calling Modes)

The get_properties function supports four different calling modes for different use cases:

6.4 Property Categories Detail

6.4.1 CORE Properties (22)

Mode	Usage	Properties Count	Description
Mode 1: Default CORE	get_properties(identifiers)	22 properties	Returns all CORE properties automatically
Mode 2: Custom Properties	get_properties(identifiers, properties)	User-specified	Specify exact properties needed
Mode 3: Include 3D	get_properties(identifiers, include_3d=True)	36 properties	CORE + 3D computational properties
Mode 4: All Properties	get_properties(identifiers, all_properties=True)	40+ properties	Every available PubChem property

CORE Properties included in Mode 1: • molecular_formula, molecular_weight • canonical_smiles, isomeric_smiles • iupac_name, cas, synonyms • xlogp, tpsa, complexity, charge • exact_mass, monoisotopic_mass • h_bond_donor_count, h_bond_acceptor_count • rotatable_bond_count, heavy_atom_count • atom_stereo_count, bond_stereo_count • covalent_unit_count • inchi, inchi_key

6.4.2 3D Properties (14)

Additional 3D Properties in Mode 3: • volume_3d • x_steric_quadrupole_3d, y_steric_quadrupole_3d, z_steric_quadrupole_3d • feature_count_3d, feature_acceptor_count_3d • feature_donor_count_3d, feature_anion_count_3d • feature_cation_count_3d, feature_ring_count_3d • feature_hydrophobe_count_3d • conformer_model_rmsd_3d, effective_rotor_count_3d • conformer_count_3d

6.5 Property Specification Formats

7. Resources

7.1 Documentation and Examples

Format	Example	Description
String format	'molecular_weight,xlogp'	Comma-separated property names
List format	['molecular_weight', 'xlogp']	Python list of property names
Alias support	['weight', 'logp', 'smiles']	Shortened aliases for common properties
Mixed categories	['molecular_weight', 'volume_3d']	Combine CORE, 3D, and other properties

Citation

If you use ChemInformant in your research, please cite:

@article{He2025,
  doi       = {10.21105/joss.08341},
  url       = {https://doi.org/10.21105/joss.08341},
  year      = {2025},
  publisher = {The Open Journal},
  volume    = {10},
  number    = {112},
  pages     = {8341},
  author    = {He, Zhiang},
  title     = {ChemInformant: A Robust and Workflow-Centric Python Client 
              for High-Throughput PubChem Access},
  journal   = {Journal of Open Source Software}
}
        

License: MIT License

ChemInformant: A Robust Python Clientfor PubChem Data Access