ChemInformant: A Robust Python Client
for PubChem Data Access

Abstract

ChemInformant is a Python client for high-throughput, programmatic access to PubChem that streamlines automated data retrieval by converting large, mixed-type lists of chemical identifiers directly into analysis-ready Pandas DataFrames. To ensure resilience, the package integrates persistent HTTP caching, automatic rate-limiting with exponential backoff retries, and runtime data validation using Pydantic. By addressing critical limitations in existing tools, such as network instability and inefficient batch processing, ChemInformant offers significant performance improvements in batch retrieval operations, providing a more reliable and efficient component for the modern Python cheminformatics ecosystem. The software is released under the MIT license and has been published in the Journal of Open Source Software.
Keywords: cheminformatics PubChem data retrieval Python batch processing chemical databases API client

Contents

1. Introduction

The PubChem database, maintained by NCBI, contains information on over 100 million chemical compounds and serves as a critical resource for chemical and biological research. However, programmatic access to this vast repository presents several challenges for researchers developing automated workflows.

Existing Python clients often suffer from network reliability issues, inefficient batch processing, and lack of robust error handling. Many existing clients lack built-in request throttling, retries, or persistent caching, forcing users to implement boilerplate code to handle network errors and redundant requests. The maintenance status of some libraries is also concerning; some popular clients have not had formal releases for several years.

Batch processing is often inefficient in existing tools. Workflows with mixed-type identifiers (e.g., names and CIDs) require manual pre-processing. Furthermore, a single invalid identifier in a large batch can cause an entire query to fail without clear error reporting, hindering data acquisition pipelines.

ChemInformant addresses these limitations by providing a robust, workflow-centric Python client that abstracts the complexity of PubChem API interactions while ensuring data integrity and optimal performance. The library transforms multi-step data acquisition tasks into single, elegant function calls, enabling researchers to focus on analysis rather than data retrieval mechanics.

2. Key Features

Analysis-Ready Pandas/SQL Output

The core API returns either a clean Pandas DataFrame or direct SQL output, eliminating data wrangling boilerplate and enabling immediate integration with both the Python data science ecosystem and modern database workflows.

Automated Network Reliability

Ensures workflows run flawlessly with built-in persistent caching, smart rate-limiting, and automatic retries. Transparently handles API pagination for large-scale queries, delivering complete result sets without manual intervention.

Flexible & Fault-Tolerant Input

Natively accepts mixed lists of identifiers (names, CIDs, SMILES) and intelligently handles invalid inputs by flagging them with clear status in the output, ensuring a single bad entry never fails an entire batch operation.

Guaranteed Data Integrity

Employs Pydantic v2 models for rigorous, runtime data validation when using the object-based API, preventing malformed or unexpected data from corrupting analysis pipelines.

Terminal-Ready CLI Tools

Includes chemfetch and chemdraw for rapid data retrieval and 2D structure visualization directly from the terminal, perfect for quick lookups without writing scripts.

2.1 Comparison with Existing Tools

ChemInformant addresses critical gaps in the current landscape of chemical information clients. The following comparison highlights key advantages over widely-used alternatives:

Table 1. Comparative analysis of key features in mainstream chemical information clients (from JOSS publication)
Key Feature ChemInformant PubChemPy PubChemR webchem ChemSpiPy
Platform Python Python R R Python
Primary Database PubChem PubChem PubChem Multi-DB ChemSpider
Persistent Caching Yes No No No No
Rate-Limiting & Retries Yes No No Partial No
Batch Retrieval Yes Partial Partial Partial Partial
Mixed Identifier Support Yes No No No No
Fault Tolerance Yes No No No No
Runtime Type Safety Yes No Partial No No
Project Activity Active Inactive Active Active Inactive

Notes: Persistent Caching stores results locally to accelerate repeated queries. Rate-Limiting & Retries manages API request limits and server errors for robust automation. Fault Tolerance reports status per-item in batch queries, avoiding complete failure on single errors.

3. Quick Start

3.1 Basic Usage

import ChemInformant as ci # Define identifiers - mixed types supported identifiers = ["aspirin", "caffeine", 1983] # 1983 is paracetamol's CID # Specify properties to retrieve properties = ["molecular_weight", "xlogp", "cas"] # Get data as Pandas DataFrame df = ci.get_properties(identifiers, properties) # Save to SQL database ci.df_to_sql(df, "sqlite:///chem_data.db", "results", if_exists="replace") print(df)

3.2 Expected Output

input_identifier cid status molecular_weight xlogp cas 0 aspirin 2244 OK 180.16 1.2 50-78-2 1 caffeine 2519 OK 194.19 -0.1 58-08-2 2 1983 1983 OK 151.16 0.5 103-90-2

3.3 Command Line Usage

# Fetch compound properties chemfetch aspirin --props "cas,molecular_weight,iupac_name" # Draw chemical structure chemdraw aspirin
CLI Demo

Command line interface demonstration

4. Installation

ChemInformant is available on PyPI and can be installed using pip:

# Basic installation pip install ChemInformant # With plotting capabilities pip install "ChemInformant[plot]" # Development installation pip install "ChemInformant[dev]"

5. Performance

The benchmark script (benchmark.py) compares ChemInformant with existing Python clients using 285 unique drug names across 6 properties, measuring retrieval speed in compounds per second for different caching scenarios (from JOSS publication).

5.1 Download Statistics

ChemInformant adoption metrics from PyPI:

6. API Reference

6.1 API Structure Overview

ChemInformant provides two main API approaches: Convenience Functions for single compound queries and the get_properties API for batch/complex queries.

graph LR A["ChemInformant API
Entry Point"] --> B["Query Strategy
Selection"] B -->|"Single Compound
Quick Access"| C["Convenience Functions
22 Specialized Methods"] B -->|"Batch Processing
Complex Queries"| D["get_properties API
Four Operational Modes"] C --> C1["Basic Properties
molecular_weight, formula
canonical_smiles, cas
"] C --> C2["Descriptors
exact_mass, tpsa
complexity, charge
"] C --> C3["Molecular Counts
h_bond_counts, rotatable_bonds
heavy_atoms
"] C --> C4["Identifiers
inchi, inchi_key
stereo_counts
"] D --> D1["Mode 1: CORE
get_properties(identifiers)
22 essential properties
"] D --> D2["Mode 2: Custom
get_properties(ids, properties)
user-specified selection
"] D --> D3["Mode 3: Enhanced
get_properties(ids, include_3d=True)
36 properties + 3D data
"] D --> D4["Mode 4: Complete
get_properties(ids, all_properties=True)
comprehensive dataset
"] classDef entryNode fill:#ffffff,stroke:#2c3e50,stroke-width:2px,color:#2c3e50 classDef decisionNode fill:#f8f9fa,stroke:#34495e,stroke-width:2px,color:#2c3e50 classDef categoryNode fill:#ecf0f1,stroke:#7f8c8d,stroke-width:1.5px,color:#2c3e50 classDef methodNode fill:#fdfdfd,stroke:#95a5a6,stroke-width:1px,color:#2c3e50 classDef modeNode fill:#f4f6f7,stroke:#bdc3c7,stroke-width:1px,color:#2c3e50 class A entryNode class B decisionNode class C,D categoryNode class C1,C2,C3,C4 methodNode class D1,D2,D3,D4 modeNode

6.3 get_properties API (Four Calling Modes)

The get_properties function supports four different calling modes for different use cases:

Table 2. get_properties API Modes
Mode Usage Properties Count Description
Mode 1: Default CORE get_properties(identifiers) 22 properties Returns all CORE properties automatically
Mode 2: Custom Properties get_properties(identifiers, properties) User-specified Specify exact properties needed
Mode 3: Include 3D get_properties(identifiers, include_3d=True) 36 properties CORE + 3D computational properties
Mode 4: All Properties get_properties(identifiers, all_properties=True) 40+ properties Every available PubChem property

6.4 Property Categories Detail

6.4.1 CORE Properties (22)

CORE Properties included in Mode 1:molecular_formula, molecular_weightcanonical_smiles, isomeric_smilesiupac_name, cas, synonymsxlogp, tpsa, complexity, chargeexact_mass, monoisotopic_massh_bond_donor_count, h_bond_acceptor_countrotatable_bond_count, heavy_atom_countatom_stereo_count, bond_stereo_countcovalent_unit_countinchi, inchi_key

6.4.2 3D Properties (14)

Additional 3D Properties in Mode 3: • volume_3d • x_steric_quadrupole_3d, y_steric_quadrupole_3d, z_steric_quadrupole_3d • feature_count_3d, feature_acceptor_count_3d • feature_donor_count_3d, feature_anion_count_3d • feature_cation_count_3d, feature_ring_count_3d • feature_hydrophobe_count_3d • conformer_model_rmsd_3d, effective_rotor_count_3d • conformer_count_3d

6.5 Property Specification Formats

Mode 2 supports flexible property specification:

Table 3. Property Specification Formats
Format Example Description
String format 'molecular_weight,xlogp' Comma-separated property names
List format ['molecular_weight', 'xlogp'] Python list of property names
Alias support ['weight', 'logp', 'smiles'] Shortened aliases for common properties
Mixed categories ['molecular_weight', 'volume_3d'] Combine CORE, 3D, and other properties

7. Resources

7.1 Documentation and Examples

Citation

If you use ChemInformant in your research, please cite:

@article{He2025, doi = {10.21105/joss.08341}, url = {https://doi.org/10.21105/joss.08341}, year = {2025}, publisher = {The Open Journal}, volume = {10}, number = {112}, pages = {8341}, author = {He, Zhiang}, title = {ChemInformant: A Robust and Workflow-Centric Python Client for High-Throughput PubChem Access}, journal = {Journal of Open Source Software} }

License: MIT License