The PubChem database, maintained by NCBI, contains information on over 100 million chemical compounds and serves as a critical resource for chemical and biological research. However, programmatic access to this vast repository presents several challenges for researchers developing automated workflows.
Existing Python clients often suffer from network reliability issues, inefficient batch processing, and lack of robust error handling. Many existing clients lack built-in request throttling, retries, or persistent caching, forcing users to implement boilerplate code to handle network errors and redundant requests. The maintenance status of some libraries is also concerning; some popular clients have not had formal releases for several years.
Batch processing is often inefficient in existing tools. Workflows with mixed-type identifiers (e.g., names and CIDs) require manual pre-processing. Furthermore, a single invalid identifier in a large batch can cause an entire query to fail without clear error reporting, hindering data acquisition pipelines.
ChemInformant addresses these limitations by providing a robust, workflow-centric Python client that abstracts the complexity of PubChem API interactions while ensuring data integrity and optimal performance. The library transforms multi-step data acquisition tasks into single, elegant function calls, enabling researchers to focus on analysis rather than data retrieval mechanics.
The core API returns either a clean Pandas DataFrame or direct SQL output, eliminating data wrangling boilerplate and enabling immediate integration with both the Python data science ecosystem and modern database workflows.
Ensures workflows run flawlessly with built-in persistent caching, smart rate-limiting, and automatic retries. Transparently handles API pagination for large-scale queries, delivering complete result sets without manual intervention.
Natively accepts mixed lists of identifiers (names, CIDs, SMILES) and intelligently handles invalid inputs by flagging them with clear status in the output, ensuring a single bad entry never fails an entire batch operation.
Employs Pydantic v2 models for rigorous, runtime data validation when using the object-based API, preventing malformed or unexpected data from corrupting analysis pipelines.
Includes chemfetch and chemdraw for rapid data retrieval and 2D structure visualization directly from the terminal, perfect for quick lookups without writing scripts.
ChemInformant addresses critical gaps in the current landscape of chemical information clients. The following comparison highlights key advantages over widely-used alternatives:
| Key Feature | ChemInformant | PubChemPy | PubChemR | webchem | ChemSpiPy |
|---|---|---|---|---|---|
| Platform | Python | Python | R | R | Python |
| Primary Database | PubChem | PubChem | PubChem | Multi-DB | ChemSpider |
| Persistent Caching | Yes | No | No | No | No |
| Rate-Limiting & Retries | Yes | No | No | Partial | No |
| Batch Retrieval | Yes | Partial | Partial | Partial | Partial |
| Mixed Identifier Support | Yes | No | No | No | No |
| Fault Tolerance | Yes | No | No | No | No |
| Runtime Type Safety | Yes | No | Partial | No | No |
| Project Activity | Active | Inactive | Active | Active | Inactive |
Notes: Persistent Caching stores results locally to accelerate repeated queries. Rate-Limiting & Retries manages API request limits and server errors for robust automation. Fault Tolerance reports status per-item in batch queries, avoiding complete failure on single errors.
Command line interface demonstration
ChemInformant is available on PyPI and can be installed using pip:
The benchmark script (benchmark.py) compares ChemInformant with existing Python clients using 285 unique drug names across 6 properties, measuring retrieval speed in compounds per second for different caching scenarios (from JOSS publication).
ChemInformant adoption metrics from PyPI:
ChemInformant provides two main API approaches: Convenience Functions for single compound queries and the get_properties API for batch/complex queries.
get_properties(identifiers)get_properties(ids, properties)get_properties(ids, include_3d=True)get_properties(ids, all_properties=True)The get_properties function supports four different calling modes for different use cases:
| Mode | Usage | Properties Count | Description |
|---|---|---|---|
| Mode 1: Default CORE | get_properties(identifiers) | 22 properties | Returns all CORE properties automatically |
| Mode 2: Custom Properties | get_properties(identifiers, properties) | User-specified | Specify exact properties needed |
| Mode 3: Include 3D | get_properties(identifiers, include_3d=True) | 36 properties | CORE + 3D computational properties |
| Mode 4: All Properties | get_properties(identifiers, all_properties=True) | 40+ properties | Every available PubChem property |
Mode 2 supports flexible property specification:
| Format | Example | Description |
|---|---|---|
| String format | 'molecular_weight,xlogp' | Comma-separated property names |
| List format | ['molecular_weight', 'xlogp'] | Python list of property names |
| Alias support | ['weight', 'logp', 'smiles'] | Shortened aliases for common properties |
| Mixed categories | ['molecular_weight', 'volume_3d'] | Combine CORE, 3D, and other properties |
If you use ChemInformant in your research, please cite:
License: MIT License