Pipeline Processing Reference
Overview
The OneCite pipeline is a 4-stage process that transforms raw references into formatted BibTeX:
Parse — read the raw input and produce
RawEntryobjectsIdentify — look up each entry in external APIs and fill in a DOI / basic metadata
Enrich — fetch full metadata for the identified entries
Format — render the completed entries as BibTeX
The implementation lives in the onecite/pipeline/ package with one
module per stage. For backwards-compatibility all public symbols remain
importable from onecite.pipeline:
from onecite.pipeline import (
ParserModule,
IdentifierModule,
EnricherModule,
FormatterModule,
)
Package Layout
onecite/pipeline/
__init__.py - re-exports + ``requests`` at package level
_utils.py - _safe_year helper
parser.py - ParserModule
identifier.py - IdentifierModule
enricher.py - EnricherModule
formatter.py - FormatterModule
Pipeline Stages
Stage 1: Parse (ParserModule)
Purpose: split the input into one RawEntry per reference.
Input: the raw input_content string and an input_type
("txt" or "bib").
Output: List[RawEntry].
ParserModule.parse(input_content, input_type) dispatches to
_parse_bibtex or _parse_text. The text parser splits on blank
lines (one reference per block), extracts any DOI or URL found in the
block, and builds a query_string for later identification when no
identifier is present.
from onecite.pipeline import ParserModule
parser = ParserModule()
entries = parser.parse("10.1038/nature14539\n\n1706.03762", "txt")
# [{'id': 0, 'raw_text': '10.1038/nature14539', 'doi': '10.1038/nature14539', ...},
# {'id': 1, 'raw_text': '1706.03762', ...}]
ParseError is raised when the input type is unsupported or BibTeX
parsing fails.
Stage 2: Identify (IdentifierModule)
Purpose: resolve each RawEntry against academic data sources and
produce an IdentifiedEntry with a DOI (when possible) plus basic
metadata.
Input: List[RawEntry] and an interactive_callback that picks
from candidate lists when confidence is medium.
Output: List[IdentifiedEntry].
Data sources actually queried by the code:
CrossRef (DOI-based and fuzzy search)
Semantic Scholar (keyword search)
arXiv (via feedparser)
PubMed (biomedical, queried when strong cues are present)
DataCite / Zenodo (datasets)
Google Books (books — triggered by ISBN or publisher cues)
OpenAIRE / BASE (theses)
GitHub (software repositories)
Google Scholar (optional, disabled by default; opt-in via
--google-scholaroruse_google_scholar=Trueand requires thescholarlypackage)
There is no runtime routing based on filename and no fixed priority
for “medical”, “CS” or “general” queries. Signal-based heuristics
inside _fuzzy_search decide when to additionally query PubMed,
Google Books, OpenAIRE/BASE, etc., but CrossRef and Semantic Scholar are
always consulted for text queries.
Confidence model:
After all sources have returned candidates, _score_candidates assigns
each candidate a match_score (0–100) based on title / author /
year / venue similarity to the query. The decision logic in
_fuzzy_search then chooses one of three paths:
match_score >= 80and a clear best candidate → auto-adopt70 <= match_score < 80→ call theinteractive_callbackwith up to 5 candidates; fall back to the top candidate if the user skips and the score is still ≥ 75match_score >= 50and a title is present → adopt cautiouslyotherwise → mark the entry as
identification_failed
Fallback paths never fabricate data: an entry that cannot be resolved is
marked identification_failed rather than filled with invented
metadata.
from onecite.pipeline import IdentifierModule
identifier = IdentifierModule(use_google_scholar=False)
identified = identifier.identify(entries, interactive_callback=lambda c: 0)
Stage 3: Enrich (EnricherModule)
Purpose: take each IdentifiedEntry and produce a
CompletedEntry with the BibTeX fields the selected template
requires.
Input: List[IdentifiedEntry] and the loaded template.
Output: List[CompletedEntry].
Fields typically filled in:
author,title,journal/booktitle,yearvolume,number,pages,publisherdoi,url,arxiv/arxiv_idabstract— returned directly by CrossRef or Semantic Scholar when the identification stage resolved the entry through them; otherwise filled in by a post-hoc DOI-only cascade described below.
The _get_crossref_metadata method requests each DOI with a proper
User-Agent header and a mailto query-string parameter, per
CrossRef’s etiquette (fixes #21).
_complete_fields intentionally performs only one kind of
completion: abstract back-fill, through a DOI-only cascade
Semantic Scholar (/paper/DOI:{doi}?fields=abstract)
↓ (empty or 4xx)
PubMed ESearch (DOI → PMID) + EFetch (PMID → abstract)
The cascade is gated by allow_abstract_fallback and is only invoked
when the caller’s raw input carried a DOI; DOIs inferred by fuzzy
search never trigger it, so a possibly-wrong candidate does not cost
extra roundtrips. Title-based fallback is intentionally not used
anywhere on this path — in testing it silently returned the abstract
of an unrelated paper for at least one DOI
(10.1007/s10462-019-09792-7), which is strictly worse than
returning None for downstream semantic cross-checks.
Wider template-driven field completion from external scrapers (the
Google Scholar path flagged in review #29) was removed in 0.1.0 and is
not being reintroduced here. The template still controls which
entry_type the formatter falls back to when classification is
ambiguous, and continues to determine the declared field set; as of
this release, the default journal_article_full template lists
abstract as an optional field so its declaration matches what the
enricher actually emits.
The legacy kwarg name allow_pubmed_fallback is retained as a
deprecated alias for one release cycle and emits
DeprecationWarning when used — its replacement
allow_abstract_fallback reflects that the flag gates the full
Semantic-Scholar + PubMed cascade, not just PubMed.
Stage 4: Format (FormatterModule)
Purpose: render each CompletedEntry as a BibTeX string.
Input: List[CompletedEntry] and an output_format.
Output: a dict with results (list of formatted strings) and
report (total / succeeded / failed_entries).
Only "bibtex" is accepted; passing any other value raises
FormatError. The previous APA and MLA renderers were removed in
response to issues #31 and #32; for APA / MLA output, post-process the
BibTeX file with pandoc or citeproc-py.
Rendering uses bibtexparser (bibtexparser.dumps) so the
output complies with the BibTeX grammar; LaTeX-special characters in
author, title, journal, publisher, etc. are escaped
unless the field already contains explicit LaTeX commands
(e.g. K{\\"u}nsch).
Complete Pipeline
Most callers never touch the individual modules and instead use the
high-level process_references function:
from onecite import process_references
result = process_references(
input_content="10.1038/nature14539",
input_type="txt",
template_name="journal_article_full",
output_format="bibtex",
interactive_callback=lambda candidates: 0, # auto-pick first
)
print('\n\n'.join(result['results']))
print(result['report'])
Under the hood this creates a PipelineController and calls its
process method, which runs all four stages in order.
Running Stages Manually
For advanced uses (e.g. unit-testing a single stage) you can drive the modules directly:
from onecite import TemplateLoader
from onecite.pipeline import (
ParserModule,
IdentifierModule,
EnricherModule,
FormatterModule,
)
template = TemplateLoader().load_template("journal_article_full")
parser = ParserModule()
identifier = IdentifierModule(use_google_scholar=False)
enricher = EnricherModule(use_google_scholar=False)
formatter = FormatterModule()
raw = parser.parse("10.1038/nature14539", "txt")
identified = identifier.identify(raw, interactive_callback=lambda c: 0)
completed = enricher.enrich(identified, template, raw)
result = formatter.format(completed, "bibtex")
print(result['results'])
Error Handling
All pipeline errors inherit from OneCiteError:
ValidationError— empty / malformed inputParseError—ParserModulecould not split the inputResolverError— raised by helpers when a data source cannot resolve an identifier; generally caught internally and recorded asidentification_failedon the entry instead of propagatingFormatError— the requestedoutput_formatis not"bibtex"
from onecite import process_references, ValidationError, FormatError
try:
result = process_references(
input_content="",
input_type="txt",
template_name="journal_article_full",
output_format="bibtex",
interactive_callback=lambda c: 0,
)
except ValidationError:
print("Empty input")
try:
process_references(
input_content="10.1038/nature14539",
input_type="txt",
template_name="journal_article_full",
output_format="apa", # no longer supported
interactive_callback=lambda c: 0,
)
except FormatError as exc:
print(exc)
Testing Hooks
Because onecite/pipeline/__init__.py imports requests at the
package level, tests that mock the network can continue to use the
original patch target:
from unittest.mock import patch
with patch("onecite.pipeline.requests.get", side_effect=fake_get):
...
For mocking the optional scholarly dependency, patch the concrete
submodule attribute instead — scholarly is imported inside
identifier.py and enricher.py:
import onecite.pipeline.identifier as identifier_mod
with patch.object(identifier_mod, "scholarly", fake_scholarly):
...
Next Steps
See Python API Reference for usage examples
Check Core API Reference for the data-class and public-function reference
Review Advanced Usage for real-world workflows