Pipeline Processing Reference
Overview
The OneCite pipeline is a 4-stage process that transforms raw references into formatted citations:
Validation - Check reference validity
Identification - Query data sources
Completion - Enrich with metadata
Formatting - Convert to output format
Pipeline Stages
Stage 1: Validation
Purpose: Ensure input is valid and can be processed
Input: Raw reference text
Output: Validated RawEntry object
Process:
Check for empty/null input
Validate format (txt or bib)
Detect reference type
Extract metadata hints
Error Handling:
Raises ValidationError
if:
Input is empty
Format is unrecognized
Data is malformed
Required fields missing
Example:
from onecite import RawEntry
from onecite.pipeline import Validator
raw = RawEntry(content="10.1038/nature14539")
validator = Validator()
if validator.validate(raw):
print("Valid reference")
else:
print("Invalid reference")
Stage 2: Identification
Purpose: Find matching citations in data sources
Input: Validated RawEntry
Output: List of IdentifiedEntry objects
Process:
Detect identifier type (DOI, arXiv, etc.)
Query appropriate data source
Parse results
Rank by relevance
Return candidates
Data Sources:
CrossRef (DOI-based)
Semantic Scholar (keyword search)
OpenAlex (academic graph)
PubMed (biomedical)
DBLP (computer science)
arXiv (preprints)
DataCite (datasets)
Zenodo (open research)
Google Books (books)
Intelligent Routing:
OneCite automatically selects best sources:
Medical terms → PubMed priority
CS terms → DBLP/arXiv priority
DOI → CrossRef priority
Mixed → Semantic Scholar
Example:
from onecite.pipeline import Identifier
from onecite import RawEntry
identifier = Identifier()
raw = RawEntry(content="10.1038/nature14539")
matches = identifier.identify(raw)
for match in matches:
print(f"{match.title} ({match.year})")
Stage 3: Completion
Purpose: Enrich entries with complete metadata
Input: IdentifiedEntry (often incomplete)
Output: CompletedEntry (fully enriched)
Process:
Query additional data sources
Fill missing fields
Normalize author names
Verify publication details
Calculate completeness score
Fields Enriched:
Authors
Title
Journal/Publisher
Year
Volume/Issue
Pages
DOI/URL
Keywords
Abstract
Completeness Scoring:
A score from 0-1 indicating data completeness:
0.9-1.0: Excellent (all fields present)
0.7-0.9: Good (most fields present)
0.5-0.7: Fair (core fields present)
< 0.5: Poor (incomplete)
Example:
from onecite.pipeline import Completer
from onecite import IdentifiedEntry
completer = Completer()
identified = IdentifiedEntry(...)
completed = completer.complete(identified)
print(f"Completeness: {completed.completeness_score}")
Stage 4: Formatting
Purpose: Convert to output format
Input: CompletedEntry
Output: Formatted string
Supported Formats:
BibTeX
APA
MLA
Custom (via templates)
Process:
Load template for format
Map fields to template variables
Apply formatting rules
Handle special characters
Return formatted string
Example:
from onecite.pipeline import Formatter
from onecite import CompletedEntry
formatter = Formatter()
completed = CompletedEntry(...)
# BibTeX output
bibtex = formatter.format(completed, "bibtex")
# APA output
apa = formatter.format(completed, "apa")
Complete Pipeline
The PipelineController orchestrates all stages:
from onecite import PipelineController
controller = PipelineController()
result = controller.process(
entries=["10.1038/nature14539"],
output_format="bibtex"
)
Validate input
For each entry: - Identify sources - Select best match - Complete entry - Format output
Aggregate results
Return summary
Advanced Pipeline Usage
Custom Data Processing
from onecite.pipeline import (
Validator,
Identifier,
Completer,
Formatter
)
from onecite import RawEntry
# Create components
validator = Validator()
identifier = Identifier()
completer = Completer()
formatter = Formatter()
# Manual pipeline
raw = RawEntry(content="10.1038/nature14539")
# Stage 1
if not validator.validate(raw):
raise ValidationError("Invalid reference")
# Stage 2
matches = identifier.identify(raw)
if not matches:
raise ResolverError("No matches found")
# Stage 3
identified = matches[0]
completed = completer.complete(identified)
# Stage 4
formatted = formatter.format(completed, "bibtex")
print(formatted)
Batch Processing
from onecite import PipelineController
controller = PipelineController()
references = [
"10.1038/nature14539",
"1706.03762",
"Smith (2020) Machine Learning"
]
result = controller.process(
entries=references,
output_format="bibtex"
)
print(f"Processed: {result['processed_count']}")
print(f"Failed: {result['failed_count']}")
Performance Optimization
Single Reference:
# Fast path for single reference
result = process_references("10.1038/nature14539")
Batch References:
# Use --quiet flag for better performance
onecite process refs.txt --quiet -o output.bib
Large Batches:
# Split into chunks
split -l 100 large_file.txt chunk_
for chunk in chunk_*; do
onecite process "$chunk" -o "${chunk}.bib" --quiet
done
Error Handling in Pipeline
Validation Errors
from onecite import ValidationError
try:
result = process_references("")
except ValidationError:
print("Empty input")
Resolution Errors
from onecite import ResolverError
try:
result = process_references("invalid/doi")
except ResolverError:
print("Could not find reference")
print("Check identifier or try again later")
Partial Success
from onecite import process_references
result = process_references(mixed_refs)
print(f"Success: {result['processed_count']}")
print(f"Failed: {result['failed_count']}")
if result['warnings']:
for warning in result['warnings']:
print(f"Warning: {warning}")
Pipeline Configuration
Custom Templates
from onecite import PipelineController
controller = PipelineController()
controller.add_template_path("./my_templates")
result = controller.process(
entries=["10.1038/nature14539"],
output_format="my_format"
)
Data Source Priority
from onecite.pipeline import Identifier
identifier = Identifier()
# Set priority for specific query types
identifier.set_source_priority(
query_type="biomedical",
sources=["pubmed", "crossref", "openalex"]
)
Timeout Configuration
from onecite import PipelineController
controller = PipelineController()
controller.set_timeout(10) # 10 seconds per query
Next Steps
See Python API Reference for usage examples
Check Core API Reference for class reference
Review Advanced Usage for complex scenarios