Pipeline Processing Reference

Overview

The OneCite pipeline is a 4-stage process that transforms raw references into formatted citations:

  1. Validation - Check reference validity

  2. Identification - Query data sources

  3. Completion - Enrich with metadata

  4. Formatting - Convert to output format

Pipeline Stages

Stage 1: Validation

Purpose: Ensure input is valid and can be processed

Input: Raw reference text

Output: Validated RawEntry object

Process:

  1. Check for empty/null input

  2. Validate format (txt or bib)

  3. Detect reference type

  4. Extract metadata hints

Error Handling:

Raises ValidationError if:

  • Input is empty

  • Format is unrecognized

  • Data is malformed

  • Required fields missing

Example:

from onecite import RawEntry
from onecite.pipeline import Validator

raw = RawEntry(content="10.1038/nature14539")
validator = Validator()

if validator.validate(raw):
    print("Valid reference")
else:
    print("Invalid reference")

Stage 2: Identification

Purpose: Find matching citations in data sources

Input: Validated RawEntry

Output: List of IdentifiedEntry objects

Process:

  1. Detect identifier type (DOI, arXiv, etc.)

  2. Query appropriate data source

  3. Parse results

  4. Rank by relevance

  5. Return candidates

Data Sources:

  • CrossRef (DOI-based)

  • Semantic Scholar (keyword search)

  • OpenAlex (academic graph)

  • PubMed (biomedical)

  • DBLP (computer science)

  • arXiv (preprints)

  • DataCite (datasets)

  • Zenodo (open research)

  • Google Books (books)

Intelligent Routing:

OneCite automatically selects best sources:

  • Medical terms → PubMed priority

  • CS terms → DBLP/arXiv priority

  • DOI → CrossRef priority

  • Mixed → Semantic Scholar

Example:

from onecite.pipeline import Identifier
from onecite import RawEntry

identifier = Identifier()
raw = RawEntry(content="10.1038/nature14539")

matches = identifier.identify(raw)
for match in matches:
    print(f"{match.title} ({match.year})")

Stage 3: Completion

Purpose: Enrich entries with complete metadata

Input: IdentifiedEntry (often incomplete)

Output: CompletedEntry (fully enriched)

Process:

  1. Query additional data sources

  2. Fill missing fields

  3. Normalize author names

  4. Verify publication details

  5. Calculate completeness score

Fields Enriched:

  • Authors

  • Title

  • Journal/Publisher

  • Year

  • Volume/Issue

  • Pages

  • DOI/URL

  • Keywords

  • Abstract

Completeness Scoring:

A score from 0-1 indicating data completeness:

  • 0.9-1.0: Excellent (all fields present)

  • 0.7-0.9: Good (most fields present)

  • 0.5-0.7: Fair (core fields present)

  • < 0.5: Poor (incomplete)

Example:

from onecite.pipeline import Completer
from onecite import IdentifiedEntry

completer = Completer()
identified = IdentifiedEntry(...)

completed = completer.complete(identified)
print(f"Completeness: {completed.completeness_score}")

Stage 4: Formatting

Purpose: Convert to output format

Input: CompletedEntry

Output: Formatted string

Supported Formats:

  • BibTeX

  • APA

  • MLA

  • Custom (via templates)

Process:

  1. Load template for format

  2. Map fields to template variables

  3. Apply formatting rules

  4. Handle special characters

  5. Return formatted string

Example:

from onecite.pipeline import Formatter
from onecite import CompletedEntry

formatter = Formatter()
completed = CompletedEntry(...)

# BibTeX output
bibtex = formatter.format(completed, "bibtex")

# APA output
apa = formatter.format(completed, "apa")

Complete Pipeline

The PipelineController orchestrates all stages:

from onecite import PipelineController

controller = PipelineController()

result = controller.process(
    entries=["10.1038/nature14539"],
    output_format="bibtex"
)
  1. Validate input

  2. For each entry: - Identify sources - Select best match - Complete entry - Format output

  3. Aggregate results

  4. Return summary

Advanced Pipeline Usage

Custom Data Processing

from onecite.pipeline import (
    Validator,
    Identifier,
    Completer,
    Formatter
)
from onecite import RawEntry

# Create components
validator = Validator()
identifier = Identifier()
completer = Completer()
formatter = Formatter()

# Manual pipeline
raw = RawEntry(content="10.1038/nature14539")

# Stage 1
if not validator.validate(raw):
    raise ValidationError("Invalid reference")

# Stage 2
matches = identifier.identify(raw)
if not matches:
    raise ResolverError("No matches found")

# Stage 3
identified = matches[0]
completed = completer.complete(identified)

# Stage 4
formatted = formatter.format(completed, "bibtex")
print(formatted)

Batch Processing

from onecite import PipelineController

controller = PipelineController()

references = [
    "10.1038/nature14539",
    "1706.03762",
    "Smith (2020) Machine Learning"
]

result = controller.process(
    entries=references,
    output_format="bibtex"
)

print(f"Processed: {result['processed_count']}")
print(f"Failed: {result['failed_count']}")

Performance Optimization

Single Reference:

# Fast path for single reference
result = process_references("10.1038/nature14539")

Batch References:

# Use --quiet flag for better performance
onecite process refs.txt --quiet -o output.bib

Large Batches:

# Split into chunks
split -l 100 large_file.txt chunk_

for chunk in chunk_*; do
    onecite process "$chunk" -o "${chunk}.bib" --quiet
done

Error Handling in Pipeline

Validation Errors

from onecite import ValidationError

try:
    result = process_references("")
except ValidationError:
    print("Empty input")

Resolution Errors

from onecite import ResolverError

try:
    result = process_references("invalid/doi")
except ResolverError:
    print("Could not find reference")
    print("Check identifier or try again later")

Partial Success

from onecite import process_references

result = process_references(mixed_refs)

print(f"Success: {result['processed_count']}")
print(f"Failed: {result['failed_count']}")

if result['warnings']:
    for warning in result['warnings']:
        print(f"Warning: {warning}")

Pipeline Configuration

Custom Templates

from onecite import PipelineController

controller = PipelineController()
controller.add_template_path("./my_templates")

result = controller.process(
    entries=["10.1038/nature14539"],
    output_format="my_format"
)

Data Source Priority

from onecite.pipeline import Identifier

identifier = Identifier()

# Set priority for specific query types
identifier.set_source_priority(
    query_type="biomedical",
    sources=["pubmed", "crossref", "openalex"]
)

Timeout Configuration

from onecite import PipelineController

controller = PipelineController()
controller.set_timeout(10)  # 10 seconds per query

Next Steps