Data Ingestion

Configure your R2R ingestion pipeline

Introduction

R2R’s ingestion pipeline transforms raw documents into structured, searchable content. It supports a wide range of file types (TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images, audio, and video) and can run in different modes and configurations to suit your performance and quality requirements.

The pipeline seamlessly integrates with R2R’s vector databases and knowledge graphs, enabling advanced retrieval, analysis, and entity/relationship extraction at scale.

Deployment Options

R2R ingestion works in two main deployment modes:

  • Light:
    Uses R2R’s built-in parsing for synchronous ingestion. This mode is simple, fast, and supports all file types locally. It’s ideal for lower-volume scenarios or quick testing.

  • Full:
    Employs workflow orchestration to run asynchronous ingestion tasks at higher throughput. It can leverage external providers like unstructured_local or unstructured_api for more advanced parsing capabilities and hybrid (text + image) analysis.

Ingestion Modes

When creating or updating documents, you can select an ingestion mode based on your needs:

  • fast: Prioritizes speed by skipping certain enrichment steps like summarization.
  • hi-res: Aims for high-quality extraction, potentially leveraging visual language models for PDFs and images. Recommended for complex or multimodal documents.
  • custom: Offers full control via ingestion_config, allowing you to tailor parsing, chunking, and enrichment parameters.

Core Concepts

Document Processing Pipeline

Ingestion in R2R covers the entire lifecycle of a document’s preparation for retrieval:

  1. Parsing: Converts source files into text.
  2. Chunking: Breaks text into semantic segments.
  3. Embedding: Transforms segments into vector representations for semantic search.
  4. Storing: Persists chunks and embeddings for retrieval.
  5. Knowledge Graph Integration: Optionally extracts entities and relationships for graph-based analysis.

Each ingested document is associated with user permissions and metadata, enabling comprehensive access control and management.

Ingestion Architecture

The ingestion pipeline is modular and extensible:

This structure allows you to customize components (e.g., choose a different parser or embedding model) without disrupting the entire system.

Multimodal Support

For documents that contain images, complex layouts, or mixed media (like PDFs), using hi-res mode can unlock visual language model (VLM) capabilities. On a full deployment, hi-res mode may incorporate unstructured_local or unstructured_api to handle these advanced parsing scenarios.

Configuration

Key Configuration Areas

Ingestion behavior is primarily managed through your r2r.toml configuration file:

1[ingestion]
2provider = "r2r" # or `unstructured_local` | `unstructured_api`
3chunking_strategy = "recursive"
4chunk_size = 1024
5chunk_overlap = 512
  • Provider: Determines which parsing engine is used (r2r built-in or unstructured_* providers).
  • Chunking Strategy & Parameters: Control how text is segmented into chunks.
  • Other Settings: Adjust file parsing logic, excluded parsers, and integration with embeddings or knowledge graphs.

Configuration Impact

Your ingestion settings influence:

  1. Postgres Configuration:
    Ensures that vector and metadata storage are optimized for semantic retrieval.

  2. Embedding Configuration:
    Defines the vector models and parameters used to embed document chunks and queries.

  3. Ingestion Settings Themselves:
    Affect parsing complexity, chunk sizes, and the extent of enrichment during ingestion.

Document Management

Document Ingestion

R2R supports multiple ingestion methods:

  • File Ingestion: Provide a file path and optional metadata:

    1ingest_response = client.documents.create(
    2 file_path="path/to/file.txt",
    3 metadata={"key1": "value1"},
    4 ingestion_mode="fast", # choose fast, hi-res, or custom
    5 # ingestion_config = {...} # `custom` setting allows for full specification
    6)
  • Direct Chunk Ingestion: Supply pre-processed text segments:

    1chunks = ["Pre-chunked content", "other pre-chunked content", ...]
    2ingest_response = client.chunks.create(chunks=chunks)

Document Updates

Update existing documents to reflect new content or corrected data:

1update_response = client.documents.update(
2 file_path="path/to/updated_file.txt",
3 id=document_id,
4 metadata=[{"status": "reviewed"}]
5)

By updating documents, you maintain version history and ensure that retrieval remains accurate as documents evolve.

Next Steps

  • Review Embedding Configuration to optimize semantic search.
  • Check out other configuration guides for integrating retrieval and knowledge graph capabilities.
Built with