Data Ingestion
Configure your R2R ingestion pipeline
Introduction
R2R’s ingestion pipeline transforms raw documents into structured, searchable content. It supports a wide range of file types (TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images, audio, and video) and can run in different modes and configurations to suit your performance and quality requirements.
The pipeline seamlessly integrates with R2R’s vector databases and knowledge graphs, enabling advanced retrieval, analysis, and entity/relationship extraction at scale.
Deployment Options
R2R ingestion works in two main deployment modes:
-
Light:
Uses R2R’s built-in parsing for synchronous ingestion. This mode is simple, fast, and supports all file types locally. It’s ideal for lower-volume scenarios or quick testing. -
Full:
Employs workflow orchestration to run asynchronous ingestion tasks at higher throughput. It can leverage external providers likeunstructured_local
orunstructured_api
for more advanced parsing capabilities and hybrid (text + image) analysis.
Ingestion Modes
When creating or updating documents, you can select an ingestion mode based on your needs:
fast
: Prioritizes speed by skipping certain enrichment steps like summarization.hi-res
: Aims for high-quality extraction, potentially leveraging visual language models for PDFs and images. Recommended for complex or multimodal documents.custom
: Offers full control viaingestion_config
, allowing you to tailor parsing, chunking, and enrichment parameters.
Core Concepts
Document Processing Pipeline
Ingestion in R2R covers the entire lifecycle of a document’s preparation for retrieval:
- Parsing: Converts source files into text.
- Chunking: Breaks text into semantic segments.
- Embedding: Transforms segments into vector representations for semantic search.
- Storing: Persists chunks and embeddings for retrieval.
- Knowledge Graph Integration: Optionally extracts entities and relationships for graph-based analysis.
Each ingested document is associated with user permissions and metadata, enabling comprehensive access control and management.
Ingestion Architecture
The ingestion pipeline is modular and extensible:
This structure allows you to customize components (e.g., choose a different parser or embedding model) without disrupting the entire system.
Multimodal Support
For documents that contain images, complex layouts, or mixed media (like PDFs), using hi-res
mode can unlock visual language model (VLM) capabilities. On a full deployment, hi-res
mode may incorporate unstructured_local
or unstructured_api
to handle these advanced parsing scenarios.
Configuration
Key Configuration Areas
Ingestion behavior is primarily managed through your r2r.toml
configuration file:
- Provider: Determines which parsing engine is used (
r2r
built-in orunstructured_*
providers). - Chunking Strategy & Parameters: Control how text is segmented into chunks.
- Other Settings: Adjust file parsing logic, excluded parsers, and integration with embeddings or knowledge graphs.
Configuration Impact
Your ingestion settings influence:
-
Postgres Configuration:
Ensures that vector and metadata storage are optimized for semantic retrieval. -
Embedding Configuration:
Defines the vector models and parameters used to embed document chunks and queries. -
Ingestion Settings Themselves:
Affect parsing complexity, chunk sizes, and the extent of enrichment during ingestion.
Document Management
Document Ingestion
R2R supports multiple ingestion methods:
-
File Ingestion: Provide a file path and optional metadata:
-
Direct Chunk Ingestion: Supply pre-processed text segments:
Document Updates
Update existing documents to reflect new content or corrected data:
By updating documents, you maintain version history and ensure that retrieval remains accurate as documents evolve.
Next Steps
- Review Embedding Configuration to optimize semantic search.
- Check out other configuration guides for integrating retrieval and knowledge graph capabilities.