Overview
Configure your R2R ingestion pipeline
Introduction
R2R’s ingestion pipeline efficiently processes various document formats, transforming them into searchable content. It seamlessly integrates with vector databases and knowledge graphs for optimal retrieval and analysis.
Implementation Options
R2R offers two main implementations for ingestion:
- Light: Uses R2R’s built-in ingestion logic, supporting a wide range of file types including TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images, audio, and video. For high-quality PDF parsing, it is recommended to use the zerox parser.
- Full: Leverages Unstructured’s open-source ingestion platform to handle supported file types. This is the default for the ‘full’ installation and provides more advanced parsing capabilities.
Core Concepts
Document Processing Pipeline
Inside R2R, ingestion refers to the complete pipeline for processing input data:
- Parsing files into text
- Chunking text into semantic units
- Generating embeddings
- Storing data for retrieval
Ingested files are stored with an associated document identifier and user identifier to enable comprehensive management.
Multimodal Support
R2R has recently expanded its capabilities to include multimodal foundation models. In addition to using such models by default for images, R2R can now use them on PDFs by configuring the parser override:
Configuration
Key Configuration Areas
Many settings are managed by the r2r.toml
configuration file:
Configuration Impact
These settings directly influence how R2R performs ingestion:
-
Database Configuration
- Configures Postgres database for semantic search and document management
- Used during retrieval to find relevant document chunks via vector similarity
-
Ingestion Settings
- Determines file type processing and text conversion methods
- Controls text chunking protocols and granularity
- Affects information storage and retrieval precision
-
Embedding Configuration
- Defines model and parameters for text-to-vector conversion
- Used during retrieval to embed user queries
- Enables vector comparison against stored document embeddings
Document Management
Document Ingestion
The system provides several methods for ingesting documents:
- File Ingestion
- Direct Chunk Ingestion
Document Updates
Update existing documents while maintaining version history:
Vector Index Management
Creating Indices
Vector indices improve search performance for large collections:
Important considerations for index creation:
- Resource intensive process
- Requires pre-warming for optimal performance
- Parameters affect build time and search quality
- Monitor system resources during creation
Managing Indices
List and delete indices as needed:
Troubleshooting
Common Issues and Solutions
-
Ingestion Failures
- Verify file permissions and paths
- Check supported file formats
- Ensure metadata matches file_paths
- Monitor memory usage
-
Chunking Issues
- Large chunks may impact retrieval quality
- Small chunks may lose context
- Adjust overlap for better context preservation
-
Vector Index Performance
- Monitor creation time
- Check memory usage
- Verify warm-up queries
- Consider rebuilding if quality degrades
Pipeline Architecture
The ingestion pipeline consists of several key components:
This modular design allows for customization and extension of individual components while maintaining robust document processing capabilities.
Next Steps
For more detailed information on configuring specific components of the ingestion pipeline, please refer to the following pages: