Configure your R2R ingestion pipeline

Introduction

R2R’s ingestion pipeline efficiently processes various document formats, transforming them into searchable content. It seamlessly integrates with vector databases and knowledge graphs for optimal retrieval and analysis.

Implementation Options

R2R offers two main implementations for ingestion:

  • Light: Uses R2R’s built-in ingestion logic, supporting a wide range of file types including TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images, audio, and video. For high-quality PDF parsing, it is recommended to use the zerox parser.
  • Full: Leverages Unstructured’s open-source ingestion platform to handle supported file types. This is the default for the ‘full’ installation and provides more advanced parsing capabilities.

Core Concepts

Document Processing Pipeline

Inside R2R, ingestion refers to the complete pipeline for processing input data:

  • Parsing files into text
  • Chunking text into semantic units
  • Generating embeddings
  • Storing data for retrieval

Ingested files are stored with an associated document identifier and user identifier to enable comprehensive management.

Multimodal Support

R2R has recently expanded its capabilities to include multimodal foundation models. In addition to using such models by default for images, R2R can now use them on PDFs by configuring the parser override:

1"ingestion_config": {
2 "parser_overrides": {
3 "pdf": "zerox"
4 }
5}

Configuration

Key Configuration Areas

Many settings are managed by the r2r.toml configuration file:

1[database]
2provider = "postgres"
3
4[ingestion]
5provider = "r2r"
6chunking_strategy = "recursive"
7chunk_size = 1_024
8chunk_overlap = 512
9excluded_parsers = ["mp4"]
10
11[embedding]
12provider = "litellm"
13base_model = "openai/text-embedding-3-small"
14base_dimension = 512
15batch_size = 128
16add_title_as_prefix = false
17rerank_model = "None"
18concurrent_request_limit = 256

Configuration Impact

These settings directly influence how R2R performs ingestion:

  1. Database Configuration

    • Configures Postgres database for semantic search and document management
    • Used during retrieval to find relevant document chunks via vector similarity
  2. Ingestion Settings

    • Determines file type processing and text conversion methods
    • Controls text chunking protocols and granularity
    • Affects information storage and retrieval precision
  3. Embedding Configuration

    • Defines model and parameters for text-to-vector conversion
    • Used during retrieval to embed user queries
    • Enables vector comparison against stored document embeddings

Document Management

Document Ingestion

The system provides several methods for ingesting documents:

  1. File Ingestion
1file_paths = ['path/to/file1.txt', 'path/to/file2.txt']
2metadatas = [{'key1': 'value1'}, {'key2': 'value2'}]
3
4ingest_response = client.ingest_files(
5 file_paths=file_paths,
6 metadatas=metadatas,
7 ingestion_config={
8 "provider": "unstructured_local",
9 "strategy": "auto",
10 "chunking_strategy": "by_title",
11 "new_after_n_chars": 256,
12 "max_characters": 512
13 }
14)
  1. Direct Chunk Ingestion
1chunks = [
2 {
3 "text": "Sample text chunk 1...",
4 },
5 {
6 "text": "Sample text chunk 2...",
7 }
8]
9
10ingest_response = client.ingest_chunks(
11 chunks=chunks,
12 metadata={"title": "Sample", "source": "example"}
13)

Document Updates

Update existing documents while maintaining version history:

1update_response = client.update_files(
2 file_paths=file_paths,
3 document_ids=document_ids,
4 metadatas=[{"status": "reviewed"}]
5)

Vector Index Management

Creating Indices

Vector indices improve search performance for large collections:

1create_response = client.create_vector_index(
2 table_name="vectors",
3 index_method="hnsw",
4 index_measure="cosine_distance",
5 index_arguments={"m": 16, "ef_construction": 64},
6 concurrently=True
7)

Important considerations for index creation:

  • Resource intensive process
  • Requires pre-warming for optimal performance
  • Parameters affect build time and search quality
  • Monitor system resources during creation

Managing Indices

List and delete indices as needed:

1# List indices
2indices = client.list_vector_indices(table_name="vectors")
3
4# Delete index
5delete_response = client.delete_vector_index(
6 index_name="index_name",
7 table_name="vectors",
8 concurrently=True
9)

Troubleshooting

Common Issues and Solutions

  1. Ingestion Failures

    • Verify file permissions and paths
    • Check supported file formats
    • Ensure metadata matches file_paths
    • Monitor memory usage
  2. Chunking Issues

    • Large chunks may impact retrieval quality
    • Small chunks may lose context
    • Adjust overlap for better context preservation
  3. Vector Index Performance

    • Monitor creation time
    • Check memory usage
    • Verify warm-up queries
    • Consider rebuilding if quality degrades

Pipeline Architecture

The ingestion pipeline consists of several key components:

This modular design allows for customization and extension of individual components while maintaining robust document processing capabilities.

Next Steps

For more detailed information on configuring specific components of the ingestion pipeline, please refer to the following pages: