Overview — Build, scale, and manage user-facing Retrieval-Augmented Generation applications.

Introduction

R2R’s ingestion pipeline efficiently processes various document formats, transforming them into searchable content. It seamlessly integrates with vector databases and knowledge graphs for optimal retrieval and analysis.

Implementation Options

R2R offers two main implementations for ingestion:

Light: Uses R2R’s built-in ingestion logic, supporting a wide range of file types including TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images, audio, and video. For high-quality PDF parsing, it is recommended to use the zerox parser.
Full: Leverages Unstructured’s open-source ingestion platform to handle supported file types. This is the default for the ‘full’ installation and provides more advanced parsing capabilities.

Core Concepts

Document Processing Pipeline

Inside R2R, ingestion refers to the complete pipeline for processing input data:

Parsing files into text
Chunking text into semantic units
Generating embeddings
Storing data for retrieval

Ingested files are stored with an associated document identifier and user identifier to enable comprehensive management.

Multimodal Support

R2R has recently expanded its capabilities to include multimodal foundation models. In addition to using such models by default for images, R2R can now use them on PDFs by configuring the parser override:

1 "ingestion_config": {
2   "parser_overrides": {
3     "pdf": "zerox"
4   }
5 }

Configuration

Key Configuration Areas

Many settings are managed by the r2r.toml configuration file:

1 [database]
2 provider = "postgres"
3 
4 [ingestion]
5 provider = "r2r"
6 chunking_strategy = "recursive"
7 chunk_size = 1_024
8 chunk_overlap = 512
9 excluded_parsers = ["mp4"]
10 
11 [embedding]
12 provider = "litellm"
13 base_model = "openai/text-embedding-3-small"
14 base_dimension = 512
15 batch_size = 128
16 add_title_as_prefix = false
17 rerank_model = "None"
18 concurrent_request_limit = 256

Configuration Impact

These settings directly influence how R2R performs ingestion:

Database Configuration
- Configures Postgres database for semantic search and document management
- Used during retrieval to find relevant document chunks via vector similarity
Ingestion Settings
- Determines file type processing and text conversion methods
- Controls text chunking protocols and granularity
- Affects information storage and retrieval precision
Embedding Configuration
- Defines model and parameters for text-to-vector conversion
- Used during retrieval to embed user queries
- Enables vector comparison against stored document embeddings

Document Management

Document Ingestion

The system provides several methods for ingesting documents:

File Ingestion

1 file_paths = ['path/to/file1.txt', 'path/to/file2.txt']
2 metadatas = [{'key1': 'value1'}, {'key2': 'value2'}]
3 
4 ingest_response = client.ingest_files(
5     file_paths=file_paths,
6     metadatas=metadatas,
7     ingestion_config={
8         "provider": "unstructured_local",
9         "strategy": "auto",
10         "chunking_strategy": "by_title",
11         "new_after_n_chars": 256,
12         "max_characters": 512
13     }
14 )

Direct Chunk Ingestion

1 chunks = [
2   {
3     "text": "Sample text chunk 1...",
4   },
5   {
6     "text": "Sample text chunk 2...",
7   }
8 ]
9 
10 ingest_response = client.ingest_chunks(
11   chunks=chunks,
12   metadata={"title": "Sample", "source": "example"}
13 )

Document Updates

Update existing documents while maintaining version history:

1 update_response = client.update_files(
2     file_paths=file_paths,
3     document_ids=document_ids,
4     metadatas=[{"status": "reviewed"}]
5 )

Vector Index Management

Creating Indices

Vector indices improve search performance for large collections:

1 create_response = client.create_vector_index(
2     table_name="vectors",
3     index_method="hnsw",
4     index_measure="cosine_distance",
5     index_arguments={"m": 16, "ef_construction": 64},
6     concurrently=True
7 )

Important considerations for index creation:

Resource intensive process
Requires pre-warming for optimal performance
Parameters affect build time and search quality
Monitor system resources during creation

Managing Indices

List and delete indices as needed:

1 # List indices
2 indices = client.list_vector_indices(table_name="vectors")
3 
4 # Delete index
5 delete_response = client.delete_vector_index(
6     index_name="index_name",
7     table_name="vectors",
8     concurrently=True
9 )

Troubleshooting

Common Issues and Solutions

Ingestion Failures
- Verify file permissions and paths
- Check supported file formats
- Ensure metadata matches file_paths
- Monitor memory usage
Chunking Issues
- Large chunks may impact retrieval quality
- Small chunks may lose context
- Adjust overlap for better context preservation
Vector Index Performance
- Monitor creation time
- Check memory usage
- Verify warm-up queries
- Consider rebuilding if quality degrades

Pipeline Architecture

The ingestion pipeline consists of several key components:

This modular design allows for customization and extension of individual components while maintaining robust document processing capabilities.

Next Steps

For more detailed information on configuring specific components of the ingestion pipeline, please refer to the following pages:

1	"ingestion_config": {
2	"parser_overrides": {
3	"pdf": "zerox"
4	}
5	}

1	[database]
2	provider = "postgres"
3
4	[ingestion]
5	provider = "r2r"
6	chunking_strategy = "recursive"
7	chunk_size = 1_024
8	chunk_overlap = 512
9	excluded_parsers = ["mp4"]
10
11	[embedding]
12	provider = "litellm"
13	base_model = "openai/text-embedding-3-small"
14	base_dimension = 512
15	batch_size = 128
16	add_title_as_prefix = false
17	rerank_model = "None"
18	concurrent_request_limit = 256

1	file_paths = ['path/to/file1.txt', 'path/to/file2.txt']
2	metadatas = [{'key1': 'value1'}, {'key2': 'value2'}]
3
4	ingest_response = client.ingest_files(
5	file_paths=file_paths,
6	metadatas=metadatas,
7	ingestion_config={
8	"provider": "unstructured_local",
9	"strategy": "auto",
10	"chunking_strategy": "by_title",
11	"new_after_n_chars": 256,
12	"max_characters": 512
13	}
14	)

1	chunks = [
2	{
3	"text": "Sample text chunk 1...",
4	},
5	{
6	"text": "Sample text chunk 2...",
7	}
8	]
9
10	ingest_response = client.ingest_chunks(
11	chunks=chunks,
12	metadata={"title": "Sample", "source": "example"}
13	)

1	update_response = client.update_files(
2	file_paths=file_paths,
3	document_ids=document_ids,
4	metadatas=[{"status": "reviewed"}]
5	)

1	create_response = client.create_vector_index(
2	table_name="vectors",
3	index_method="hnsw",
4	index_measure="cosine_distance",
5	index_arguments={"m": 16, "ef_construction": 64},
6	concurrently=True
7	)

1	# List indices
2	indices = client.list_vector_indices(table_name="vectors")
3
4	# Delete index
5	delete_response = client.delete_vector_index(
6	index_name="index_name",
7	table_name="vectors",
8	concurrently=True
9	)