Ingestion — Build, scale, and manage user-facing Retrieval-Augmented Generation applications.

This SDK documentation is periodically updated. For the latest parameter details, please cross-reference with the API Reference documentation.

Inside R2R, ingestion refers to the complete pipeline for processing input data:

Parsing files into text
Chunking text into semantic units
Generating embeddings
Storing data for retrieval

Ingested files are stored with an associated document identifier as well as a user identifier to enable comprehensive management.

Document Ingestion and Management

R2R has recently expanded the available options for ingesting files using multimodal foundation models. In addition to using such models by default for images, R2R can now use them on PDFs, like it is shown here, by passing the following in your ingestion configuration:

1 "ingestion_config": {
2   ...,
3   "parser_overrides": {
4     "pdf": "zerox"
5   }
6 }

We recommend this method for achieving the highest quality ingestion results.

Ingest Files

Ingest files or directories into your R2R system:

1 file_paths = ['path/to/file1.txt', 'path/to/file2.txt']
2 metadatas = [{'key1': 'value1'}, {'key2': 'value2'}]
3 
4 # Ingestion configuration for `R2R Full`
5 ingest_response = client.ingest_files(
6     file_paths=file_paths,
7     metadatas=metadatas,
8     # Runtime chunking configuration
9     ingestion_config={
10         "provider": "unstructured_local",  # Local processing
11         "strategy": "auto",                # Automatic processing strategy
12         "chunking_strategy": "by_title",   # Split on title boundaries
13         "new_after_n_chars": 256,          # Start new chunk (soft limit)
14         "max_characters": 512,             # Maximum chunk size (hard limit)
15         "combine_under_n_chars": 64,       # Minimum chunk size
16         "overlap": 100,                    # Character overlap between chunks
17         "chunk_enrichment_settings": {     # Document enrichment settings
18             "enable_chunk_enrichment": False,
19         }
20     }
21 )

An ingested file is parsed, chunked, embedded and stored inside your R2R system. The stored information includes a document identifier, a corresponding user identifier, and other metadata. Knowledge graph creation is done separately and at the collection level. Refer to the ingestion configuration section for comprehensive details on available options.

Response

file_paths

list[str]Required

A list of file paths or directory paths to ingest. If a directory path is provided, all files within the directory and its subdirectories will be ingested.

metadatas

Optional[list[dict]]

An optional list of metadata dictionaries corresponding to each file. If provided, the length should match the number of files being ingested.

document_ids

Optional[list[Union[UUID, str]]]

An optional list of document IDs to assign to the ingested files. If provided, the length should match the number of files being ingested.

versions

Optional[list[str]]

An optional list of version strings for the ingested files. If provided, the length should match the number of files being ingested.

ingestion_config

Optional[Union[dict, IngestionConfig]]

The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime. Learn more about configuration here.

Other Provider Options

Unstructured Provider Options

run_with_orchestration

Optional[bool]

Whether or not ingestion runs with orchestration, default is True. When set to False, the ingestion process will run synchronous and directly return the result.

Understanding Ingestion Status

After calling ingest_files, the response includes important status information:

1 # Successful ingestion
2 {
3     'message': 'Ingestion task queued successfully.',
4     'task_id': '6e27dfca-606d-422d-b73f-2d9e138661b4',
5     'document_id': 'c3291abf-8a4e-5d9d-80fd-232ef6fd8526'
6 }
7 
8 # Check document status later
9 doc_status = client.documents_overview(
10     document_ids=['c3291abf-8a4e-5d9d-80fd-232ef6fd8526']
11 )
12 # ingestion_status will be one of: 'pending', 'processing', 'success', 'failed'

We have added support for contextual chunk enrichment! You can learn more about it here.

Currently, you need to enable it in your ingestion config:

1 [ingestion.chunk_enrichment_settings]
2     enable_chunk_enrichment = true # disabled by default
3     strategies = ["semantic", "neighborhood"]
4     forward_chunks = 3            # Look ahead 3 chunks
5     backward_chunks = 3           # Look behind 3 chunks
6     semantic_neighbors = 10       # Find 10 semantically similar chunks
7     semantic_similarity_threshold = 0.7  # Minimum similarity score
8     generation_config = { model = "openai/gpt-4o-mini" }

Ingest Chunks

The ingest_chunks method allows direct ingestion of pre-processed text, bypassing the standard parsing pipeline. This is useful for:

Custom preprocessing pipelines
Streaming data ingestion
Working with non-file data sources

1 chunks = [
2   {
3     "text": "Aristotle was a Greek philosopher...",
4   },
5   ...,
6   {
7     "text": "He was born in 384 BC in Stagira...",
8   }
9 ]
10 
11 ingest_response = client.ingest_chunks(
12   chunks=chunks,
13   metadata={"title": "Aristotle", "source": "wikipedia"}
14 )

Response

chunks

list[dict]Required

A list of chunk dictionaries to ingest. Each dictionary should contain at least a “text” key with the chunk text. An optional “metadata” key can contain a dictionary of metadata for the chunk.

document_id

Optional[UUID]

An optional document ID to assign to the ingested chunks. If not provided, a new document ID will be generated.

metadata

Optional[dict]

An optional metadata dictionary for the document.

run_with_orchestration

Optional[bool]

Whether or not ingestion runs with orchestration, default is True. When set to False, the ingestion process will run synchronous and directly return the result.

Update Files

Update existing documents while maintaining version history:

1 # Basic update with new metadata
2 update_response = client.update_files(
3     file_paths=file_paths,
4     document_ids=document_ids,
5     metadatas=[{
6         "status": "reviewed"
7     }]
8 )
9 
10 # Update with custom chunking
11 update_response = client.update_files(
12     file_paths=file_paths,
13     document_ids=document_ids,
14     ingestion_config={
15         "chunking_strategy": "by_title",
16         "max_characters": 1024  # Larger chunks for this version
17     }
18 )

The ingestion configuration can be customized analogously to the ingest files endpoint above.

Response

file_paths

list[str]Required

A list of file paths to update.

document_ids

Optional[list[Union[UUID, str]]]Required

A list of document IDs corresponding to the files being updated. When not provided, an attempt is made to generate the correct document id from the given user id and file path.

metadatas

Optional[list[dict]]

An optional list of metadata dictionaries for the updated files.

ingestion_config

Optional[Union[dict, IngestionConfig]]

The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime. Learn more about configuration here.

Other Provider Options

Unstructured Provider Options

run_with_orchestration

Optional[bool]

Whether or not ingestion runs with orchestration, default is True. When set to False, the ingestion process will run synchronous and directly return the result.

Update Chunks

Update the content of an existing chunk in your R2R system:

1 document_id = "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"
2 extraction_id = "aeba6400-1bd0-5ee9-8925-04732d675434"
3 
4 update_response = client.update_chunks(
5     document_id=document_id,
6     extraction_id=extraction_id,
7     text="Updated chunk content...",
8     metadata={"source": "manual_edit", "edited_at": "2024-10-24"}
9 )

Response

document_id

UUIDRequired

The ID of the document containing the chunk to update.

extraction_id

UUIDRequired

The ID of the specific chunk to update.

text

strRequired

The new text content to replace the existing chunk text.

metadata

Optional[dict]

An optional metadata dictionary for the updated chunk. If provided, this will replace the existing chunk metadata.

run_with_orchestration

Optional[bool]

Whether or not the update runs with orchestration, default is True. When set to False, the update process will run synchronous and directly return the result.

Documents Overview

Retrieve high-level document information. Results are restricted to the current user’s files, unless the request is made by a superuser, in which case results from all users are returned:

1 documents_overview = client.documents_overview()

Response

document_ids

Optional[list[Union[UUID, str]]]

An optional list of document IDs to filter the overview.

offset

Optional[int]

An optional value to offset the starting point of fetched results, defaults to 0.

limit

Optional[int]

An optional value to limit the fetched results, defaults to 100.

Document Chunks

Fetch and examine chunks for a particular document. Chunks represent the atomic units of text after processing:

1 document_id = "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"
2 chunks = client.document_chunks(document_id)

Response

document_id

strRequired

The ID of the document to retrieve chunks for.

offset

Optional[int]

An optional value to offset the starting point of fetched results, defaults to 0.

limit

Optional[int]

An optional value to limit the fetched results, defaults to 100.

include_vectors

Optional[bool]

An optional value to return the vectors associated with each chunk, defaults to False.

Delete Documents

Delete a document by its ID:

1 delete_response = client.delete(
2   {
3     "document_id":
4       {"$eq": "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"}
5   }
6 )

Response

filters

list[dict]Required

A list of logical filters to perform over input documents fields which identifies the unique set of documents to delete (e.g., {"document_id": {"$eq": "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"}}). Logical operations might include variables such as "user_id" or "title" and filters like neq, gte, etc.

Vector Index Management

Create Vector Index

Vector indices significantly improve search performance for large collections but add overhead for smaller datasets. Only create indices when working with hundreds of thousands of documents or when search latency is critical.

Create a vector index for similarity search:

1 create_response = client.create_vector_index(
2     table_name="vectors",
3     index_method="hnsw",
4     index_measure="cosine_distance",
5     index_arguments={"m": 16, "ef_construction": 64},
6     concurrently=True
7 )

Response

table_name

str

The table to create the index on. Options: vectors, entities_document, entities_collection, communities. Default: vectors

index_method

str

The indexing method to use. Options: hnsw, ivfflat, auto. Default: hnsw

index_measure

str

Distance measure for vector comparisons. Options: cosine_distance, l2_distance, max_inner_product. Default: cosine_distance

index_arguments

Optional[dict]

Configuration parameters for the chosen index method.

Configuration Options

index_name

Optional[str]

Custom name for the index. If not provided, one will be auto-generated

concurrently

bool

Whether to create the index concurrently. Default: True

Important Considerations

Vector index creation requires careful planning and consideration of your data and performance requirements. Keep in mind:

Resource Intensive Process

Index creation can be CPU and memory intensive, especially for large datasets
For HNSW indexes, memory usage scales with both dataset size and m parameter
Consider creating indexes during off-peak hours for production systems

Performance Tuning

HNSW Parameters:
- m: Higher values (16-64) improve search quality but increase memory usage and build time
- ef_construction: Higher values increase build time and quality but have diminishing returns past 100
- Recommended starting point: m=16, ef_construction=64

1 # Example balanced configuration
2 client.create_vector_index(
3     table_name="vectors",
4     index_method="hnsw",
5     index_measure="cosine_distance",
6     index_arguments={
7         "m": 16,                # Moderate connectivity
8         "ef_construction": 64   # Balanced build time/quality
9     },
10     concurrently=True
11 )

Pre-warming Required

Important: Newly created indexes require pre-warming to achieve optimal performance
Initial queries may be slower until the index is loaded into memory
The first several queries will automatically warm the index
For production systems, consider implementing explicit pre-warming by running representative queries after index creation
Without pre-warming, you may not see the expected performance improvements

Best Practices

Always use concurrently=True in production to avoid blocking other operations
Monitor system resources during index creation
Test index performance with representative queries before deploying
Consider creating indexes on smaller test datasets first to validate parameters

Distance Measures Choose the appropriate measure based on your use case:

cosine_distance: Best for normalized vectors (most common)
l2_distance: Better for absolute distances
max_inner_product: Optimized for dot product similarity

List Vector Indices

List existing vector indices for a table:

1 indices = client.list_vector_indices(table_name="vectors")

Response

table_name

str

The table to list indices from. Options: vectors, entities_document, entities_collection, communities. Default: vectors

Delete Vector Index

Delete a vector index from a table:

1 delete_response = client.delete_vector_index(
2     index_name="ix_vector_cosine_ops_hnsw__20241021211541",
3     table_name="vectors",
4     concurrently=True
5 )

Response

index_name

strRequired

Name of the index to delete

table_name

str

The table containing the index. Options: vectors, entities_document, entities_collection, communities. Default: vectors

concurrently

bool

Whether to delete the index concurrently. Default: True

Troubleshooting Common Issues

Ingestion Failures

Check file permissions and paths
Verify file formats are supported
Ensure metadata length matches file_paths length
Monitor memory usage for large files

Chunking Issues

Large chunks may impact retrieval quality
Small chunks may lose context
Adjust overlap for better context preservation

Vector Index Performance

Monitor index creation time
Check memory usage during creation
Verify warm-up queries are representative
Consider index rebuild if quality degrades

1	"ingestion_config": {
2	...,
3	"parser_overrides": {
4	"pdf": "zerox"
5	}
6	}

1	file_paths = ['path/to/file1.txt', 'path/to/file2.txt']
2	metadatas = [{'key1': 'value1'}, {'key2': 'value2'}]
3
4	# Ingestion configuration for `R2R Full`
5	ingest_response = client.ingest_files(
6	file_paths=file_paths,
7	metadatas=metadatas,
8	# Runtime chunking configuration
9	ingestion_config={
10	"provider": "unstructured_local", # Local processing
11	"strategy": "auto", # Automatic processing strategy
12	"chunking_strategy": "by_title", # Split on title boundaries
13	"new_after_n_chars": 256, # Start new chunk (soft limit)
14	"max_characters": 512, # Maximum chunk size (hard limit)
15	"combine_under_n_chars": 64, # Minimum chunk size
16	"overlap": 100, # Character overlap between chunks
17	"chunk_enrichment_settings": { # Document enrichment settings
18	"enable_chunk_enrichment": False,
19	}
20	}
21	)

1	# Successful ingestion
2	{
3	'message': 'Ingestion task queued successfully.',
4	'task_id': '6e27dfca-606d-422d-b73f-2d9e138661b4',
5	'document_id': 'c3291abf-8a4e-5d9d-80fd-232ef6fd8526'
6	}
7
8	# Check document status later
9	doc_status = client.documents_overview(
10	document_ids=['c3291abf-8a4e-5d9d-80fd-232ef6fd8526']
11	)
12	# ingestion_status will be one of: 'pending', 'processing', 'success', 'failed'

1	[ingestion.chunk_enrichment_settings]
2	enable_chunk_enrichment = true # disabled by default
3	strategies = ["semantic", "neighborhood"]
4	forward_chunks = 3 # Look ahead 3 chunks
5	backward_chunks = 3 # Look behind 3 chunks
6	semantic_neighbors = 10 # Find 10 semantically similar chunks
7	semantic_similarity_threshold = 0.7 # Minimum similarity score
8	generation_config = { model = "openai/gpt-4o-mini" }

1	chunks = [
2	{
3	"text": "Aristotle was a Greek philosopher...",
4	},
5	...,
6	{
7	"text": "He was born in 384 BC in Stagira...",
8	}
9	]
10
11	ingest_response = client.ingest_chunks(
12	chunks=chunks,
13	metadata={"title": "Aristotle", "source": "wikipedia"}
14	)

1	# Basic update with new metadata
2	update_response = client.update_files(
3	file_paths=file_paths,
4	document_ids=document_ids,
5	metadatas=[{
6	"status": "reviewed"
7	}]
8	)
9
10	# Update with custom chunking
11	update_response = client.update_files(
12	file_paths=file_paths,
13	document_ids=document_ids,
14	ingestion_config={
15	"chunking_strategy": "by_title",
16	"max_characters": 1024 # Larger chunks for this version
17	}
18	)

1	document_id = "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"
2	extraction_id = "aeba6400-1bd0-5ee9-8925-04732d675434"
3
4	update_response = client.update_chunks(
5	document_id=document_id,
6	extraction_id=extraction_id,
7	text="Updated chunk content...",
8	metadata={"source": "manual_edit", "edited_at": "2024-10-24"}
9	)

1	document_id = "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"
2	chunks = client.document_chunks(document_id)

1	delete_response = client.delete(
2	{
3	"document_id":
4	{"$eq": "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"}
5	}
6	)

1	create_response = client.create_vector_index(
2	table_name="vectors",
3	index_method="hnsw",
4	index_measure="cosine_distance",
5	index_arguments={"m": 16, "ef_construction": 64},
6	concurrently=True
7	)

1	# Example balanced configuration
2	client.create_vector_index(
3	table_name="vectors",
4	index_method="hnsw",
5	index_measure="cosine_distance",
6	index_arguments={
7	"m": 16, # Moderate connectivity
8	"ef_construction": 64 # Balanced build time/quality
9	},
10	concurrently=True
11	)

1	delete_response = client.delete_vector_index(
2	index_name="ix_vector_cosine_ops_hnsw__20241021211541",
3	table_name="vectors",
4	concurrently=True
5	)