Ingestion — Build, scale, and manage user-facing Retrieval-Augmented Generation applications.

Document Ingestion and Management

Ingest Files

Ingest files or directories into your R2R system using the ingest-files command:

$ r2r ingest-files path/to/file1.txt path/to/file2.txt \
>   --document-ids 9fbe403b-c11c-5aae-8ade-ef22980c3ad1 \
>   --metadatas '{"key1": "value1"}'

Arguments

Retry Failed Ingestions

Retry ingestion for documents that previously failed using the retry-ingest-files command:

$ r2r retry-ingest-files 9fbe403b-c11c-5aae-8ade-ef22980c3ad1

Arguments

Update Files

Update existing documents using the update-files command:

$ r2r update-files path/to/file1_v2.txt \
>   --document-ids 9fbe403b-c11c-5aae-8ade-ef22980c3ad1 \
>   --metadatas '{"key1": "value2"}'

Arguments

Vector Index Management

Create Vector Index

Create a new vector index for similarity search using the create-vector-index command:

$ r2r create-vector-index \
>   --table-name vectors \
>   --index-method hnsw \
>   --index-measure cosine_distance \
>   --index-arguments '{"m": 16, "ef_construction": 64}'

Arguments

Important Considerations

Vector index creation requires careful planning and consideration of your data and performance requirements. Keep in mind:

Resource Intensive Process

Index creation can be CPU and memory intensive, especially for large datasets
For HNSW indexes, memory usage scales with both dataset size and m parameter
Consider creating indexes during off-peak hours for production systems

Performance Tuning

HNSW Parameters:
- m: Higher values (16-64) improve search quality but increase memory usage and build time
- ef_construction: Higher values increase build time and quality but have diminishing returns past 100
- Recommended starting point: m=16, ef_construction=64

$ # Example balanced configuration
> r2r create-vector-index \
>   --table-name vectors \
>   --index-method hnsw \
>   --index-measure cosine_distance \
>   --index-arguments '{"m": 16, "ef_construction": 64}'

Pre-warming Required

Important: Newly created indexes require pre-warming to achieve optimal performance
Initial queries may be slower until the index is loaded into memory
The first several queries will automatically warm the index
For production systems, consider implementing explicit pre-warming by running representative queries after index creation
Without pre-warming, you may not see the expected performance improvements

Best Practices

Always use concurrent index creation (avoid --no-concurrent) in production to prevent blocking other operations
Monitor system resources during index creation
Test index performance with representative queries before deploying
Consider creating indexes on smaller test datasets first to validate parameters
Implement index pre-warming strategy before handling production traffic

Distance Measures Choose the appropriate measure based on your use case:

cosine_distance: Best for normalized vectors (most common)
l2_distance: Better for absolute distances
max_inner_product: Optimized for dot product similarity

List Vector Indices

List existing vector indices using the list-vector-indices command:

$ r2r list-vector-indices --table-name vectors

Arguments

Delete Vector Index

Delete a vector index using the delete-vector-index command:

$ r2r delete-vector-index my-index-name --table-name vectors

Arguments

Sample File Management

Ingest Sample Files

Ingest one or more sample files from the R2R GitHub repository:

$ # Ingest a single sample file
> r2r ingest-sample-file
> 
> # Ingest a smaller version of the sample file
> r2r ingest-sample-file --v2
> 
> # Ingest multiple sample files
> r2r ingest-sample-files

These commands have no additional arguments. The --v2 flag for ingest-sample-file ingests a smaller version of the sample Aristotle text file.

Ingest Local Sample Files

Ingest the local sample files in the core/examples/data_unstructured directory:

$ r2r ingest-sample-files-from-unstructured

This command has no additional arguments. It will ingest all files found in the data_unstructured directory.

$	r2r ingest-files path/to/file1.txt path/to/file2.txt \
>	--document-ids 9fbe403b-c11c-5aae-8ade-ef22980c3ad1 \
>	--metadatas '{"key1": "value1"}'

$	r2r update-files path/to/file1_v2.txt \
>	--document-ids 9fbe403b-c11c-5aae-8ade-ef22980c3ad1 \
>	--metadatas '{"key1": "value2"}'

$	r2r create-vector-index \
>	--table-name vectors \
>	--index-method hnsw \
>	--index-measure cosine_distance \
>	--index-arguments '{"m": 16, "ef_construction": 64}'

$	# Example balanced configuration
>	r2r create-vector-index \
>	--table-name vectors \
>	--index-method hnsw \
>	--index-measure cosine_distance \
>	--index-arguments '{"m": 16, "ef_construction": 64}'

$	# Ingest a single sample file
>	r2r ingest-sample-file
>
>	# Ingest a smaller version of the sample file
>	r2r ingest-sample-file --v2
>
>	# Ingest multiple sample files
>	r2r ingest-sample-files