Ingestion
Ingesting files with R2R.
This SDK documentation is periodically updated. For the latest parameter details, please cross-reference with the API Reference documentation.
Inside R2R, ingestion
refers to the complete pipeline for processing input data:
- Parsing files into text
- Chunking text into semantic units
- Generating embeddings
- Storing data for retrieval
Ingested files are stored with an associated document identifier as well as a user identifier to enable comprehensive management.
Document Ingestion and Management
R2R has recently expanded the available options for ingesting files using multimodal foundation models. In addition to using such models by default for images, R2R can now use them on PDFs, like it is shown here, by passing the following in your ingestion configuration:
We recommend this method for achieving the highest quality ingestion results.
Ingest Files
Ingest files or directories into your R2R system:
[Previous sections remain the same through the Update Files code example, then continuing with:]
Array of files to update.
Document IDs corresponding to files being updated.
Optional metadata for updated files.
Chunking configuration options.
Whether or not ingestion runs with orchestration, default is True
. When set to False
, the ingestion process will run synchronous and directly return the result.
Update Chunks
Update the content of an existing chunk in your R2R system:
The ID of the document containing the chunk to update.
The ID of the specific chunk to update.
The new text content to replace the existing chunk text.
An optional metadata object for the updated chunk. If provided, this will replace the existing chunk metadata.
Whether or not the update runs with orchestration, default is true
. When set to false
, the update process will run synchronous and directly return the result.
Documents Overview
Retrieve high-level document information:
Results are restricted to the current user’s files unless the request is made by a superuser.
Optional array of document IDs to filter results.
Starting point for pagination, defaults to 0.
Maximum number of results to return, defaults to 100.
Document Chunks
Fetch and examine chunks for a particular document:
These chunks represent the atomic units of text after processing.
ID of the document to retrieve chunks for.
Starting point for pagination, defaults to 0.
Maximum number of chunks to return, defaults to 100.
Whether to include embedding vectors in response.
Delete Documents
Delete documents using filters:
Filter conditions to identify documents for deletion.
Vector Index Management
Create Vector Index
Vector indices significantly improve search performance for large collections but add overhead for smaller datasets. Only create indices when working with hundreds of thousands of documents or when search latency is critical.
Create a vector index for similarity search:
Table to create index on: vectors, entities_document, entities_collection, communities.
Index method: hnsw, ivfflat, or auto.
Distance measure: cosine_distance, l2_distance, or max_inner_product.
Configuration for chosen index method.
List Vector Indices
List existing indices:
Delete Vector Index
Remove an existing index:
Best Practices and Performance Optimization
Vector Index Configuration
-
HNSW Parameters:
m
: Higher values (16-64) improve search quality but increase memoryef_construction
: Higher values improve quality but slow construction- Recommended starting point:
m=16
,ef_construction=64
-
Distance Measures:
cosine_distance
: Best for normalized vectors (most common)l2_distance
: Better for absolute distancesmax_inner_product
: Optimized for dot product similarity
-
Production Considerations:
- Always use
concurrently: true
to avoid blocking operations - Create indexes during off-peak hours
- Pre-warm indices with representative queries
- Monitor memory usage during creation
- Always use
Chunking Strategy
-
Size Guidelines:
- Avoid chunks >1024 characters for retrieval quality
- Keep chunks >64 characters to maintain context
- Use overlap for better context preservation
-
Method Selection:
- Use
by_title
for structured documents - Use
basic
for uniform text content - Consider
recursive
for nested content
- Use
Troubleshooting
Common Issues
-
Ingestion Failures:
- Verify file permissions and paths
- Check file format support
- Ensure metadata array length matches files
- Monitor memory for large files
-
Vector Index Performance:
- Check index creation time
- Monitor memory usage
- Verify warm-up queries
- Consider rebuilding if quality degrades
-
Chunking Issues:
- Adjust overlap for context preservation
- Monitor chunk sizes
- Verify language detection
- Check encoding for special characters