Ingestion
Ingesting files with R2R.
This SDK documentation is periodically updated. For the latest parameter details, please cross-reference with the API Reference documentation.
Inside R2R, ingestion
refers to the complete pipeline for processing input data:
- Parsing files into text
- Chunking text into semantic units
- Generating embeddings
- Storing data for retrieval
Ingested files are stored with an associated document identifier as well as a user identifier to enable comprehensive management.
Document Ingestion and Management
R2R has recently expanded the available options for ingesting files using multimodal foundation models. In addition to using such models by default for images, R2R can now use them on PDFs, like it is shown here, by passing the following in your ingestion configuration:
We recommend this method for achieving the highest quality ingestion results.
Ingest Files
Ingest files or directories into your R2R system:
An ingested
file is parsed, chunked, embedded and stored inside your R2R system. The stored information includes a document identifier, a corresponding user identifier, and other metadata. Knowledge graph creation is done separately and at the collection
level. Refer to the ingestion configuration section for comprehensive details on available options.
A list of file paths or directory paths to ingest. If a directory path is provided, all files within the directory and its subdirectories will be ingested.
An optional list of metadata dictionaries corresponding to each file. If provided, the length should match the number of files being ingested.
An optional list of document IDs to assign to the ingested files. If provided, the length should match the number of files being ingested.
An optional list of version strings for the ingested files. If provided, the length should match the number of files being ingested.
The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime. Learn more about configuration here.
Whether or not ingestion runs with orchestration, default is True
. When set to False
, the ingestion process will run synchronous and directly return the result.
Understanding Ingestion Status
After calling ingest_files
, the response includes important status information:
We have added support for contextual chunk enrichment! You can learn more about it here.
Currently, you need to enable it in your ingestion config:
Ingest Chunks
The ingest_chunks
method allows direct ingestion of pre-processed text, bypassing the standard parsing pipeline. This is useful for:
- Custom preprocessing pipelines
- Streaming data ingestion
- Working with non-file data sources
A list of chunk dictionaries to ingest. Each dictionary should contain at least a “text” key with the chunk text. An optional “metadata” key can contain a dictionary of metadata for the chunk.
An optional document ID to assign to the ingested chunks. If not provided, a new document ID will be generated.
An optional metadata dictionary for the document.
Whether or not ingestion runs with orchestration, default is True
. When set to False
, the ingestion process will run synchronous and directly return the result.
Update Files
Update existing documents while maintaining version history:
The ingestion configuration can be customized analogously to the ingest files endpoint above.
A list of file paths to update.
A list of document IDs corresponding to the files being updated. When not provided, an attempt is made to generate the correct document id from the given user id and file path.
An optional list of metadata dictionaries for the updated files.
The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime. Learn more about configuration here.
Whether or not ingestion runs with orchestration, default is True
. When set to False
, the ingestion process will run synchronous and directly return the result.
Update Chunks
Update the content of an existing chunk in your R2R system:
The ID of the document containing the chunk to update.
The ID of the specific chunk to update.
The new text content to replace the existing chunk text.
An optional metadata dictionary for the updated chunk. If provided, this will replace the existing chunk metadata.
Whether or not the update runs with orchestration, default is True
. When set to False
, the update process will run synchronous and directly return the result.
Documents Overview
Retrieve high-level document information. Results are restricted to the current user’s files, unless the request is made by a superuser, in which case results from all users are returned:
An optional list of document IDs to filter the overview.
An optional value to offset the starting point of fetched results, defaults to 0
.
An optional value to limit the fetched results, defaults to 100
.
Document Chunks
Fetch and examine chunks for a particular document. Chunks represent the atomic units of text after processing:
The ID of the document to retrieve chunks for.
An optional value to offset the starting point of fetched results, defaults to 0
.
An optional value to limit the fetched results, defaults to 100
.
An optional value to return the vectors associated with each chunk, defaults to False
.
Delete Documents
Delete a document by its ID:
A list of logical filters to perform over input documents fields which identifies the unique set of documents to delete (e.g., {"document_id": {"$eq": "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"}}
). Logical operations might include variables such as "user_id"
or "title"
and filters like neq
, gte
, etc.
Vector Index Management
Create Vector Index
Vector indices significantly improve search performance for large collections but add overhead for smaller datasets. Only create indices when working with hundreds of thousands of documents or when search latency is critical.
Create a vector index for similarity search:
The table to create the index on. Options: vectors, entities_document, entities_collection, communities. Default: vectors
The indexing method to use. Options: hnsw, ivfflat, auto. Default: hnsw
Distance measure for vector comparisons. Options: cosine_distance, l2_distance, max_inner_product. Default: cosine_distance
Configuration parameters for the chosen index method.
Custom name for the index. If not provided, one will be auto-generated
Whether to create the index concurrently. Default: True
Important Considerations
Vector index creation requires careful planning and consideration of your data and performance requirements. Keep in mind:
Resource Intensive Process
- Index creation can be CPU and memory intensive, especially for large datasets
- For HNSW indexes, memory usage scales with both dataset size and
m
parameter - Consider creating indexes during off-peak hours for production systems
Performance Tuning
- HNSW Parameters:
m
: Higher values (16-64) improve search quality but increase memory usage and build timeef_construction
: Higher values increase build time and quality but have diminishing returns past 100- Recommended starting point:
m=16
,ef_construction=64
Pre-warming Required
- Important: Newly created indexes require pre-warming to achieve optimal performance
- Initial queries may be slower until the index is loaded into memory
- The first several queries will automatically warm the index
- For production systems, consider implementing explicit pre-warming by running representative queries after index creation
- Without pre-warming, you may not see the expected performance improvements
Best Practices
- Always use
concurrently=True
in production to avoid blocking other operations - Monitor system resources during index creation
- Test index performance with representative queries before deploying
- Consider creating indexes on smaller test datasets first to validate parameters
Distance Measures Choose the appropriate measure based on your use case:
cosine_distance
: Best for normalized vectors (most common)l2_distance
: Better for absolute distancesmax_inner_product
: Optimized for dot product similarity
List Vector Indices
List existing vector indices for a table:
The table to list indices from. Options: vectors, entities_document, entities_collection, communities. Default: vectors
Delete Vector Index
Delete a vector index from a table:
Name of the index to delete
The table containing the index. Options: vectors, entities_document, entities_collection, communities. Default: vectors
Whether to delete the index concurrently. Default: True
Troubleshooting Common Issues
Ingestion Failures
- Check file permissions and paths
- Verify file formats are supported
- Ensure metadata length matches file_paths length
- Monitor memory usage for large files
Chunking Issues
- Large chunks may impact retrieval quality
- Small chunks may lose context
- Adjust overlap for better context preservation
Vector Index Performance
- Monitor index creation time
- Check memory usage during creation
- Verify warm-up queries are representative
- Consider index rebuild if quality degrades