Ingestion
Ingesting files and managing vector indices with the R2R CLI.
Document Ingestion and Management
Ingest Files
Ingest files or directories into your R2R system using the ingest-files
command:
Retry Failed Ingestions
Retry ingestion for documents that previously failed using the retry-ingest-files
command:
Update Files
Update existing documents using the update-files
command:
Vector Index Management
Vector Index Management
Create Vector Index
Create a new vector index for similarity search using the create-vector-index
command:
Important Considerations
Vector index creation requires careful planning and consideration of your data and performance requirements. Keep in mind:
Resource Intensive Process
- Index creation can be CPU and memory intensive, especially for large datasets
- For HNSW indexes, memory usage scales with both dataset size and
m
parameter - Consider creating indexes during off-peak hours for production systems
Performance Tuning
- HNSW Parameters:
m
: Higher values (16-64) improve search quality but increase memory usage and build timeef_construction
: Higher values increase build time and quality but have diminishing returns past 100- Recommended starting point:
m=16
,ef_construction=64
Pre-warming Required
- Important: Newly created indexes require pre-warming to achieve optimal performance
- Initial queries may be slower until the index is loaded into memory
- The first several queries will automatically warm the index
- For production systems, consider implementing explicit pre-warming by running representative queries after index creation
- Without pre-warming, you may not see the expected performance improvements
Best Practices
- Always use concurrent index creation (avoid
--no-concurrent
) in production to prevent blocking other operations - Monitor system resources during index creation
- Test index performance with representative queries before deploying
- Consider creating indexes on smaller test datasets first to validate parameters
- Implement index pre-warming strategy before handling production traffic
Distance Measures Choose the appropriate measure based on your use case:
cosine_distance
: Best for normalized vectors (most common)l2_distance
: Better for absolute distancesmax_inner_product
: Optimized for dot product similarity
List Vector Indices
List existing vector indices using the list-vector-indices
command:
Delete Vector Index
Delete a vector index using the delete-vector-index
command:
Sample File Management
Ingest Sample Files
Ingest one or more sample files from the R2R GitHub repository:
These commands have no additional arguments. The --v2
flag for ingest-sample-file
ingests a smaller version of the sample Aristotle text file.
Ingest Local Sample Files
Ingest the local sample files in the core/examples/data_unstructured
directory:
This command has no additional arguments. It will ingest all files found in the data_unstructured
directory.