Ingesting files and managing vector indices with the R2R CLI.

Document Ingestion and Management

Ingest Files

Ingest files or directories into your R2R system using the ingest-files command:

$r2r ingest-files path/to/file1.txt path/to/file2.txt \
> --document-ids 9fbe403b-c11c-5aae-8ade-ef22980c3ad1 \
> --metadatas '{"key1": "value1"}'

Retry Failed Ingestions

Retry ingestion for documents that previously failed using the retry-ingest-files command:

$r2r retry-ingest-files 9fbe403b-c11c-5aae-8ade-ef22980c3ad1

Update Files

Update existing documents using the update-files command:

$r2r update-files path/to/file1_v2.txt \
> --document-ids 9fbe403b-c11c-5aae-8ade-ef22980c3ad1 \
> --metadatas '{"key1": "value2"}'

Vector Index Management

Vector Index Management

Create Vector Index

Create a new vector index for similarity search using the create-vector-index command:

$r2r create-vector-index \
> --table-name vectors \
> --index-method hnsw \
> --index-measure cosine_distance \
> --index-arguments '{"m": 16, "ef_construction": 64}'

Important Considerations

Vector index creation requires careful planning and consideration of your data and performance requirements. Keep in mind:

Resource Intensive Process

  • Index creation can be CPU and memory intensive, especially for large datasets
  • For HNSW indexes, memory usage scales with both dataset size and m parameter
  • Consider creating indexes during off-peak hours for production systems

Performance Tuning

  1. HNSW Parameters:
    • m: Higher values (16-64) improve search quality but increase memory usage and build time
    • ef_construction: Higher values increase build time and quality but have diminishing returns past 100
    • Recommended starting point: m=16, ef_construction=64
$# Example balanced configuration
>r2r create-vector-index \
> --table-name vectors \
> --index-method hnsw \
> --index-measure cosine_distance \
> --index-arguments '{"m": 16, "ef_construction": 64}'

Pre-warming Required

  • Important: Newly created indexes require pre-warming to achieve optimal performance
  • Initial queries may be slower until the index is loaded into memory
  • The first several queries will automatically warm the index
  • For production systems, consider implementing explicit pre-warming by running representative queries after index creation
  • Without pre-warming, you may not see the expected performance improvements

Best Practices

  1. Always use concurrent index creation (avoid --no-concurrent) in production to prevent blocking other operations
  2. Monitor system resources during index creation
  3. Test index performance with representative queries before deploying
  4. Consider creating indexes on smaller test datasets first to validate parameters
  5. Implement index pre-warming strategy before handling production traffic

Distance Measures Choose the appropriate measure based on your use case:

  • cosine_distance: Best for normalized vectors (most common)
  • l2_distance: Better for absolute distances
  • max_inner_product: Optimized for dot product similarity

List Vector Indices

List existing vector indices using the list-vector-indices command:

$r2r list-vector-indices --table-name vectors

Delete Vector Index

Delete a vector index using the delete-vector-index command:

$r2r delete-vector-index my-index-name --table-name vectors

Sample File Management

Ingest Sample Files

Ingest one or more sample files from the R2R GitHub repository:

$# Ingest a single sample file
>r2r ingest-sample-file
>
># Ingest a smaller version of the sample file
>r2r ingest-sample-file --v2
>
># Ingest multiple sample files
>r2r ingest-sample-files

These commands have no additional arguments. The --v2 flag for ingest-sample-file ingests a smaller version of the sample Aristotle text file.

Ingest Local Sample Files

Ingest the local sample files in the core/examples/data_unstructured directory:

$r2r ingest-sample-files-from-unstructured

This command has no additional arguments. It will ingest all files found in the data_unstructured directory.