Ingesting files with R2R.

This SDK documentation is periodically updated. For the latest parameter details, please cross-reference with the API Reference documentation.

Inside R2R, ingestion refers to the complete pipeline for processing input data:

  • Parsing files into text
  • Chunking text into semantic units
  • Generating embeddings
  • Storing data for retrieval

Ingested files are stored with an associated document identifier as well as a user identifier to enable comprehensive management.

Document Ingestion and Management

R2R has recently expanded the available options for ingesting files using multimodal foundation models. In addition to using such models by default for images, R2R can now use them on PDFs, like it is shown here, by passing the following in your ingestion configuration:

1"ingestion_config": {
2 ...,
3 "parser_overrides": {
4 "pdf": "zerox"
5 }
6}

We recommend this method for achieving the highest quality ingestion results.

Ingest Files

Ingest files or directories into your R2R system:

1file_paths = ['path/to/file1.txt', 'path/to/file2.txt']
2metadatas = [{'key1': 'value1'}, {'key2': 'value2'}]
3
4# Ingestion configuration for `R2R Full`
5ingest_response = client.ingest_files(
6 file_paths=file_paths,
7 metadatas=metadatas,
8 # Runtime chunking configuration
9 ingestion_config={
10 "provider": "unstructured_local", # Local processing
11 "strategy": "auto", # Automatic processing strategy
12 "chunking_strategy": "by_title", # Split on title boundaries
13 "new_after_n_chars": 256, # Start new chunk (soft limit)
14 "max_characters": 512, # Maximum chunk size (hard limit)
15 "combine_under_n_chars": 64, # Minimum chunk size
16 "overlap": 100, # Character overlap between chunks
17 "chunk_enrichment_settings": { # Document enrichment settings
18 "enable_chunk_enrichment": False,
19 }
20 }
21)

An ingested file is parsed, chunked, embedded and stored inside your R2R system. The stored information includes a document identifier, a corresponding user identifier, and other metadata. Knowledge graph creation is done separately and at the collection level. Refer to the ingestion configuration section for comprehensive details on available options.

file_paths
list[str]Required

A list of file paths or directory paths to ingest. If a directory path is provided, all files within the directory and its subdirectories will be ingested.

metadatas
Optional[list[dict]]

An optional list of metadata dictionaries corresponding to each file. If provided, the length should match the number of files being ingested.

document_ids
Optional[list[Union[UUID, str]]]

An optional list of document IDs to assign to the ingested files. If provided, the length should match the number of files being ingested.

versions
Optional[list[str]]

An optional list of version strings for the ingested files. If provided, the length should match the number of files being ingested.

ingestion_config
Optional[Union[dict, IngestionConfig]]

The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime. Learn more about configuration here.

run_with_orchestration
Optional[bool]

Whether or not ingestion runs with orchestration, default is True. When set to False, the ingestion process will run synchronous and directly return the result.

Understanding Ingestion Status

After calling ingest_files, the response includes important status information:

1# Successful ingestion
2{
3 'message': 'Ingestion task queued successfully.',
4 'task_id': '6e27dfca-606d-422d-b73f-2d9e138661b4',
5 'document_id': 'c3291abf-8a4e-5d9d-80fd-232ef6fd8526'
6}
7
8# Check document status later
9doc_status = client.documents_overview(
10 document_ids=['c3291abf-8a4e-5d9d-80fd-232ef6fd8526']
11)
12# ingestion_status will be one of: 'pending', 'processing', 'success', 'failed'

We have added support for contextual chunk enrichment! You can learn more about it here.

Currently, you need to enable it in your ingestion config:

1[ingestion.chunk_enrichment_settings]
2 enable_chunk_enrichment = true # disabled by default
3 strategies = ["semantic", "neighborhood"]
4 forward_chunks = 3 # Look ahead 3 chunks
5 backward_chunks = 3 # Look behind 3 chunks
6 semantic_neighbors = 10 # Find 10 semantically similar chunks
7 semantic_similarity_threshold = 0.7 # Minimum similarity score
8 generation_config = { model = "openai/gpt-4o-mini" }

Ingest Chunks

The ingest_chunks method allows direct ingestion of pre-processed text, bypassing the standard parsing pipeline. This is useful for:

  • Custom preprocessing pipelines
  • Streaming data ingestion
  • Working with non-file data sources
1chunks = [
2 {
3 "text": "Aristotle was a Greek philosopher...",
4 },
5 ...,
6 {
7 "text": "He was born in 384 BC in Stagira...",
8 }
9]
10
11ingest_response = client.ingest_chunks(
12 chunks=chunks,
13 metadata={"title": "Aristotle", "source": "wikipedia"}
14)
chunks
list[dict]Required

A list of chunk dictionaries to ingest. Each dictionary should contain at least a “text” key with the chunk text. An optional “metadata” key can contain a dictionary of metadata for the chunk.

document_id
Optional[UUID]

An optional document ID to assign to the ingested chunks. If not provided, a new document ID will be generated.

metadata
Optional[dict]

An optional metadata dictionary for the document.

run_with_orchestration
Optional[bool]

Whether or not ingestion runs with orchestration, default is True. When set to False, the ingestion process will run synchronous and directly return the result.

Update Files

Update existing documents while maintaining version history:

1# Basic update with new metadata
2update_response = client.update_files(
3 file_paths=file_paths,
4 document_ids=document_ids,
5 metadatas=[{
6 "status": "reviewed"
7 }]
8)
9
10# Update with custom chunking
11update_response = client.update_files(
12 file_paths=file_paths,
13 document_ids=document_ids,
14 ingestion_config={
15 "chunking_strategy": "by_title",
16 "max_characters": 1024 # Larger chunks for this version
17 }
18)

The ingestion configuration can be customized analogously to the ingest files endpoint above.

file_paths
list[str]Required

A list of file paths to update.

document_ids
Optional[list[Union[UUID, str]]]Required

A list of document IDs corresponding to the files being updated. When not provided, an attempt is made to generate the correct document id from the given user id and file path.

metadatas
Optional[list[dict]]

An optional list of metadata dictionaries for the updated files.

ingestion_config
Optional[Union[dict, IngestionConfig]]

The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime. Learn more about configuration here.

run_with_orchestration
Optional[bool]

Whether or not ingestion runs with orchestration, default is True. When set to False, the ingestion process will run synchronous and directly return the result.

Update Chunks

Update the content of an existing chunk in your R2R system:

1document_id = "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"
2extraction_id = "aeba6400-1bd0-5ee9-8925-04732d675434"
3
4update_response = client.update_chunks(
5 document_id=document_id,
6 extraction_id=extraction_id,
7 text="Updated chunk content...",
8 metadata={"source": "manual_edit", "edited_at": "2024-10-24"}
9)
document_id
UUIDRequired

The ID of the document containing the chunk to update.

extraction_id
UUIDRequired

The ID of the specific chunk to update.

text
strRequired

The new text content to replace the existing chunk text.

metadata
Optional[dict]

An optional metadata dictionary for the updated chunk. If provided, this will replace the existing chunk metadata.

run_with_orchestration
Optional[bool]

Whether or not the update runs with orchestration, default is True. When set to False, the update process will run synchronous and directly return the result.

Documents Overview

Retrieve high-level document information. Results are restricted to the current user’s files, unless the request is made by a superuser, in which case results from all users are returned:

1documents_overview = client.documents_overview()
document_ids
Optional[list[Union[UUID, str]]]

An optional list of document IDs to filter the overview.

offset
Optional[int]

An optional value to offset the starting point of fetched results, defaults to 0.

limit
Optional[int]

An optional value to limit the fetched results, defaults to 100.

Document Chunks

Fetch and examine chunks for a particular document. Chunks represent the atomic units of text after processing:

1document_id = "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"
2chunks = client.document_chunks(document_id)
document_id
strRequired

The ID of the document to retrieve chunks for.

offset
Optional[int]

An optional value to offset the starting point of fetched results, defaults to 0.

limit
Optional[int]

An optional value to limit the fetched results, defaults to 100.

include_vectors
Optional[bool]

An optional value to return the vectors associated with each chunk, defaults to False.

Delete Documents

Delete a document by its ID:

1delete_response = client.delete(
2 {
3 "document_id":
4 {"$eq": "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"}
5 }
6)
filters
list[dict]Required

A list of logical filters to perform over input documents fields which identifies the unique set of documents to delete (e.g., {"document_id": {"$eq": "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"}}). Logical operations might include variables such as "user_id" or "title" and filters like neq, gte, etc.

Vector Index Management

Create Vector Index

Vector indices significantly improve search performance for large collections but add overhead for smaller datasets. Only create indices when working with hundreds of thousands of documents or when search latency is critical.

Create a vector index for similarity search:

1create_response = client.create_vector_index(
2 table_name="vectors",
3 index_method="hnsw",
4 index_measure="cosine_distance",
5 index_arguments={"m": 16, "ef_construction": 64},
6 concurrently=True
7)
table_name
str

The table to create the index on. Options: vectors, entities_document, entities_collection, communities. Default: vectors

index_method
str

The indexing method to use. Options: hnsw, ivfflat, auto. Default: hnsw

index_measure
str

Distance measure for vector comparisons. Options: cosine_distance, l2_distance, max_inner_product. Default: cosine_distance

index_arguments
Optional[dict]

Configuration parameters for the chosen index method.

index_name
Optional[str]

Custom name for the index. If not provided, one will be auto-generated

concurrently
bool

Whether to create the index concurrently. Default: True

Important Considerations

Vector index creation requires careful planning and consideration of your data and performance requirements. Keep in mind:

Resource Intensive Process

  • Index creation can be CPU and memory intensive, especially for large datasets
  • For HNSW indexes, memory usage scales with both dataset size and m parameter
  • Consider creating indexes during off-peak hours for production systems

Performance Tuning

  1. HNSW Parameters:
    • m: Higher values (16-64) improve search quality but increase memory usage and build time
    • ef_construction: Higher values increase build time and quality but have diminishing returns past 100
    • Recommended starting point: m=16, ef_construction=64
1# Example balanced configuration
2client.create_vector_index(
3 table_name="vectors",
4 index_method="hnsw",
5 index_measure="cosine_distance",
6 index_arguments={
7 "m": 16, # Moderate connectivity
8 "ef_construction": 64 # Balanced build time/quality
9 },
10 concurrently=True
11)

Pre-warming Required

  • Important: Newly created indexes require pre-warming to achieve optimal performance
  • Initial queries may be slower until the index is loaded into memory
  • The first several queries will automatically warm the index
  • For production systems, consider implementing explicit pre-warming by running representative queries after index creation
  • Without pre-warming, you may not see the expected performance improvements

Best Practices

  1. Always use concurrently=True in production to avoid blocking other operations
  2. Monitor system resources during index creation
  3. Test index performance with representative queries before deploying
  4. Consider creating indexes on smaller test datasets first to validate parameters

Distance Measures Choose the appropriate measure based on your use case:

  • cosine_distance: Best for normalized vectors (most common)
  • l2_distance: Better for absolute distances
  • max_inner_product: Optimized for dot product similarity

List Vector Indices

List existing vector indices for a table:

1indices = client.list_vector_indices(table_name="vectors")
table_name
str

The table to list indices from. Options: vectors, entities_document, entities_collection, communities. Default: vectors

Delete Vector Index

Delete a vector index from a table:

1delete_response = client.delete_vector_index(
2 index_name="ix_vector_cosine_ops_hnsw__20241021211541",
3 table_name="vectors",
4 concurrently=True
5)
index_name
strRequired

Name of the index to delete

table_name
str

The table containing the index. Options: vectors, entities_document, entities_collection, communities. Default: vectors

concurrently
bool

Whether to delete the index concurrently. Default: True

Troubleshooting Common Issues

Ingestion Failures

  • Check file permissions and paths
  • Verify file formats are supported
  • Ensure metadata length matches file_paths length
  • Monitor memory usage for large files

Chunking Issues

  • Large chunks may impact retrieval quality
  • Small chunks may lose context
  • Adjust overlap for better context preservation

Vector Index Performance

  • Monitor index creation time
  • Check memory usage during creation
  • Verify warm-up queries are representative
  • Consider index rebuild if quality degrades