Ingestion — Build, scale, and manage user-facing Retrieval-Augmented Generation applications.

This SDK documentation is periodically updated. For the latest parameter details, please cross-reference with the API Reference documentation.

Inside R2R, ingestion refers to the complete pipeline for processing input data:

Parsing files into text
Chunking text into semantic units
Generating embeddings
Storing data for retrieval

Ingested files are stored with an associated document identifier as well as a user identifier to enable comprehensive management.

Document Ingestion and Management

R2R has recently expanded the available options for ingesting files using multimodal foundation models. In addition to using such models by default for images, R2R can now use them on PDFs, like it is shown here, by passing the following in your ingestion configuration:

1 {
2   "ingestion_config": {
3     "parser_overrides": {
4       "pdf": "zerox"
5     }
6   }
7 }

We recommend this method for achieving the highest quality ingestion results.

Ingest Files

Ingest files or directories into your R2R system:

1 const files = [
2   { path: 'path/to/file1.txt', name: 'file1.txt' },
3   { path: 'path/to/file2.txt', name: 'file2.txt' }
4 ];
5 const metadatas = [
6   { key1: 'value1' },
7   { key2: 'value2' }
8 ];
9 
10 // Runtime chunking configuration
11 const ingestResponse = await client.ingestFiles(files, {
12   metadatas,
13   user_ids: ['user-id-1', 'user-id-2'],
14   ingestion_config: {
15     provider: "unstructured_local",  // Local processing
16     strategy: "auto",                // Automatic processing strategy
17     chunking_strategy: "by_title",   // Split on title boundaries
18     new_after_n_chars: 256,         // Start new chunk (soft limit)
19     max_characters: 512,            // Maximum chunk size (hard limit)
20     combine_under_n_chars: 64,      // Minimum chunk size
21     overlap: 100,                   // Character overlap between chunks
22   }
23 });

[Previous sections remain the same through the Update Files code example, then continuing with:]

Response

files

Array<File | { path: string; name: string }>Required

Array of files to update.

options

objectRequired

document_ids

Array<string>Required

Document IDs corresponding to files being updated.

metadatas

Array<Record<string, any>>

Optional metadata for updated files.

ingestion_config

object

Chunking configuration options.

run_with_orchestration

Optional[bool]

Whether or not ingestion runs with orchestration, default is True. When set to False, the ingestion process will run synchronous and directly return the result.

Update Chunks

Update the content of an existing chunk in your R2R system:

1 const documentId = "9fbe403b-c11c-5aae-8ade-ef22980c3ad1";
2 const extractionId = "aeba6400-1bd0-5ee9-8925-04732d675434";
3 
4 const updateResponse = await client.updateChunks({
5   document_id: documentId,
6   extraction_id: extractionId,
7   text: "Updated chunk content...",
8   metadata: {
9     source: "manual_edit",
10     edited_at: "2024-10-24"
11   }
12 });

Response

params

objectRequired

document_id

stringRequired

The ID of the document containing the chunk to update.

extraction_id

stringRequired

The ID of the specific chunk to update.

text

stringRequired

The new text content to replace the existing chunk text.

metadata

Record<string, any>

An optional metadata object for the updated chunk. If provided, this will replace the existing chunk metadata.

run_with_orchestration

boolean

Whether or not the update runs with orchestration, default is true. When set to false, the update process will run synchronous and directly return the result.

Documents Overview

Retrieve high-level document information:

1 // Get all documents (paginated)
2 const documentsOverview = await client.documentsOverview();
3 
4 // Get specific documents
5 const specificDocs = await client.documentsOverview({
6   document_ids: ['doc-id-1', 'doc-id-2'],
7   offset: 0,
8   limit: 10
9 });

Results are restricted to the current user’s files unless the request is made by a superuser.

Response

document_ids

Array<string>

Optional array of document IDs to filter results.

offset

number

Starting point for pagination, defaults to 0.

limit

number

Maximum number of results to return, defaults to 100.

Document Chunks

Fetch and examine chunks for a particular document:

1 const documentId = '9fbe403b-c11c-5aae-8ade-ef22980c3ad1';
2 const chunks = await client.documentChunks(
3   documentId,
4   0,     // offset
5   100,   // limit
6   false  // include_vectors
7 );

These chunks represent the atomic units of text after processing.

Response

documentId

stringRequired

ID of the document to retrieve chunks for.

offset

number

Starting point for pagination, defaults to 0.

limit

number

Maximum number of chunks to return, defaults to 100.

includeVectors

boolean

Whether to include embedding vectors in response.

Delete Documents

Delete documents using filters:

1 const deleteResponse = await client.delete({
2   document_id: {
3     "$eq": "91662726-7271-51a5-a0ae-34818509e1fd"
4   }
5 });
6 
7 // Delete multiple documents
8 const bulkDelete = await client.delete({
9   user_id: {
10     "$in": ["user-1", "user-2"]
11   }
12 });

Response

filters

objectRequired

Filter conditions to identify documents for deletion.

Vector Index Management

Create Vector Index

Vector indices significantly improve search performance for large collections but add overhead for smaller datasets. Only create indices when working with hundreds of thousands of documents or when search latency is critical.

Create a vector index for similarity search:

1 const createResponse = await client.createVectorIndex({
2     tableName: "vectors",
3     indexMethod: "hnsw",
4     indexMeasure: "cosine_distance",
5     indexArguments: {
6         m: 16,                  // Number of connections
7         ef_construction: 64     // Build time quality factor
8     },
9     concurrently: true
10 });

Response

tableName

string

Table to create index on: vectors, entities_document, entities_collection, communities.

indexMethod

string

Index method: hnsw, ivfflat, or auto.

indexMeasure

string

Distance measure: cosine_distance, l2_distance, or max_inner_product.

indexArguments

object

Configuration for chosen index method.

HNSW Parameters

IVFFlat Parameters

List Vector Indices

List existing indices:

1 const indices = await client.listVectorIndices({
2     tableName: "vectors"
3 });

Response

Delete Vector Index

Remove an existing index:

1 const deleteResponse = await client.deleteVectorIndex({
2     indexName: "ix_vector_cosine_ops_hnsw__20241021211541",
3     tableName: "vectors",
4     concurrently: true
5 });

Response

Best Practices and Performance Optimization

Vector Index Configuration

HNSW Parameters:
- m: Higher values (16-64) improve search quality but increase memory
- ef_construction: Higher values improve quality but slow construction
- Recommended starting point: m=16, ef_construction=64
Distance Measures:
- cosine_distance: Best for normalized vectors (most common)
- l2_distance: Better for absolute distances
- max_inner_product: Optimized for dot product similarity
Production Considerations:
- Always use concurrently: true to avoid blocking operations
- Create indexes during off-peak hours
- Pre-warm indices with representative queries
- Monitor memory usage during creation

Chunking Strategy

Size Guidelines:
- Avoid chunks >1024 characters for retrieval quality
- Keep chunks >64 characters to maintain context
- Use overlap for better context preservation
Method Selection:
- Use by_title for structured documents
- Use basic for uniform text content
- Consider recursive for nested content

Troubleshooting

Common Issues

Ingestion Failures:
- Verify file permissions and paths
- Check file format support
- Ensure metadata array length matches files
- Monitor memory for large files
Vector Index Performance:
- Check index creation time
- Monitor memory usage
- Verify warm-up queries
- Consider rebuilding if quality degrades
Chunking Issues:
- Adjust overlap for context preservation
- Monitor chunk sizes
- Verify language detection
- Check encoding for special characters

1	{
2	"ingestion_config": {
3	"parser_overrides": {
4	"pdf": "zerox"
5	}
6	}
7	}

1	const files = [
2	{ path: 'path/to/file1.txt', name: 'file1.txt' },
3	{ path: 'path/to/file2.txt', name: 'file2.txt' }
4	];
5	const metadatas = [
6	{ key1: 'value1' },
7	{ key2: 'value2' }
8	];
9
10	// Runtime chunking configuration
11	const ingestResponse = await client.ingestFiles(files, {
12	metadatas,
13	user_ids: ['user-id-1', 'user-id-2'],
14	ingestion_config: {
15	provider: "unstructured_local", // Local processing
16	strategy: "auto", // Automatic processing strategy
17	chunking_strategy: "by_title", // Split on title boundaries
18	new_after_n_chars: 256, // Start new chunk (soft limit)
19	max_characters: 512, // Maximum chunk size (hard limit)
20	combine_under_n_chars: 64, // Minimum chunk size
21	overlap: 100, // Character overlap between chunks
22	}
23	});

1	const documentId = "9fbe403b-c11c-5aae-8ade-ef22980c3ad1";
2	const extractionId = "aeba6400-1bd0-5ee9-8925-04732d675434";
3
4	const updateResponse = await client.updateChunks({
5	document_id: documentId,
6	extraction_id: extractionId,
7	text: "Updated chunk content...",
8	metadata: {
9	source: "manual_edit",
10	edited_at: "2024-10-24"
11	}
12	});

1	// Get all documents (paginated)
2	const documentsOverview = await client.documentsOverview();
3
4	// Get specific documents
5	const specificDocs = await client.documentsOverview({
6	document_ids: ['doc-id-1', 'doc-id-2'],
7	offset: 0,
8	limit: 10
9	});

1	const documentId = '9fbe403b-c11c-5aae-8ade-ef22980c3ad1';
2	const chunks = await client.documentChunks(
3	documentId,
4	0, // offset
5	100, // limit
6	false // include_vectors
7	);

1	const deleteResponse = await client.delete({
2	document_id: {
3	"$eq": "91662726-7271-51a5-a0ae-34818509e1fd"
4	}
5	});
6
7	// Delete multiple documents
8	const bulkDelete = await client.delete({
9	user_id: {
10	"$in": ["user-1", "user-2"]
11	}
12	});

1	const createResponse = await client.createVectorIndex({
2	tableName: "vectors",
3	indexMethod: "hnsw",
4	indexMeasure: "cosine_distance",
5	indexArguments: {
6	m: 16, // Number of connections
7	ef_construction: 64 // Build time quality factor
8	},
9	concurrently: true
10	});

1	const indices = await client.listVectorIndices({
2	tableName: "vectors"
3	});

1	const deleteResponse = await client.deleteVectorIndex({
2	indexName: "ix_vector_cosine_ops_hnsw__20241021211541",
3	tableName: "vectors",
4	concurrently: true
5	});