Ingesting files with R2R.

This SDK documentation is periodically updated. For the latest parameter details, please cross-reference with the API Reference documentation.

Inside R2R, ingestion refers to the complete pipeline for processing input data:

  • Parsing files into text
  • Chunking text into semantic units
  • Generating embeddings
  • Storing data for retrieval

Ingested files are stored with an associated document identifier as well as a user identifier to enable comprehensive management.

Document Ingestion and Management

R2R has recently expanded the available options for ingesting files using multimodal foundation models. In addition to using such models by default for images, R2R can now use them on PDFs, like it is shown here, by passing the following in your ingestion configuration:

1{
2 "ingestion_config": {
3 "parser_overrides": {
4 "pdf": "zerox"
5 }
6 }
7}

We recommend this method for achieving the highest quality ingestion results.

Ingest Files

Ingest files or directories into your R2R system:

1const files = [
2 { path: 'path/to/file1.txt', name: 'file1.txt' },
3 { path: 'path/to/file2.txt', name: 'file2.txt' }
4];
5const metadatas = [
6 { key1: 'value1' },
7 { key2: 'value2' }
8];
9
10// Runtime chunking configuration
11const ingestResponse = await client.ingestFiles(files, {
12 metadatas,
13 user_ids: ['user-id-1', 'user-id-2'],
14 ingestion_config: {
15 provider: "unstructured_local", // Local processing
16 strategy: "auto", // Automatic processing strategy
17 chunking_strategy: "by_title", // Split on title boundaries
18 new_after_n_chars: 256, // Start new chunk (soft limit)
19 max_characters: 512, // Maximum chunk size (hard limit)
20 combine_under_n_chars: 64, // Minimum chunk size
21 overlap: 100, // Character overlap between chunks
22 }
23});

[Previous sections remain the same through the Update Files code example, then continuing with:]

files
Array<File | { path: string; name: string }>Required

Array of files to update.

options
objectRequired
document_ids
Array<string>Required

Document IDs corresponding to files being updated.

metadatas
Array<Record<string, any>>

Optional metadata for updated files.

ingestion_config
object

Chunking configuration options.

run_with_orchestration
Optional[bool]

Whether or not ingestion runs with orchestration, default is True. When set to False, the ingestion process will run synchronous and directly return the result.

Update Chunks

Update the content of an existing chunk in your R2R system:

1const documentId = "9fbe403b-c11c-5aae-8ade-ef22980c3ad1";
2const extractionId = "aeba6400-1bd0-5ee9-8925-04732d675434";
3
4const updateResponse = await client.updateChunks({
5 document_id: documentId,
6 extraction_id: extractionId,
7 text: "Updated chunk content...",
8 metadata: {
9 source: "manual_edit",
10 edited_at: "2024-10-24"
11 }
12});
params
objectRequired
document_id
stringRequired

The ID of the document containing the chunk to update.

extraction_id
stringRequired

The ID of the specific chunk to update.

text
stringRequired

The new text content to replace the existing chunk text.

metadata
Record<string, any>

An optional metadata object for the updated chunk. If provided, this will replace the existing chunk metadata.

run_with_orchestration
boolean

Whether or not the update runs with orchestration, default is true. When set to false, the update process will run synchronous and directly return the result.

Documents Overview

Retrieve high-level document information:

1// Get all documents (paginated)
2const documentsOverview = await client.documentsOverview();
3
4// Get specific documents
5const specificDocs = await client.documentsOverview({
6 document_ids: ['doc-id-1', 'doc-id-2'],
7 offset: 0,
8 limit: 10
9});

Results are restricted to the current user’s files unless the request is made by a superuser.

document_ids
Array<string>

Optional array of document IDs to filter results.

offset
number

Starting point for pagination, defaults to 0.

limit
number

Maximum number of results to return, defaults to 100.

Document Chunks

Fetch and examine chunks for a particular document:

1const documentId = '9fbe403b-c11c-5aae-8ade-ef22980c3ad1';
2const chunks = await client.documentChunks(
3 documentId,
4 0, // offset
5 100, // limit
6 false // include_vectors
7);

These chunks represent the atomic units of text after processing.

documentId
stringRequired

ID of the document to retrieve chunks for.

offset
number

Starting point for pagination, defaults to 0.

limit
number

Maximum number of chunks to return, defaults to 100.

includeVectors
boolean

Whether to include embedding vectors in response.

Delete Documents

Delete documents using filters:

1const deleteResponse = await client.delete({
2 document_id: {
3 "$eq": "91662726-7271-51a5-a0ae-34818509e1fd"
4 }
5});
6
7// Delete multiple documents
8const bulkDelete = await client.delete({
9 user_id: {
10 "$in": ["user-1", "user-2"]
11 }
12});
filters
objectRequired

Filter conditions to identify documents for deletion.

Vector Index Management

Create Vector Index

Vector indices significantly improve search performance for large collections but add overhead for smaller datasets. Only create indices when working with hundreds of thousands of documents or when search latency is critical.

Create a vector index for similarity search:

1const createResponse = await client.createVectorIndex({
2 tableName: "vectors",
3 indexMethod: "hnsw",
4 indexMeasure: "cosine_distance",
5 indexArguments: {
6 m: 16, // Number of connections
7 ef_construction: 64 // Build time quality factor
8 },
9 concurrently: true
10});
tableName
string

Table to create index on: vectors, entities_document, entities_collection, communities.

indexMethod
string

Index method: hnsw, ivfflat, or auto.

indexMeasure
string

Distance measure: cosine_distance, l2_distance, or max_inner_product.

indexArguments
object

Configuration for chosen index method.

List Vector Indices

List existing indices:

1const indices = await client.listVectorIndices({
2 tableName: "vectors"
3});

Delete Vector Index

Remove an existing index:

1const deleteResponse = await client.deleteVectorIndex({
2 indexName: "ix_vector_cosine_ops_hnsw__20241021211541",
3 tableName: "vectors",
4 concurrently: true
5});

Best Practices and Performance Optimization

Vector Index Configuration

  1. HNSW Parameters:

    • m: Higher values (16-64) improve search quality but increase memory
    • ef_construction: Higher values improve quality but slow construction
    • Recommended starting point: m=16, ef_construction=64
  2. Distance Measures:

    • cosine_distance: Best for normalized vectors (most common)
    • l2_distance: Better for absolute distances
    • max_inner_product: Optimized for dot product similarity
  3. Production Considerations:

    • Always use concurrently: true to avoid blocking operations
    • Create indexes during off-peak hours
    • Pre-warm indices with representative queries
    • Monitor memory usage during creation

Chunking Strategy

  1. Size Guidelines:

    • Avoid chunks >1024 characters for retrieval quality
    • Keep chunks >64 characters to maintain context
    • Use overlap for better context preservation
  2. Method Selection:

    • Use by_title for structured documents
    • Use basic for uniform text content
    • Consider recursive for nested content

Troubleshooting

Common Issues

  1. Ingestion Failures:

    • Verify file permissions and paths
    • Check file format support
    • Ensure metadata array length matches files
    • Monitor memory for large files
  2. Vector Index Performance:

    • Check index creation time
    • Monitor memory usage
    • Verify warm-up queries
    • Consider rebuilding if quality degrades
  3. Chunking Issues:

    • Adjust overlap for context preservation
    • Monitor chunk sizes
    • Verify language detection
    • Check encoding for special characters