Ingestion Cookbook

Learn how to ingest, update, and delete documents with R2R

Introduction

R2R provides a powerful and flexible ingestion pipeline that allows you to efficiently process and manage various types of documents. This cookbook will guide you through the process of ingesting files, updating existing documents, and deleting documents using the R2R Python SDK.

As of version 3.2.13, we have expanded the options for ingesting files using multimodal foundation models. In addition to using such models by default for images, R2R can now use them on PDFs by passing the following in your ingestion configuration:

1"ingestion_config": {
2 ...,
3 "parser_overrides": {
4 "pdf": "zerox"
5 }
6}

We recommend this method for achieving the highest quality ingestion results.

Ingesting Files

To ingest files into your R2R system, you can use the ingest_files method from the Python SDK:

1file_paths = ['path/to/file1.txt', 'path/to/file2.txt']
2metadatas = [{'key1': 'value1'}, {'key2': 'value2'}]
3
4ingest_response = client.ingest_files(
5 file_paths=file_paths,
6 metadatas=metadatas,
7 ingestion_config={
8 "provider": "unstructured_local",
9 "strategy": "auto",
10 "chunking_strategy": "by_title",
11 "new_after_n_chars": 256,
12 "max_characters": 512,
13 "combine_under_n_chars": 64,
14 "overlap": 100,
15 }
16)

The ingest_files method accepts the following parameters:

  • file_paths (required): A list of file paths or directory paths to ingest.
  • metadatas (optional): A list of metadata dictionaries corresponding to each file.
  • document_ids (optional): A list of document IDs to assign to the ingested files.
  • ingestion_config (optional): Custom ingestion settings to override the default configuration, which you can read more about here.

Ingesting Chunks

If you have pre-processed chunks of text, you can directly ingest them using the ingest_chunks method:

1chunks = [
2 {"text": "This is the first chunk."},
3 {"text": "This is the second chunk."}
4]
5
6ingest_response = client.ingest_chunks(
7 chunks=chunks,
8 document_id="custom_document_id",
9 metadata={"custom_metadata": "value"},
10)

The ingest_chunks method accepts the following parameters:

  • chunks (required): A list of dictionaries containing the text and metadata for each chunk.
  • document_id (optional): A custom document ID to assign to the ingested chunks.
  • metadata (optional): Additional metadata to associate with the ingested chunks.

Updating Files

To update existing documents in your R2R system, you can use the update_files method:

1file_paths = ['path/to/updated_file1.txt', 'path/to/updated_file2.txt']
2document_ids = ['document1_id', 'document2_id']
3
4update_response = client.update_files(
5 file_paths=file_paths,
6 document_ids=document_ids,
7 metadatas=[{"version": "2.0"}, {"version": "1.5"}],
8)

The update_files method accepts the following parameters:

  • file_paths (required): A list of file paths for the updated documents.
  • document_ids (required): A list of document IDs corresponding to the files being updated.
  • metadatas (optional): A list of metadata dictionaries to update for each document.

Updating Chunks

To update specific chunks within existing documents in your R2R deployment, you can use the update_chunks method:

1document_id = "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"
2extraction_id = "aeba6400-1bd0-5ee9-8925-04732d675434"
3
4update_response = client.update_chunks(
5 document_id=document_id,
6 extraction_id=extraction_id,
7 text="Updated chunk content with new information...",
8 metadata={
9 "source": "manual_edit",
10 "edited_at": "2024-10-24",
11 "editor": "John Doe"
12 }
13)

The update_chunks method accepts the following parameters:

  • document_id (required): The ID of the document containing the chunk you want to update.
  • extraction_id (required): The ID of the specific chunk you want to update.
  • text (required): The new text content that will replace the existing chunk text.
  • metadata (optional): A metadata dictionary that will replace the existing chunk metadata.
  • run_with_orchestration (optional): Whether to run the update through orchestration (default: true).

This method is particularly useful when you need to:

  • Correct errors in specific chunks
  • Update outdated information
  • Add or modify metadata for individual chunks
  • Make targeted changes without reprocessing entire documents

Note that updating chunks will trigger a re-vectorization of the modified content, ensuring that your vector search capabilities remain accurate with the updated information.

Deleting Documents and Chunks

To delete documents or chunks from your R2R deployment, you can use the delete method:

1# For documents
2delete_response = client.delete(
3 {
4 "document_id": {"$eq": "document1_id"}
5 }
6)
7
8# For chunks
9delete_response = client.delete(
10 {
11 "extraction_id": {"$eq": "extraction1_id"}
12 }
13)

The delete method accepts a dictionary specifying the filters to identify the documents to delete. In this example, it deletes the document with the ID “document1_id” and the chunk with the ID “extraction1_id.”

Conclusion

R2R’s ingestion pipeline provides a flexible and efficient way to process, update, and manage your documents. By utilizing the ingest_files, ingest_chunks, update_files, and delete methods from the Python SDK, you can seamlessly integrate document management capabilities into your applications.

For more detailed information on the available parameters and response formats, refer to the Python SDK Ingestion Documentation.