Ingestion Cookbook
Learn how to ingest, update, and delete documents with R2R
Introduction
R2R provides a powerful and flexible ingestion pipeline that allows you to efficiently process and manage various types of documents. This cookbook will guide you through the process of ingesting files, updating existing documents, and deleting documents using the R2R Python SDK.
As of version 3.2.13
, we have expanded the options for ingesting files using multimodal foundation models. In addition to using such models by default for images, R2R can now use them on PDFs by passing the following in your ingestion configuration:
We recommend this method for achieving the highest quality ingestion results.
Ingesting Files
To ingest files into your R2R system, you can use the ingest_files
method from the Python SDK:
The ingest_files
method accepts the following parameters:
file_paths
(required): A list of file paths or directory paths to ingest.metadatas
(optional): A list of metadata dictionaries corresponding to each file.document_ids
(optional): A list of document IDs to assign to the ingested files.ingestion_config
(optional): Custom ingestion settings to override the default configuration, which you can read more about here.
Ingesting Chunks
If you have pre-processed chunks of text, you can directly ingest them using the ingest_chunks
method:
The ingest_chunks
method accepts the following parameters:
chunks
(required): A list of dictionaries containing the text and metadata for each chunk.document_id
(optional): A custom document ID to assign to the ingested chunks.metadata
(optional): Additional metadata to associate with the ingested chunks.
Updating Files
To update existing documents in your R2R system, you can use the update_files
method:
The update_files
method accepts the following parameters:
file_paths
(required): A list of file paths for the updated documents.document_ids
(required): A list of document IDs corresponding to the files being updated.metadatas
(optional): A list of metadata dictionaries to update for each document.
Updating Chunks
To update specific chunks within existing documents in your R2R deployment, you can use the update_chunks
method:
The update_chunks
method accepts the following parameters:
document_id
(required): The ID of the document containing the chunk you want to update.extraction_id
(required): The ID of the specific chunk you want to update.text
(required): The new text content that will replace the existing chunk text.metadata
(optional): A metadata dictionary that will replace the existing chunk metadata.run_with_orchestration
(optional): Whether to run the update through orchestration (default: true).
This method is particularly useful when you need to:
- Correct errors in specific chunks
- Update outdated information
- Add or modify metadata for individual chunks
- Make targeted changes without reprocessing entire documents
Note that updating chunks will trigger a re-vectorization of the modified content, ensuring that your vector search capabilities remain accurate with the updated information.
Deleting Documents and Chunks
To delete documents or chunks from your R2R deployment, you can use the delete
method:
The delete
method accepts a dictionary specifying the filters to identify the documents to delete. In this example, it deletes the document with the ID “document1_id” and the chunk with the ID “extraction1_id.”
Conclusion
R2R’s ingestion pipeline provides a flexible and efficient way to process, update, and manage your documents. By utilizing the ingest_files
, ingest_chunks
, update_files
, and delete
methods from the Python SDK, you can seamlessly integrate document management capabilities into your applications.
For more detailed information on the available parameters and response formats, refer to the Python SDK Ingestion Documentation.