Introduction

R2R provides a powerful and flexible ingestion pipeline that allows you to efficiently process and manage various types of documents. This cookbook will guide you through the process of ingesting files, updating existing documents, and deleting documents using the R2R Python SDK.

As of version 3.2.13, we have expanded the options for ingesting files using multimodal foundation models. In addition to using such models by default for images, R2R can now use them on PDFs by passing the following in your ingestion configuration:

{
  "parser_overrides": {
    "pdf": "zerox"
  }
}

We recommend this method for achieving the highest quality ingestion results.

Ingesting Files

To ingest files into your R2R system, you can use the ingest_files method from the Python SDK:

file_paths = ['path/to/file1.txt', 'path/to/file2.txt']
metadatas = [{'key1': 'value1'}, {'key2': 'value2'}]

ingest_response = client.ingest_files(
    file_paths=file_paths,
    metadatas=metadatas,
    ingestion_config={
        "provider": "unstructured_local",
        "strategy": "auto",
        "chunking_strategy": "by_title",
        "new_after_n_chars": 256,
        "max_characters": 512,
        "combine_under_n_chars": 64,
        "overlap": 100,
    }
)

The ingest_files method accepts the following parameters:

  • file_paths (required): A list of file paths or directory paths to ingest.
  • metadatas (optional): A list of metadata dictionaries corresponding to each file.
  • document_ids (optional): A list of document IDs to assign to the ingested files.
  • ingestion_config (optional): Custom ingestion settings to override the default configuration, which you can read more about here.

Ingesting Chunks

If you have pre-processed chunks of text, you can directly ingest them using the ingest_chunks method:

chunks = [
    {"text": "This is the first chunk."},
    {"text": "This is the second chunk."}
]

ingest_response = client.ingest_chunks(
    chunks=chunks,
    document_id="custom_document_id",
    metadata={"custom_metadata": "value"},
)

The ingest_chunks method accepts the following parameters:

  • chunks (required): A list of dictionaries containing the text and metadata for each chunk.
  • document_id (optional): A custom document ID to assign to the ingested chunks.
  • metadata (optional): Additional metadata to associate with the ingested chunks.

Updating Files

To update existing documents in your R2R system, you can use the update_files method:

file_paths = ['path/to/updated_file1.txt', 'path/to/updated_file2.txt']
document_ids = ['document1_id', 'document2_id']

update_response = client.update_files(
    file_paths=file_paths,
    document_ids=document_ids,
    metadatas=[{"version": "2.0"}, {"version": "1.5"}],
)

The update_files method accepts the following parameters:

  • file_paths (required): A list of file paths for the updated documents.
  • document_ids (required): A list of document IDs corresponding to the files being updated.
  • metadatas (optional): A list of metadata dictionaries to update for each document.

Deleting Documents

To delete documents from your R2R system, you can use the delete method:

delete_response = client.delete(
    {
        "document_id": {"$eq": "document1_id"}
    }
)

The delete method accepts a dictionary specifying the filters to identify the documents to delete. In this example, it deletes the document with the ID “document1_id”.

Conclusion

R2R’s ingestion pipeline provides a flexible and efficient way to process, update, and manage your documents. By utilizing the ingest_files, ingest_chunks, update_files, and delete methods from the Python SDK, you can seamlessly integrate document management capabilities into your applications.

For more detailed information on the available parameters and response formats, refer to the Python SDK Ingestion Documentation.