Documents

Ingest and manage your documents

R2R provides a powerful and flexible ingestion pipeline to process and manage various types of documents. It supports a wide range of file formats—text, documents, PDFs, images, audio, and even video—and transforms them into searchable, analyzable content.

The ingestion process includes parsing, chunking, embedding, and optionally extracting entities and relationships for knowledge graph construction.

This documentation will guide you through:

  • Ingesting files, raw text, or pre-processed chunks
  • Choosing an ingestion mode (fast, hi-res, or custom)
  • Updating and deleting documents and chunks

Refer to the documents API and SDK reference for detailed examples for interacting with documents.

Ingesting Documents

A Document represents ingested content in R2R. When you ingest a file, text, or chunks:

  1. The file (or text) is parsed into text.
  2. Text is chunked into manageable units.
  3. Embeddings are generated for semantic search.
  4. Content is stored for retrieval and optionally linked to the knowledge graph.

Ingestion inside R2R is asynchronous. You can monitor ingestion status and confirm when documents are ready:

$r2r documents list
{
'id': '9fbe403b-c11c-5aae-8ade-ef22980c3ad1',
'title': 'file.txt',
'user_id': '2acb499e-8428-543b-bd85-0d9098718220',
'type': 'txt',
'created_at': '2024-09-05T18:20:47.921933Z',
'updated_at': '2024-09-05T18:20:47.921938Z',
'ingestion_status': 'success',
'restructuring_status': 'pending',
'version': 'v0',
'summary': 'The document contains a ....', # AI generated summary
'collection_ids': [],
'metadata': {'version': 'v0'}
}
...

An ingestion_status of "success" confirms the document is fully ingested. You can also check your R2R dashboard for ingestion progress and status.

For more details on creating documents, refer to the create document API.

Ingestion Modes

R2R offers three modes of ingestion to allow for maximal customization:

Unprocessed files

A speed-oriented ingestion mode that prioritizes rapid processing with minimal enrichment. Summaries and some advanced parsing are skipped, making this ideal for quickly processing large volumes of documents.

1 file_path = 'path/to/file.txt'
2
3 # export R2R_API_KEY='sk-....'
4
5 ingest_response = client.documents.create(
6 file_path=file_path,
7 ingestion_mode="fast" # fast mode for quick processing
8 )

Raw text

If you have pre-processed chunks from your own pipeline, you can directly ingest them. This is especially useful if you’ve already divided content into logical segments.

1raw_text = "This is my first document."
2ingest_response = client.documents.create(
3 raw_text=raw_text,
4)

Pre-Processed Chunks

If you have pre-processed chunks from your own pipeline, you can directly ingest them. This is especially useful if you’ve already divided content into logical segments.

1chunks = ["This is my first parsed chunk", "This is my second parsed chunk"]
2ingest_response = client.documents.create(
3 chunks=chunks,
4)
5print(ingest_response)
6# {'results': [{'message': 'Document created and ingested successfully.', 'document_id': '7a0dad00-b041-544e-8028-bc9631a0a527'}]}

Deleting Documents and Chunks

To remove documents or chunks, call their respective delete methods:

1# Delete a document
2delete_response = client.documents.delete(document_id)
3
4# Delete a chunk
5delete_response = client.chunks.delete(chunk_id)

You can also delete documents by specifying filters using the by-filter route.

Conclusion

R2R’s ingestion pipeline is flexible and efficient, allowing you to tailor ingestion to your needs:

  • Use fast for quick processing.
  • Use hi-res for high-quality, multimodal analysis.
  • Use custom for advanced, granular control.

You can easily ingest documents or pre-processed chunks, update their content, and delete them when no longer needed. Combined with powerful retrieval and knowledge graph capabilities, R2R enables seamless integration of advanced document management into your applications.

Built with