Ingestion
Learn how to ingest, update, and delete documents with R2R
Introduction
R2R provides a powerful and flexible ingestion pipeline to process and manage various types of documents. It supports a wide range of file formats—text, documents, PDFs, images, audio, and even video—and transforms them into searchable, analyzable content. The ingestion process includes parsing, chunking, embedding, and optionally extracting entities and relationships for knowledge graph construction.
This cookbook will guide you through:
- Ingesting files, raw text, or pre-processed chunks
- Choosing an ingestion mode (
fast
,hi-res
, orcustom
) - Updating and deleting documents and chunks
For more on configuring ingestion, see the Ingestion Configuration Overview.
Ingestion Modes
R2R offers three primary ingestion modes to tailor the process to your requirements:
-
fast
:
A speed-oriented ingestion mode that prioritizes rapid processing with minimal enrichment. Summaries and some advanced parsing are skipped, making this ideal for quickly processing large volumes of documents. -
hi-res
:
A comprehensive, high-quality ingestion mode that may leverage multimodal foundation models (visual language models) for parsing complex documents and PDFs, even integrating image-based content.- On a lite deployment, R2R uses its built-in (
r2r
) parser. - On a full deployment, it can use
unstructured_local
orunstructured_api
for more robust parsing and advanced features.
Choosehi-res
mode if you need the highest quality extraction, including image-to-text analysis and richer semantic segmentation.
- On a lite deployment, R2R uses its built-in (
-
custom
:
For advanced users who require fine-grained control. Incustom
mode, you provide a fullingestion_config
dict or object to specify every detail: parser options, chunking strategy, character limits, and more.
Example Usage:
Ingesting Documents
A Document
represents ingested content in R2R. When you ingest a file, text, or chunks:
- The file (or text) is parsed into text.
- Text is chunked into manageable units.
- Embeddings are generated for semantic search.
- Content is stored for retrieval and optionally linked to the knowledge graph.
In a full R2R installation, ingestion is asynchronous. You can monitor ingestion status and confirm when documents are ready:
An ingestion_status
of "success"
confirms the document is fully ingested. You can also check the R2R dashboard at http://localhost:7273 for ingestion progress and status.
For more details on creating documents, refer to the Create Document API.
Ingesting Pre-Processed Chunks
If you have pre-processed chunks from your own pipeline, you can directly ingest them. This is especially useful if you’ve already divided content into logical segments.
For more on ingesting chunks, see the Create Chunks API.
Deleting Documents and Chunks
To remove documents or chunks, call their respective delete
methods:
You can also delete documents by specifying filters using the by-filter
route.
Additional Configuration & Concepts
-
Light vs. Full Deployments:
- Light (default) uses R2R’s built-in parser and supports synchronous ingestion.
- Full deployments orchestrate ingestion tasks asynchronously and integrate with more complex providers like
unstructured_local
.
-
Provider Configuration:
Settings inr2r.toml
or at runtime (ingestion_config
) can adjust parsing and chunking strategies:fast
andhi-res
modes are influenced by strategies like"auto"
or"hi_res"
in the unstructured provider.custom
mode allows you to override chunk size, overlap, excluded parsers, and more at runtime.
For detailed configuration options, see:
Conclusion
R2R’s ingestion pipeline is flexible and efficient, allowing you to tailor ingestion to your needs:
- Use
fast
for quick processing. - Use
hi-res
for high-quality, multimodal analysis. - Use
custom
for advanced, granular control.
You can easily ingest documents or pre-processed chunks, update their content, and delete them when no longer needed. Combined with powerful retrieval and knowledge graph capabilities, R2R enables seamless integration of advanced document management into your applications.