Multimodal RAG features in R2R are still under development. Multimodal features with local LLMs are currently disabled.


R2R supports ingesting images, audio, and videos. These multimodal features are implemented in the ImageParser, AudioParser, and MovieParser. Currently, these features are powered by OpenAI multimodal models, with ongoing work to expand support to other providers and local LLMs.

Processing Different Modalities

R2R handles various types of media files:

  • Audio Files: Transcribed using whisper-1
  • Image Files: Described using gpt-4o
  • Video Files: Processes both transcripts and frame slices

All results are chunked and embedded using the default pipeline settings. Further, if knowledge graphs are enabled then all ingested data will also be used to populate your knowledge graph.

Ingesting Multimodal Data


R2R’s multimodal RAG capabilities enable:

  • Ingestion and processing of diverse data types (images, audio, video)
  • Leveraging of advanced models for transcription and description
  • Effective chunking and embedding for optimal search and retrieval

The provided examples demonstrate how R2R handles different file types, generating contextually relevant search results across modalities.

R2R is continuously evolving to include more robust features and support for additional data types. Your feedback is invaluable in this development process.

For detailed setup and basic functionality, refer back to the R2R Quickstart. For more advanced usage and customization options, join the R2R Discord community.