Local LLMs
Learn how to run a Retrieval-Augmented Generation system locally using R2R
Introduction
R2R natively supports RAG with local LLMs through LM Studio and Ollama.
Ollama
LM Studio
To get started with Ollama, you must follow the instructions on their official website.
To run R2R with default Ollama settings, which utilize llama3.1
and mxbai-embed-large
, execute r2r serve --docker --config-name=ollama
.
Preparing Local LLMs
Ollama has a default context window size of 2048 tokens. Many of the prompts and processes that R2R uses requires larger window sizes.
It is recommended to set the context size to a minimum of 16k tokens. The following guideline is generally useful to determine what your system can handle:
- 8GB RAM/VRAM: ~4K-8K context
- 16GB RAM/VRAM: ~16K-32K context
- 24GB+ RAM/VRAM: 32K+ context
To change the default you must first create a modelfile for Ollama, where you can set num_ctx
:
Then you must create a manifest for that model:
Next, make sure that you have all the necessary LLMs installed:
These commands will need to be replaced with models specific to your configuration when deploying R2R with a customized configuration.
Configuration
R2R uses a TOML configuration file for managing settings, which you can read about here. For local setup, we’ll use the default local_llm
configuration. This can be customized to your needs by setting up a standalone project.
Local Configuration Details
The local_llm
configuration file (core/configs/local_llm.toml
) includes:
We are still working on adding local multimodal RAG features. Your feedback would be appreciated.
For more information on how to configure R2R, visit here.
Summary
The above steps are all you need to get RAG up and running with local LLMs in R2R. For detailed setup and basic functionality, refer back to the R2R Quickstart. For more advanced usage and customization options, refer to the basic configuration or join the R2R Discord community.