Local LLMs

Learn how to run a Retrieval-Augmented Generation system locally using R2R

Introduction

R2R natively supports RAG with local LLMs through LM Studio and Ollama.

To get started with Ollama, you must follow the instructions on their official website.

To run R2R with default Ollama settings, which utilize llama3.1 and mxbai-embed-large, execute r2r serve --docker --config-name=ollama.

Preparing Local LLMs

Ollama has a default context window size of 2048 tokens. Many of the prompts and processes that R2R uses requires larger window sizes.

It is recommended to set the context size to a minimum of 16k tokens. The following guideline is generally useful to determine what your system can handle:

  • 8GB RAM/VRAM: ~4K-8K context
  • 16GB RAM/VRAM: ~16K-32K context
  • 24GB+ RAM/VRAM: 32K+ context

To change the default you must first create a modelfile for Ollama, where you can set num_ctx:

1echo 'FROM llama3.1
2PARAMETER num_ctx 16000' > Modelfile

Then you must create a manifest for that model:

1ollama create llama3.1 -f Modelfile

Next, make sure that you have all the necessary LLMs installed:

$# in a separate terminal
>ollama pull llama3.1
>ollama pull mxbai-embed-large
>ollama serve

These commands will need to be replaced with models specific to your configuration when deploying R2R with a customized configuration.

Configuration

R2R uses a TOML configuration file for managing settings, which you can read about here. For local setup, we’ll use the default local_llm configuration. This can be customized to your needs by setting up a standalone project.

The local_llm configuration file (core/configs/local_llm.toml) includes:

1[completion]
2provider = "litellm"
3concurrent_request_limit = 1
4
5 [completion.generation_config]
6 model = "ollama/llama3.1"
7 temperature = 0.1
8 top_p = 1
9 max_tokens_to_sample = 1_024
10 stream = false
11 add_generation_kwargs = { }
12
13[database]
14provider = "postgres"
15
16[embedding]
17provider = "ollama"
18base_model = "mxbai-embed-large"
19base_dimension = 1_024
20batch_size = 32
21add_title_as_prefix = true
22concurrent_request_limit = 32
23
24[ingestion]
25excluded_parsers = [ "mp4" ]

We are still working on adding local multimodal RAG features. Your feedback would be appreciated.

For more information on how to configure R2R, visit here.

Summary

The above steps are all you need to get RAG up and running with local LLMs in R2R. For detailed setup and basic functionality, refer back to the R2R Quickstart. For more advanced usage and customization options, refer to the basic configuration or join the R2R Discord community.

Built with