Local LLMs | The most advanced AI retrieval system. Agentic Retrieval-Augmented Generation (RAG) with a RESTful API.

Introduction

R2R natively supports RAG with local LLMs through LM Studio and Ollama.

Follow along with our Local LLM cookbook for a full walkthrough on how to use R2R with local LLMs!

Ollama

LM Studio

To get started with Ollama, you must follow the instructions on their official website.

To run R2R with default Ollama settings, which utilize llama3.1 and mxbai-embed-large, run:

1 export R2R_CONFIG_NAME=ollama
2 python -m r2r.serve

Preparing Local LLMs

Ollama has a default context window size of 2048 tokens. Many of the prompts and processes that R2R uses requires larger window sizes.

It is recommended to set the context size to a minimum of 16k tokens. The following guideline is generally useful to determine what your system can handle:

8GB RAM/VRAM: ~4K-8K context
16GB RAM/VRAM: ~16K-32K context
24GB+ RAM/VRAM: 32K+ context

To change the default you must first create a modelfile for Ollama, where you can set num_ctx:

1 echo 'FROM llama3.1
2 PARAMETER num_ctx 16000' > Modelfile

Then you must create a manifest for that model:

1 ollama create llama3.1 -f Modelfile

Next, make sure that you have all the necessary LLMs installed:

1 # in a separate terminal
2 ollama pull llama3.1
3 ollama pull mxbai-embed-large
4 ollama serve

These commands will need to be replaced with models specific to your configuration when deploying R2R with a customized configuration.

Configuration

R2R uses a TOML configuration file for managing settings, which you can read about here. For local setup, we’ll use the default ollama configuration. This can be customized to your needs by setting up a standalone project.

Local Configuration Details

The ollama configuration file (core/configs/ollama.toml) includes:

1 [completion]
2 provider = "litellm"
3 concurrent_request_limit = 1
4 
5   [completion.generation_config]
6   model = "ollama/llama3.1"
7   temperature = 0.1
8   top_p = 1
9   max_tokens_to_sample = 1_024
10   stream = false
11   add_generation_kwargs = { }
12 
13 [database]
14 provider = "postgres"
15 
16 [embedding]
17 provider = "ollama"
18 base_model = "mxbai-embed-large"
19 base_dimension = 1_024
20 batch_size = 32
21 add_title_as_prefix = true
22 concurrent_request_limit = 32
23 
24 [ingestion]
25 excluded_parsers = [ "mp4" ]

For more information on how to configure R2R, visit here.

We are still working on adding local multimodal RAG features. Your feedback would be appreciated.

The ingestion and graph creation process has been tested across different language models. When selecting a model, consider the tradeoff between performance and model size—larger models often generate more detailed graphs with more elements, while smaller models may be more efficient but produce simpler graphs.

Model	Entities	Relationships
llama3.1:8B	76	60
llama3.2:3B	29	29

Summary

The above steps are all you need to get RAG up and running with local LLMs in R2R. For detailed setup and basic functionality, refer back to the R2R Quickstart. For more advanced usage and customization options, refer to the basic configuration or join the R2R Discord community.