Local LLMs

Overview

There are many amazing LLMs and embedding models that can be run locally. R2R fully supports using these models, giving you full control over your data and infrastructure.

Running models locally can be ideal for sensitive data handling, reducing API costs, or situations where internet connectivity is limited. While cloud-based LLMs often provide cutting-edge performance, local models offer a compelling balance of capability, privacy, and cost-effectiveness for many use cases.

Local LLM features are currently restricted to:

Self-hosted instances
Enterprise tier cloud accounts

Contact our sales team for Enterprise pricing and features.

Serving Local Models

For this cookbook, we’ll serve our local models via Ollama. You may follow the instructions on their official website to install.

You can also follow along using LM Studio. To get started with LM Studio, see our Local LLM documentation.

R2R supports LiteLLM for routing embedding and completion requests. This allows for OpenAI-compatible endpoints to be called and seamlessly routed to, if you are serving local models another way.

We must first download the models that we wish to run and start our ollama server. The following command will ‘pull’ the models and begin the Ollama server via http://localhost:11434.

Bash

1 ollama pull llama3.1
2 ollama pull mxbai-embed-large

Ollama has a default context window size of 2048 tokens. Many of the prompts and processes that R2R uses requires larger window sizes.

It is recommended to set the context size to a minimum of 16k tokens. The following guideline is generally useful to determine what your system can handle:

8GB RAM/VRAM: ~4K-8K context
16GB RAM/VRAM: ~16K-32K context
24GB+ RAM/VRAM: 32K+ context

To change the default context window you must first create a Modelfile for Ollama, where you can set num_ctx:

1 echo 'FROM llama3.1
2 PARAMETER num_ctx 16000' > Modelfile

Then you must create a manifest for that model:

1 ollama create llama3.1 -f Modelfile

Bash

Then, we can start the Ollama server:

1 ollama serve

Configuring R2R

Now that our models have been loaded and our Ollama server is ready, we can launch our R2R server.

The standard distribution of R2R includes a configuration file for running llama3.1 and mxbai-embed-large. If you wish to utilize other models, you must create a custom config file and pass this to your server.

ollama.toml

1 [app]
2 # LLM used for internal operations, like deriving conversation names
3 fast_llm = "ollama/llama3.1"
4 
5 # LLM used for user-facing output, like RAG replies
6 quality_llm = "ollama/llama3.1"
7 
8 # LLM used for ingesting visual inputs
9 vlm = "ollama/llama3.2-vision" # TODO - Replace with viable candidate
10 
11 # LLM used for transcription
12 audio_lm = "ollama/llama3.1" # TODO - Replace with viable candidate
13 
14 [embedding]
15 provider = "ollama"
16 base_model = "mxbai-embed-large"
17 base_dimension = 1_024
18 batch_size = 128
19 add_title_as_prefix = true
20 concurrent_request_limit = 2
21 
22 [completion_embedding]
23 provider = "ollama"
24 base_model = "mxbai-embed-large"
25 base_dimension = 1_024
26 batch_size = 128
27 add_title_as_prefix = true
28 concurrent_request_limit = 2
29 
30 [agent]
31 tools = ["local_search"]
32 
33 [agent.generation_config]
34 model = "ollama/llama3.1"
35 
36 [completion]
37 provider = "litellm"
38 concurrent_request_limit = 1
39 
40 [completion.generation_config]
41 temperature = 0.1
42 top_p = 1
43 max_tokens_to_sample = 1_024
44 stream = false

We launch R2R by specifying this configuration file:

1 export R2R_CONFIG_NAME=ollama
2 python -m r2r.serve

Since we’re serving with Docker, once R2R successfully launches the R2R dashboard opens for us. We can upload a document and see requests hit our Ollama server.

The processed document and the Ollama server logs. — The R2R Dashboard and Ollama server showing successful ingestion

Retrieval and Search

Now that we have ingested our file, we can perform RAG and chunk search over it. Here, we see that we are able to get relevant results and correct answers—all without needing to make a request out to an external provider!

Local RAG

Local Search

A RAG search done with local LLMs. — A RAG search done using a local LLM

Extracting Entities and Relationships

If we’d like to build a graph for our document, we must first extract the entities and relationships that it contains. Through the dashboard we can select the ‘Document Extraction’ action in the documents table. This will start the extraction process in the background, which uses named entity recognition to find entities and relationships.

Note that this process can take quite a bit of time, depending on the size of your document and the hardware running your model. Once the process is complete, we will see that the extraction status has turned green.

Successful Extraction

Extracted Entities

Extracted Relationships

Successful extraction on the documents table. — A successful extraction shown on the documents table

Graph RAG

Now we must pull the document extractions into the graph. This is done at the collection level, and creates a copy of our extractions for searching over and creating communities with.

Then, we can conduct search, RAG, or agent queries that utilize the graph.

Graph RAG

Pulling Extractions into Graph

A search that utilizes the entities and relationships from the graph. — A RAG search that includes entities and relationships from the graph

Building communities

We can go one step further and create communities over the entities and relationships in the graph. By clustering over the closely related extractions, we can further develop the understanding of how these entities and relationships interact. This can be particularly helpful in sets of documents where we see overarching or recuring themes.

We trigger the extraction procedure, which produces a number of communities. Now, when we run queries over our graph we can utilize the communities to provide context that better encompasses overall concepts and ideas throughout our documents.

Overview

Serving Local Models

Bash

Bash

Configuring R2R

ollama.toml

Retrieval and Search

Local RAG

Local Search

Extracting Entities and Relationships

Successful Extraction

Extracted Entities

Extracted Relationships

Graph RAG

Graph RAG

Pulling Extractions into Graph

Building communities

RAG with Communities

Generated Communities