Local LLMs

Run R2R with Local LLMs

Overview

There are many amazing LLMs and embedding models that can be run locally. R2R fully supports using these models, giving you full control over your data and infrastructure.

Running models locally can be ideal for sensitive data handling, reducing API costs, or situations where internet connectivity is limited. While cloud-based LLMs often provide cutting-edge performance, local models offer a compelling balance of capability, privacy, and cost-effectiveness for many use cases.

Local LLM features are currently restricted to:

  • Self-hosted instances
  • Enterprise tier cloud accounts

Contact our sales team for Enterprise pricing and features.

1

Serving Local Models

For this cookbook, we’ll serve our local models via Ollama. You may follow the instructions on their official website to install.

You can also follow along using LM Studio. To get started with LM Studio, see our Local LLM documentation.

R2R supports LiteLLM for routing embedding and completion requests. This allows for OpenAI-compatible endpoints to be called and seamlessly routed to, if you are serving local models another way.

We must first download the models that we wish to run and start our ollama server. The following command will ‘pull’ the models and begin the Ollama server via http://localhost:11434.

1ollama pull llama3.1
2ollama pull mxbai-embed-large

Ollama has a default context window size of 2048 tokens. Many of the prompts and processes that R2R uses requires larger window sizes.

It is recommended to set the context size to a minimum of 16k tokens. The following guideline is generally useful to determine what your system can handle:

  • 8GB RAM/VRAM: ~4K-8K context
  • 16GB RAM/VRAM: ~16K-32K context
  • 24GB+ RAM/VRAM: 32K+ context

To change the default context window you must first create a Modelfile for Ollama, where you can set num_ctx:

1echo 'FROM llama3.1
2PARAMETER num_ctx 16000' > Modelfile

Then you must create a manifest for that model:

1ollama create llama3.1 -f Modelfile

Then, we can start the Ollama server:

1ollama serve
2

Configuring R2R

Now that our models have been loaded and our Ollama server is ready, we can launch our R2R server.

The standard distribution of R2R includes a configuration file for running llama3.1 and mxbai-embed-large. If you wish to utilize other models, you must create a custom config file and pass this to your server.

1[agent]
2system_instruction_name = "rag_agent"
3tool_names = ["local_search"]
4
5[agent.generation_config]
6model = "ollama/llama3.1"
7
8[completion]
9provider = "litellm"
10concurrent_request_limit = 1
11
12[completion.generation_config]
13model = "ollama/llama3.1"
14temperature = 0.1
15top_p = 1
16max_tokens_to_sample = 1_024
17stream = false
18add_generation_kwargs = { }
19
20[embedding]
21provider = "ollama"
22base_model = "mxbai-embed-large"
23base_dimension = 1_024
24batch_size = 128
25add_title_as_prefix = true
26concurrent_request_limit = 2
27
28[database]
29provider = "postgres"
30
31[database.graph_creation_settings]
32 graph_entity_description_prompt = "graphrag_entity_description"
33 entity_types = [] # if empty, all entities are extracted
34 relation_types = [] # if empty, all relations are extracted
35 fragment_merge_count = 4 # number of fragments to merge into a single extraction
36 max_knowledge_relationships = 100
37 max_description_input_length = 65536
38 generation_config = { model = "ollama/llama3.1" } # and other params, model used for relationshipt extraction
39
40[database.graph_enrichment_settings]
41 community_reports_prompt = "graphrag_community_reports"
42 max_summary_input_length = 65536
43 generation_config = { model = "ollama/llama3.1" } # and other params, model used for node description and graph clustering
44 leiden_params = {}
45
46[database.graph_search_settings]
47 generation_config = { model = "ollama/llama3.1" }
48
49
50[orchestration]
51provider = "simple"
52
53
54[ingestion]
55vision_img_model = "ollama/llama3.2-vision"
56vision_pdf_model = "ollama/llama3.2-vision"
57chunks_for_document_summary = 16
58document_summary_model = "ollama/llama3.1"
59
60[ingestion.extra_parsers]
61 pdf = "zerox"

We launch R2R by specifying this configuration file:

1r2r serve --docker --config-name=local_llm

Since we’re serving with Docker, once R2R successfully launches the R2R dashboard opens for us. We can upload a document and see requests hit our Ollama server.

The processed document and the Ollama server logs.
The R2R Dashboard and Ollama server showing successful ingestion
4

Extracting Entities and Relationships

If we’d like to build a graph for our document, we must first extract the entities and relationships that it contains. Through the dashboard we can select the ‘Document Extraction’ action in the documents table. This will start the extraction process in the background, which uses named entity recognition to find entities and relationships.

Note that this process can take quite a bit of time, depending on the size of your document and the hardware running your model. Once the process is complete, we will see that the extraction status has turned green.

Successful extraction on the documents table.
A successful extraction shown on the documents table
5

Graph RAG

Now we must pull the document extractions into the graph. This is done at the collection level, and creates a copy of our extractions for searching over and creating communities with.

Then, we can conduct search, RAG, or agent queries that utilize the graph.

A search that utilizes the entities and relationships from the graph.
A RAG search that includes entities and relationships from the graph
6

Building communities

We can go one step further and create communities over the entities and relationships in the graph. By clustering over the closely related extractions, we can further develop the understanding of how these entities and relationships interact. This can be particularly helpful in sets of documents where we see overarching or recuring themes.

We trigger the extraction procedure, which produces a number of communities. Now, when we run queries over our graph we can utilize the communities to provide context that better encompasses overall concepts and ideas throughout our documents.

A RAG search that utilizes communities.
A RAG query that utilizes communities
Built with