Local LLMs
Run R2R with Local LLMs
Overview
There are many amazing LLMs and embedding models that can be run locally. R2R fully supports using these models, giving you full control over your data and infrastructure.
Running models locally can be ideal for sensitive data handling, reducing API costs, or situations where internet connectivity is limited. While cloud-based LLMs often provide cutting-edge performance, local models offer a compelling balance of capability, privacy, and cost-effectiveness for many use cases.
Local LLM features are currently restricted to:
- Self-hosted instances
- Enterprise tier cloud accounts
Contact our sales team for Enterprise pricing and features.
Serving Local Models
For this cookbook, we’ll serve our local models via Ollama. You may follow the instructions on their official website to install.
You can also follow along using LM Studio. To get started with LM Studio, see our Local LLM documentation.
R2R supports LiteLLM for routing embedding and completion requests. This allows for OpenAI-compatible endpoints to be called and seamlessly routed to, if you are serving local models another way.
We must first download the models that we wish to run and start our ollama server. The following command will ‘pull’ the models and begin the Ollama server via http://localhost:11434
.
Bash
Ollama has a default context window size of 2048 tokens. Many of the prompts and processes that R2R uses requires larger window sizes.
It is recommended to set the context size to a minimum of 16k tokens. The following guideline is generally useful to determine what your system can handle:
- 8GB RAM/VRAM: ~4K-8K context
- 16GB RAM/VRAM: ~16K-32K context
- 24GB+ RAM/VRAM: 32K+ context
To change the default context window you must first create a Modelfile for Ollama, where you can set num_ctx
:
Then you must create a manifest for that model:
Bash
Then, we can start the Ollama server:
Configuring R2R
Now that our models have been loaded and our Ollama server is ready, we can launch our R2R server.
The standard distribution of R2R includes a configuration file for running llama3.1
and mxbai-embed-large
. If you wish to utilize other models, you must create a custom config file and pass this to your server.
local_llm.toml
We launch R2R by specifying this configuration file:
Since we’re serving with Docker, once R2R successfully launches the R2R dashboard opens for us. We can upload a document and see requests hit our Ollama server.
Retrieval and Search
Now that we have ingested our file, we can perform RAG and chunk search over it. Here, we see that we are able to get relevant results and correct answers—all without needing to make a request out to an external provider!
Local RAG
Local Search
Extracting Entities and Relationships
If we’d like to build a graph for our document, we must first extract the entities and relationships that it contains. Through the dashboard we can select the ‘Document Extraction’ action in the documents table. This will start the extraction process in the background, which uses named entity recognition to find entities and relationships.
Note that this process can take quite a bit of time, depending on the size of your document and the hardware running your model. Once the process is complete,
we will see that the extraction
status has turned green.
Successful Extraction
Extracted Entities
Extracted Relationships
Graph RAG
Now we must pull
the document extractions into the graph. This is done at the collection level, and creates a copy of our extractions for searching over and creating communities with.
Then, we can conduct search, RAG, or agent queries that utilize the graph.
Graph RAG
Pulling Extractions into Graph
Building communities
We can go one step further and create communities over the entities and relationships in the graph. By clustering over the closely related extractions, we can further develop the understanding of how these entities and relationships interact. This can be particularly helpful in sets of documents where we see overarching or recuring themes.
We trigger the extraction procedure, which produces a number of communities. Now, when we run queries over our graph we can utilize the communities to provide context that better encompasses overall concepts and ideas throughout our documents.