Tutorials
Local RAG

Easy Local RAG with R2R: A Step-by-Step Guide

Introduction

R2R, short for "RAG to Riches," is a game-changing framework that simplifies the process of building applications with LLMs. With R2R, you can hit the ground running and have a system running locally in minutes, eliminating the need for complex cloud infrastructure or costly hosted services.

In this comprehensive, step-by-step tutorial, we'll guide you through the process of installing R2R, ingesting your documents, querying those docs using a local LLM, and tailoring the RAG pipeline to perfectly fit your unique requirements. By the end of this guide, you'll have a fully functional, locally-hosted, and readily deployable LLM application at your fingertips!

Llama Coding

Running R2R with Docker

If you prefer using Docker, you can easily run R2R in a containerized environment. Here's how:

  1. Pull the latest R2R Docker image:

    docker pull emrgntcmplxty/r2r:latest
  2. Choose the appropriate configuration option based on your deployment:

    • For cloud deployment, select default and pass --env-file .env to provide the necessary environment variables.
    • For local deployment, select local_ollama.
  3. Run the R2R Docker container:

    docker run -d --name r2r_container -p 8000:8000 -e CONFIG_OPTION=local_ollama emrgntcmplxty/r2r:latest

    This command starts the R2R container in detached mode (-d), names it r2r_container, maps port 8000 from the container to the host (-p 8000:8000), and sets the configuration option to local_ollama using the -e flag.

Once the container is running, you can interact with the R2R API in the same way as described in the tutorial.

Setting Up Your Environment

R2R supports two popular approaches to Local LLM inference: ollama and Llama.cpp. Llama.cpp is supported through a dedicated implementation in R2R, whereas ollama is provided through a connection managed by the litellm library.

If you wish to use ollama, it must be installed independently. You can install ollama by following the instructions on their official website (opens in a new tab) or by referring to their GitHub README (opens in a new tab).

Next, let's install R2R itself. We'll use pip to manage our Python dependencies. Run the following command:

pip install --upgrade pip
pip install 'r2r[eval,local-llm]'

This will install R2R along with the dependencies needed to run local LLMs.

Pipeline Configuration

Let's move on to setting up the R2R pipeline. R2R relies on a config.json file for defining various settings, such as embedding models and chunk sizes. By default, the config.json found in the R2R GitHub repository's root directory is set up for cloud-based services.

For setting up an on-premises RAG system, we need to adjust the configuration to use local resources. This involves changing the embedding provider, selecting the appropriate LLM provider, and disabling evaluations.

To streamline this process, we've provided pre-configured local settings in the examples/configs (opens in a new tab) directory, named local_ollama and local_llama_cpp. Here's an overview of the primary changes from the default configuration:

{
  "embedding": {
    "provider": "sentence-transformers",
    "model": "mixedbread-ai/mxbai-embed-large-v1",
    "dimension": 384,
    "batch_size": 32
  },
  "evals": {
    "provider": "none",
    "sampling_fraction": 0.0
  },
  "language_model": {
    "provider": "ollama"  // or "llama-cpp"
    // "model-name": "your-target-model.gguf" necessary for "llama-cpp"
    // default model-name is `tinyllama-1.1b-chat-v1.0.Q2_K.gguf`
  },
  ...
}

You may also modify the configuration defaults for ingestion, logging, and your vector database provider in a similar manner. More information on this follows below.

This chosen config modification above instructs R2R to use the sentence-transformers library for embeddings with the mixedbread-ai/mxbai-embed-large-v1 model, turns off evals, and sets the LLM provider to ollama. During ingestion, the default is to split documents into chunks of 512 characters with 20 characters of overlap between chunks.

A local vector database will be used to store the embeddings. The current default is a minimal sqlite implementation, with plans to migrate the tutorial to LanceDB shortly.

Server Standup

To stand up the server, we can readily choose from one of the two options below:

python -m r2r.examples.servers.configurable_pipeline --config local_ollama
python -m r2r.examples.servers.configurable_pipeline --config local_llama_cpp

Note, to run with Llama.cpp you must download tinyllama-1.1b-chat-v1.0.Q2_K.gguf here (opens in a new tab) and place the output into ~/cache/model/. The choice between ollama and Llama.cpp depends on our preferred LLM provider.

The server exposes an API for interacting with the RAG pipeline. See the API docs for details on the available endpoints.

Ingesting and Embedding Documents

With our environment set up and our server running in a separate process, we're ready to ingest a document! As an example, R2R includes the famous Stoic text "Meditations" by Marcus Aurelius as a PDF. This file ships with the package and is included by default.

Run this command to ingest the document:

python -m r2r.examples.clients.qna_rag_client ingest

The output should look something like this:

> Upload response =  {'message': "File 'meditations.pdf' processed and saved at '/Users/user/R2R/uploads/meditations.pdf'"}

Here's what's happening under the hood:

  1. R2R loads the included PDF and converts it to text using PyPDF2.
  2. It splits the text into chunks of 512 characters each, with 20 characters overlapping between chunks.
  3. Each chunk is embedded using the mixedbread-ai/mxbai-embed-large-v1 model from sentence-transformers.
  4. The chunks and embeddings are stored in the specified vector database, which defaults to a local SQLite database.

With just one command, we've gone from a raw document to an embedded knowledge base we can query. In addition to the raw chunks, metadata such as user ID or document ID can be attached to enable easy filtering later.

Running Queries on the Local LLM

Time for the fun part - asking questions!

Llama Image

To query our knowledge base using ollama, first ensure that you have started the ollama server with this command in the terminal:

ollama serve llama2

Then, to ask a question, run:

python -m r2r.examples.clients.qna_rag_client rag_completion \
  --query="What was Lyfts profit in 2020?" \
  --model="ollama/llama2"

For Llama.cpp, run:

python -m r2r.examples.clients.qna_rag_client rag_completion \
  --query="What was Lyfts profit in 2020?" \
  --model=tinyllama-1.1b-chat-v1.0.Q2_K.gguf

This command tells R2R to use the specified model to generate a completion for the given query. R2R will:

  1. Embed the query using mixedbread-ai/mxbai-embed-large-v1.
  2. Find the chunks most similar to the query embedding.
  3. Pass the query and relevant chunks to the LLM to generate a response.

After a brief wait, you should see the LLM's response, perhaps something like:

{
  "message": {
    "search_results": [...],
    "content": "Lyft did not make a profit in 2020; instead, it reported a net loss of $1.8 billion."
  }
}

How cool is that? With R2R, we've built a simple QA system backed by a local LLM that we can converse with. The best part is, all the data and compute stayed on our own machine.

Configuring Your RAG Pipeline

R2R provides flexibility in customizing various aspects of the RAG pipeline to suit your needs. This can be done through the config.json file. Some key configuration options include:

Vector Database Provider

R2R supports multiple vector database providers:

  • local: A SQLite-based local vector database
  • qdrant: Integration with Qdrant
  • pgvector: Integration with pgvector extension for Postgres

Set the provider field under vector_database in config.json to specify your provider.

Embedding Provider

R2R supports OpenAI and local inference embedding providers:

  • openai: OpenAI models like text-embedding-3-small
  • sentence-transformers: HuggingFace models like mixedbread-ai/mxbai-embed-large-v1

Configure the embedding section to set your desired embedding model, dimension, and batch size.

Language Model Provider

  • openai: Models like gpt-3.5-turbo
  • litellm (default): Integrates with many providers (OpenAI, Anthropic, Vertex AI, etc.)
  • ollama: A specifically supported provider, with the connection managed by litellm

Logging Provider

R2R supports logging to Postgres, SQLite (local), and Redis. The logs capture pipeline execution information.

Check out the full R2R configuration docs for more details on all available options.

Customizing Your RAG Pipeline

The R2R library provides flexibility in customizing various aspects of the RAG pipeline to suit your specific needs. You can create custom implementations of the ingestion pipeline, embedding pipeline, RAG pipeline, and evaluation pipeline by subclassing the respective base classes.

For example, to create a custom ingestion pipeline:

from r2r.pipelines import IngestionPipeline
 
class CustomIngestionPipeline(IngestionPipeline):
    def process_data(self, entry_type, entry_data):
        # Custom processing logic
        ...

Then pass your custom pipeline to E2EPipelineFactory.create_pipeline() using the ingestion_pipeline_impl parameter:

app = E2EPipelineFactory.create_pipeline(
    config=R2RConfig.load_config(),
    ingestion_pipeline_impl=CustomIngestionPipeline,
)

Similar subclassing patterns can be used for the embedding pipeline (EmbeddingPipeline), RAG pipeline (RAGPipeline), and evaluation pipeline (EvalPipeline). You can see an example of a custom pipeline that can be run similar to the previous examples of this tutorial here (opens in a new tab).

This allows you to deeply customize the behavior of ingestion, embedding, retrieval, and evaluation to fit your application's exact needs.

Scaling Beyond Local

While running locally is great for initial development and testing, you may quickly run into limitations as you scale up. Ingesting larger datasets, searching over more chunks, or supporting more concurrent users will require beefier hardware than your laptop provides.

Luckily, R2R makes it easy to transition from local development to a scalable cloud deployment. The same config.json you've been using can be easily modified to swap out local components for hosted services:

{
  "vector_database": {
    "provider": "qdrant"
  }
}

Then, you will want to set QDRANT_HOST, QDRANT_PORT, and QDRANT_API_KEY in your local environment variables.

With these changes, R2R will store vectors in a remote Qdrant (opens in a new tab) vector database while still using your local LLM for querying. You can incrementally adopt cloud services as needed while maintaining a consistent development experience.

Next Steps

In this tutorial, we've walked through the steps to get a local LLM application up and running with R2R:

  1. Installing dependencies
  2. Configuring the R2R pipeline
  3. Ingesting and embedding documents
  4. Running queries on a local LLM
  5. Customizing pipeline components

But we've only scratched the surface of what you can build with R2R! I encourage you to keep experimenting - try ingesting your own documents, tweaking model parameters, and exploring R2R's evaluation capabilities.

If you have any questions or want to learn more, R2R's GitHub repo (opens in a new tab), R2R Discord (opens in a new tab), and docs (opens in a new tab) are great resources.

Happy building! I can't wait to see what you create with R2R and local LLMs.