LLMs

Configure your LLM provider

Language Model System

R2R uses Large Language Models (LLMs) as the core reasoning engine for RAG operations, providing sophisticated text generation and analysis capabilities.

R2R uses LiteLLM as to route LLM requests because of their provider flexibility. Read more about LiteLLM here.

LLM Configuration

The LLM system can be customized through the completion section in your r2r.toml file:

r2r.toml
1[app]
2# LLM used for internal operations, like deriving conversation names
3fast_llm = "openai/gpt-4o-mini"
4
5# LLM used for user-facing output, like RAG replies
6quality_llm = "openai/gpt-4o"
7
8# LLM used for ingesting visual inputs
9vlm = "openai/gpt-4o"
10
11# LLM used for transcription
12audio_lm = "openai/whisper-1"
13
14...
15
16[completion]
17provider = "r2r" # defaults to "r2r" with "litellm" fallback
18concurrent_request_limit = 16 # defaults to 256
19
20 [completion.generation_config]
21 temperature = 0.1 # defaults to 0.1
22 top_p = 1 # defaults to 1
23 max_tokens_to_sample = 1_024 # defaults to 1_024
24 stream = false # defaults to false
25 add_generation_kwargs = {} # defaults to {}

Relevant environment variables to the above configuration would be OPENAI_API_KEY, ANTHROPIC_API_KEY, AZURE_API_KEY, etc. depending on your chosen provider.

Advanced LLM Features in R2R

R2R leverages several advanced LLM features to provide robust text generation:

Concurrent Request Management

The system implements sophisticated request handling with rate limiting and concurrency control

  1. Rate Limiting: Prevents API throttling through intelligent request scheduling
  2. Concurrent Processing: Manages multiple LLM requests efficiently
  3. Error Handling: Implements retry logic with exponential backoff

Performance Considerations

When configuring LLMs in R2R, consider these optimization strategies:

  1. Concurrency Management:

    • Adjust concurrent_request_limit based on provider limits
    • Monitor API usage and adjust accordingly
    • Consider implementing request caching for repeated queries
  2. Model Selection:

    • Balance model capabilities with latency requirements
    • Consider cost per token for different providers
    • Evaluate context window requirements
  3. Resource Management:

    • Monitor token usage with large responses
    • Implement appropriate error handling and retry strategies
    • Consider implementing fallback models for critical systems

Serving select LLM providers

Select from the toggleable providers below.
1export OPENAI_API_KEY=your_openai_key
2# .. set other environment variables
3
4# Set your `my_r2r.toml` similar to shown:
5# [app]
6# quality_llm = "openai/gpt-4o-mini"

Supported models include:

  • openai/gpt-4o
  • openai/gpt-4-turbo
  • openai/gpt-4
  • openai/gpt-4o-mini

For a complete list of supported OpenAI models and detailed usage instructions, please refer to the LiteLLM OpenAI documentation.

Runtime Configuration of LLM Provider

R2R supports runtime configuration of the LLM provider, allowing you to dynamically change the model or provider for each request. This flexibility enables you to use different models or providers based on specific requirements or use cases.

Combining Search and Generation

When performing a RAG query, you can dynamically set the LLM generation settings:

1response = client.rag(
2 "What are the latest advancements in quantum computing?",
3 rag_generation_config={
4 "stream": False,
5 "model": "openai/gpt-4o-mini",
6 "temperature": 0.7,
7 "max_tokens": 150
8 }
9)

For more detailed information on configuring other search and RAG settings, please refer to the RAG Configuration documentation.

Built with