LLMs

Configure your LLM provider

Language Model System

R2R uses Large Language Models (LLMs) as the core reasoning engine for RAG operations, providing sophisticated text generation and analysis capabilities.

R2R uses LiteLLM as to route LLM requests because of their provider flexibility. Read more about LiteLLM here.

LLM Configuration

The LLM system can be customized through the completion section in your r2r.toml file:

r2r.toml
1[completion]
2provider = "litellm" # defaults to "litellm"
3concurrent_request_limit = 16 # defaults to 256
4
5 [completion.generation_config]
6 model = "openai/gpt-4o" # defaults to "openai/gpt-4o"
7 temperature = 0.1 # defaults to 0.1
8 top_p = 1 # defaults to 1
9 max_tokens_to_sample = 1_024 # defaults to 1_024
10 stream = false # defaults to false
11 add_generation_kwargs = {} # defaults to {}

Relevant environment variables to the above configuration would be OPENAI_API_KEY, ANTHROPIC_API_KEY, AZURE_API_KEY, etc. depending on your chosen provider.

Advanced LLM Features in R2R

R2R leverages several advanced LLM features to provide robust text generation:

Concurrent Request Management

The system implements sophisticated request handling with rate limiting and concurrency control:

1class CompletionProvider:
2 async def aget_completion(
3 self,
4 messages: list[dict],
5 generation_config: GenerationConfig,
6 **kwargs,
7 ) -> LLMChatCompletion:
8 task = {
9 "messages": messages,
10 "generation_config": generation_config,
11 "kwargs": kwargs,
12 }
13 response = await self._execute_with_backoff_async(task)
14 return LLMChatCompletion(**response.dict())
  1. Rate Limiting: Prevents API throttling through intelligent request scheduling
  2. Concurrent Processing: Manages multiple LLM requests efficiently
  3. Error Handling: Implements retry logic with exponential backoff

Performance Considerations

When configuring LLMs in R2R, consider these optimization strategies:

  1. Concurrency Management:

    • Adjust concurrent_request_limit based on provider limits
    • Monitor API usage and adjust accordingly
    • Consider implementing request caching for repeated queries
  2. Model Selection:

    • Balance model capabilities with latency requirements
    • Consider cost per token for different providers
    • Evaluate context window requirements
  3. Resource Management:

    • Monitor token usage with large responses
    • Implement appropriate error handling and retry strategies
    • Consider implementing fallback models for critical systems

Serving select LLM providers

Select from the toggleable providers below.
1export OPENAI_API_KEY=your_openai_key
2# .. set other environment variables
3
4# Set your `my_r2r.toml` as shown:
5# [completion]
6# provider = "litellm"
7# [completion.generation_config]
8# model = "openai/gpt-4o-mini"
9# r2r serve --config-path=my_r2r.toml
10r2r serve

Supported models include:

  • openai/gpt-4o
  • openai/gpt-4-turbo
  • openai/gpt-4
  • openai/gpt-4o-mini

For a complete list of supported OpenAI models and detailed usage instructions, please refer to the LiteLLM OpenAI documentation.

Runtime Configuration of LLM Provider

R2R supports runtime configuration of the LLM provider, allowing you to dynamically change the model or provider for each request. This flexibility enables you to use different models or providers based on specific requirements or use cases.

Combining Search and Generation

When performing a RAG query, you can dynamically set the LLM generation settings:

1response = client.rag(
2 "What are the latest advancements in quantum computing?",
3 rag_generation_config={
4 "stream": False,
5 "model": "openai/gpt-4o-mini",
6 "temperature": 0.7,
7 "max_tokens": 150
8 }
9)

For more detailed information on configuring other search and RAG settings, please refer to the RAG Configuration documentation.