LLMs — The most advanced AI retrieval system. Containerized, Retrieval-Augmented Generation (RAG) with a RESTful API.

Language Model System

R2R uses Large Language Models (LLMs) as the core reasoning engine for RAG operations, providing sophisticated text generation and analysis capabilities.

R2R uses LiteLLM as to route LLM requests because of their provider flexibility. Read more about LiteLLM here.

LLM Configuration

The LLM system can be customized through the completion section in your r2r.toml file:

r2r.toml

1 [completion]
2 provider = "litellm" # defaults to "litellm"
3 concurrent_request_limit = 16 # defaults to 256
4 
5     [completion.generation_config]
6     model = "openai/gpt-4o" # defaults to "openai/gpt-4o"
7     temperature = 0.1 # defaults to 0.1
8     top_p = 1 # defaults to 1
9     max_tokens_to_sample = 1_024 # defaults to 1_024
10     stream = false # defaults to false
11     add_generation_kwargs = {} # defaults to {}

Relevant environment variables to the above configuration would be OPENAI_API_KEY, ANTHROPIC_API_KEY, AZURE_API_KEY, etc. depending on your chosen provider.

Advanced LLM Features in R2R

R2R leverages several advanced LLM features to provide robust text generation:

Concurrent Request Management

The system implements sophisticated request handling with rate limiting and concurrency control:

1 class CompletionProvider:
2     async def aget_completion(
3         self,
4         messages: list[dict],
5         generation_config: GenerationConfig,
6         **kwargs,
7     ) -> LLMChatCompletion:
8         task = {
9             "messages": messages,
10             "generation_config": generation_config,
11             "kwargs": kwargs,
12         }
13         response = await self._execute_with_backoff_async(task)
14         return LLMChatCompletion(**response.dict())

Rate Limiting: Prevents API throttling through intelligent request scheduling
Concurrent Processing: Manages multiple LLM requests efficiently
Error Handling: Implements retry logic with exponential backoff

Performance Considerations

When configuring LLMs in R2R, consider these optimization strategies:

Concurrency Management:
- Adjust concurrent_request_limit based on provider limits
- Monitor API usage and adjust accordingly
- Consider implementing request caching for repeated queries
Model Selection:
- Balance model capabilities with latency requirements
- Consider cost per token for different providers
- Evaluate context window requirements
Resource Management:
- Monitor token usage with large responses
- Implement appropriate error handling and retry strategies
- Consider implementing fallback models for critical systems

Serving select LLM providers

Select from the toggleable providers below.

OpenAI

Azure

Anthropic

Vertex AI

AWS Bedrock

Groq

Ollama

Cohere

Anyscale

1 export OPENAI_API_KEY=your_openai_key
2 # .. set other environment variables
3 
4 # Set your `my_r2r.toml` as shown:
5 # [completion]
6 # provider = "litellm"
7 #   [completion.generation_config]
8 #   model = "openai/gpt-4o-mini"
9 # r2r serve --config-path=my_r2r.toml
10 r2r serve

Supported models include:

openai/gpt-4o
openai/gpt-4-turbo
openai/gpt-4
openai/gpt-4o-mini

For a complete list of supported OpenAI models and detailed usage instructions, please refer to the LiteLLM OpenAI documentation.

Runtime Configuration of LLM Provider

R2R supports runtime configuration of the LLM provider, allowing you to dynamically change the model or provider for each request. This flexibility enables you to use different models or providers based on specific requirements or use cases.

Combining Search and Generation

When performing a RAG query, you can dynamically set the LLM generation settings:

1 response = client.rag(
2     "What are the latest advancements in quantum computing?",
3     rag_generation_config={
4         "stream": False,
5         "model": "openai/gpt-4o-mini",
6         "temperature": 0.7,
7         "max_tokens": 150
8     }
9 )

For more detailed information on configuring other search and RAG settings, please refer to the RAG Configuration documentation.

1	[completion]
2	provider = "litellm" # defaults to "litellm"
3	concurrent_request_limit = 16 # defaults to 256
4
5	[completion.generation_config]
6	model = "openai/gpt-4o" # defaults to "openai/gpt-4o"
7	temperature = 0.1 # defaults to 0.1
8	top_p = 1 # defaults to 1
9	max_tokens_to_sample = 1_024 # defaults to 1_024
10	stream = false # defaults to false
11	add_generation_kwargs = {} # defaults to {}

1	class CompletionProvider:
2	async def aget_completion(
3	self,
4	messages: list[dict],
5	generation_config: GenerationConfig,
6	**kwargs,
7	) -> LLMChatCompletion:
8	task = {
9	"messages": messages,
10	"generation_config": generation_config,
11	"kwargs": kwargs,
12	}
13	response = await self._execute_with_backoff_async(task)
14	return LLMChatCompletion(**response.dict())

1	export OPENAI_API_KEY=your_openai_key
2	# .. set other environment variables
3
4	# Set your `my_r2r.toml` as shown:
5	# [completion]
6	# provider = "litellm"
7	# [completion.generation_config]
8	# model = "openai/gpt-4o-mini"
9	# r2r serve --config-path=my_r2r.toml
10	r2r serve

1	response = client.rag(
2	"What are the latest advancements in quantum computing?",
3	rag_generation_config={
4	"stream": False,
5	"model": "openai/gpt-4o-mini",
6	"temperature": 0.7,
7	"max_tokens": 150
8	}
9	)