R2R uses language models to generate responses based on retrieved context. You can configure R2R’s server-side LLM generation settings with the r2r.toml:

r2r.toml
[completion]
provider = "litellm"
concurrent_request_limit = 16

    [completion.generation_config]
    model = "openai/gpt-4o"
    temperature = 0.1
    top_p = 1
    max_tokens_to_sample = 1_024
    stream = false
    add_generation_kwargs = {}

Key generation configuration options:

  • provider: The LLM provider (defaults to “LiteLLM” for maximum flexibility).
  • concurrent_request_limit: Maximum number of concurrent LLM requests.
  • model: The language model to use for generation.
  • temperature: Controls the randomness of the output (0.0 to 1.0).
  • top_p: Nucleus sampling parameter (0.0 to 1.0).
  • max_tokens_to_sample: Maximum number of tokens to generate.
  • stream: Enable/disable streaming of generated text.
  • api_base: The base URL for remote communication, e.g. https://api.openai.com/v1

Serving select LLM providers

export OPENAI_API_KEY=your_openai_key
# .. set other environment variables

# Optional - Update default model
# Set '"model": "openai/gpt-4o-mini"' in `r2r.toml`
# then call `r2r serve --config-path=r2r.toml`
r2r serve
# Set '"model": "openai/gpt-4o-mini"' in `r2r.toml`
# then call `r2r serve --config-path=r2r.toml`
r2r serve

Supported models include:

  • openai/gpt-4o
  • openai/gpt-4-turbo
  • openai/gpt-4
  • openai/gpt-4o-mini

For a complete list of supported OpenAI models and detailed usage instructions, please refer to the LiteLLM OpenAI documentation.

Runtime Configuration of LLM Provider

R2R supports runtime configuration of the LLM provider, allowing you to dynamically change the model or provider for each request. This flexibility enables you to use different models or providers based on specific requirements or use cases.

Combining Search and Generation

When performing a RAG query, you can dynamically set the LLM generation settings:

response = client.rag(
    "What are the latest advancements in quantum computing?",
    rag_generation_config={
        "stream": False,
        "model": "openai/gpt-4o-mini",
        "temperature": 0.7,
        "max_tokens": 150
    }
)

For more detailed information on configuring other search and RAG settings, please refer to the RAG Configuration documentation.

Next Steps

For more detailed information on configuring specific components of R2R, please refer to the following pages: