Evals | The most advanced AI retrieval system. Agentic Retrieval-Augmented Generation (RAG) with a RESTful API.

Overview

This guide demonstrates how to evaluate your R2R RAG outputs using the Ragas evaluation framework.

In this tutorial, you will:

Prepare a sample dataset in R2R
Use R2R’s /rag endpoint to perform Retrieval-Augmented Generation
Install and configure Ragas for evaluation
Evaluate the generated responses using multiple metrics
Analyze evaluation traces for deeper insights

Setting Up Ragas for R2R Evaluation

Installing Ragas

First, install Ragas and its dependencies:

1 %pip install ragas langchain-openai -q

Configuring Ragas with OpenAI

Ragas uses an LLM to perform evaluations. Set up an OpenAI model as the evaluator:

1 from langchain_openai import ChatOpenAI
2 from ragas.llms import LangchainLLMWrapper
3 
4 # Make sure your OPENAI_API_KEY environment variable is set
5 llm = ChatOpenAI(model="gpt-4o-mini")
6 evaluator_llm = LangchainLLMWrapper(llm)
7 
8 # If you'll be using embeddings for certain metrics
9 from langchain_openai import OpenAIEmbeddings
10 from ragas.embeddings import LangchainEmbeddingsWrapper
11 evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Sample Dataset and R2R RAG Implementation

For this guide, we assume you have:

An initialized R2R client
A dataset about AI companies already ingested into R2R
Basic knowledge of R2R’s RAG capabilities

Here’s a quick example of using R2R’s /rag endpoint to generate an answer:

1 from r2r import R2RClient
2 
3 client = R2RClient()  # Assuming R2R_API_KEY is set in your environment
4 
5 query = "What makes Meta AI's LLaMA models stand out?"
6 
7 search_settings = {
8     "limit": 2,
9     "graph_settings": {"enabled": False, "limit": 2},
10 }
11 
12 response = client.retrieval.rag(
13     query=query,
14     search_settings=search_settings
15 )
16 
17 print(response.results.generated_answer)

The output might look like:

Meta AI's LLaMA models stand out due to their open-source nature, which supports innovation and experimentation by making high-quality models accessible to researchers and developers [1]. This approach democratizes AI development, fostering collaboration across industries and enabling researchers without access to expensive resources to work with advanced AI models [2].

Evaluating R2R with Ragas

Ragas provides a comprehensive evaluation framework specifically designed for RAG systems. The R2R-Ragas integration makes it easy to assess the quality of your R2R implementation.

Creating a Test Dataset

First, prepare a set of test questions and reference answers:

1 questions = [
2     "Who are the major players in the large language model space?",
3     "What is Microsoft's Azure AI platform known for?",
4     "What kind of models does Cohere provide?",
5 ]
6 
7 references = [
8     "The major players include OpenAI (GPT Series), Anthropic (Claude Series), Google DeepMind (Gemini Models), Meta AI (LLaMA Series), Microsoft Azure AI (integrating GPT Models), Amazon AWS (Bedrock with Claude and Jurassic), Cohere (business-focused models), and AI21 Labs (Jurassic Series).",
9     "Microsoft's Azure AI platform is known for integrating OpenAI's GPT models, enabling businesses to use these models in a scalable and secure cloud environment.",
10     "Cohere provides language models tailored for business use, excelling in tasks like search, summarization, and customer support.",
11 ]

Collecting R2R Responses

Generate responses using your R2R implementation:

1 r2r_responses = []
2 
3 search_settings = {
4     "limit": 2,
5     "graph_settings": {"enabled": False, "limit": 2},
6 }
7 
8 for que in questions:
9     response = client.retrieval.rag(query=que, search_settings=search_settings)
10     r2r_responses.append(response)

The R2R-Ragas Integration

Ragas includes a dedicated integration for R2R that handles the conversion of R2R’s response format to Ragas’s evaluation dataset format:

1 from ragas.integrations.r2r import transform_to_ragas_dataset
2 
3 # Convert R2R responses to Ragas format
4 ragas_eval_dataset = transform_to_ragas_dataset(
5     user_inputs=questions, 
6     r2r_responses=r2r_responses, 
7     references=references
8 )
9 
10 print(ragas_eval_dataset)
11 # Output: EvaluationDataset(features=['user_input', 'retrieved_contexts', 'response', 'reference'], len=3)

The transform_to_ragas_dataset function extracts the necessary components from R2R responses, including:

The generated answer
The retrieved context chunks
Citation information

Key Evaluation Metrics for R2R

Ragas offers several metrics that are particularly useful for evaluating R2R implementations:

1 from ragas.metrics import AnswerRelevancy, ContextPrecision, Faithfulness
2 from ragas import evaluate
3 
4 # Define the metrics to use
5 ragas_metrics = [
6     AnswerRelevancy(llm=evaluator_llm),  # How relevant is the answer to the query?
7     ContextPrecision(llm=evaluator_llm),  # How precisely were the right documents retrieved?
8     Faithfulness(llm=evaluator_llm)       # Does the answer stick to facts in the context?
9 ]
10 
11 # Run the evaluation
12 results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)

Each metric provides valuable insights:

Answer Relevancy: Measures how well the R2R-generated response addresses the user’s query
Context Precision: Evaluates if R2R’s retrieval mechanism is bringing back the most relevant documents
Faithfulness: Checks if R2R’s generated answers accurately reflect the information in the retrieved documents

Interpreting Evaluation Results

The evaluation results show detailed scores for each sample and metric:

1 # View results as a dataframe
2 df = results.to_pandas()
3 print(df)

Example output:

   user_input                                    retrieved_contexts                                           response                                          reference  answer_relevancy  context_precision  faithfulness
0  Who are the major players...                  [In the rapidly advancing field of...]                      The major players in the large language...         The major players include OpenAI...         1.000000              1.0     1.000000
1  What is Microsoft's Azure AI...              [Microsoft's Azure AI platform is famous for...]            Microsoft's Azure AI platform is known for...      Microsoft's Azure AI platform is...         0.948908              1.0     0.833333
2  What kind of models does Cohere provide?     [Cohere is well-known for its language models...]          Cohere provides language models tailored for...    Cohere provides language models...         0.903765              1.0     1.000000

Advanced Visualization with Ragas App

For a more interactive analysis, upload results to the Ragas app:

1 # Make sure RAGAS_APP_TOKEN is set in your environment
2 results.upload()

This generates a shareable dashboard with:

Detailed scores per metric and sample
Visual comparisons across metrics
Trace information showing why scores were assigned
Suggestions for improvement

You can examine:

Which queries R2R handled well
Where retrieval or generation could be improved
Patterns in your RAG system’s performance

Advanced Evaluation Features

Non-LLM Metrics for Fast Evaluation

In addition to LLM-based metrics, you can use non-LLM metrics for faster evaluations:

1 from ragas.metrics import BleuScore
2 
3 # Create a BLEU score metric
4 bleu_metric = BleuScore()
5 
6 # Add it to your evaluation
7 quick_metrics = [bleu_metric]
8 quick_results = evaluate(dataset=ragas_eval_dataset, metrics=quick_metrics)

Custom Evaluation Criteria with AspectCritic

For tailored evaluations specific to your use case, AspectCritic allows you to define custom evaluation criteria:

1 from ragas.metrics import AspectCritic
2 
3 # Define a custom evaluation aspect
4 custom_metric = AspectCritic(
5     name="factual_accuracy",
6     llm=evaluator_llm,
7     definition="Verify if the answer accurately states company names, model names, and specific capabilities without any factual errors."
8 )
9 
10 # Evaluate with your custom criteria
11 custom_results = evaluate(dataset=ragas_eval_dataset, metrics=[custom_metric])

Training Your Own Metric

If you want to fine-tune metrics to your specific requirements:

Use the Ragas app to annotate evaluation results
Download the annotations as JSON
Train your custom metric:

1 from ragas.config import InstructionConfig, DemonstrationConfig
2 
3 demo_config = DemonstrationConfig(embedding=evaluator_embeddings)
4 inst_config = InstructionConfig(llm=evaluator_llm)
5 
6 # Train your metric with your annotations
7 metric.train(
8     path="your-annotations.json", 
9     demonstration_config=demo_config, 
10     instruction_config=inst_config
11 )

Conclusion

This guide demonstrated how to use Ragas to thoroughly evaluate your R2R RAG implementation. By leveraging these evaluation tools, you can:

Measure the quality of your R2R system across multiple dimensions
Identify specific areas for improvement in retrieval and generation
Track performance improvements as you refine your implementation
Establish benchmarks for consistent quality

Through regular evaluation with Ragas, you can optimize your R2R configuration to deliver the most accurate, relevant, and helpful responses to your users.

For more information on R2R features, refer to the R2R documentation. To explore additional evaluation metrics and techniques with Ragas, visit the Ragas documentation.