Evals

Overview

This guide demonstrates how to evaluate your R2R RAG outputs using the Ragas evaluation framework.

In this tutorial, you will:

  • Prepare a sample dataset in R2R
  • Use R2R’s /rag endpoint to perform Retrieval-Augmented Generation
  • Install and configure Ragas for evaluation
  • Evaluate the generated responses using multiple metrics
  • Analyze evaluation traces for deeper insights

Setting Up Ragas for R2R Evaluation

Installing Ragas

First, install Ragas and its dependencies:

1%pip install ragas langchain-openai -q

Configuring Ragas with OpenAI

Ragas uses an LLM to perform evaluations. Set up an OpenAI model as the evaluator:

1from langchain_openai import ChatOpenAI
2from ragas.llms import LangchainLLMWrapper
3
4# Make sure your OPENAI_API_KEY environment variable is set
5llm = ChatOpenAI(model="gpt-4o-mini")
6evaluator_llm = LangchainLLMWrapper(llm)
7
8# If you'll be using embeddings for certain metrics
9from langchain_openai import OpenAIEmbeddings
10from ragas.embeddings import LangchainEmbeddingsWrapper
11evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Sample Dataset and R2R RAG Implementation

For this guide, we assume you have:

  1. An initialized R2R client
  2. A dataset about AI companies already ingested into R2R
  3. Basic knowledge of R2R’s RAG capabilities

Here’s a quick example of using R2R’s /rag endpoint to generate an answer:

1from r2r import R2RClient
2
3client = R2RClient() # Assuming R2R_API_KEY is set in your environment
4
5query = "What makes Meta AI's LLaMA models stand out?"
6
7search_settings = {
8 "limit": 2,
9 "graph_settings": {"enabled": False, "limit": 2},
10}
11
12response = client.retrieval.rag(
13 query=query,
14 search_settings=search_settings
15)
16
17print(response.results.generated_answer)

The output might look like:

Meta AI's LLaMA models stand out due to their open-source nature, which supports innovation and experimentation by making high-quality models accessible to researchers and developers [1]. This approach democratizes AI development, fostering collaboration across industries and enabling researchers without access to expensive resources to work with advanced AI models [2].

Evaluating R2R with Ragas

Ragas provides a comprehensive evaluation framework specifically designed for RAG systems. The R2R-Ragas integration makes it easy to assess the quality of your R2R implementation.

Creating a Test Dataset

First, prepare a set of test questions and reference answers:

1questions = [
2 "Who are the major players in the large language model space?",
3 "What is Microsoft's Azure AI platform known for?",
4 "What kind of models does Cohere provide?",
5]
6
7references = [
8 "The major players include OpenAI (GPT Series), Anthropic (Claude Series), Google DeepMind (Gemini Models), Meta AI (LLaMA Series), Microsoft Azure AI (integrating GPT Models), Amazon AWS (Bedrock with Claude and Jurassic), Cohere (business-focused models), and AI21 Labs (Jurassic Series).",
9 "Microsoft's Azure AI platform is known for integrating OpenAI's GPT models, enabling businesses to use these models in a scalable and secure cloud environment.",
10 "Cohere provides language models tailored for business use, excelling in tasks like search, summarization, and customer support.",
11]

Collecting R2R Responses

Generate responses using your R2R implementation:

1r2r_responses = []
2
3search_settings = {
4 "limit": 2,
5 "graph_settings": {"enabled": False, "limit": 2},
6}
7
8for que in questions:
9 response = client.retrieval.rag(query=que, search_settings=search_settings)
10 r2r_responses.append(response)

The R2R-Ragas Integration

Ragas includes a dedicated integration for R2R that handles the conversion of R2R’s response format to Ragas’s evaluation dataset format:

1from ragas.integrations.r2r import transform_to_ragas_dataset
2
3# Convert R2R responses to Ragas format
4ragas_eval_dataset = transform_to_ragas_dataset(
5 user_inputs=questions,
6 r2r_responses=r2r_responses,
7 references=references
8)
9
10print(ragas_eval_dataset)
11# Output: EvaluationDataset(features=['user_input', 'retrieved_contexts', 'response', 'reference'], len=3)

The transform_to_ragas_dataset function extracts the necessary components from R2R responses, including:

  • The generated answer
  • The retrieved context chunks
  • Citation information

Key Evaluation Metrics for R2R

Ragas offers several metrics that are particularly useful for evaluating R2R implementations:

1from ragas.metrics import AnswerRelevancy, ContextPrecision, Faithfulness
2from ragas import evaluate
3
4# Define the metrics to use
5ragas_metrics = [
6 AnswerRelevancy(llm=evaluator_llm), # How relevant is the answer to the query?
7 ContextPrecision(llm=evaluator_llm), # How precisely were the right documents retrieved?
8 Faithfulness(llm=evaluator_llm) # Does the answer stick to facts in the context?
9]
10
11# Run the evaluation
12results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)

Each metric provides valuable insights:

  • Answer Relevancy: Measures how well the R2R-generated response addresses the user’s query
  • Context Precision: Evaluates if R2R’s retrieval mechanism is bringing back the most relevant documents
  • Faithfulness: Checks if R2R’s generated answers accurately reflect the information in the retrieved documents

Interpreting Evaluation Results

The evaluation results show detailed scores for each sample and metric:

1# View results as a dataframe
2df = results.to_pandas()
3print(df)

Example output:

user_input retrieved_contexts response reference answer_relevancy context_precision faithfulness
0 Who are the major players... [In the rapidly advancing field of...] The major players in the large language... The major players include OpenAI... 1.000000 1.0 1.000000
1 What is Microsoft's Azure AI... [Microsoft's Azure AI platform is famous for...] Microsoft's Azure AI platform is known for... Microsoft's Azure AI platform is... 0.948908 1.0 0.833333
2 What kind of models does Cohere provide? [Cohere is well-known for its language models...] Cohere provides language models tailored for... Cohere provides language models... 0.903765 1.0 1.000000

Advanced Visualization with Ragas App

For a more interactive analysis, upload results to the Ragas app:

1# Make sure RAGAS_APP_TOKEN is set in your environment
2results.upload()

This generates a shareable dashboard with:

  • Detailed scores per metric and sample
  • Visual comparisons across metrics
  • Trace information showing why scores were assigned
  • Suggestions for improvement

You can examine:

  • Which queries R2R handled well
  • Where retrieval or generation could be improved
  • Patterns in your RAG system’s performance

Advanced Evaluation Features

Non-LLM Metrics for Fast Evaluation

In addition to LLM-based metrics, you can use non-LLM metrics for faster evaluations:

1from ragas.metrics import BleuScore
2
3# Create a BLEU score metric
4bleu_metric = BleuScore()
5
6# Add it to your evaluation
7quick_metrics = [bleu_metric]
8quick_results = evaluate(dataset=ragas_eval_dataset, metrics=quick_metrics)

Custom Evaluation Criteria with AspectCritic

For tailored evaluations specific to your use case, AspectCritic allows you to define custom evaluation criteria:

1from ragas.metrics import AspectCritic
2
3# Define a custom evaluation aspect
4custom_metric = AspectCritic(
5 name="factual_accuracy",
6 llm=evaluator_llm,
7 definition="Verify if the answer accurately states company names, model names, and specific capabilities without any factual errors."
8)
9
10# Evaluate with your custom criteria
11custom_results = evaluate(dataset=ragas_eval_dataset, metrics=[custom_metric])

Training Your Own Metric

If you want to fine-tune metrics to your specific requirements:

  1. Use the Ragas app to annotate evaluation results
  2. Download the annotations as JSON
  3. Train your custom metric:
1from ragas.config import InstructionConfig, DemonstrationConfig
2
3demo_config = DemonstrationConfig(embedding=evaluator_embeddings)
4inst_config = InstructionConfig(llm=evaluator_llm)
5
6# Train your metric with your annotations
7metric.train(
8 path="your-annotations.json",
9 demonstration_config=demo_config,
10 instruction_config=inst_config
11)

Conclusion

This guide demonstrated how to use Ragas to thoroughly evaluate your R2R RAG implementation. By leveraging these evaluation tools, you can:

  1. Measure the quality of your R2R system across multiple dimensions
  2. Identify specific areas for improvement in retrieval and generation
  3. Track performance improvements as you refine your implementation
  4. Establish benchmarks for consistent quality

Through regular evaluation with Ragas, you can optimize your R2R configuration to deliver the most accurate, relevant, and helpful responses to your users.

For more information on R2R features, refer to the R2R documentation. To explore additional evaluation metrics and techniques with Ragas, visit the Ragas documentation.