Evals
Overview
This guide demonstrates how to evaluate your R2R RAG outputs using the Ragas evaluation framework.
In this tutorial, you will:
- Prepare a sample dataset in R2R
- Use R2R’s
/rag
endpoint to perform Retrieval-Augmented Generation - Install and configure Ragas for evaluation
- Evaluate the generated responses using multiple metrics
- Analyze evaluation traces for deeper insights
Setting Up Ragas for R2R Evaluation
Installing Ragas
First, install Ragas and its dependencies:
Configuring Ragas with OpenAI
Ragas uses an LLM to perform evaluations. Set up an OpenAI model as the evaluator:
Sample Dataset and R2R RAG Implementation
For this guide, we assume you have:
- An initialized R2R client
- A dataset about AI companies already ingested into R2R
- Basic knowledge of R2R’s RAG capabilities
Here’s a quick example of using R2R’s /rag
endpoint to generate an answer:
The output might look like:
Evaluating R2R with Ragas
Ragas provides a comprehensive evaluation framework specifically designed for RAG systems. The R2R-Ragas integration makes it easy to assess the quality of your R2R implementation.
Creating a Test Dataset
First, prepare a set of test questions and reference answers:
Collecting R2R Responses
Generate responses using your R2R implementation:
The R2R-Ragas Integration
Ragas includes a dedicated integration for R2R that handles the conversion of R2R’s response format to Ragas’s evaluation dataset format:
The transform_to_ragas_dataset
function extracts the necessary components from R2R responses, including:
- The generated answer
- The retrieved context chunks
- Citation information
Key Evaluation Metrics for R2R
Ragas offers several metrics that are particularly useful for evaluating R2R implementations:
Each metric provides valuable insights:
- Answer Relevancy: Measures how well the R2R-generated response addresses the user’s query
- Context Precision: Evaluates if R2R’s retrieval mechanism is bringing back the most relevant documents
- Faithfulness: Checks if R2R’s generated answers accurately reflect the information in the retrieved documents
Interpreting Evaluation Results
The evaluation results show detailed scores for each sample and metric:
Example output:
Advanced Visualization with Ragas App
For a more interactive analysis, upload results to the Ragas app:
This generates a shareable dashboard with:
- Detailed scores per metric and sample
- Visual comparisons across metrics
- Trace information showing why scores were assigned
- Suggestions for improvement
You can examine:
- Which queries R2R handled well
- Where retrieval or generation could be improved
- Patterns in your RAG system’s performance
Advanced Evaluation Features
Non-LLM Metrics for Fast Evaluation
In addition to LLM-based metrics, you can use non-LLM metrics for faster evaluations:
Custom Evaluation Criteria with AspectCritic
For tailored evaluations specific to your use case, AspectCritic allows you to define custom evaluation criteria:
Training Your Own Metric
If you want to fine-tune metrics to your specific requirements:
- Use the Ragas app to annotate evaluation results
- Download the annotations as JSON
- Train your custom metric:
Conclusion
This guide demonstrated how to use Ragas to thoroughly evaluate your R2R RAG implementation. By leveraging these evaluation tools, you can:
- Measure the quality of your R2R system across multiple dimensions
- Identify specific areas for improvement in retrieval and generation
- Track performance improvements as you refine your implementation
- Establish benchmarks for consistent quality
Through regular evaluation with Ragas, you can optimize your R2R configuration to deliver the most accurate, relevant, and helpful responses to your users.
For more information on R2R features, refer to the R2R documentation. To explore additional evaluation metrics and techniques with Ragas, visit the Ragas documentation.