RAG evals
Metrics to evaluate a RAG system.
In this tutorial, we’ll demonstrate how to evaluate different aspects of Retrieval-Augmented Generation (RAG) using Evidently.
We’ll demonstrate a local open-source workflow, viewing results as a pandas dataframe and a visual report — ideal for Jupyter or Colab. At the end, we also show how to upload results to the Evidently Platform. If you are in a non-interactive Python environment, choose this option.
We will evaluate both retrieval and generation quality:
-
Retrieval. Assessing the quality of retrieved contexts, including per-chunk relevance.
-
Generation. Evaluating the quality of the final response, both with and without ground truth.
By the end of this tutorial, you’ll know how to evaluate different aspects of a RAG system, and generate structured reports to track RAG performance.
Run a sample notebook: Jupyter notebook or open it in Colab.
To simplify things, we won’t create an actual RAG app, but will simulate getting scored outputs.
1. Installation and Imports
Install Evidently:
Import the required modules:
Pass your OpenAI key as an environment variable:
2. Evaluating Retrieval
Single Context
First, let’s test retrieval quality when a single context is retrieved for each query.
Generate a synthetic dataset. We create a simple dataset with questions, retrieved contexts, and generated responses.
To be able to preview a full-with pandas dataset.
Evaluate overall context quality. We first assess whether the retrieved context provides sufficient information to answer the question and view results as a pandas dataframe.
What happened in this code:
-
We create an Evidently dataset object.
-
Simultaneously, we add descriptors: evaluators that score each row.
-
We use a built-in LLM judge metric
ContextQualityLLMEval
.
You can also choose a different evaluator LLM or modify the prompt. See LLM judge parameters.
Here is what you get:
Evaluate chunk relevance. You can also score the relevance of the chunk using a different ContextRelevance
metric.
In this case you will get a binary “Hit” on whether the context is relevant or not.
It’s more useful for multiple context, though.
Multiple Contexts
RAG systems often retrieve multiple chunks. In this case, we can assess the relevance of each individual chunk first.
Let’s generate a toy dataset. Pass all contexts as a list.
Hit Rate. To aggregate the results per query, we can assess if at least one retrieved chunk contains relevant information (Hit).
You can see the list of individual relevance scores that appear in the same order as your chunks.
Mean Relevance. Alternatively, you can compute an average relevance score.
Here is an example result:
3. Evaluating Generation
With Ground Truth
If you a have ground truth dataset for RAG, you can compare the generated responses against known correct answers.
Synthetic data. You can generate a ground truth dataset for your RAG using Evidently Platform.
Let’s generate a new toy example with “target” column:
There are multiple ways to run this comparison, including LLM-based matching (CorrectnessLLMEval
) and non-LLM methods like Semantic similarity and BERTScore. Let’s run all three at once, but we’d recommend choosing the one:
Here is what you get:
Editing the LLM prompt. You can tweak the definition of correctness to your own liking. Here is an example tutorial on how we tune a correctness descriptor prompt.
Without Ground Truth
If you don’t have reference answers, you can use reference-free LLM judges to assess response quality. For example, here is you how can run evaluation for Faithfulness
to detect if the response is contradictory or unfaithful to the context:
Here is an example result:
You can add other useful checks over your final response like:
-
Length constraints: are responses within expected limits?
-
Refusal rate: monitoring how often the system declines questions.
-
String matching: checking for required wording (e.g., disclaimers).
-
Response tone: ensuring responses match the intended style.
Available evaluators. Check a full list of available descriptors.
4. Get Reports
Once you have defined what you are evaluating, you can group all your evals in a Report to summarize the results across multiple tested inputs.
Let’s put it all together.
Score data. Once you have a pandas dataframe synthetic_df
, you create an Evidently dataset object and choose the selected descriptors by simply listing them.
Get a Report. Instead of rendering the results as a dataframe, you create a Report.
This will render an HTML report in the notebook cell. You can use other export options, like as_dict()
for a Python dictionary output.
This lets you see a well-rounded evaluation. In this toy example, we can see that the system generally retrieves the right data well but struggles with generation. The next step could be improving your prompt to ensure responses stay true to context.
Add test conditions. You can also set up explicit pass/fail tests based on expected score distributions using the Tests. These are conditional expectations you add to metrics.
In this case, we expect all retrieved contexts to be valid and all responses to be faithful, so our tests fail. You can adjust these conditions — for example, allowing a certain percentage of responses to fail.
5. Upload to Evidently Cloud
To be able to easily run and compare evals, systematically track the results, and interact with your evaluation dataset, you can use the Evidently Cloud platform.
Set up Evidently Cloud
-
Sign up for a free Evidently Cloud account.
-
Create an Organization if you log in for the first time. Get an ID of your organization. (Link).
-
Get an API token. Click the Key icon in the left menu. Generate and save the token. (Link).
Import the components to connect with Evidently Cloud:
Create a Project
Connect to Evidently Cloud using your API token:
Create a Project within your Organization, or connect to an existing Project:
Alternatively, retrieve an existing project:
Send your eval
Since you already created the eval, you can simply upload it to the Evidently Cloud.
You can then go to the Evidently Cloud, open your Project and explore the Report with scored data that’s easy to interact with.
What’s Next?
Considering implementing a regression testing at every update to monitor how your RAG system retrieval and response quality changes.