LLM regression testing
How to run regression testing for LLM outputs.
In this tutorial, you will learn how to perform regression testing for LLM outputs.
You can compare new and old responses after changing a prompt, model, or anything else in your system. By re-running the same inputs with new parameters, you can spot any significant changes. This helps you push updates with confidence or identify issues to fix.
This example uses Evidently Cloud. You’ll run evals in Python and upload them. You can also skip the upload and view Reports locally. For self-hosted, replace CloudWorkspace
with Workspace
.
Tutorial scope
Here’s what we’ll do:
-
Create a toy dataset. Build a small Q&A dataset with answers and reference responses.
-
Get new answers. Imitate generating new answers to the same question.
-
Create and run a Report with Tests. Compare the answers using LLM-as-a-judge to evaluate length, correctness and style consistency.
-
Build a monitoring Dashboard. Get plots to track the results of Tests over time.
To simplify things, we won’t create an actual LLM app, but will simulate generating new outputs.
To complete the tutorial, you will need:
-
Basic Python knowledge.
-
An OpenAI API key to use for the LLM evaluator.
-
An Evidently Cloud account to track test results. If not yet, sign up for a free account.
You can see all the code in Jupyter notebook or click to open in Colab.
1. Installation and Imports
Install Evidently:
Import the required modules:
To connect to Evidently Cloud:
Optional. To create monitoring panels as code:
Pass your OpenAI key:
2. Create a Project
Connect to Evidently Cloud. Replace with your actual token:
Create a Project:
3. Prepare the Dataset
Create a toy dataset with questions and reference answers.
Get a quick preview:
Here is how the data looks:
Optional: quick data exploration. You might want to have a quick look at some data statistics to help you set conditions for Tests. Let’s check the text length and sentence count distribution.
In this code, you:
-
Created an Evidently Dataset object with automatic data definition.
-
Added two built-in descriptors on text length and symbol count. (See others).
-
Exported results as a dataframe.
Here is the preview:
In a small dataset, you can grasp it all at once. For a larger dataset, you can add a summary report to see the distribution.
This renders the Report directly in the interactive Python environment like Jupyter notebook or Colab. See other export options.
4. Get new answers
Suppose you generate new responses using your LLM after changing a prompt. We will imitate it by adding a new column with new responses to the DataFrame:
Here is the resulting dataset with the added new column:
How to connect it with your app? Replace this step with calling your LLM app to score the inputs and add the new responses to the DataFrame. You can also use our tracely
library to instrument your app and get traces as a tabular dataset. Check the tutorial with tracing workflow.
5. Design the Test suite
To compare new answers with old ones, we need evaluation metrics. You can use deterministic or embeddings-based metrics like Semantic Similarity. However, you often need more custom criteria. Using LLM-as-a-judge is useful for this, letting you define what to detect.
Let’s formulate what we want to Tests:
-
Length check. All new responses must be no longer than 200 symbols.
-
Correctness. All new responses should not contradict the reference answer.
-
Style. All new responses should match the style of the reference.
Text length is easy to check, but for Correctness and Style, we’ll write our custom LLM judges.
Correctness judge
We implement the correctness evaluator, using an Evidenty template for binary classification. We ask the LLM to classify each response as “correct” or “incorrect” based on the target_response
column and provide reasoning for its decision.
We recommend splitting each evaluation criterion into separate judges and using a simple grading scale, like binary classifiers, for better reliability.
Ideally, evaluate your judge first! Each LLM evaluator is a small ML system you should align with your preferences. We recommend running a couple of iterations. Check the tutorial on LLM judges.
Template parameters. For an explanation of each parameter, check the LLM judge docs.
Style judge
Using a similar approach, we’ll create a custom judge for style match: it should look whether the style (not the contents!) of both responses remains similar.
This could be useful to detect more subtle changes, like LLM becoming suddenly more verbose.
At the same time, these types of checks are much more subjective and we can expect some variability in the judge responses, so we can treat this test as “non-critical”.
6. Run the evaluation
Now, we can run tests that evaluate for correctness, style and text length. We do this in two steps.
Score the data. First, we define the row-level descriptors we want to add. They will process each individual response and add the score/label to the dataset.
We’ll include the two evaluators we just created, and built-in TextLength()
descriptor.
Understand Descriptors. See the list of other built-in descriptors.
To add these descriptors to the dataset, run:
To preview the results of this step locally:
However, simply looking at the dataset is not very useful: we need to summarize the results and assess if the results are up to the mark. For that, we need a Report with the added tests.
Create a Report. Let’s formulate the Report:
What happens in this code:
-
We create an Evidently Report to compute aggregate Metrics.
-
We use
TextEvals
to summarize all descriptors. -
We also add Tests for specific values we want to validate. You add Tests by picking a metric you want to assess, and adding a condition to it. (See available Metrics).
-
To set test conditions, you define the expectations using parameters like
gt
(greater than),lt
(less than),eq
(equal), etc. (Check Test docs). -
We also label one of the tests (style match) as non-critical. This means it will trigger warning instead of a fail, and will be visually labeled yellow in the Report and the monitoring panel.
If you want to test share instead of count, use share_tests
instead of tests
.
Run the Report. Now that our Report with its test conditions is ready - let’s run it! We will apply it to the eval_dataset
that we prepared earlier, and send it to the Evidently Cloud.
Including data is optional but useful for most LLM use cases since you’d want to see not just the aggregate results but also the raw texts outputs.
You can preview the results in your Python notebook: call my_eval
or my_eval.json()
.
To view the results, navigate to the Evidently Platform. Go to the Home Page, enter your Project, and find the Reports section in the left menu. Here, you’ll see the Report you can explore.
The Report will have two sections. Metrics show a summary or all values, and Tests will show the pass/fail results in the next tab. You will also see the Dataset with added scores and explanations.
Report view, with “Style” metric selected:
Note: your explanations will vary since LLMs are non-deterministic.
The Test Suite with all Test results:
You can see that we failed the Length check. To find the failed output, you can sort the column “Length” in order and find the longest response.
Using Tags. You can optionally attach Tags to your Reports to associate this specific run with some parameter, like a prompt version. Check the docs on Tags and Metadata.
7. Test again
Let’s say you made yet another change to the prompt. Our reference dataset stays the same, but we generate a new set of answers that we want to compare to this reference.
Here is the toy eval_data_2
to imitate the result of the change.
Create a new dataset:
Repeat the same evaluation as before. Since we already defined the descriptors and Report composition with conditional checks, we only need to apply it to the new data:
Explore the new Report. This time, the response length is within bounds, but one of the responses is incorrect: you can see the explanation of the contradition picked up by the LLM judge.
There is also a “softer” fail for one of the responses that now has a different tone.
8. Get a Dashboard
As you run multiple Reports, you may want to track results in time to see if you are improving. You can configure a Dashboard, both in UI or programmatically.
Let’s create a couple of Panels using Dashboards as code approach so that it’s easy to reproduce. The following code will add:
-
A counter panel to show the SUCCESS rate of the latest Test run.
-
A test monitoring panel to show all Test results over time.
When you navigate to the UI, you will now see a Panel which shows a summary of Test results (Success, Failure, and Warning) for each Report we ran. As you add more Tests to the same Project, the Panels will be automatically updated to show new Test results.
If you hover over individual Test results, you will able to see the specific Test and conditions. You can click on it to open up the specific underlying Report to explore.
Using Dashboards. You can design and add other Panel types, like simply plotting mean/max values or distributions of scores over time. Check the docs on Dashboards.
What’s next? As you design a similar Test Suite for your use case, you can integrate it with CI/CD workflows to run on every change. You can also enable alerts to be sent to your email / Slack whenever the Tests fail.