LLM Regression Testing
How to run regression testing for LLM outputs.
In this tutorial, we’ll show you how to do regression testing for LLM outputs. You’ll learn how to compare new and old responses after changing a prompt, model, or anything else in your system. By re-running the same inputs, you can spot any significant changes. This helps you push updates with confidence or identify issues to fix.
Tutorial scope
Here's what we'll do:
Create a toy dataset. Build a small Q&A dataset with answers and reference responses.
Get new answers. Imitate generating new answers to the same question we want to compare.
Create and run a Test Suite. Compare the answers using LLM-as-a-judge to evaluate length, correctness and style match.
Build a monitoring Dashboard. Get plots to track the results of Tests over time.
To complete the tutorial, you will need:
Basic Python knowledge.
An OpenAI API key to use for the LLM evaluator.
An Evidently Cloud account to track test results. If not yet, sign up for a free account.
Use the provided code snippets or run a sample notebook.
Jupyter notebook:
Or click to open in Colab.
1. Installation and Imports
Install Evidently:
Import the required modules:
To connect to Evidently Cloud:
To create monitoring panels as code:
Pass your OpenAI key:
2. Create a Project
Connect to Evidently Cloud. Replace with your actual token:
Create a Project:
Need help? Check how to find API key and create a Team.
3. Prepare the Dataset
Create a dataset with questions and reference answers. We'll later compare the new LLM responses against them:
Get a quick preview:
Here is how the data looks:
You might want to have a quick look at some data statistics to help you set conditions for Tests. Let's check the text length distribution. This will render a summary Report directly in the notebook cell.
If you work in a non-interactive Python environment, call report.as_dict()
or report.json()
instead.
Here is the distribution of text length:
4. Get new answers
Suppose you generate new responses using your LLM after changing a prompt. We will imitate it by adding a new column with new responses to the DataFrame:
Here is the resulting dataset with the added new column:
How to run it in production? In practice, replace this step with calling your LLM app to score the inputs. After you get the new responses, add them to a DataFrame. You can also use our tracing library to instrument your app and get traces as a tabular dataset. Check the tutorial with tracing workflow.
5. Design the Test suite
To compare new answers with old ones, we need evaluation metrics. You can use deterministic or embeddings-based metrics like SemanticSimilarity. However, you often need more custom criteria. Using LLM-as-a-judge is useful for this, letting you define what to detect.
Let’s design our Tests:
Length check. All new responses must be between 80 and 200 symbols.
Correctness. All new responses should give the same answer without contradictions.
Style. All new responses should match the style of the reference.
Text length is easy to check, but for Correctness and Style checks, we'll write our custom LLM judges.
Correctness judge
We implement the correctness evaluator, using an Evidenty template for binary classification. We ask the LLM to classify each response as correct or incorrect based on the {target_response} column and provide reasoning for its decision.
We recommend splitting each evaluation criterion into separate judges and using a simple grading scale, like binary classifiers, for better reliability.
Don't forget to evaluate your judge! Each LLM evaluator is a small ML system you should tune to align with your preferences. We recommend running a couple of iterations to tune it. Check the tutorial on creating LLM judges.
Docs on LLM judge. For an explanation of each parameter, check the docs on LLM judge functionality.
Style judge
Using a similar approach, we'll create a judge for style. We'll also add clarifications to define what we mean by a style match.
Complete Test Suite
Now, we can create a Test Suite that includes checks for correctness, style matching, and text length.
Choose Tests. We select Evidently column-level tests like
TestCategoryCount
andTestShareOfOutRangeValues
. (You can pick other Tests, likeTestColumnValueMin
orTestColumnValueMean
).Set Parameters and Conditions. Some Tests require parameters: for example,
left
andright
to set the allowed range for Text Length. For Test fail conditions, use parameters likegt
(greater than),lt
(less than),eq
(equal), etc.Set non-critical Tests. Identify non-critical Tests, like the style match check, to trigger warnings instead of fails. This helps visually separate them on monitoring panels and set alerts only for critical failures.
We reference our two LLM judges, style_eval
and correctness_eval
, and apply them to the response
column in our dataset. For text length, we use the built-in TextLength()
descriptor for the same column.
In this example, we expect the share of failures to be zero using the eq=0
condition. You can adjust this, such as using lte=0.1, which means "less than 10%". This would cause the Test to fail if more than 10% of rows are out of the set length range.
Allowing some share of Tests to fail is convenient for real-world applications.
You can add additional Tests as you see fit for regular expressions, word presence, etc. and Tests for other columns in the same Test Suite.
Understand Tests. Learn how to set Test conditions and use Tests with text data. See the list of All tests.
Understand Descriptors. See the list of available text Descriptors in the All metrics table.
6. Run the Test Suite
Now that our Test Suite is ready - let's run it!
To apply this Test Suite to the eval_data
that we prepared earlier:
This will compute the Test Suite: but how do you see it? You can preview the results in your Python notebook (call test_suite
). However, we’ll now send it to Evidently Cloud along with the scored data:
Including data is optional but useful for most LLM use cases since you'd want to see not just the aggregate Test results but also the raw texts to debug when Tests fail.
To view the results, navigate to the Evidently Platform. Go to the (Home Page), enter your Project, and find the "Test Suites" section in the left menu. Here, you'll see the Test Suite you can explore.
You'll find both the summary Test results and the Dataset with added scores and explanations. You can zoom in on specific evaluations, such as sorting the data by Text Length or finding rows labeled as "incorrect" or "style-mismatched".
Note: your explanations will vary since LLMs are non-deterministic.
Using Tags. You can optionally attach Tags to your Test Suite to associate this specific run with some parameter, like a prompt version. Check the docs on generating snapshots.
7. Test again
Let's say you made yet another change to the prompt. Our reference dataset stays the same, but we generate a new set of answers that we want to compare to this reference.
Here is the toy eval_data_2
to imitate the result of the change.
Now, we can apply the same Test Suite to this data and send it to Evidently Cloud.
If you go and open the new Test Suite results, you can again explore the outcomes and explanations.
8. Get a Dashboard
You can continue running Test Suites in this manner. As you run multiple, you may want to track Test results over time.
You can easily add this to a Dashboard, both in UI or programmatically. Let's create a couple of Panels using Dashboards as a code approach.
The following code will add:
A counter panel to show the SUCCESS rate of the latest Test run.
A test monitoring panel to show all Test results over time.
When you navigate to the UI, you will now see a Panel which shows a summary of Test results (Success, Failure, and Warning) for each Test Suite we ran. As you add more Tests to the same Project, the Panels will be automatically updated to show new Test results.
If you hover over individual Test results, you will able to see the specific Test and conditions.
Using Dashboards. You can design and add other Panel types. Check the docs on Dashboards.
What's next? As you design a similar Test Suite for your use case, you can integrate it with CI/CD workflows to run on every change.
Last updated