Tutorial - LLM Evaluation
Evaluate and test your LLM use case in 15 minutes.
Evaluating the quality of LLM outputs is essential for building a production-grade LLM application. During development, you need to compare quality with different prompts and detect regressions. Once your app is live, you need to ensure outputs are safe and accurate and understand user behavior.
Manually reviewing individual outputs doesn't scale. This tutorial shows you how to automate LLM evaluations from experiments to production.
You will learn both about the evaluation methods and the workflow to run and track them.
Want a very simple example first? This "Hello World" will take a couple minutes.
In this tutorial, you will:
Prepare a toy chatbot dataset
Evaluate responses using different methods:
Text statistics
Text patterns
Model-based evaluations
LLM-as-a-judge
Metadata analysis
Generate visual Reports to explore evaluation results
Get a monitoring Dashboard to track metrics over time
Build a custom Test Suite to run conditional checks
You can run this tutorial locally, with the option to use Evidently Cloud for monitoring. You will work with a Q&A chatbot example, but the methods will apply to other use cases, such as RAGs and agents.
Requirements:
Basic Python knowledge.
The open-source Evidently Python library.
Optional:
An OpenAI API key (to use LLM-as-a-judge).
An Evidently Cloud account (for live monitoring).
Let's get started!
To complete the tutorial, use the provided code snippets or run a sample notebook.
Jupyter notebook:
Or click to open in Colab.
You can also follow the video version:
If you're having problems or getting stuck, reach out on Discord.
1. Installation and imports
Install Evidently in your Python environment:
Run the imports. To work with toy data:
To run the evals:
To send results to Evidently Cloud:
Optional. To remotely manage the dashboard design in Evidently Cloud:
2. Prepare a dataset
We'll use a dialogue dataset that imitates a company Q&A system where employees ask questions about HR, finance, etc. You can download the example CSV file from source or import it using requests
:
Convert it into the pandas DataFrame. Parse dates and set conversation "start_time" as index:
To get a preview:
How to collect data?: you can use the open-source tracely
library to collect the inputs and outputs from your LLM app. Check the Tracing Quickstart. You can then download the traced Dataset for evaluation.
How to pass an existing dataset? You can import a pandas DataFrame with flexible structure. Include any text columns (e.g., inputs and responses), DateTime, and optional metadata like ID, feedback, model type, etc. If you have multi-turn conversations, parse them into a table by session or input-output pairs.
3. Create a Project
This step is optional. You can also run the evaluations locally without sending results to the Cloud.
To be able to save and share results and get a live monitoring dashboard, create a Project in Evidently Cloud. Here's how to set it up:
Sign up. If you do not have one yet, create a free Evidently Cloud account and name your Organization.
Create an Organization when you log in for the first time. Get an ID of your organization. Organizations page.
Get your API token. Click the Key icon in the left menu to go. Generate and save the token. (Token page).
Connect to Evidently Cloud. Pass your API key to connect.
Create a Project. Create a new Project inside your Organization, adding your title and description:
4. Run evaluations
You will now learn how to apply different methods to evaluate your text data.
Text statistics. Evaluate simple properties like text length.
Text patterns. Detect specific words or regular patterns.
Model-based evals. Use ready-made ML models to score data (e.g., by sentiment).
LLM-as-a-judge. Prompt LLMs to categorize or score texts by custom criteria.
Similarity metrics. Measure semantic similarity between pairs of text.
To view the evaluation results, you will generate visual Reports in your Python environment. In the following sections of the tutorial, you'll also explore other formats like conditional Test Suites and live monitoring Dashboards.
It is recommended to map the data schema to make sure it is parsed correctly.
Create column mapping. Identify the type of columns in your data. Pointing to a "datetime" column will also add a time index to the plots.
Now, let's run evaluations!
You can skip steps. Each example below is self-contained, so you can skip any of them or head directly to Step 6 to see the monitoring flow.
Text statistics
Let's run a simple evaluation to understand the basic flow.
Evaluate text length. Generate a Report to evaluate the length of texts in the "response" column. Run this check for the first 100 rows in the assistant_logs
dataframe:
This calculates the number of symbols in each text and shows a summary in your notebook cell. (You can also export it in other formats - see step 5).
You can see the distribution of text length across all responses and descriptive statistics like the mean or minimal text length.
Click on "details" to see how the mean text length changes over time. The index comes from the datetime
column you mapped earlier. This helps you notice any temporal patterns, such as if texts are longer or shorter during specific periods.
Get a side-by-side comparison. You can also generate statistics for two datasets at once. For example, compare the outputs of two different prompts or data from today against yesterday.
Pass one dataset as reference
and another as current
. For simplicity, let's compare the first and next 50 rows from the same dataframe:
You will now see the summary results for both datasets:
Each evaluation that computes a score for every text in the dataset is called a descriptor
. Descriptors can be numerical (like the TextLength()
you just used) or categorical.
Evidently has many built-in descriptors. For example, try other simple statistics like SentenceCount()
or WordCount()
. We'll show more complex examples below.
List of all descriptors See all available descriptors in the "Descriptors" section of All Metrics table.
Text patterns
You can use regular expressions to identify text patterns. For example, check if the responses mention competitors, named company products, include emails, or specific topical words. These descriptors return a binary score ("True" or "False") for pattern matches.
Let's check if responses contain words related to compensation (such as salary, benefits, or payroll). Pass this word list to the IncludesWords
descriptor. This will also check for word variants.
Add an optional display name for this eval:
Here is an example result. You can see that 10 responses out of 100 relate to the topic of compensation as defined by this word list. "Details" show occurrences in time.
Such pattern evals are fast and cheap to compute at scale. You can try other descriptors like:
Contains(items=[])
for non-vocabulary words like competitor names or longer expressions,BeginsWith(prefix="")
for specific starting sequence,Custom
RegEx(reg_exp=r"")
, etc.
Model-based scoring
You can use pre-trained machine learning models to score your texts. Evidently has:
Built-in model-based descriptors like
Sentiment
.Wrappers to call external models published on HuggingFace.
Let's start with a Sentiment check. This returns a sentiment score from -1 (very negative) to 1 (very positive).
You will see the distribution of response sentiment. Most are positive or neutral, but there are a few chats with a negative sentiment.
In "details", you can look at specific times when the average sentiment of responses dipped:
To review specific responses with sentiment below zero, you can also export the dataset with scores. We'll show this later on.
Let's first see how to use external models from HuggingFace. There are two options:
Pre-selected models, like Toxicity. Pass the
HuggingFaceToxicityModel()
descriptor. This model returns a predicted toxicity score between 0 to 1.Custom models, where you specify the model name and output to use. For example, let's call the
SamLowe/roberta-base-go_emotions
model using the generalHuggingFaceModel
descriptor. This model classifies text into 28 emotions. If you pick the "neutral" label, the descriptor will return the predicted score from 0 to 1 on whether responses convey neutral emotion.
In each case, the descriptor first downloads the model from HuggingFace to your environment and then uses it to score the data. It takes a few moments to load the model.
How to interpret the results? It's typical to use a predicted score above 0.5 as a "positive" label. The toxicity score is near 0 for all responses - nothing to worry about! For neutrality, most responses have predicted scores above the 0.5 threshold, but a few are below. You can review them individually.
Choosing other models. You can choose other models, e.g. to score texts by topic. See docs.
LLM as a judge
For more complex or nuanced checks, you can use LLMs as a judge. This requires creating an evaluation prompt asking LLMs to assess the text by specific criteria, such as tone or conciseness.
This step is optional. You'll need an OpenAI API key and will incur costs by running the evaluation. Skip if you don't want to use external LLMs.
Pass the OpenAI key. It is recommended to pass the key as an environment variable. See Open AI docs for best practices.
Run template evals. Let's start with built-in prompt templates.
DeclineLLMEval()
checks if the response contains a denial.PIILLMEval()
checks if the response contains personally identifiable information. You can also ask to provide for a reasoning of the score.
To minimize API calls, we will pass only 10 data rows.
Create a custom judge. You can also define your own LLM judge with a custom prompt. To illustrate, let's ask the LLM to judge whether the provided responses are concise and return a Concise
or Verbose
label with an explanation. (Or Unknown
if not sure).
Include the custom_judge
descriptor to the Report:
All our responses are concise - great! To see the individual scores, you can publish a dataframe (see Step 5), or send the results to Evidently Cloud.
How to create your own judge. You can create custom prompts, and optionally pass the context or reference answer alongside the response. See docs
Metadata summary
Our dataset also includes user upvotes and downvotes in a categorical feedback
column. You can easily add summaries for any numerical or categorical column to the Report.
To add a summary on the “feedback” column, use ColumnSummaryMetric()
:
You will see a distribution of upvotes and downvotes.
Semantic Similarity
You can evaluate how closely two texts are in meaning using an embedding model. This descriptor requires you to define two columns. In our example, we can compare Responses and Questions to see if the chatbot answers are semantically relevant to the question.
This descriptor converts all texts into embeddings, measures Cosine Similarity between them, and returns a score from 0 to 1:
0 means that texts are opposite in meaning;
0.5 means that texts are unrelated;
1 means that texts are semantically close.
To compute the Semantic Similarity:
In our examples, the semantic similarity always stays above 0.81, which means that answers generally relate to the question.
5. Export results
This is optional. You can proceed without exporting the results.
You can export the evaluation results beyond viewing the visual Reports in Python. Here are some options.
Publish a DataFrame. Add computed scores (like semantic similarity, or LLM-based scores with an explanation) directly to your original dataset. This will let you further analyze the data, like identifying examples with the lowest scores.
Python dictionary. Get summary scores as a dictionary. Use it to export specific values for further pipeline actions:
JSON. Export summary scores as JSON:
HTML. Save a visual HTML report as a file:
You can also send the results to Evidently Cloud for monitoring!
6. Monitor results over time
In this section, you will learn how to monitor evaluations using Evidently Cloud. This allows you to:
Track offline experiment results. Keep records of evaluation scores from different experiments, like comparing output quality using different prompts.
Run evaluations in production. Periodically evaluate batches or samples of production data, such as hourly or daily.
Here's how you can set this up.
Define the evaluations. First, let's design a Report. This will specify what you want to evaluate.
Say, you want to compute summaries for metadata columns and evaluate text length, sentiment, and mentions of compensation in chatbot responses.
You can include more complex checks like LLM-as-a-judge in the same way: just list the corresponding descriptor.
Run the Report. Compute the Report for the first 50 rows:
Upload the results. Send the Report to the Evidently Cloud Project you created earlier:
View the Report. Go to the Project and open the Reports section using the menu on the left.
A single Report gives us all the information right there. But as you run more checks, you will want to see how values change over time. Let's imitate a few consecutive runs to evaluate more batches of data.
Imitate ongoing evaluations. Run and send several Reports, each time taking the next 50 rows of data. For illustration, we repeat the runs. In practice, you would compute each Report after new experiments or as you get a new batch of production data to evaluate.
Run the Report for the next 50 rows of data:
Now you will have 5 Reports in the Project. Let's get a dashboard!
Get a Monitoring Dashboard. You can start with pre-built templates.
Go to Project Dashboard.
Enter the edit mode by clicking on the "Edit" button in the top right corner.
Choose "Add Tab",
Add a "Descriptors" Tab and then a "Columns" Tab.
Use the "Show in Order" toggle above the dashboard to ignore the time gaps.
You will instantly get a dashboard with evaluation results over time.
In the "Descriptors" tab, you will see how the distributions of the text evaluation results. For example, you can notice a dip in mean Sentiment in the fourth evaluation run.
In the "Columns" tab, you can see all the metadata summaries over time. For example, you can notice that all responses in the last run were generated with gpt-3.5.
You can also add alerting conditions for specific values.
Monitoring Panel types. In addition to Tabs, you can choose monitoring panels one by one. You can choose panel title, type (bar, line chart), etc. Read more on available Panels.
7. Run conditional tests
So far, you've used Reports to summarize evaluation outcomes. However, you often want to set specific conditions for the metric values. For example, check if all texts fall within the expected length range and review results only if something goes wrong.
This is where you can use an alternative interface called TestSuites
. It will look like this:
Test Suites work similarly to Reports
, but instead of listing metrics
, you define tests
and set conditions using parameters like gt
(greater than), lt
(less than), eq
(equal), etc.
Define a Test Suite. Let’s create a simple example:
This test checks the following conditions:
Average response sentiment is positive.
Response length is always non-zero.
Maximum response length does not exceed 2000 symbols (e.g., due to chat window constraints).
Mean response length is above 500 symbols (e.g., this is a known pattern).
How to test set test conditions. Read more about Tests. You can use other descriptors and tests. For example, use TestCategoryShare
to check if the share of responses labeled "Concise" by the LLM judge is above a certain threshold. You can also automatically generate conditions from a reference dataset (e.g. expect +/- 10% of the reference values).
Compute multiple Test Suites. Let's simulate running 5 Test Suites sequentially, each on 50 rows of data, with timestamps spaced hourly:
We use a cycle for demonstration. In production, you would run these checks sequentially.
Add a test monitoring Panel. Now, let's add a simple panel to display Test results over time. You can manage dashboards in the UI (like you did before) or programmatically. Let's now explore how to do it from Python.
Load the latest dashboard configuration to Python. If you skip this step, the new Test panels will override the Tabs you added earlier.
Copy the Project ID from above the dashboard:
Next, create a Test panel within the "Tests" tab to display detailed test results:
View the test results in time. Go to the Evidently Cloud dashboard to see the history of all tests. You can notice that a single test failed in the last run. If you hover on the specific test, you can see that we failed the mean text length condition.
View the individual Test Suite. To debug, open the latest Test Suite. In "Details," you will see the distribution of text length and the current mean value, which is just slightly below the set threshold.
When can you use these Test Suites? Here are two ideas:
Regression testing. Run Test Suites whenever you change prompt or app parameters to compare new responses with references or against set criteria.
Continuous testing. Run Test Suites periodically over production logs to check that the output quality stays within expectations.
You can also set up alerts to get a notification if your Tests contain failures.
What is regression testing?. Check a separate tutorial on the regression testing workflow.
What's next?
Here are some of the things you might want to explore next:
Design the monitoring. Read more about how to add monitoring panels, configure alerts, or send data in near real-time in the Monitoring User Guide.
Need help? Ask in our Discord community.
Last updated