Tutorial - LLM Evaluation
Evaluate and test your LLM use case in 15 minutes.
Evaluating output quality is a crucial element of building production-grade LLM applications. During development, you must compare different prompts and models and catch issues as you tweak them. Once your app is live, you must monitor its performance on actual data to ensure it's safe and accurate.
Simple "vibe checks" of individual outputs don't scale well. This tutorial shows how you can automate LLM evaluations from experiments to production.
Want a very simple example first? This "Hello World" will take a couple minutes.
In this tutorial, you will:
Run preset evaluations for text data.
Build a custom evaluation suite using assertions and model-based grading.
Visualize results to compare two datasets or experiments.
Create a test suite to catch regressions automatically.
Get a live dashboard to track evaluation results.
You can run this tutorial locally, with the option to use Evidently Cloud for live monitoring in the final step.
Requirements:
Basic Python knowledge.
The open-source Evidently Python library.
Optional:
An OpenAI API key (to use LLM-as-a-judge).
An Evidently Cloud account (for live monitoring).
This tutorial covers several methods for LLM evals, from regular expressions to external ML models for scoring and LLM judges. We'll use a Q&A chatbot as an example use case, but these methods apply to other use cases like RAGs and agents.
Let's get started!
1. Installation and imports
Install Evidently in your Python environment:
Import the components to prepare the toy data:
Import the components to run the evals:
For some checks, you also need the NLTK package:
Optional. To be able to send results to Evidently Cloud:
Optional. To remotely manage the dashboard design in Evidently Cloud:
2. Prepare a dataset
We'll use an example dialogue dataset that imitates a company Q&A system in which employees ask questions about HR, finance, etc.
Download the CSV file from GitHub:
Import it as a pandas DataFrame and add a datetime index:
Here is a preview with assistant_logs.head()
:
How do I pass my own data? Import it as a pandas DataFrame. The structure is flexible: you can include text columns (inputs and responses), DateTime columns, and optional metadata like ID, feedback, model type, etc. If you have multi-turn conversations, parse them into a table by session or input-output pairs.
3. Create a Project
This step is optional. You can also run all evaluations locally.
To be able to save and share results and get a live monitoring dashboard, create a Project in Evidently Cloud. Here's how to set it up:
Sign up. If you do not have one yet, create an Evidently Cloud account and your Organization.
Add a Team. Click Teams in the left menu. Create a Team, copy and save the Team ID. (Team page).
Get your API token. Click the Key icon in the left menu to go. Generate and save the token. (Token page).
Connect to Evidently Cloud. Pass your API key to connect.
Create a Project. Create a new Project inside your Team, adding your title and description:
4. Run your first eval
You will now run a few simple out-of-the-box evals and generate a visual Report in your Python environment.
Create column mapping. This optional step helps identify specific columns in your data. For example, pointing to a "datetime" column will add a time index to the plots.
Run simple evals. Let's generate a Report with some pre-selected text statistics using a TextEval
Preset. We'll look at the "response" column in the first 100 rows assistant_logs[:100]
:
The Report will show stats like:
text sentiment (scale -1 to 1)
text length (number of symbols)
number of sentences in a text
percentage of out-of-vocabulary words (scale 0 to 100)
percentage of non-letter characters (scale 0 to 100)
We call these generated statistics descriptors
. They can be numerical or categorical.
What else is there? See available descriptors in the All Metrics table. We’ll show more complex evaluations later in the tutorial. Additionally, you can run your evals as a Test Suite (get a pass/fail for each check), or see trends on a monitoring dashboard.
5. Export results
This is optional. You can proceed without exporting or sending data elsewhere.
You can export and save evaluation results beyond viewing them in Python. Here are some options.
Python dictionary. Get summary scores:
JSON. Export summary scores as JSON:
HTML. Save a visual HTML report as a file:
Publish a DataFrame. You can add computed scores (like sentiment) directly to your original dataset. This allows you to further analyze your data, e.g. by finding low-sentiment responses.
Evidently Cloud. Save results for sharing and tracking quality over time. To add the Report to the Project you created earlier, use add_report
.
To see it in the UI, go to the Reports section using the left menu.
6. Customize evaluations
You will now learn to create a custom evaluation suite for your LLM system inputs and outputs.
You can combine different types of checks:
Rule-based. Detect specific words or patterns in your data.
ML-based. Use external models to score data (e.g., for toxicity, topic, tone).
LLM-as-a-judge. Prompt LLMs to categorize or score texts.
Similarity metrics. Use distance metrics to compare pairs of texts.
Custom Python functions. Pass your own eval.
Evidently provides a library of ready-made descriptors to parametrize. The following section will show a few examples. For clarity, we'll generate separate Reports for each group of checks. In practice, you can put all evals together in a single Report.
Rule-based evals
These evals are fast and cheap to compute at scale. Evidently has built-in descriptors for:
Regular expression checks like custom
RegExp
,BeginsWith
,EndsWith
,Contains
,IncludesWords
, etc. Then return a binary score ("True" or "False") for each row.Numerical descriptors like
OOV
(share of out-of-vocabulary words),SentenceCount
,WordCount
, etc. They return a numerical score for each row in the dataset.
You will again use TextEvals
Preset, but now add a list of descriptors
with their parameters. Display names are optional but make the Report easier to read.
Here is an example result for IncludesWords(words_list=['salary'])
descriptor. You can see only 4 instances that match this condition. "Details" show occurrences in time.
ML models
You can also use any pre-trained machine learning model to score your texts. Evidently has:
In-built model-based descriptors like
Sentiment
.Wrappers to call external Python functions or models published on HuggingFace (
HuggingFaceModel
).
Let's evaluate the responses for Sentiment (in-built model, scores from - 1 to 1) and Toxicity (using external HuggingFace classifier model that returns the score between 0 to 1 for the "toxic" class).
This code downloads the Hugging Face model to score your data locally. Example result with the distribution of toxicity scores:
Choosing other models. You can choose other models, e.g. to score texts by topic or emotion. See docs
LLM as a judge
This step is optional. Skip if you don't have an OpenAI API key or want to avoid using external LLMs.
OpenAI key. Pass it as an environment variable: see docs. You will incur costs when running this eval.
For more complex or nuanced checks, you can use LLMs as a judge. This requires creating an evaluation prompt asking LLMs to assess the text by specific criteria, for example, tone or conciseness.
To illustrate, let's create a prompt to ask the LLM to judge if the provided text includes personally identifiable information (PII) and return the label "1" if it is present. Use "REPLACE" in the prompt to specify where to include the text from your column.
Include an OpenAIPrompting
descriptor to the Report, refering this prompt. We will pass only 10 lines of code to the current data to minimize API calls.
How to create your own judge. You can create your own prompts, and optionally pass the context for scoring alongside the response. See docs.
Metadata columns
Our dataset also includes pre-existing user evaluations in a categorical feedback
column with upvotes and downvotes. You can add summaries for any numerical or categorical column in the Report.
To add a summary on the “feedback” column, use ColumnSummaryMetric()
:
You will see a distribution of upvotes and downvotes.
7. Compare datasets
You might want to compare two datasets using the same criteria. For example, you could compare completions to the same prompt from two different models or today's data to yesterday's. In Evidently, we call the two datasets current
and reference
.
Side-by-side Reports
You can generate similar Reports as before but with two datasets. This lets you visualize the distributions side by side.
For simplicity, let's take the first 100 rows as "reference" and the next 100 as "current". You can combine text evals and metadata summary.
Here is how a summary of upvotes and downvotes looks for two datasets:
Data Drift detection
In addition to side-by-side visualizations, you can evaluate data drift - shift in distributions between two datasets. You can run statistical tests or use distance metrics.
You can compare both the distribution of raw texts (“how different the texts are”) and distributions of descriptors (e.g., “how different is the distribution of text length”).
This is useful for detecting pattern shifts. For example, you might notice a sudden increase in responses of fixed length or that responses generally become shorter or longer. You can also use the "drift score" as a metric in monitoring to detect when things change significantly.
Descriptor drift. To compare the distribution of descriptors, pass them to the TextDescriptorsDriftMetric
:
Here is the output. In our case, we work with the same distribution so there is no drift.
Data drift methods. You might want to tweak data drift detection methods and thresholds to adjust the sensitivity. Check more here. It’s also important to choose appropriate comparison windows where you expect the distributions to be generally similar.
Raw data drift. To perform drift detection on raw text data, pass the column with texts to ColumnDriftMetric()
:
Data Drift Preset. You can also use DataDriftPreset()
to compare distribution of all columns - text and metadata - in the dataset at once.
To detect drift on raw data, Evidently will train and evaluate a classifier model to differentiate the two text datasets. If the model can identify if a text sample belongs to the “reference” or “current” dataset, you can consider them different enough. The resulting score is the ROC AUC of the classifier. (0.5 means the classifier is no better than random, and the datasets are very similar. Values between 0.5 and 1 show the model is able to differentiate between the two datasets, and there is a likely change between the datasets.)
If drift is detected, Evidently shows phrases that help differentiate between the two datasets. In our case, there is no drift, so there is no interpretation.
8. Regression testing
Up to now, you've used Reports to view computed values. However, manually comparing results can be inconvenient at scale. You might want to set specific expectations for your text qualities and only review results when something goes wrong.
You can use Evidently Test Suites
for this purpose. They have a similar API to Reports
, but instead of listing metrics
, you list tests
and pass conditions using parameters like gt
(greater than), lt
(less than), eq
(equal), etc.
Let’s run an example.
This checks the following conditions:
Response Length: Should always be more than 100 characters. The Test will fail if at least one response is under 100 characters.
Question Length Range: Should be between 30 and 100 characters 90% of the time. The Test will fail if more than 10% of the values are outside this range.
Response Sentiment Score: Should always be above 0. The Test will fail if at least one response is slightly negative (below 0).
Out-of-Vocabulary Words in Response: Should be under 15%. The Test will fail if more than 15% of the words are out of vocabulary. This might signal a change in the generated text (e.g., language or special symbols usage) we might want to know about it.
Here’s how the resulting Test Suite looks. In our case, the sentiment Test failed. You can open “Details” to see supporting visuals to debug.
Setting Test conditions. You can flexibly encode conditions using in-built Tests and parameters. You can also automatically generate conditions from a reference dataset (e.g. expect +/- 10% of the reference values). Read more about Tests.
9. Monitoring dashboard
You can also create a live dashboard to monitor values and check results over time. You can use Evidently Cloud or self-host a UI service. Let's run a quick example with Evidently Cloud.
Let's write a script to simulate several production runs, each time passing 20 data rows to generate a new Test Suite (same as in example above). We will also add a daily timestamp.
Note: We do this loop for demonstration. In production, you would run checks sequentially.
Compute and send ten Test Suites to Evidently Cloud.
Finally, define what you'd like to see on the dashboard. You can add monitoring Panels and Tabs from the UI or define them programmatically.
Let's add a simple Panel to show Test results over time using the Python API.
Once you go to the Evidently Cloud, you can see the Test results over time. We can clearly see that we consistently fail the sentiment check.
Monitoring Panel types. You can plot not only Test results, but also statistics and distributions of individual metrics and descriptors over time. See available Panels.
What's next?
Here are some of the things you might want to explore next:
Designing monitoring. Read more about how to design monitoring panels, configure alerts, or send data in near real-time in the Monitoring User Guide.
Need help? Ask in our Discord community.
Last updated