Evaluate Text Data
How to run evaluations for text data with Descriptors.
A Descriptor is a row-level score evaluating a specific characteristic of a text data. A simple example is text length.
Descriptors range from regular expressions and text statistics to ML- and LLM-based checks. For example, you can calculate the semantic similarity between two texts, or ask LLM to label responses as "relevant" or "not relevant".
You can use Descriptors in two ways:
In a Report. This helps visualize and summarize the scores, like show text length across all texts.
In a Test Suite. This checks if conditions are met, like if all texts are within a certain length (True/False).
Code examples
Using descriptors to evaluate LLM outputs using TextEvals
Preset:
Using descriptors with tabular Metrics and Tests:
Imports
After installing Evidently, import the selected descriptors and the relevant components based on whether you want to generate Reports or run Tests.
Note. For some Descriptors that use vocabulary-based checks (like IncludesWords
or OOV
for out-of-vocabulary words), you may need to download nltk
dictionaries:
How it works
Here is the general flow to run an evaluation:
Input data. Prepare the data as a Pandas DataFrame. Include at least one text column. This will be your
current_data
to run evals on. Optionally, prepare thereference
dataset.Schema mapping. Define your data schema using Column Mapping. Optional, but highly recommended.
Define the Report or Test Suite. Create a
Report
or aTestSuite
object with the selected checks.Run the Report. Run the Report on your
current_data
, passing thecolumn_mapping
. Optionally, pass thereference_data
.Get the summary results. Get a visual Report in Jupyter notebook, export the metrics, or upload it to Evidently Platform.
Get the scored datasets. To see row-level scores, export the Pandas DataFrame with added descriptors. (Or view this on Evidently Platform).
Available Descriptors. See Descriptors in the All Metrics page.
Reports and Test Suites. For basic API, read how to run Reports and Test Suites.
Text Evals
For most cases, we recommend using the TextEvals
Preset. It provides an easy way to create a Report that summarizes Descriptor values for a specific column.
Basic example. To evaluate the Sentiment and Text Length in symbols for the response
column:
Run the Report on your DataFrame df
:
You can access the Report just like usual, and export the results as HTML, JSON, a Python dictionary, etc. To view the interactive Report directly in Jupyter Notebook or Colab:
You can add the Descriptors to your original dataset. To view the DataFrame:
To create a DataFrame:
How to get the outputs. Check the details on all available Output Formats.
Display name. Itโs a good idea to add a display_name
to each Descriptor. This name shows up in visualizations and column headers. Itโs especially handy if youโre using checks like regular expressions with word lists, where the auto-generated title could get very long.
Evaluations for multiple columns. If you want to evaluate several columns, like "response" and "question", just list multiple Presets in the same Report and include the Descriptors you need for each one.
Descriptor parameters. Some Descriptors have required parameters. For example, if youโre testing for competitor mentions using the Contains
Descriptor, you must include the names in the items
list:
Multi-column descriptors. Some Descriptors like SemanticSimilarity
require a second column. Pass it as a parameter:
Some Descriptors, like custom LLM judges, might require a more complex setup, but you can still include them in the Report just like any other Descriptor.
Reference. To see the Descriptor parameters, check the All Metrics page.
LLM-as-a-judge. For a detailed guide on setting up LLM-based evals, check the guide to LLM as a jugde.
Custom descriptors. You can implement descriptors as Python functions. Check the guide on custom descriptors.
Using Metrics
The TextEvals
Preset works by generating a ColumnSummaryMetric
for each Descriptor you calculate. You can achieve the same results by explicitly creating this Metric for each Descriptor:
For two-column descriptor like SemanticSimilarity()
, pass both columns as a list:
Text Descriptor Drift detection. Sometimes, you might want to use a different Metric, like ColumnDriftMetric
. Here is how to do this:
In this case, youโll need to pass both reference
and current
datasets. The Metric will compare the distribution of "response" Text Length in the two datasets and return a drift score.
You can use other column-level Metrics this way:
However, in most cases, it's better to first generate a DataFrame with the scores through TextEvals
. You can then run evaluations on the new dataset by referencing the newly added column directly.
Run Tests
You can also run Tests with text Descriptors to verify set conditions and return a Pass or Fail result.
Example 1. To test that the average response sentiment is greater or equal (gte
) to 0, and that the maximum text length is less than or equal (lte
) to 200 characters:
Example 2. To test that the number of responses mentioning competitors is zero:
Example 3. To test that Semantic similarity between two columns is greater or equal to 0.9:
Available Tests. You can use any column-level Tests with Descriptors. Here are a few particularly useful:
For numerical Descriptors:
In these examples, the Test conditions come from the reference
dataset. You can also pass custom ones.
Test conditions. See the list of All tests with defaults. Learn how to set custom Test conditions.
For categorical Descriptors, use TestCategoryCount
or TestCategoryShare
Tests.
For example, to test if the share of responses that contain travel-related words is less than or equal to 20%:
Last updated