Descriptors
How to run evaluations for text data.
To evaluate text data, like LLM inputs and outputs, you create Descriptors. This is a universal interface for all evals - from text statistics to LLM judges.
Each descriptor computes a score or label per row of your dataset. You can combine multiple descriptors and set optional pass/fail conditions. You can use built-in descriptors or create custom ones using LLM prompts or Python.
For a general introduction, check Core Concepts.
Basic flow
Step 1. Imports. Import the following modules:
Note. Some Descriptors (like OOVWordsPercentage()
may require nltk
dictionaries:
Step 2. Add descriptors via the Dataset object. There are two ways to do this:
- Option A. Simultaneously create the
Dataset
object and add descriptors to the selected columns (in this case, “answer” column).
Read more on how how to create the Dataset and Data Definition
- Option B. Add descriptors to the existing Dataset using
add_descriptors
.
For example, first create the Dataset.
Then, add the scores to this Dataset:
Step 3. (Optional). Export results. You can preview the DataFrame with results:
Step 4. Get the Report. This will summarize the results, capturing stats and distributions for all descriptors. The easiest way to get the Report is through TextEvals
Preset.
To configure and run the Report for the eval_dataset
:
You can view the Report in Python, export the outputs (HTML, JSON, Python dictionary) or upload it to the Evidently platform. Check more in output formats.
Customizing descriptors
All descriptors and parameters. Evidently has multiple implemented descriptors, both deterministic and LLM-based. See a reference table with all descriptors and parameters.
Alias. It is best to add an alias
to each Descriptor to make it easier to reference. This name shows up in visualizations and column headers. It’s especially handy if you’re using checks like regular expressions with word lists, where the auto-generated title could get very long.
Descriptor parameters. Some Descriptors have required parameters. For example, if you’re testing for competitor mentions using the Contains
Descriptor, add the list of items
:
These parameters are specific to each descriptors. Check the reference table.
Multi-column descriptors. Some evals use more than one column. For example, to match a new answer against reference, or measure semantic similarity. Pass both columns using parameters:
LLM-as-a-judge. There are also built-in descriptors that prompt an external LLM to return an evaluation score. You can add them like any other descriptor, but you must also provide an API key to use the corresponding LLM.
Using and customizing LLM judge. Check the in-depth LLM judge guide on using built-in and custom LLM-based evaluators.
Custom evals. Beyond custom LLM judges, you can also implement your own programmatic evals as Python functions. Check the custom descriptor guide.
Adding Descriptor Tests
Descriptor Tests let you define pass/fail checks for each row in your dataset. Instead of just calculating values (like “How long is this text?”), you can ask:
- Is the text under 100 characters?
- Is the sentiment positive?
You can also combine multiple tests into a single summary result per row.
Step 1. Imports. Run imports:
Step 2. Add tests to a descriptor. When creating a descriptor (like TextLength
or Sentiment
), use the tests argument to set conditions. Each test adds a new column with a True/False result.
Use test parameters like gte
(greater than or equal), lte
(less than or equal), eq (equal). Check the full list here.
You can preview the results with: eval_dataset.as_dataframe()
:
Step 3. Add a Test Summary. Use TestSummary
to combine multiple tests into one or more summary columns. For example, the following returns True if all tests pass:
TestSummary
will only consider tests added before it in the list of descriptors.
For LLM judge descriptors returning multiple columns (e.g., label and reasoning), you must specify the target column for the test — see DeclineLLMEval
in the example.
You can aggregate Test results differently and include multiple summary columns, such as total count, pass rate, or weighted score:
Testing existing columns. Use ColumnTest
to apply checks to any column, even ones not generated by descriptors. This is useful for working with metadata or precomputed values:
Summary Reports
You’ve already seen how to generate a report using the TextEvals
preset. It’s the simplest and useful way to summarize evaluation results. However, you can also create custom reports using different metric combinations for more control.
Imports. Import the components you’ll need:
Selecting a list of columns. You can apply TextEvals
to specific descriptors in your dataset. This makes your report more focused and lightweight.
Custom Report with different Metrics. Each Evidently Report is built from individual Metrics. For example, TextEvals
internally uses ValueStats
Metric for each descriptor. To customize the Report, you can reference specific descriptors and use metrics like MeanValue
, MaxValue
, etc:
List of all Metrics. Check the Reference table. Consider using column-level Metrics like MeanValue
, MeanValue
, MaxValue
, QuantileValue
, OutRangeValueCount
and CategoryCount
.
Drift detection. You can also run advanced checks, like comparing distributions between two datasets, for example, to detect text length drift:
Dataset-level Test Suites
You can also attach Tests to your Metrics to get pass/fail results at the dataset Report level. Example tests:
- No response has sentiment < 0
- No response exceeds 150 characters
- No more than 10% of rows fail the summary test
This produces a Test Suite that shows clear pass/fail results for the overall dataset. This is useful for automated checks and regression testing.
Report and Tests API. Check separate guides on generating Reports and setting Test conditions.