To evaluate text data, like LLM inputs and outputs, you create Descriptors. This is a universal interface for all evals - from text statistics to LLM judges.Each descriptor computes a score or label per row of your dataset. You can combine multiple descriptors and set optional pass/fail conditions. You can use built-in descriptors or create custom ones using LLM prompts or Python.For a general introduction, check Core Concepts. You can also refer to the LLM quickstart for a minimal example.
Use this code snippet to create sample data for testing:
Copy
import pandas as pddata = [ ["What is the chemical symbol for gold?", "The chemical symbol for gold is Au."], ["What is the capital of Japan?", "The capital of Japan is Tokyo."], ["Tell me a joke.", "Why don't programmers like nature? It has too many bugs!"], ["What is the boiling point of water?", "The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit)."], ["Who painted the Mona Lisa?", "Leonardo da Vinci painted the Mona Lisa."], ["What’s the fastest animal on land?", "The cheetah is the fastest land animal, capable of running up to 75 miles per hour."], ["Can you help me with my math homework?", "I'm sorry, but I can't assist with homework. You might want to consult your teacher for help."], ["How many states are there in the USA?", "There are 50 states in the USA."], ["What’s the primary function of the heart?", "The primary function of the heart is to pump blood throughout the body."], ["Can you tell me the latest stock market trends?", "I'm sorry, but I can't provide real-time stock market trends. You might want to check a financial news website or consult a financial advisor."]]# Columnscolumns = ["question", "answer"]# Creating the DataFramedf = pd.DataFrame(data, columns=columns)
Step 3. (Optional). Export results. You can preview the DataFrame with results:
Copy
eval_dataset.as_dataframe()
Step 4. Get the Report. This will summarize the results, capturing stats and distributions for all descriptors. The easiest way to get the Report is through TextEvals Preset.To configure and run the Report for the eval_dataset:
You can view the Report in Python, export the outputs (HTML, JSON, Python dictionary) or upload it to the Evidently platform. Check more in output formats.
All descriptors and parameters. Evidently has multiple implemented descriptors, both deterministic and LLM-based. See a reference table with all descriptors and parameters.
Alias. It is best to add an alias to each Descriptor to make it easier to reference. This name shows up in visualizations and column headers. It’s especially handy if you’re using checks like regular expressions with word lists, where the auto-generated title could get very long.
Descriptor parameters. Some Descriptors have required parameters. For example, if you’re testing for competitor mentions using the Contains Descriptor, add the list of items:
These parameters are specific to each descriptors. Check the reference table.Multi-column descriptors. Some evals use more than one column. For example, to match a new answer against reference, or measure semantic similarity. Pass both columns using parameters:
LLM-as-a-judge. There are also built-in descriptors that prompt an external LLM to return an evaluation score. You can add them like any other descriptor, but you must also provide an API key to use the corresponding LLM.
Descriptor Tests let you define pass/fail checks for each row in your dataset. Instead of just calculating values (like “How long is this text?”), you can ask:
Is the text under 100 characters?
Is the sentiment positive?
You can also combine multiple tests into a single summary result per row.Step 1. Imports. Run imports:
Copy
from evidently.descriptors import ColumnTest, TestSummaryfrom evidently.tests import *
Step 2. Add tests to a descriptor. When creating a descriptor (like TextLength or Sentiment), use the tests argument to set conditions. Each test adds a new column with a True/False result.
Copy
eval_dataset = Dataset.from_pandas( df, data_definition=DataDefinition(text_columns=["question", "answer"]), descriptors=[ Sentiment("answer", alias="Sentiment", tests=[ gte(0, alias="Sentiment is non-negative")]), TextLength("answer", alias="Length", tests=[ lte(100, alias="Length is under 100")]), ])
Use test parameters like gte (greater than or equal), lte (less than or equal), eq (equal). Check the full list here.You can preview the results with: eval_dataset.as_dataframe():
Step 3. Add a Test Summary. Use TestSummary to combine multiple tests into one or more summary columns. For example, the following returns True if all tests pass:
Copy
eval_dataset = Dataset.from_pandas( df, data_definition=DataDefinition(text_columns=["question", "answer"]), descriptors=[ Sentiment("answer", alias="Sentiment", tests=[ gte(0, alias="Sentiment is non-negative")]), TextLength("answer", alias="Length", tests=[ lte(100, alias="Length is under 100")]), DeclineLLMEval("answer", alias="Denials", tests=[ eq("OK", column="Denials", alias="Is not a refusal")]), TestSummary(success_all=True, alias="Test result"), #returns True if all conditions are satisfied ])
TestSummary will only consider tests added before it in the list of descriptors.
For LLM judge descriptors returning multiple columns (e.g., label and reasoning), you must specify the target column for the test — see DeclineLLMEval in the example.
You can aggregate Test results differently and include multiple summary columns, such as total count, pass rate, or weighted score:
Copy
eval_dataset.add_descriptors(descriptors=[ TestSummary( success_all=True, # True if all tests pass success_any=True, # True if any test passes success_count=True, # Total number of tests passed success_rate=True, # Share of passed tests score=True, # Weighted score score_weights={ "Sentiment is non-negative": 0.9, "Length is under 100": 0.1, }, )])
Testing existing columns. Use ColumnTest to apply checks to any column, even ones not generated by descriptors. This is useful for working with metadata or precomputed values:
You’ve already seen how to generate a report using the TextEvals preset. It’s the simplest and useful way to summarize evaluation results. However, you can also create custom reports using different metric combinations for more control.Imports. Import the components you’ll need:
Custom Report with different Metrics. Each Evidently Report is built from individual Metrics. For example, TextEvals internally uses ValueStats Metric for each descriptor. To customize the Report, you can reference specific descriptors and use metrics like MeanValue, MaxValue, etc:
List of all Metrics. Check the Reference table. Consider using column-level Metrics like MeanValue, MeanValue, MaxValue, QuantileValue, OutRangeValueCount and CategoryCount.
Drift detection. You can also run advanced checks, like comparing distributions between two datasets, for example, to detect text length drift: