You know how to generate Reports or Test Suites for text data using Descriptors.
You know how to pass custom parameters for Reports or Test Suites.
You know how to specify text data in column mapping.
You can use external LLMs to score your text data. This method lets you evaluate texts by any custom criteria you define in a prompt.
The LLM โjudgeโ will return a numerical score or a category for each text in a column. It works like any other Evidently descriptor: you can view and analyze scores, run conditional Tests, and monitor evaluation results in time.
Evidently currently supports scoring data using Open AI LLMs (more LLMs coming soon). Use the LLMEval descriptor to create an evaluator with any custom criteria, or choose any of the built-in evaluators (like detection of Denials, Personally identifiable information, etc.).
LLM Eval
Code example
Refer to a How-to example:
OpenAI key. Add the token as the environment variable: see docs. You will incur costs when running this eval.
Built-in evaluators
You can use built-in evaluators that include pre-written prompts for specific criteria. These descriptors default to returning a binary category label with reasoning and using gpt-4o-mini model from OpenAI.
Imports. Import the LLMEval and built-in evaluators you want to use:
from evidently.descriptors import LLMEval, NegativityLLMEval, PIILLMEval, DeclineLLMEval, BiasLLMEval, ToxicityLLMEval, ContextQualityLLMEval
Get a Report. To create a Report, simply list them like any other descriptor:
Run descriptors over two columns. An evaluator that assesses if the context contains enough information to answer the question requires both columns. Run the evaluation over the context column and pass the name of the column containing the question as a parameter.
Which descriptors are there? See the list of available built-in descriptors in the All Metrics page.
Custom LLM judge
You can also create a custom LLM evaluator using the provided templates. You specify the parameters and evaluation criteria, and Evidently will generate the complete evaluation prompt to send to the LLM together with the evaluation data.
Imports. To import the template for the Binary Classification evaluator prompt:
from evidently.features.llm_judge import BinaryClassificationPromptTemplate
Fill in the template. Include the definition of your criteria, names of categories, etc. For example, to define the prompt for "conciseness" evaluation:
custom_judge =LLMEval( subcolumn="category", template =BinaryClassificationPromptTemplate( criteria = """Conciseness refers to the quality of being brief and to the point, while still providing all necessary information.
A concise response should:- Provide the necessary information without unnecessary details or repetition.- Be brief yet comprehensive enough to address the query.- Use simple and direct language to convey the message effectively.""", target_category="concise", non_target_category="verbose", uncertainty="unknown", include_reasoning=True, pre_messages=[("system", "You are a judge which evaluates text.")], ), provider = "openai", model = "gpt-4o-mini", display_name="Conciseness",)
See the explanation of each parameter below.
You do not need to explicitly ask the LLM to classify your input into two classes, ask for reasoning, or format it specially. This is already part of the template.
Using text from multiple columns. You can use this template to run evals that use data from multiple columns.
For example, you can evaluate the output in the response column, simultaneously including data from the context or question column. This applies to scenarios like classifying the relevance of the response in relation to the question or its factuality based on context, etc.
Pass the names of the additional_columns in your dataset and reference the {column} when you write the criteria. When you run the eval, Evidently will insert the contents of each text in the corresponding column in the evaluation prompt.
multi_column_judge =LLMEval( subcolumn="category", additional_columns={"question": "question"}, template=BinaryClassificationPromptTemplate( criteria=""""Relevance" refers to the response directly addresses the question and effectively meets the user's intent.
Relevant answer is an answer that directly addresses the question and effectively meets the user's intent.====={question}=====""", target_category="Relevant", non_target_category="Irrelevant", include_reasoning=True, pre_messages=[("system", "You are an expert evaluator assessing the quality of a Q&A system. Your goal is to determine if the provided answer is relevant to the question based on the criteria below.")],
), provider="openai", model="gpt-4o-mini", display_name="Relevancy" )
You do not need to explicitly include the name of your primary column in the evaluation prompt. Since you include it as column_name in the TextEvals preset, it will be automatically passed to the template.
There is an earlier implementation of this approach with OpenAIPrompting descriptor. See the documentation below.
OpenAIPrompting Descriptor
To import the Descriptor:
from evidently.descriptors import OpenAIPrompting
Define a prompt. This is a simplified example:
pii_prompt ="""Please identify whether the below text contains personally identifiable information, such as name, address, date of birth, or other.
Text: REPLACE Use the following categories for PII identification:1 if text contains PII0 if text does not contain PII0 if the provided data is not sufficient to make a clear determinationReturn only one category."""
The prompt has a REPLACE placeholder that will be filled with the texts you want to evaluate. Evidently will take the content of each row in the selected column, insert into the placeholder position in a prompt and pass it to the LLM for scoring.
To compute the score for the column response and get a summary Report:
Specifies the type of descriptor. Available values: category, score.
template
Forces a specific template for evaluation. Available: BinaryClassificationPromptTemplate.
provider
The provider of the LLM to be used for evaluation. Available: openai.
model
Specifies the model used for evaluation within the provider, e.g., gpt-3.5-turbo-instruct.
criteria
Free-form text defining evaluation criteria.
target_category
Name of the desired or positive category.
non_target_category
Name of the undesired or negative category.
uncertainty
Category to return when the provided information is not sufficient to make a clear determination. Available: unknown (Default), target, non_target.
include_reasoning
Specifies whether to include reasoning in the classification. Available: True (Default), False. It will be included with the result.
pre_messages
List of system messages that set context or instructions before the evaluation task. For example, you can explain the evaluator role ("you are an expert..") or context ("your goal is to grade the work of an intern..")
additional_columns
A dictionary of additional columns present in your dataset to include in the evaluation prompt. Use it to map the column name to the placeholder name you reference in the criteria. For example: ({"mycol": "question"}.