Configure LLM Judges
How to run prompt-based evaluators for custom criteria.
LLM-based descriptors use an external LLM for evaluation. You can:
- Use built-in evaluators (with pre-written prompts), or
- Run evals for custom criteria you configure.
Pre-requisites:
- You know how to use descriptors to evaluate text data.
Imports
Built-in LLM judges
Available descriptors. Check all available built-in LLM evals in the reference table.
There are built-in evaluators for popular criteria, like detecting toxicity or if the text contains a refusal. These built-in descriptors:
- Default to binary classifiers.
- Default to using
gpt-4o-mini
model from OpenAI. - Return a label, the reasoning for the decision, and an optional score.
OpenAI key. Add the token as the environment variable: see docs.
Run a single-column eval. For example, to evaluate whether response
contains any toxicity:
View the results as usual:
Example output:
Run a multi-column eval. Some evaluators naturally require two columns. For example, to evaluate Context Quality (“does it have enough information to answer the question?”), you must run this evaluation over your context
column, and pass the question
column as a parameter.
Example output:
Parametrize evaluators. You can switch the output format from category
to score
(0 to 1) or exclude the reasoning to get only the label:
Column names. The alias you set defines the column name with the category. If you enable the score result as well, it will get the “Alias score” name.
Change the evaluator LLM
You can change the model and provider that you use for LLM evaluations.
Change the model. Specify a different model from OpenAI:
Change the provider. To use a different LLM, first import the corresponding API key as an environment variable.
And pass the name of the provider
and model
. For example:
List of providers and models. Evidently uses litellm
to call different model APIs which implements 50+ providers. You can match the provider
name and the model
name parameters to the list given in the LiteLLM docs.
You can also pass the API key as an option instead of an environment variable.
Custom LLM judge
You can also create a custom LLM evaluator using the provided templates:
- Choose a template (binary or multi-class classification).
- Specify the evaluation criteria (grading logic and names of categories)
Evidently will then generate the complete evaluation prompt to send to the selected LLM together with the evaluation data.
Binary classifier
You can as the LLM judge to classify texts into two categories you define.
Single column
Example 1. To evaluate if the text is “concise” or “verbose”:
You do not need to explicitly ask the LLM to classify your input into two classes, return reasoning, or format the output. This is already part of the Evidently template.
To apply this descriptor for your data, pass the template
name to the LLMEval
descriptor:
Publish results as usual:
Example 2. This template is very flexible: you can adapt it for any custom criteria. For instance, to evaluate if the question is appropriate to the scope of your LLM application. A simplified prompt:
Apply the template:
Example output:
Multiple columns
A custom evaluator can also use multiple columns. To implement this, mention the second {column_name}
inside your evaluation criteria.
Example. To evaluate if the response is faithful to the context:
You do not need to include the primary column name in the evaluation prompt, since it’s already part of the template. You choose this column when you apply the descriptor.
When you apply the descriptor, include the second column using the additional_columns
parameter:
Get the results as usual:
Example output:
Multi-class classifier
If you want to use more than two classes, use the multi-class template and define your grading rubric (class and class definition) as a dictionary.
Single column
Example. Let’s evaluate how well the response aligns with the brand policy, classifying into 5 categories.
It’s usually best to use as few categories as possible to make sure each of them is clearly defined.
To define the template:
Note that we set include_score
as False - in this case we only get a single resulting label.
To apply the template:
Get the results as usual:
Example output:
Multi-column
Similarly to the binary evaluator, you can pass multiple columns to your evaluation prompt. To implement this, mention the additional {column_name}
inside your evaluation criteria.
Let’s evaluate the relevance of answer to the question, classifying into “relevant”, “irrelevant” and “partially” relevant. To define the evaluation template, we include the placeholder in our template:
Note that we set include_score
as True - in this case we will also receive individual scores for each label.
To apply the template:
Get the results as usual:
Example output:
Parameters
LLMEval
Parameter | Description | Options |
---|---|---|
template | Sets a specific template for evaluation. | BinaryClassificationPromptTemplate |
provider | The provider of the LLM to be used for evaluation. | openai (Default) or any provider supported by LiteLLM. |
model | Specifies the model used for evaluation. | Any available provider model (e.g., gpt-3.5-turbo , gpt-4 ) |
additional_columns | A dictionary of additional columns present in your dataset to include in the evaluation prompt. Use it to map the column name to the placeholder name you reference in the criteria . For example: ({"mycol": "question"} . | Custom dictionary (optional) |
BinaryClassificationPromptTemplate
Parameter | Description | Options |
---|---|---|
criteria | Free-form text defining evaluation criteria. | Custom string (required) |
target_category | Name of the target category you want to detect (e.g., you care about its precision/recall more than the other). The choice of “target” category has no impact on the evaluation itself. However, it can be useful for later quality evaluations of your LLM judge. | Custom category (required) |
non_target_category | Name of the non-target category. | Custom category (required) |
uncertainty | Category to return when the provided information is not sufficient to make a clear determination. | unknown (Default), target , non_target |
include_reasoning | Specifies whether to include the LLM-generated explanation of the result. | True (Default), False |
pre_messages | List of system messages that set context or instructions before the evaluation task. Use it to explain the evaluator role (“you are an expert..”) or context (“your goal is to grade the work of an intern..”). | Custom string (optional) |
MulticlassClassificationPromptTemplate
Parameter | Description | Options |
---|---|---|
criteria | Free-form text defining evaluation criteria. | Custom string (required) |
target_category | Name of the target category you want to detect (e.g., you care about its precision/recall more than the other). The choice of “target” category has no impact on the evaluation itself. However, it can be useful for later quality evaluations of your LLM judge. | Custom category (required) |
category_criteria | A dictionary with categories and definitions. | Custom category list (required) |
uncertainty | Category to return when the provided information is not sufficient to make a clear determination. | unknown (Default) |
include_reasoning | Specifies whether to include the LLM-generated explanation of the result. | True (Default), False |
pre_messages | List of system messages that set context or instructions before the evaluation task. | Custom string (optional) |