In this tutorial, we’ll show how to evaluate text for custom criteria using LLM as the judge, and evaluate the LLM judge itself.

This is a local example. You will run and explore results using the open-source Python library. At the end, we’ll optionally show how to upload results to the Evidently Platform for easy exploration.

We’ll explore two ways to use an LLM as a judge:

  • Reference-based. Compare new responses against a reference. This is useful for regression testing or whenever you have a “ground truth” (approved responses) to compare against.

  • Open-ended. Evaluate responses based on custom criteria, which helps evaluate new outputs when there’s no reference available.

We will focus on demonstrating how to create and tune the LLM evaluator, which you can then apply in different contexts, like regression testing or prompt comparison.

Tutorial scope

Here’s what we’ll do:

  • Create an evaluation dataset. Create a toy Q&A dataset.

  • Create and run an LLM as a judge. Design an LLM evaluator prompt.

  • Evaluate the judge. Compare the LLM judge’s evaluations with manual labels.

We’ll start with the reference-based evaluator that determines whether a new response is correct (it’s more complex since it requires passing two columns to the prompt). Then, we’ll create a simpler judge focused on verbosity.

To complete the tutorial, you will need:

  • Basic Python knowledge.

  • An OpenAI API key to use for the LLM evaluator.

We recommend running this tutorial in Jupyter Notebook or Google Colab to render rich HTML objects with summary results directly in a notebook cell.

Run a sample notebook: Jupyter notebook or open it in Colab.

1. Installation and Imports

Install Evidently:

!pip install evidently

Import the required modules:

import pandas as pd
import numpy as np

from evidently.future.datasets import Dataset
from evidently.future.datasets import DataDefinition
from evidently.future.datasets import Descriptor
from evidently.future.descriptors import *

from evidently.future.datasets import BinaryClassification

from evidently.future.report import Report
from evidently.future.presets import TextEvals, ValueStats, ClassificationPreset
from evidently.future.metrics import *

from evidently.features.llm_judge import BinaryClassificationPromptTemplate

Pass your OpenAI key as an environment variable:

import os
os.environ["OPENAI_API_KEY"] = "YOUR_KEY"

2. Create the Dataset

First, we’ll create a toy Q&A dataset with customer support question that includes:

  • Questions. The inputs sent to the LLM app.

  • Target responses. The approved responses you consider accurate.

  • New responses. Imitated new responses from the system.

  • Manual labels with explanation. Labels that say if response is correct or not.

Why add the labels? It’s a good idea to be the judge yourself before you write a prompt. This helps:

  • Formulate better criteria. You discover nuances that help you write a better prompt.

  • Get the “ground truth”. You can use it to evaluate the quality of the LLM judge.

Ultimately, an LLM judge is a small ML system, and it needs its own evals!

Generate the dataframe. Here’s how you can create this dataset in one go:

Synthetic data. You can also generate example inputs for your LLM app using Evidently Platform.

Create an Evidently dataset object. Pass the dataframe and map the column types:

definition = DataDefinition(
    text_columns=["question", "target_response", "new_response"],
    categorical_columns=["label"]
    )

eval_dataset = Dataset.from_pandas(
    pd.DataFrame(golden_dataset),
    data_definition=definition)

To preview the dataset:

pd.set_option('display.max_colwidth', None)
golden_dataset.head(5)

Here’s the distribution of examples: we have both correct and incorrect responses.

3. Correctness evaluator

Now it’s time to set up an LLM judge! We’ll start with an evaluator that checks if responses are correct compared to the reference. The goal is to match the quality of our manual labels.

Configure the evaluator prompt. We’ll use the LLMEval Descriptor to create a custom binary evaluator. Here’s how to define the prompt template for correctness:

correctness = BinaryClassificationPromptTemplate(
        criteria = """An ANSWER is correct when it is the same as the REFERENCE in all facts and details, even if worded differently.
        The ANSWER is incorrect if it contradicts the REFERENCE, adds additional claims, omits or changes details.
        REFERENCE:
        =====
        {target_response}
        =====""",
        target_category="incorrect",
        non_target_category="correct",
        uncertainty="unknown",
        include_reasoning=True,
        pre_messages=[("system", "You are an expert evaluator. You will be given an ANSWER and REFERENCE")],
        )

The Binary Classification template (check docs) instructs an LLM to classify the input into two classes and add reasoning. You don’t need to ask for these details explicitly, or worry about parsing the output structure — that’s built into the template. You only need to add the criteria.

In this example, we’ve set up the prompt to be strict (“all fact and details”). You can write it differently. This flexibility is one of the key benefits of creating a custom judge.

Score your data. To add this new descriptor to your dataset, run:

eval_dataset.add_descriptors(descriptors=[
    LLMEval("new_response",
            template=correctness,
            provider = "openai",
            model = "gpt-4o-mini",
            alias="Correctness",
            additional_columns={"target_response": "target_response"}),
    ])

Preview the results. You can view the scored dataset in Python. This will show a DataFrame with newly added scores and explanations.

eval_dataset.as_dataframe()

Note: your explanations will vary since LLMs are non-deterministic.

If you want, you can also add the column that will help you easily sort and find all error where the LLM-judged label is different from the ground truth label.

eval_dataset.add_descriptors(descriptors=[
    ExactMatch(columns=["label", "Correctness"], alias="Judge_match")])

Get a Report. Summarize the result by generating an Evidently Report.

report = Report([
    TextEvals()
])

my_eval = report.run(eval_dataset, None)
my_eval

This will render an HTML report in the notebook cell. You can use other export options, like as_dict() for a Python dictionary output.

Since we already performed exact matching, you can see the crude accuracy of our judge. However, accuracy is not always the best metric. In this case, we might be more interested in recall: we want to make sure that the judge does not miss any “incorrect” answers .

4. Evaluate the LLM Eval quality

This part is a bit meta: we’re going to evaluate the quality of our LLM evaluator itself! We can treat it as a simple binary classification problem.

Data definition. To evaluate the classification quality, we need to map the structure of the dataset accordingly first. The column with the manual label is the “target”, and the LLM-judge response is the “prediction”:

df=eval_dataset.as_dataframe()

definition_2 = DataDefinition(
    classification=[BinaryClassification(
        target="label",
        prediction_labels="Correctness",
        pos_label = "incorrect")],
    categorical_columns=["label", "Correctness"])

class_dataset = Dataset.from_pandas(
    pd.DataFrame(df),
    data_definition=definition_2)

Pos_label refers to the class that is treated as the target (“what we want to predict better”) for metrics like precision, recall, F1-score.

Get a Report. Let’s use aClassificationPreset() that combines several classification metrics:

report = Report([
    ClassificationPreset()
])

my_eval = report.run(class_dataset, None)
my_eval

# or my_eval.as_dict()

We can now get a well-rounded evaluation and explore the confusion matrix. We have one type of error each: overall the results are pretty good! You can also refine the prompt to try to improve them.

5. Verbosity evaluator

Next, let’s create a simpler verbosity judge. It will check whether the responses are concise and to the point. This only requires evaluating one output column: such checks are perfect for production evaluations where you don’t have a reference answer.

Here’s how to set up the prompt template for verbosity:

verbosity = BinaryClassificationPromptTemplate(
        criteria = """Conciseness refers to the quality of being brief and to the point, while still providing all necessary information.
            A concise response should:
            - Provide the necessary information without unnecessary details or repetition.
            - Be brief yet comprehensive enough to address the query.
            - Use simple and direct language to convey the message effectively.""",
        target_category="concise",
        non_target_category="verbose",
        uncertainty="unknown",
        include_reasoning=True,
        pre_messages=[("system", "You are an expert text evaluator. You will be given a text of the response to a user question.")],
        )

Add this new descriptor to our existing dataset:

eval_dataset.add_descriptors(descriptors=[
    LLMEval("new_response",
            template=verbosity,
            provider = "openai",
            model = "gpt-4o-mini",
            alias="Verbosity")
    ])

Run the Report and view the summary results: 

report = Report([
    TextEvals()
])

my_eval = report.run(eval_dataset, None)
my_eval

You can also view the dataframe using eval_dataset.as_dataframe()

Don’t fully agree with the results? Use these labels as a starting point, edit the decisions where you see fit - now you’ve got your golden dataset! Next, iterate on your judge prompt. You can also try different evaluator LLMs to see which one does the job better. How to change an LLM.

What’s next?

The LLM judge itself is just one part of your overall evaluation framework. You can integrate this evaluator into different workflows, such as testing your LLM outputs after changing a prompt.

To be able to easily run and compare evals, systematically track the results, and interact with your evaluation dataset, you can use the Evidently Cloud platform.

Set up Evidently Cloud

  • Sign up for a free Evidently Cloud account.

  • Create an Organization if you log in for the first time. Get an ID of your organization. (Link).

  • Get an API token. Click the Key icon in the left menu. Generate and save the token. (Link).

Import the components to connect with Evidently Cloud:

from evidently.ui.workspace.cloud import CloudWorkspace

Create a Project

Connect to Evidently Cloud using your API token:

ws = CloudWorkspace(token="YOUR_API_TOKEN", url="https://app.evidently.cloud")

Create a Project within your Organization, or connect to an existing Project:

project = ws.create_project("My project name", org_id="YOUR_ORG_ID")
project.description = "My project description"
project.save()

# or project = ws.get_project("PROJECT_ID")

Send your eval

Since you already created the eval, you can simply upload it to the Evidently Cloud.

ws.add_run(project.id, my_eval, include_data=True)

You can then go to the Evidently Cloud, open your Project and explore the Report.