Text evals with LLM-as-judge
How to use external LLMs to score text data.
Last updated
How to use external LLMs to score text data.
Last updated
Pre-requisites:
You know how to generate Reports or Test Suites for text data using Descriptors.
You know how to pass custom parameters for Reports or Test Suites.
You know how to specify text data in column mapping.
You can use external LLMs to score your text data. This method lets you evaluate texts by any custom criteria you define in a prompt.
The LLM โjudgeโ will return a numerical score or a category for each text in a column. It works like any other Evidently descriptor
: you can view and analyze scores, run conditional Tests, and monitor evaluation results in time.
Evidently currently supports scoring data using Open AI LLMs (more LLMs coming soon). Use the LLMEval
descriptor to create an evaluator with any custom criteria, or choose any of the built-in evaluators (like detection of Denials, Personally identifiable information, etc.).
Refer to a How-to example:
OpenAI key. Add the token as the environment variable: see docs. You will incur costs when running this eval.
You can use built-in evaluators that include pre-written prompts for specific criteria. These descriptors default to returning a binary category label with reasoning and using gpt-4o-mini
model from OpenAI.
Imports. Import the LLMEval
and built-in evaluators you want to use:
Get a Report. To create a Report, simply list them like any other descriptor:
Parametrize evaluators. You can switch the output format from category
to score
(0 to 1) or exclude the reasoning:
Run descriptors over two columns. An evaluator that assesses if the context contains enough information to answer the question requires both columns. Run the evaluation over the context
column and pass the name of the column containing the question
as a parameter.
Which descriptors are there? See the list of available built-in descriptors in the All Metrics page.
You can also create a custom LLM evaluator using the provided templates. You specify the parameters and evaluation criteria, and Evidently will generate the complete evaluation prompt to send to the LLM together with the evaluation data.
Imports. To import the template for the Binary Classification evaluator prompt:
Fill in the template. Include the definition of your criteria
, names of categories, etc. For example, to define the prompt for "conciseness" evaluation:
See the explanation of each parameter below.
You do not need to explicitly ask the LLM to classify your input into two classes, ask for reasoning, or format it specially. This is already part of the template.
Using text from multiple columns. You can use this template to run evals that use data from multiple columns.
For example, you can evaluate the output in the response
column, simultaneously including data from the context
or question
column. This applies to scenarios like classifying the relevance of the response in relation to the question or its factuality based on context, etc.
Pass the names of the additional_columns
in your dataset and reference the {column}
when you write the criteria
. When you run the eval, Evidently will insert the contents of each text in the corresponding column in the evaluation prompt.
You do not need to explicitly include the name of your primary column in the evaluation prompt. Since you include it as column_name
in the TextEvals
preset, it will be automatically passed to the template.
There is an earlier implementation of this approach with OpenAIPrompting
descriptor. See the documentation below.
Parameter | Description |
---|---|
Parameter | Description |
---|---|
subcolumn
Specifies the type of descriptor. Available values: category
, score
.
template
Forces a specific template for evaluation. Available: BinaryClassificationPromptTemplate
.
provider
The provider of the LLM to be used for evaluation. Available: openai
.
model
Specifies the model used for evaluation within the provider, e.g., gpt-3.5-turbo-instruct
.
criteria
Free-form text defining evaluation criteria.
target_category
Name of the desired or positive category.
non_target_category
Name of the undesired or negative category.
uncertainty
Category to return when the provided information is not sufficient to make a clear determination. Available: unknown
(Default), target
, non_target
.
include_reasoning
Specifies whether to include reasoning in the classification. Available: True
(Default), False
. It will be included with the result.
pre_messages
List of system messages that set context or instructions before the evaluation task. For example, you can explain the evaluator role ("you are an expert..") or context ("your goal is to grade the work of an intern..")
additional_columns
A dictionary of additional columns present in your dataset to include in the evaluation prompt. Use it to map the column name to the placeholder name you reference in the criteria
. For example: ({"mycol": "question"}
.