How to run prompt-based evaluators for custom criteria.
Toy data to run the example
gpt-4o-mini
model from OpenAI.response
contains any toxicity:
context
column, and pass the question
column as a parameter.
category
to score
(0 to 1) or exclude the reasoning to get only the label:
LLMEval
.
Change the model. Specify a different model from OpenAI:
provider
and model
. For example:
litellm
to call different model APIs which implements 50+ providers. You can match the provider
name and the model
name parameters to the list given in the LiteLLM docs. Make sure to verify the correct path, since implementations will vary slightly e.g. provider="gemini", model="gemini/gemini-2.0-flash-lite"
.print(conciseness.get_template())
template
name to the LLMEval
descriptor:
{column_placeholder}
inside your evaluation criteria.
Example. To evaluate if the response is faithful to the context:
additional_columns
parameter to map it to the placeholder inside the prompt:
include_score
as False - in this case we only get a single resulting label.
To apply the template:
{column_placeholder}
inside your evaluation criteria.
Let’s evaluate the relevance of answer to the question, classifying into “relevant”, “irrelevant” and “partially” relevant. To define the evaluation template, we include the question placeholder in our template:
include_score
as True - in this case we will also receive individual scores for each label.
To apply the template:
Parameter | Description | Options |
---|---|---|
template | Sets a specific template for evaluation. | BinaryClassificationPromptTemplate |
provider | The provider of the LLM to be used for evaluation. | openai (Default) or any provider supported by LiteLLM. |
model | Specifies the model used for evaluation. | Any available provider model (e.g., gpt-3.5-turbo , gpt-4 ) |
additional_columns | A dictionary of additional columns present in your dataset to include in the evaluation prompt. Use it to map the column name to the placeholder name you reference in the criteria . For example: ({"mycol": "question"} . | Custom dictionary (optional) |
Parameter | Description | Options |
---|---|---|
criteria | Free-form text defining evaluation criteria. | Custom string (required) |
target_category | Name of the target category you want to detect (e.g., you care about its precision/recall more than the other). The choice of “target” category has no impact on the evaluation itself. However, it can be useful for later quality evaluations of your LLM judge. | Custom category (required) |
non_target_category | Name of the non-target category. | Custom category (required) |
uncertainty | Category to return when the provided information is not sufficient to make a clear determination. | unknown (Default), target , non_target |
include_reasoning | Specifies whether to include the LLM-generated explanation of the result. | True (Default), False |
pre_messages | List of system messages that set context or instructions before the evaluation task. Use it to explain the evaluator role (“you are an expert..”) or context (“your goal is to grade the work of an intern..”). | Custom string (optional) |
Parameter | Description | Options |
---|---|---|
criteria | Free-form text defining evaluation criteria. | Custom string (required) |
target_category | Name of the target category you want to detect (e.g., you care about its precision/recall more than the other). The choice of “target” category has no impact on the evaluation itself. However, it can be useful for later quality evaluations of your LLM judge. | Custom category (required) |
category_criteria | A dictionary with categories and definitions. | Custom category list (required) |
uncertainty | Category to return when the provided information is not sufficient to make a clear determination. | unknown (Default) |
include_reasoning | Specifies whether to include the LLM-generated explanation of the result. | True (Default), False |
pre_messages | List of system messages that set context or instructions before the evaluation task. | Custom string (optional) |
OpenAIPrompting descriptor
OpenAIPrompting
descriptor. See the documentation below.OpenAIPrompting DescriptorTo import the Descriptor:response
and get a summary Report:num
(numerical) or cat
(categorical).context
and context_column
simultaneously.context
and context_column
simultaneously.gpt-3.5-turbo-instruct
.None
as a response label.