Text evals with LLM-as-judge

How to use external LLMs to score text data.


  • You know how to generate Reports or Test Suites for text data using Descriptors.

  • You know how to pass custom parameters for Reports or Test Suites.

  • You know how to specify text data in column mapping.

You can use external LLMs to score your text data. This method lets you evaluate texts based on any custom criteria that you define in a prompt.

The LLM “judge” must return a numerical score or a category for each text in a column. You will then be able to view scores, analyze their distribution or run conditional tests through the usual Descriptor interface.

Evidently currently supports scoring data using Open AI LLMs. Use the OpenAIPrompting() descriptor to define your prompt and criteria.

Code example

You can refer to an end-to-end example with different Descriptors:

OpenAI key. Add the OpenAI token as the environment variable: see docs. You will incur costs when running this eval.

To import the Descriptor:

from evidently.descriptors import OpenAIPrompting

Define a prompt. This is a simplified example:

pii_prompt = """
Please identify whether the below text contains personally identifiable information, such as name, address, date of birth, or other.
Use the following categories for PII identification:
1 if text contains PII
0 if text does not contain PII
0 if the provided data is not sufficient to make a clear determination
Return only one category.

The prompt has a REPLACE placeholder that will be filled with the texts you want to evaluate. Evidently will take the content of each row in the selected column, insert into the placeholder position in a prompt and pass it to the LLM for scoring.

To compute the score for the column response and get a summary Report:

report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
            display_name="PII for response (by gpt3.5)"

You can do the same for Test Suites.

Descriptor parameters


prompt: str

  • The text of the evaluation prompt that will be sent to the LLM.

  • Include at least one placeholder string.

prompt_replace_string: str

  • A placeholder string within the prompt that will be replaced by the evaluated text.

  • The default string name is "REPLACE".

feature_type: str

  • The type of Descriptor the prompt will return.

  • Available: num (numerical) or cat (categorical).

  • This affects the statistics and default visualizations.

context_replace_string: str

  • An optional placeholder string within the prompt that will be replaced by the additional context.

  • The default string name is "CONTEXT".

context: Optional[str]

  • Additional context that will be added to the evaluation prompt, that does not change between evaluations.

  • Examples: a reference document, a set of positive and negative examples etc.

  • Pass this context as a string.

  • You cannot use context and context_column simultaneously.

context_column: Optional[str]

  • Additional context that will be added to the evaluation prompt, that is specific to each row.

  • Examples: a chunk of text retrieved from reference documents for a specific query.

  • Point to the column that contains the context.

  • You cannot use context and context_column simultaneously.

model: str

  • The name of the OpenAI model to be used for the LLM prompting, e.g., gpt-3.5-turbo-instruct.

openai_params: Optional[dict]

  • A dictionary with additional parameters for the OpenAI API call.

  • Examples: temperature, max tokens, etc.

  • Use parameters that OpenAI API accepts for a specific model.

possible_values: Optional[List[str]]

  • A list of possible values that the LLM can return.

  • This helps validate the output from the LLM and ensure it matches the expected categories.

  • If the validation does not pass, you will get None as a response label.

display_name: Optional[str]

  • A display name visible in Reports and as a column name in tabular export.

  • Use it to name your Descriptor.

Last updated