> ## Documentation Index
> Fetch the complete documentation index at: https://docs.evidentlyai.com/llms.txt
> Use this file to discover all available pages before exploring further.

# LLM Evaluation

> Evaluate text outputs in under 5 minutes

Evidently helps you evaluate LLM outputs automatically. The lets you compare prompts, models, run regression or adversarial tests with clear, repeatable checks. That means faster iterations, more confident decisions, and fewer surprises in production.

In this Quickstart, you'll try a simple eval in Python and view the results in Jupyter notebook or Colab. There are a few extras, like custom LLM judges or tests, if you want to go further.

Let’s dive in.

<Info>
  Need help at any point? Ask on [Discord](https://discord.com/invite/xZjKRaNp8b).
</Info>

## 1. Set up your environment

Install the Evidently Python library:

```python theme={null}
!pip install evidently
```

Components to run the evals:

```python theme={null}
import pandas as pd
from evidently import Dataset
from evidently import DataDefinition
from evidently import Report
from evidently.presets import TextEvals
from evidently.tests import lte, gte, eq
from evidently.descriptors import LLMEval, TestSummary, DeclineLLMEval, Sentiment, TextLength, IncludesWords
from evidently.llm.templates import BinaryClassificationPromptTemplate
```

## 2. Prepare the dataset

Let's create a toy demo chatbot dataset with "Questions" and "Answers".

```python theme={null}
data = [
    ["What is the chemical symbol for gold?", "Gold chemical symbol is Au."],
    ["What is the capital of Japan?", "The capital of Japan is Tokyo."],
    ["Tell me a joke.", "Why don't programmers like nature? Too many bugs!"],
    ["When does water boil?", "Water's boiling point is 100 degrees Celsius."],
    ["Who painted the Mona Lisa?", "Leonardo da Vinci painted the Mona Lisa."],
    ["What’s the fastest animal on land?", "The cheetah is the fastest land animal, capable of running up to 75 miles per hour."],
    ["Can you help me with my math homework?", "I'm sorry, but I can't assist with homework."],
    ["How many states are there in the USA?", "USA has 50 states."],
    ["What’s the primary function of the heart?", "The primary function of the heart is to pump blood throughout the body."],
    ["Can you tell me the latest stock market trends?", "I'm sorry, but I can't provide real-time stock market trends. You might want to check a financial news website or consult a financial advisor."]
]
columns = ["question", "answer"]

eval_df = pd.DataFrame(data, columns=columns)
#eval_df.head()
```

<Info>
  **Preparing your own data**. You can provide data with any structure. Some common setups:

  * Inputs and outputs from your LLM
  * Inputs, outputs, and reference outputs (for comparison)
  * Inputs, context, and outputs (for RAG evaluation)
</Info>

<Info>
  **Collecting live data**. You can also trace inputs and outputs from your LLM app and download the dataset from traces. See the [Tracing Quickstart](/quickstart_tracing)
</Info>

## 3. Run evaluations

We'll evaluate the answers for:

* **Sentiment:** from -1 (negative) to 1 (positive)
* **Text length:** character count
* **Denials:** refusals to answer. This uses an LLM-as-a-judge with built-in prompt.

Each evaluation is a `descriptor`. It adds a new score or label to each row in your dataset.

For LLM-as-a-judge, we'll use OpenAI GPT-4o mini. Set OpenAI key as an environment variable:

```python theme={null}
## import os
## os.environ["OPENAI_API_KEY"] = "YOUR KEY"
```

<Info>
  If you don't have an OpenAI key, you can use a keyword-based check `IncludesWords` instead.
</Info>

To run evals, pass the dataset and specify the list of descriptors to add:

```python theme={null}
eval_dataset = Dataset.from_pandas(
    eval_df,
    data_definition=DataDefinition(),
    descriptors=[
        Sentiment("answer", alias="Sentiment"),
        TextLength("answer", alias="Length"),
        DeclineLLMEval("answer", alias="Denials")]) 

# Or IncludesWords("answer", words_list=['sorry', 'apologize'], alias="Denials")
```

**Congratulations!** You've just run your first eval. Preview the results locally in pandas:

```python theme={null}
eval_dataset.as_dataframe()
```

<img src="https://mintcdn.com/evi/1w-MTC1_UznqpX8R/images/examples/llm_quickstart_preview.png?fit=max&auto=format&n=1w-MTC1_UznqpX8R&q=85&s=9610f1564b855787dd98e7f6e5f23513" alt="" width="1880" height="590" data-path="images/examples/llm_quickstart_preview.png" />

<Info>
  **What other evals are there?** Browse all available descriptors including deterministic checks, semantic similarity, and LLM judges in the [descriptor list](/metrics/all_descriptors).
</Info>

## 4.  Create a Report

**Create and run a Report**. It will summarize the evaluation results.

```python theme={null}
report = Report([
    TextEvals()
])

my_eval = report.run(eval_dataset, None)
```

**Local preview**. In a Python environment like Jupyter notebook or Colab, run:

```python theme={null}
my_eval
```

This will render the Report directly in the notebook cell. You can also get a JSON or Python dictionary, or save as an external HTML file.

```python theme={null}
# my_eval.json()
# my_eval.dict()
# my_report.save_html(“file.html”)
```

<Info>
  Local Reports are great for quick experiments. To run comparisons, keep track of the results and collaborate with others, you can also upload the results to Evidently Platform and build a dashboard to visualize the results. Read more about [platform self-hosting](https://docs.evidentlyai.com/docs/setup/self-hosting).
</Info>

## 5. (Optional) Add tests

You can add conditions to your evaluations. For example, you may expect that:

* **Sentiment** is non-negative (greater or equal to 0)
* **Text length** is at most 150 symbols (less or equal to 150).
* **Denials**: there are none.
* If any condition is false, consider the output to be a "fail".

You can implement this logic easily.

<Accordion title="Add test conditions" description="How to add test conditions" icon="ballot-check">
  ```python theme={null}
  # Run the evaluation with tests 
  eval_dataset = Dataset.from_pandas(
    eval_df,
    data_definition=DataDefinition(),
    descriptors=[
        Sentiment("answer", alias="Sentiment",
                  tests=[gte(0, alias="Is_non_negative")]),
        TextLength("answer", alias="Length",
                   tests=[lte(150, alias="Has_expected_length")]),
        DeclineLLMEval("answer", alias="Denials",
                       tests=[eq("OK", column="Denials",
                                 alias="Is_not_a_refusal")]),
        TestSummary(success_all=True, alias="All_tests_passed")])

  # Uncomment to preview the results locally
  # eval_dataset.as_dataframe()
  ```

  <img src="https://mintcdn.com/evi/1w-MTC1_UznqpX8R/images/examples/llm_quickstart_descriptor_tests-min.png?fit=max&auto=format&n=1w-MTC1_UznqpX8R&q=85&s=ee0ba4d12063808e5cf8fe0479334c32" alt="" width="2124" height="878" data-path="images/examples/llm_quickstart_descriptor_tests-min.png" />

  You can limit the summary report to include only specific descriptor(s).

  ```python theme={null}
  report = Report([
      TextEvals(columns=["All_tests_passed"])
  ])

  my_eval = report.run(eval_dataset, None)
  ws.add_run(project.id, my_eval, include_data=True)

  #my_eval
  ```

  To identify rows that failed any criteria, sort by "All\_test\_passed" column:
</Accordion>

<img src="https://mintcdn.com/evi/1w-MTC1_UznqpX8R/images/examples/llm_quickstart_descriptor_tests_report-min.png?fit=max&auto=format&n=1w-MTC1_UznqpX8R&q=85&s=f60bbbb3023b73203f91037df3cde7de" alt="" width="2396" height="1566" data-path="images/examples/llm_quickstart_descriptor_tests_report-min.png" />

## 6. (Optional) Add a custom LLM jugde

You can implement custom criteria using built-in LLM judge templates.

<Accordion title="Custom LLM judge" description="How to create a custom LLM evaluator" icon="sparkles">
  Let's classify user questions as "appropriate" or "inappropriate" for an educational tool.

  ```python theme={null}
  # Define the evaluation criteria
  appropriate_scope = BinaryClassificationPromptTemplate(
      criteria="""An appropriate question is any educational query related to
      academic subjects, general school-level world knowledge, or skills.
      An inappropriate question is anything offensive, irrelevant, or out of
      scope.""",
      target_category="APPROPRIATE",
      non_target_category="INAPPROPRIATE",
      include_reasoning=True,
  )

  # Apply evaluation
  llm_evals = Dataset.from_pandas(
      eval_df,
      data_definition=DataDefinition(),
      descriptors=[
          LLMEval("question", template=appropriate_scope,
                  provider="openai", model="gpt-4o-mini",
                  alias="Question topic")
      ]
  )

  # Run and upload report
  report = Report([
      TextEvals()
  ])

  my_eval = report.run(llm_evals, None)
  ws.add_run(project.id, my_eval, include_data=True)

  # Uncomment to replace ws.add_run for a local preview 
  # my_eval
  ```

  You can implement any criteria this way, and plug in different LLM models.
</Accordion>

<img src="https://mintcdn.com/evi/1w-MTC1_UznqpX8R/images/examples/llm_quickstart_descriptor_custom_llm_judge-min.png?fit=max&auto=format&n=1w-MTC1_UznqpX8R&q=85&s=e18ff176747ed051ffb4f3a3674084a2" alt="" width="2382" height="1536" data-path="images/examples/llm_quickstart_descriptor_custom_llm_judge-min.png" />

## What's next?

Read more on how you can configure [LLM judges for custom criteria or using other LLMs](/metrics/customize_llm_judge).

We also have lots of other examples! [Explore tutorials](/metrics/introduction).
