LLM Evaluation
Evaluate text outputs in under 5 minutes
Evidently helps you evaluate LLM outputs automatically. The lets you compare prompts, models, run regression or adversarial tests with clear, repeatable checks. That means faster iterations, more confident decisions, and fewer surprises in production.
In this Quickstart, you’ll try a simple eval in Python and view the results in Evidently Cloud. If you want to stay fully local, you can also do that - just skip a couple steps.
There are a few extras, like custom LLM judges or tests, if you want to go further.
Let’s dive in.
Need help at any point? Ask on Discord.
1. Set up your environment
For a fully local flow, skip steps 1.1 and 1.3.
1.1. Set up Evidently Cloud
-
Sign up for a free Evidently Cloud account.
-
Create an Organization if you log in for the first time. Get an ID of your organization. (Link).
-
Get an API token. Click the Key icon in the left menu. Generate and save the token. (Link).
1.2. Installation and imports
Install the Evidently Python library:
Components to run the evals:
Components to connect with Evidently Cloud:
1.3. Create a Project
Connect to Evidently Cloud using your API token:
Create a Project within your Organization, or connect to an existing Project:
2. Prepare the dataset
Let’s create a toy demo chatbot dataset with “Questions” and “Answers”.
Preparing your own data. You can provide data with any structure. Some common setups:
- Inputs and outputs from your LLM
- Inputs, outputs, and reference outputs (for comparison)
- Inputs, context, and outputs (for RAG evaluation)
Collecting live data. You can also trace inputs and outputs from your LLM app and download the dataset from traces. See the Tracing Quickstart
3. Run evaluations
We’ll evaluate the answers for:
- Sentiment: from -1 (negative) to 1 (positive)
- Text length: character count
- Denials: refusals to answer. This uses an LLM-as-a-judge with built-in prompt.
Each evaluation is a descriptor
. It adds a new score or label to each row in your dataset.
For LLM-as-a-judge, we’ll use OpenAI GPT-4o mini. Set OpenAI key as an environment variable:
If you don’t have an OpenAI key, you can use a keyword-based check IncludesWords
instead.
To run evals, pass the dataset and specify the list of descriptors to add:
Congratulations! You’ve just run your first eval. Preview the results locally in pandas:
What other evals are there? Browse all available descriptors including deterministic checks, semantic similarity, and LLM judges in the descriptor list.
4. Create a Report
Create and run a Report. It will summarize the evaluation results.
Local preview. In a Python environment like Jupyter notebook or Colab, run:
This will render the Report directly in the notebook cell. You can also get a JSON or Python dictionary, or save as an external HTML file.
Local Reports are great for quick experiments. To run comparisons, keep track of the results and collaborate with others, upload the results to Evidently Platform.
Upload the Report to Evidently Cloud together with scored data:
Explore. Go to Evidently Cloud, open your Project, and navigate to Reports. You will see all score summaries and can browse the data. E.g. sort to find all answers labeled as “Denials”.
5. Get a Dashboard
As you run more evals, it’s useful to track them over time. Go to “Dashboard” in the left menu, enter the “Edit” mode, and add a new “Columns” tab:
You’ll see a set of panels with descriptor values. Each will have a single data point for now. As you log more evaluation results, you can track trends and set up alerts.
Want to see more complex workflows? You can add pass/fail conditions and custom evals.
6. (Optional) Add tests
You can add conditions to your evaluations. For example, you may expect that:
- Sentiment is non-negative (greater or equal to 0)
- Text length is at most 150 symbols (less or equal to 150).
- Denials: there are none.
- If any condition is false, consider the output to be a “fail”.
You can implement this logic easily.
7. (Optional) Add a custom LLM jugde
You can implement custom criteria using built-in LLM judge templates.
What’s next?
Read more on how you can configure LLM judges for custom criteria or using other LLMs.
We also have lots of other examples! Explore tutorials.