Need help at any point? Ask on Discord.
1. Set up your environment
For a fully local flow, skip steps 1.1 and 1.3.1.1. Set up Evidently Cloud
- Sign up for a free Evidently Cloud account.
- Create an Organization if you log in for the first time. Get an ID of your organization. (Link).
- Get an API token. Click the Key icon in the left menu. Generate and save the token. (Link).
1.2. Installation and imports
Install the Evidently Python library:1.3. Create a Project
Connect to Evidently Cloud using your API token:2. Prepare the dataset
Let’s create a toy demo chatbot dataset with “Questions” and “Answers”.Preparing your own data. You can provide data with any structure. Some common setups:
- Inputs and outputs from your LLM
- Inputs, outputs, and reference outputs (for comparison)
- Inputs, context, and outputs (for RAG evaluation)
Collecting live data. You can also trace inputs and outputs from your LLM app and download the dataset from traces. See the Tracing Quickstart
3. Run evaluations
We’ll evaluate the answers for:- Sentiment: from -1 (negative) to 1 (positive)
- Text length: character count
- Denials: refusals to answer. This uses an LLM-as-a-judge with built-in prompt.
descriptor
. It adds a new score or label to each row in your dataset.
For LLM-as-a-judge, we’ll use OpenAI GPT-4o mini. Set OpenAI key as an environment variable:
If you don’t have an OpenAI key, you can use a keyword-based check
IncludesWords
instead.
What other evals are there? Browse all available descriptors including deterministic checks, semantic similarity, and LLM judges in the descriptor list.
4. Create a Report
Create and run a Report. It will summarize the evaluation results.
5. Get a Dashboard
As you run more evals, it’s useful to track them over time. Go to “Dashboard” in the left menu, enter the “Edit” mode, and add a new “Columns” tab:
6. (Optional) Add tests
You can add conditions to your evaluations. For example, you may expect that:- Sentiment is non-negative (greater or equal to 0)
- Text length is at most 150 symbols (less or equal to 150).
- Denials: there are none.
- If any condition is false, consider the output to be a “fail”.
Add test conditions
How to add test conditions
Add test conditions
How to add test conditions


7. (Optional) Add a custom LLM jugde
You can implement custom criteria using built-in LLM judge templates.Custom LLM judge
How to create a custom LLM evaluator
Custom LLM judge
How to create a custom LLM evaluator
Let’s classify user questions as “appropriate” or “inappropriate” for an educational tool.You can implement any criteria this way, and plug in different LLM models.
