Quickstart - LLM evaluations
LLM evaluation "Hello world."
This quickstart shows how to evaluate text data, such as inputs and outputs from your LLM system.
You will run evals locally in Python and send results to Evidently Cloud for analysis and monitoring.
Need help? Ask on Discord.
1. Set up Evidently Cloud
Set up your Evidently Cloud workspace:
Sign up for a free Evidently Cloud account.
Create an Organization when you log in for the first time. Get an ID of your organization. Organizations page.
Get your API token. Click the Key icon in the left menu. Generate and save the token. (Token page).
Now, switch to your Python environment.
2. Installation
Install the Evidently Python library:
Import the components to run the evals:
Import the components to connect with Evidently Cloud:
3. Create a Project
Connect to Evidently Cloud using your API token:
Create a Project within your Organization:
4. Import the toy dataset
Prepare your data as a pandas dataframe with texts and metadata columns. Here’s a toy chatbot dataset with "Questions" and "Answers".
Collecting live data: use the open-source tracely
library to collect the inputs and outputs from your LLM app. Check the Tracing Quickstart. You can then download the traced dataset for evaluation.
5. Run your first eval
You have two options:
Run evals that work locally.
Use LLM-as-a-judge (requires an OpenAI token).
Define your evals. You will evaluate all "Answers" for:
Sentiment: from -1 for negative to 1 for positive.
Text length: character count.
Presence of "sorry" or "apologize": True/False.
Each evaluation is a descriptor
. You can choose from multiple built-in evaluations or create custom ones, including LLM-as-a-judge.
6. Send results to Evidently Cloud
Upload the Report and include raw data for detailed analysis:
View the Report. Go to Evidently Cloud, open your Project, and navigate to "Reports" in the left.
You will see the scores summary, and the dataset with new descriptor columns. For example, you can sort to find all answers with "Denials".
7. Get a dashboard
Go to the "Dashboard" tab and enter the "Edit" mode. Add a new tab, and select the "Descriptors" template.
You'll see a set of panels that show descriptor values. Each has a single data point. As you log ongoing evaluation results, you can track trends and set up alerts.
What's next?
Explore the full tutorial for advanced workflows: custom LLM judges, conditional test suites, monitoring, and more.
Last updated