OSS Quickstart - LLM evals
Run your first LLM evaluation using Evidently open-source.
This quickstart shows how to evaluate text data, such as inputs and outputs from your LLM system.
It's best to run this example in Jupyter Notebook or Google Colab so that you can render HTML Reports directly in a notebook cell.
1. Installation
Install the Evidently library.
Import the required modules:
2. Create a toy dataset
Prepare your data as a pandas dataframe, with any texts and metadata columns. Here’s a toy example with chatbot "Questions" and "Answers":
Note: You can use the open-source tracely
library to collect inputs and outputs from a live LLM app.
3. Run your first eval
Run evaluations for the "Answer" column:
Sentiment (from -1 for negative to 1 for positive)
Text length (number of symbols))
Presence of "sorry" or "apologize" (True/False)
Each evaluation is a descriptor
. You can choose from many built-in evaluations or create custom ones.
View the Report in Python to see the distribution of scores:
You can also export the dataset with added descriptors for each row.
Or get a dictionary with results:
4. Use LLM as a judge (Optional)
To run this, you'll need an OpenAI key.
Set the OpenAI key (it's best to pass it as an environment variable). See Open AI docs for best practices.
Run a Report with the new DeclineLLMEval
. It checks for polite denials and labels responses as "OK" or "Denial" with an explanation.
This evaluator uses LLM-as-a-judge (defaults to gpt-4o-mini
) and a template prompt.
View the Report in Python:
View the dataset with scores and explanation:
Or get a dictionary with results:
What's next?
Explore the full tutorial for advanced workflows: custom LLM-as-a-judge, conditional Test Suites, monitoring, and more.
You can also send evaluation results to Evidently Cloud to analyze and track them. See the Quickstart:
Need help? Ask in our Discord community.
Last updated