Intro
Tutorials and guides
End-to-end code examples.
We have an applied course on LLM evaluations! Free video course with 10+ tutorials. Sign up.
Quickstarts
If you are new, start here.
LLM quickstart
Evaluate the quality of text outputs.
ML quickstart
Test tabular data quality and data drift.
Tracing quickstart
Collect inputs and outputs from AI your app.
LLM Tutorials
End-to-end examples of specific workflows and use cases.
LLM judges
Create and evaluate an LLM judge. (Python)
Regression testing
Tests LLM outputs against expected responses.
RAG evaluation
A walkthrough of different RAG evaluation metrics.
LLM Evaluation Course - Video Tutorials
We have an applied LLM evaluation course where we walk through the core evaluation workflows. Each consists of the code example and a video tutorial walthrough.
📹 See complete Youtube playlist
Tutorial | Description | Code example | Video |
---|---|---|---|
Intro to LLM Evals | Introduction to LLM evaluation: concepts, goals, and motivations behind evaluating LLM outputs. | – | |
LLM Evaluation Methods | Tutorial with an overview of methods.
| Open Notebook | |
LLM as a Judge | Tutorial on creating and tuning LLM judges aligned with human preferences. | Open Notebook | |
Clasification Evaluation | Tutorial on evaluating LLMs and a simple predictive ML baseline on a multi-class classification task. | Open Notebook | |
Content Generation with LLMs | Tutorial on how to use LLMs to write tweets and evaluate how engaging they are. Introduction to the concept of tracing. | Open Notebook | |
RAG evaluations |
| Open Notebook | |
AI agent evaluations | Tutorial on how to build a simple Q&A agent and evaluate tool choice and answer correctness. | Open Notebook | |
Adversarial testing | Tutorial on how to run scenario-based risk testing on forbidden topics and brand risks. | Open Notebook |
More examples
You can also find more examples in the Example Repository.