Skip to main content
We have an applied course on LLM evaluations! Free video course with 10+ tutorials. Sign up.

Quickstarts

If you are new, start here.

LLM quickstart

Evaluate the quality of text outputs.

ML quickstart

Test tabular data quality and data drift.

Tracing quickstart

Collect inputs and outputs from AI your app.

LLM Tutorials

End-to-end examples of specific workflows and use cases.

LLM as a judge

How to create and evaluate an LLM judge against human labels.

RAG evaluation

A walkthrough of different RAG evaluation metrics.

LLM as a jury

Using multiple LLMs to evaluate the same output.

LLM evaluation methods

A walkthrough of different LLM evaluation methods. [CODE + VIDEO]

Descriptor cookbook

A walkthrough of different descriptors (deterministic, ML, etc.) a single notebook.

LLM judge prompt optimization (1)

Optimize a multi-class classifier using target labels.

LLM judge prompt optimization (2)

Optimize a binary classifier using target labels and free-form feedback.

ML tutorials

End-to-end examples of specific workflows and use cases.

Metric cookbook

Various data/ML metrics: Regression, Classification, Data Quality, Data Drift.

Integrations

End-to-end examples of integrating Evidently with other tools and platforms.

GitHub actions

Running Evidently evals as part of CI/CD workflow. Native GitHub action integration for regression testing.

Different LLM providers as judges

Examples of using different external evaluator LLMs as LLM judges: OpenAI, Gemini, Google Vertex, Mistral, Ollama.

Evidently + Grafana: LLM evals

Visualize Evidently LLM evaluation metrics with Grafana. (Postgres as a database).

Evidently+ Grafana: Data drift

Visualize Evidently data drift evaluations on a Grafana dashboard. (Postgres as a database).

Deployment

Evidently Open-source UI tutorial

How to create a workspace, project and run Reports.

LLM Evaluation Course - Video Tutorials

We have an applied LLM evaluation course where we walk through the core evaluation workflows. Each consists of the code example and a video tutorial walthrough. 📥 Sign up for the course 📹 See complete Youtube playlist
TutorialDescriptionCode exampleVideo
Intro to LLM EvalsIntroduction to LLM evaluation: concepts, goals, and motivations behind evaluating LLM outputs.
  • Video
LLM Evaluation MethodsTutorial with an overview of methods.
  • Part 1. Anatomy of a single evaluation. Covers basic LLM evaluation API and setup.
  • Part 2. Reference-based evaluation: exact match, semantic similarity, BERTScore, and LLM judge.
  • Part 3. Reference-free evaluation: text statistics, regex, ML models, LLM judges, and session-level evaluators.
Open Notebook
  • Video 1
  • Video 2
  • Video 3
LLM as a JudgeTutorial on creating and tuning LLM judges aligned with human preferences.Open Notebook
  • Video
Clasification EvaluationTutorial on evaluating LLMs and a simple predictive ML baseline on a multi-class classification task.Open Notebook
  • Video
Content Generation with LLMsTutorial on how to use LLMs to write tweets and evaluate how engaging they are. Introduction to the concept of tracing.Open Notebook
  • Video
RAG evaluations
  • Part 1. Theory on how to evaluate RAG systems: retrieval, generation quality and synthetic data.
  • Part 2. Tutorial on building a toy RAG application and evaluating correctness and faithfulness.
Open Notebook
  • Video 1
  • Video 2
AI agent evaluationsTutorial on how to build a simple Q&A agent and evaluate tool choice and answer correctness.Open Notebook
  • Video
Adversarial testingTutorial on how to run scenario-based risk testing on forbidden topics and brand risks.Open Notebook
  • Video

More examples

You can also find more examples in the Example Repository.