What is Evidently?

Evidently helps evaluate, test, and monitor data and ML-powered systems.

Predictive tasks: classification, regression, ranking, recommendations.
Generative tasks: chatbots, RAGs, Q&A, summarization.
Data monitoring: data quality and data drift for text, tabular data, embeddings.

Evidently is available both as an open-source Python library and Evidently Cloud platform.

Get started

Evidently Cloud

AI evaluation and observability platform built on top of Evidently Python library. Includes advanced features, collaboration and support.

Evidently Open-Source

An open-source Python library with 20m+ downloads. Helps evaluate, test and monitor data, ML and LLM-powered systems.

You can explore more in-depth Examples and Tutorials.

How it works

Evidently helps evaluate and track quality of ML-based systems, from experimentation to production.

Evidently is both a library of 100+ ready-made evaluations, and a framework to easily implement yours: from Python functions to LLM judges.

Evidently has a modular architecture, and you can start with ad hoc checks without complex installations. There are 3 interfaces: you can get a visual Report to see a summary of evaluation metrics, run conditional checks with a TestSuite to get a pass/fail outcome, or plot the evaluation results over time on a Monitoring Dashboard.

Reports

Reports compute different metrics on data and ML quality. You can use Reports for visual analysis and debugging, or as a computation layer for the monitoring dashboard.

You can be as hands-off or hands-on as you like: start with Presets, and customize metrics as you go.

Tests suites

Tests verify whether computed metrics satisfy defined conditions. Each Test returns a pass or fail result.

This interface helps automate your evaluations for regression testing, checks during CI/CD, or validation steps in data pipelines.

ML monitoring dashboard

The monitoring dashboard helps visualize ML system performance over time and detect issues. You can track key metrics and test outcomes.

You can use Evidently Cloud or self-host. Evidently Cloud offers extra features like user authentication and roles, built-in alerting, and a no-code interface.

What can you evaluate?

Evidently Reports, Test Suites and ML Monitoring dashboard rely on the shared set of metrics. Here are some examples of what you can evaluate.

Evaluation group	Examples
Tabular Data Quality	Missing values, duplicates, empty rows or columns, min-max ranges, new categorical values, correlation changes, etc.
Text Descriptors	Text length, out-of-vocabulary words, share of special symbols, regular expressions matches.
Data Distribution Drift	Statistical tests and distance metrics to compare distributions of model predictions, numerical and categorical features, text data, or embeddings.
Classification Quality	Accuracy, precision, recall, ROC AUC, confusion matrix, class separation quality, classification bias.
Regression Quality	MAE, ME, RMSE, error distribution, error normality, error bias per group and feature.
Ranking and Recommendations	NDCG, MAP, MRR, Hit Rate, recommendation serendipity, novelty, diversity, popularity bias.
LLM Output Quality	Model-based scoring with external models and LLMs to detect toxicity, sentiment, evaluate retrieval relevance, etc.

Evaluation group

Examples

Tabular Data Quality

Missing values, duplicates, empty rows or columns, min-max ranges, new categorical values, correlation changes, etc.

Text Descriptors

Text length, out-of-vocabulary words, share of special symbols, regular expressions matches.

Data Distribution Drift

Statistical tests and distance metrics to compare distributions of model predictions, numerical and categorical features, text data, or embeddings.

Classification Quality

Accuracy, precision, recall, ROC AUC, confusion matrix, class separation quality, classification bias.

Regression Quality

MAE, ME, RMSE, error distribution, error normality, error bias per group and feature.

Ranking and Recommendations

NDCG, MAP, MRR, Hit Rate, recommendation serendipity, novelty, diversity, popularity bias.

LLM Output Quality

Model-based scoring with external models and LLMs to detect toxicity, sentiment, evaluate retrieval relevance, etc.

You can also implement custom checks as Python functions or define your prompts for LLM-as-a-judge.

See more:

Community and support

Evidently is in active development, and we are happy to receive and incorporate feedback. If you have any questions, ideas or want to hang out and chat about doing ML in production, join our Discord community!

To get updates on new features, integrations and code tutorials, sign up for the Evidently User Newsletter.

NextGet Started

Last updated 2 months ago