What is Evidently?
Last updated
Last updated
Evidently helps evaluate, test, and monitor data and ML-powered systems.
Predictive tasks: classification, regression, ranking, recommendations.
Generative tasks: chatbots, RAGs, Q&A, summarization.
Data monitoring: data quality and data drift for text, tabular data, embeddings.
Evidently is available both as an open-source Python library and Evidently Cloud platform.
You can explore more in-depth Examples and Tutorials.
Evidently helps evaluate and track quality of ML-based systems, from experimentation to production.
Evidently is both a library of 100+ ready-made evaluations, and a framework to easily implement yours: from Python functions to LLM judges.
Evidently has a modular architecture, and you can start with ad hoc checks without complex installations. There are 3 interfaces: you can get a visual Report
to see a summary of evaluation metrics, run conditional checks with a TestSuite
to get a pass/fail outcome, or plot the evaluation results over time on a Monitoring Dashboard
.
Reports compute different metrics on data and ML quality. You can use Reports for visual analysis and debugging, or as a computation layer for the monitoring dashboard.
You can be as hands-off or hands-on as you like: start with Presets, and customize metrics as you go.
Tests verify whether computed metrics satisfy defined conditions. Each Test returns a pass or fail result.
This interface helps automate your evaluations for regression testing, checks during CI/CD, or validation steps in data pipelines.
The monitoring dashboard helps visualize ML system performance over time and detect issues. You can track key metrics and test outcomes.
You can use Evidently Cloud or self-host. Evidently Cloud offers extra features like user authentication and roles, built-in alerting, and a no-code interface.
Evidently Reports, Test Suites and ML Monitoring dashboard rely on the shared set of metrics. Here are some examples of what you can evaluate.
Tabular Data Quality
Missing values, duplicates, empty rows or columns, min-max ranges, new categorical values, correlation changes, etc.
Text Descriptors
Text length, out-of-vocabulary words, share of special symbols, regular expressions matches.
Data Distribution Drift
Statistical tests and distance metrics to compare distributions of model predictions, numerical and categorical features, text data, or embeddings.
Classification Quality
Accuracy, precision, recall, ROC AUC, confusion matrix, class separation quality, classification bias.
Regression Quality
MAE, ME, RMSE, error distribution, error normality, error bias per group and feature.
Ranking and Recommendations
NDCG, MAP, MRR, Hit Rate, recommendation serendipity, novelty, diversity, popularity bias.
LLM Output Quality
Model-based scoring with external models and LLMs to detect toxicity, sentiment, evaluate retrieval relevance, etc.
You can also implement custom checks as Python functions or define your prompts for LLM-as-a-judge.
See more:
Evidently is in active development, and we are happy to receive and incorporate feedback. If you have any questions, ideas or want to hang out and chat about doing ML in production, join our Discord community!
To get updates on new features, integrations and code tutorials, sign up for the Evidently User Newsletter.
Evidently Cloud
AI evaluation and observability platform built on top of Evidently Python library. Includes advanced features, collaboration and support.
Evidently Open-Source
An open-source Python library with 20m+ downloads. Helps evaluate, test and monitor data, ML and LLM-powered systems.