All metrics
List of Metrics, Descriptors and Metric Presets available in Evidently.
We are doing our best to maintain this page up to date. In case of discrepancies, check the "All metrics" notebook in examples. If you notice an error, please send us a pull request with an update!
Metric Presets
Defaults: Presets use the default parameters for each Metric. You can see them in the tables below.
Data Quality
Metric | Parameters |
---|---|
DatasetSummaryMetric() Dataset-level. Calculates descriptive dataset statistics, including:
| Required: n/a Optional:
|
DatasetMissingValuesMetric() Dataset-level. Calculates the number and share of missing values in the dataset. Displays the number of missing values per column. | Required: n/a Optional:
|
DatasetCorrelationsMetric() Dataset-level. Calculates the correlations between all columns in the dataset. Uses: Pearson, Spearman, Kendall, Cramer_V. Visualizes the heatmap. | Required: n/a Optional: n/a |
ColumnSummaryMetric() Column-level. Calculates various descriptive statistics for numerical, categorical, text or DateTime columns, including:
Plots the distribution histogram. If DateTime is provided, also plots the distribution over time. If Target is provided, also plots the relation with Target. | Required:
|
ColumnMissingValuesMetric() Column-level. Calculates the number and share of missing values in the column. | Required: n/a Optional:
|
ColumnRegExpMetric()
Column-level.
Calculates the number and share of the values that do not match a defined regular expression.
Example use: | Required:
Optional:
|
ColumnDistributionMetric() Column-level. Plots the distribution histogram and returns bin positions and values for the given column. | Required:
|
ColumnValuePlot() Column-level. Plots the values in time. | Required:
|
ColumnQuantileMetric()
Column-level.
Calculates the defined quantile value and plots the distribution for the given numerical column.
Example use: | Required:
Optional: n/a |
ColumnCorrelationsMetric() Column-level. Calculates the correlations between the defined column and all the other columns in the dataset. | Required:
|
ColumnValueListMetric()
Column-level.
Calculates the number of values in the list / out of the list / not found in a given column. The value list should be specified.
Example use: | Required:
Optional: n/a |
ColumnValueRangeMetric()
Column-level.
Calculates the number and share of values in the specified range / out of range in a given column. Plots the distributions.
Example use: | Required:
|
ConflictPredictionMetric() Dataset-level. Calculates the number of instances where the model returns a different output for an identical input. Can be a signal of low-quality model or data errors. | Required: n/a Optional: n/a |
ConflictTargetMetric() Dataset-level. Calculates the number of instances where there is a different target value or label for an identical input. Can be a signal of a labeling or data error. | Required: n/a Optional: n/a |
Defaults for Missing Values. The metrics that calculate the number or share of missing values detect four types of missing values by default: Pandas nulls (None, NAN, etc.), "" (empty string), Numpy "-inf" value, Numpy "inf" value. You can also pass custom missing values as a parameter and specify if you want to replace the default list. Example:
Text Evals
Text Evals only apply to text columns. To compute a Descriptor for a single text column, use a TextEvals
Preset. Read docs.
You can also explicitly specify the Evidently Metric (e.g., ColumnSummaryMetric
) to visualize the descriptor, or pick a Test (e.g., TestColumnValueMin
) to run validations.
Descriptors: Text Patterns
Check for regular expression matches.
Descriptor | Parameters |
---|---|
RegExp()
Example use:
| Required:
|
BeginsWith()
Example use:
| Required:
|
EndsWith()
Example use:
| Required:
|
Contains()
Example use:
| Required:
|
DoesNotContain()
Example use:
| Required:
|
IncludesWords()
Example use:
| Required:
|
ExcludesWords()
Example use:
| Required:
|
ItemMatch()
Example use:
| Required:
|
ItemNoMatch()
Example use:
| Required:
|
WordMatch()
Example use:
| Required:
|
WordNoMatch()
Example use:
| Required:
|
ExactMatch()
Example use:
| Required:
|
IsValidJSON()
| Required: n/a Optional:
|
JSONSchemaMatch()
Example use:
| Required:
|
JSONMatch()
Example use:
| Required:
|
ContainsLink()
| Required: n/a Optional:
|
IsValidPython()
| Required: n/a Optional:
|
Descriptors: Text stats
Computes descriptive text statistics.
Descriptor | Parameters |
---|---|
TextLength()
| Required: n/a Optional:
|
OOV()
| Required: n/a Optional:
|
NonLetterCharacterPercentage()
| Required: n/a Optional:
|
SentenceCount()
| Required: n/a Optional:
|
WordCount()
| Required: n/a Optional:
|
Descriptors: LLM-based
Use external LLMs with an evaluation prompt to score text data. (Also known as LLM-as-a-judge method).
Descriptor | Parameters |
---|---|
LLMEval() Scores the text using the user-defined criteria, automatically formatted in a templated evaluation prompt. | See docs for examples and parameters. |
DeclineLLMEval() Detects texts containing a refusal or a rejection to do something. Returns a label (DECLINE or OK) or score. | See docs for parameters. |
PIILLMEval() Detects texts containing PII (Personally Identifiable Information). Returns a label (PII or OK) or score. | See docs for parameters. |
NegativityLLMEval() Detects negative texts (containing critical or pessimistic tone). Returns a label (NEGATIVE or POSITIVE) or score. | See docs for parameters. |
BiasLLMEval() Detects biased texts (containing prejudice for or against a person or group). Returns a label (BIAS or OK) or score. | See docs for parameters. |
ToxicityLLMEval() Detects toxic texts (containing harmful, offensive, or derogatory language). Returns a label (TOXICITY or OK) or score. | See docs for parameters. |
ContextQualityLLMEval() Evaluates if CONTEXT is VALID (has sufficient information to answer the QUESTION) or INVALID (has missing or contradictory information). Returns a label (VALID or INVALID) or score. | Run the descriptor over the |
Descriptors: Model-based
Use pre-trained machine learning models for evaluation.
Descriptor | Parameters |
---|---|
Semantic Similarity()
Example use:
| Required:
Optional:
|
Sentiment()
| Required: n/a Optional:
|
HuggingFaceModel() Scores the text using the user-selected HuggingFace model. | See docs with some example models (classification by topic, emotion, etc.) |
HuggingFaceToxicityModel()
| Optional:
|
BERTScore()
| Required:
Optional:
|
Data Drift
Defaults for Data Drift. By default, all data drift metrics use the Evidently drift detection logic that selects a drift detection method based on feature type and volume. You always need a reference dataset.
To modify the logic or select a different test, you should set data drift parameters or embeddings drift parameters. You can choose from 20+ drift detection methods and optionally pass feature importances.
Metric | Parameters |
---|---|
DatasetDriftMetric()
| Required: n/a Optional:
|
DataDriftTable()
| Required: n/a Optional:
How to set data drift parameters, embeddings drift parameters. |
ColumnDriftMetric()
| |
EmbeddingsDriftMetric()
|
Classification
The metrics work both for probabilistic and non-probabilistic classification. All metrics are dataset-level. All metrics require column mapping of target and prediction.
Metric | Parameters |
---|---|
ClassificationDummyMetric() Calculates the quality of the dummy model built on the same data. This can serve as a baseline. | Required: n/a Optional: n/a |
ClassificationQualityMetric() Calculates various classification performance metrics, including:
| Required:: n/a Optional:
|
ClassificationClassBalance() Calculates the number of objects for each label. Plots the histogram. | Required: n/a Optional: n/a |
ClassificationConfusionMatrix() Calculates the TPR, TNR, FPR, FNR, and plots the confusion matrix. | Required: n/a Optional:
|
ClassificationQualityByClass() Calculates the classification quality metrics for each class. Plots the matrix. | Required: n/a Optional:
|
ClassificationClassSeparationPlot() Visualization of the predicted probabilities by class. Applicable for probabilistic classification only. | Required: n/a Optional: n/a |
ClassificationProbDistribution() Visualization of the probability distribution by class. Applicable for probabilistic classification only. | Required: n/a Optional: n/a |
ClassificationRocCurve() Plots ROC Curve. Applicable for probabilistic classification only. | Required: n/a Optional: n/a |
ClassificationPRCurve() Plots Precision-Recall Curve. Applicable for probabilistic classification only. | Required: n/a Optional: n/a |
ClassificationPRTable() Calculates the Precision-Recall table that shows model quality at a different decision threshold. | Required: n/a Optional: n/a |
ClassificationQualityByFeatureTable() Plots the relationship between feature values and model quality. | Required: n/a Optional:
|
Regression
All metrics are dataset-level. All metrics require column mapping of target and prediction.
Metric | Parameters |
---|---|
RegressionDummyMetric() Calculates the quality of the dummy model built on the same data. This can serve as a baseline. | Required: n/a Optional: n/a |
RegressionQualityMetric() Calculates various regression performance metrics, including:
| Required: n/a Optional: n/a |
RegressionPredictedVsActualScatter() Visualizes predicted vs actual values in a scatter plot. | Required: n/a Optional: n/a |
RegressionPredictedVsActualPlot() Visualizes predicted vs. actual values in a line plot. | Required: n/a Optional: n/a |
RegressionErrorPlot() Visualizes the model error (predicted - actual) in a line plot. | Required: n/a Optional: n/a |
RegressionAbsPercentageErrorPlot() Visualizes the absolute percentage error in a line plot. | Required: n/a Optional: n/a |
RegressionErrorDistribution() Visualizes the distribution of the model error in a histogram. | Required: n/a Optional: n/a |
RegressionErrorNormality() Visualizes the quantile-quantile plot (Q-Q plot) to estimate value normality. | Required: n/a Optional: n/a |
RegressionTopErrorMetric() Calculates the regression performance metrics for different groups:
Visualizes the group division on a scatter plot with predicted vs. actual values. | Required: n/a Optional:
|
RegressionErrorBiasTable() Plots the relationship between feature values and model quality per group (for top-X% error groups, as above). | Required: n/a Optional:
|
Ranking and Recommendations
All metrics are dataset-level. Check individual metric descriptions here. All metrics require recommendations column mapping.
Optional shared parameters for multiple metrics:
no_feedback_users: bool = False
. Specifies whether to include the users who did not select any of the items, when computing the quality metric. Default: False.min_rel_score: Optional[int] = None
. Specifies the minimum relevance score to consider relevant when calculating the quality metrics for non-binary targets (e.g., if a target is a rating or a custom score).
Metric | Parameters |
---|---|
RecallTopKMetric()
Calculates the recall at | Required:
Optional:
|
PrecisionTopKMetric()
Calculates the precision at | Required:
Optional:
|
FBetaTopKMetric()
Calculates the F-measure at | Required:
Optional:
|
MAPKMetric()
Calculates the Mean Average Precision (MAP) at | Required:
Optional:
|
MARKMetric()
Calculates the Mean Average Recall (MAR) at | Required:
Optional:
|
NDCGKMetric()
Calculates the Normalized Discounted Cumulative Gain at | Required:
Optional:
|
MRRKMetric()
Calculates the Mean Reciprocal Rank (MRR) at | Required:
Optional:
|
HitRateKMetric()
Calculates the hit rate at | Required:
Optional:
|
DiversityMetric()
Calculates intra-list Diversity at | Required:
Optional:
|
NoveltyMetric()
Calculates novelty at | Required:
Optional:
|
SerendipityMetric()
Calculates serendipity at | Required:
Optional:
|
PersonalizationMetric() Measures the average uniqueness of each user's top-K recommendations. | Required:
Optional:
|
PopularityBias() Evaluates the popularity bias in recommendations by computing ARP (average recommendation popularity), Gini index, and coverage. Requires a training dataset. | Required:
Optional:
|
ItemBiasMetric() Visualizes the distribution of recommendations by a chosen dimension (column), сomparative to its distribution in the training set. Requires a training dataset. | Required:
Optional:
|
UserBiasMetric() Visualizes the distribution of the chosen category (e.g. user characteristic), comparative to its distribution in the training dataset. Requires a training dataset. | Required:
Optional:
|
ScoreDistribution()
Computes the predicted score entropy. Visualizes the distribution of the scores at | Required:
Optional:
|
RecCasesTable() Shows the list of recommendations for specific user IDs (or 5 random if not specified). | Required:
Optional:
|
Last updated