All tests
List of all tests available in Evidently.
How to read the tables:
  • Test: the name of an individual test that you can include in a Test Suite. If a test has an optional parameter, we include an example.
  • Description: plain text explanation of how the test works. We also specify whether the test applies to the whole dataset or individual columns.(Note that you can still apply column-level tests to all the columns in the dataset).
  • Default: plain text explanation of the default parameters. Many tests have two types of defaults. The first applies when you pass a reference dataset and Evidently can derive expectations from this baseline. The second applies if you do not provide the reference. You can always override the defaults by specifying a custom condition.
We organize the tests into logical groups. Note that the groups do not match the presets with the same name, e.g., there are more Data Quality tests below than in the DataQuality preset.
We are doing our best to maintain this page up to date. In case of discrepancies, consult the code on GitHub (API reference coming soon!) or the current version of the "All tests" example notebook in the [Examples](https://docs.evidentlyai.com/examples) section. If you notice an error, please send us a pull request to update the documentation!

Data integrity

Note: the tests that evaluate the number or share of nulls detect four types of nulls by default: Pandas nulls (None, NAN, etc.), "" (empty string), Numpy "-inf" value, Numpy "inf" value. You can also pass a custom list of nulls as a parameter and specify if you want to replace the default list. Example:
TestNumberOfNulls(null_values=["", 0, "n/a", -9999, None], replace=True)
Test
Description
Default
TestNumberOfRows()
Dataset-level. Tests if the number of rows is within the expected range.
Expects +/-10% or >30. With reference: the test fails if the number of rows differs by over 10% from the reference. No reference: the test fails if the number of rows is <= 30.
TestNumberOfColumns()
Dataset-level. Tests if the number of columns is within the expected range.
Expects the same or non-zero. With reference: the test fails if the number of columns differs from the reference. No reference: the test fails if the number of columns is 0.
TestNumberOfNulls null_values: list replace: bool = True
Dataset-level. Tests if the number of nulls and missing values in the whole dataset is within the expected range.
Expects up to +10% or 0. With reference: the test fails if the share of nulls and missing values is over 10% higher than in reference. No reference: the test fails if the dataset contains nulls.
TestShareOfNulls null_values: list replace: bool = True
Dataset-level. Tests if the share of nulls and missing values in the dataset is within the expected range.
Expects up to +10% or 0. With reference: the test fails if the share of nulls and missing values is over 10% higher than in reference. No reference: the test fails if the dataset contains nulls.
TestNumberOfColumnsWithNulls null_values: list replace: bool = True
Dataset-level. Tests if the number of columns in the dataset that contain nulls and missing values is within the expected range.
Expects <= or 0. With reference: the test fails if the number of columns with nulls and missing values is higher than in reference. No reference: the test fails if the dataset contains columns with nulls.
TestShareOfColumnsWithNulls null_values: list replace: bool = True
Dataset-level. Tests if the share of columns in the dataset that contain nulls and missing values is within the expected range.
Expects <= or 0. With reference: the test fails if the share of columns with nulls and missing values is higher than in reference. No reference: the test fails if the dataset contains columns with nulls.
TestNumberOfRowsWithNulls null_values: list replace: bool = True
Dataset-level. Tests if the number of rows in the dataset that contain nulls and missing values is within the expected range.
Expects up to +10% or 0. With reference: the test fails if the share of rows with nulls and missing values is over 10% higher than in reference. No reference: the test fails if the dataset contains rows with nulls.
TestShareOfRowsWithNulls null_values: list replace: bool = True
Dataset-level. Tests if the share of rows in the dataset that contain nulls and missing values is within the expected range.
Expects up to +10% or 0. With reference: the test fails if the share of rows with nulls and missing values is over 10% higher than in reference. No reference: the test fails if the dataset contains rows with nulls.
TestNumberOfDifferentNulls null_values: list replace: bool = True
Dataset-level. Tests if the number of differently encoded Nulls in the dataset is within the expected range. Detects 4 types of nulls by default and/or nulls from a user list.
Expects <= or none. With reference: the test fails if the current dataset has more types of nulls. No reference: the test fails if the current dataset contains nulls.
TestNumberOfConstantColumns()
Dataset-level. Tests if the number of columns with all constant values is within the expected range.
Expects =< or none. With reference: the test fails if the number of constant columns is higher than in the reference. No reference: the test fails if there is at least one constant column.
TestNumberOfEmptyRows()
Dataset-level. Tests if the number of empty rows is within the expected range.
Expects +/- 10% or none. With reference: the test fails if the share of empty rows is over 10% higher or lower than in the reference. No reference: the test fails if there is at least one empty row.
TestNumberOfEmptyColumns()
Dataset-level. Tests if the number of empty columns is within the expected range.
Expects =< or none. With reference: the test fails if the number of empty columns is higher than in the reference. No reference: the test fails if there is at least one empty column.
TestNumberOfDuplicatedRows()
Dataset-level. Tests if the number of duplicate rows is within the expected range.
Expects +/- 10% or none. With reference: the test fails if the share of duplicate rows is over 10% higher or lower than in the reference. No reference: the test fails if there is at least one duplicate row.
TestNumberOfDuplicatedColumns()
Dataset-level. Tests if the number of duplicate columns is within the expected range.
Expects =< or none. With reference: the test fails if the number of duplicate columns is higher than in the reference. No reference: the test fails if there is at least one duplicate column.
TestColumnsType columns_type: dict
Dataset-level. Tests the types of all columns against the reference.
Expects types to match. With reference: the test fails if at least one column type does not match. No reference: N/A
TestColumnNumberOfNulls(column_name='name') null_values: list replace: bool = True
Column-level. Tests the number of nulls and missing values in a given column against the reference.
Expects up to 10% or none. With reference: the test fails if the share of nulls and missing values in a column is over 10% higher than in reference. No reference: the test fails if the column contains nulls.
TestColumnShareOfNulls(column_name='name') null_values: list replace: bool = True
Column-level. Tests the share of nulls and missing values in a given column against the reference.
Expects up to 10% or none. With reference: the test fails if the share of nulls and missing values in a column is over 10% higher than in reference. No reference: the test fails if the column contains nulls.
TestColumnNumberOfDifferentNulls(column_name='name') null_values: list replace: bool = True
Column-level. Tests if the number of differently encoded Nulls in the column is within the expected range. Detects 4 types of nulls by default and/or nulls from a user list.
Expects <= or none. With reference: the test fails if the current column has more types of nulls. No reference: The test fails if the column contains nulls.
TestColumnAllConstantValues(column_name='name')
Column-level. Tests if all the values in a given column are constant.
Expects non-constant. The test fails if all values in a given column are constant.
TestColumnAllUniqueValues(column_name='name')
Column-level. Tests if all the values in a given column are unique.
Expects all unique (e.g., IDs). The test fails if at least one value in a given column is not unique.
TestColumnValueRegExp(column_name='name, reg_exp='^[0..9]')
Column-level. Tests if the values in the column match a defined regular expression. You need to specify the regular expression to run this test.
Expects +/-10% or all to match. With reference: the test fails if the share of values that match a regular expression is over 10% higher or lower than in the reference. No reference: the test fails if at least one of the values does not match a regular expression.

Data quality

If you provide the reference dataset Evidently will automatically derive all relevant statistics (e.g., minimum value, maximum value, value range, value list, etc.) to shape expectations. If you do not provide the reference, you can pass these conditions as a parameter.
Test
Description
Default
TestTargetPredictionCorrelation method: str=’pearson’ available: pearson, spearman, kendall, cramer_v
Dataset-level. Tests if the strength of correlation between the target and prediction is within the expected range.
Expects +/- 0.25 in correlation strength, or > 0. With reference: the test fails if there is a 0.25+ change in the correlation strength between target and prediction. No reference: the test fails if the correlation between target and prediction <=0
TestHighlyCorrelatedFeatures method: str=’pearson’ available: pearson, spearman, kendall, cramer_v
Dataset-level. Tests if any of the columns are highly correlated. Example use: to detect and drop highly correlated features.
Expects +/- 10% in max correlation strength, or < 0.9. With reference: the test fails if there is a 10%+ change in the correlation strength for the most correlated feature pair.
TestTargetFeaturesCorrelations method: str=’pearson’ available: pearson, spearman, kendall, cramer_v
Dataset-level. Tests if any of the features is highly correlated with the target. Example use: to detect target leak.
Expects +/- 10% in max correlation strength, or < 0.9. With reference: the test fails if there is a 10%+ change in the correlation strength for the feature most correlated with the target. No reference: the test fails if at least one feature is correlated with the target >= 0.9
TestPredictionFeaturesCorrelations() method: str=’pearson’ available: pearson, spearman, kendall, cramer_v
Dataset-level. Tests if any of the features is highly correlated with the prediction. Example use: to detect when predictions rely on a single feature.
Expects +/- 10% in max correlation strength, or < 0.9. With reference: the test fails if there is a 10%+ change in the correlation strength for the feature most correlated with the prediction. No reference: the test fails if at least one feature is correlated with the prediction >= 0.9
TestCorrelationChanges() method: str = ’pearson’ available: pearson, spearman, kendall, cramer_v corr_diff: float = 0.25
Dataset-level. Tests the number of correlation violations (significant change in the correlation strength between the two features).
Expects none. With reference: the test fails if at least 1 correlation violation is detected. Significant correlation change is: 0.25+. No reference: N/A
TestFeatureValueMin(column_name='num-column')
Column-level. Tests if the minimum value of a given numerical column is within the expected range.
Expects not lower. With reference: the test fails if the minimum value is lower than in the reference. No reference: N/A
TestFeatureValueMax(column_name='num-column')
Column-level. Tests if the maximum value of a given numerical column is within the expected range.
Expects not higher. With reference: the test fails if the maximum value is higher than in the reference. No reference: N/A
TestFeatureValueMean(column_name='num-column')
Column-level. Tests if the mean value of a given numerical column is within the expected range.
Expects +/-10%. With reference: the test fails if the mean value is different by more than 10%. No reference: N/A
TestFeatureValueMedian(column_name='num-column')
Column-level. Tests if the median value of a given numerical column is within the expected range.
Expects +/-10%. With reference: the test fails if the median value is different by more than 10%. No reference: N/A
TestFeatureValueStd(column_name='num-column')
Column-level. Tests if the standard deviation of a given numerical column is within the expected range.
Expects +/-10%. With reference: the test fails if the standard deviation is different by more than 10%. No reference: N/A
TestNumberOfUniqueValues(column_name='name')
Column-level. Tests if the number of unique values in a given column is within the expected range.
Expects +/-10%. With reference: the test fails if the share of unique values is different by more than 10%. No reference: N/A
TestUniqueValuesShare(column_name=name')
Column-level. Tests if the share of unique values in a given column is within the expected range.
Expects +/-10%. With reference: the test fails if the share of unique values is different by more than 10%. No reference: N/A
TestMostCommonValueShare(column_name='name')
Column-level. Tests if the share of the most common value in a given column is within the expected range.
Expects +/-10%. With reference: the test fails if the share of the most common value is different by more than 10% from the reference. No reference: the test fails if the share of the most common value is >= 80%.
TestMeanInNSigmas(column_name='num-column') n_sigmas: int = 2
Column-level. Tests if the mean value in a given numerical column is within the expected range, defined in standard deviations.
Expects +/- 2 std dev. With reference: the test fails if the current mean value is out of the +/- 2 std dev interval from the reference mean value. No reference: N/A
TestValueRange(column_name='num_column') left: float right: float
Column-level. Tests if a numerical column contains values out of the min-max range.
Expects all values to be in range. With reference: the test fails if the column contains values out of the min-max range as seen in the reference. No reference: N/A
TestShareOfOutRangeValues(column_name='num_column') left: float right: float
Column-level. Tests the share of values out of the min-max range.
Expects +/-10%. With reference: the test fails if over 10% of values are out of range. No reference: N/A
TestNumberOfOutRangeValues(column_name='num_column') left: float right: float
Column-level. Tests the number of values out of the min-max range.
Expects +/-10%. With reference: the test fails if over 10% of values are out of range. No reference: N/A
TestValueList(column_name='cat_column') values: List[str]
Column-level. Tests if a categorical column contains values out of the list.
Expects all values to be in the list. With reference: the test fails if the column contains values out of the list (as seen in reference). No reference: N/A
TestNumberOfOutListValues(column_name='cat_column') values: List[str]
Column-level. Tests the number of values in a given column that are out of list.
Expects +/-10%. With reference: the test fails if over 10% of values are out of the list. No reference: N/A
TestShareOfOutListValues(column_name='cat_column') values: List[str]
Column-level. Tests the share of values in a given column that are out of list.
Expects +/-10%. With reference: the test fails if over 10% of values are out of the list. No reference: N/A
TestValueQuantile(column_name='num_column', quantile=0.25)
Column-level. Tests that a defined quantile value is within the expected range. You need to pass this quantile to run the test.
Expects +/-10%. With reference: the test fails if the quantile value is over 10% higher or lower. No reference: N/A

Data drift

By default, all data drift tests use the Evidently drift detection logic that selects a different statistical test or metric based on feature type and volume.
To modify the logic or select a different test, you should pass a DataDrift Options object.
Test
Description
Default
TestRocAuc()
Dataset-level. Computes the ROC AUC and compares it to the reference if available.
Expects +/-20% or > 0.5 With reference: the test fails if the ROC AUC is over 20% higher or lower than in the reference. No reference: the test fails if ROC AUC is <= 0.5.
TestLogLoss()
Dataset-level. Computes the LogLoss and compares it to the reference.
Expects +/-20% or better than a dummy model With reference: the test fails if the LogLoss is over 20% higher or lower than in the reference. No reference: the test fails if LogLoss is higher than the LogLoss of the dummy model (equals 0.5 for a constant model).

Regression

If there is no reference data, Evidently will compare the model performance to a dummy model that predicts the optimal constant (varies by the metric).
Test
Description
Default
TestValueMAE()
Dataset-level. Computes the Mean Absolute Error (MAE) and compares it to the reference if available.
Expects +/-10% or better than a dummy model. With reference: if MAE is higher or lower by over 10%, the test fails. No reference: the test fails if the MAE value is higher than the MAE of the dummy model that predicts the optimal constant (median of the target value).
TestValueRMSE()
Dataset-level. Computes the Root Mean Square Error (RMSE) and compares it to the reference if available.
Expects +/-10% or better than a dummy model. With reference: if RMSE is higher or lower by over 10%, the test fails. No reference: the test fails if the RMSE value is higher than the RMSE of the dummy model that predicts the optimal constant (mean of the target value).
TestValueMeanError()
Dataset-level. Computes the Mean Error (ME) and tests if it is near zero.
Expects the Mean Error to be near zero. With/without reference: the test fails if the Mean Error is skewed and the condition is violated. Condition: eq = approx(absolute=0.1*error_std) error_std = (curr_true - curr_preds).std()
TestValueMAPE()
Dataset-level. Computes the Mean Absolute Percentage Error (MAPE) and compares it to the reference if available.
Expects +/-10% or better than a dummy model. With reference: if MAPE is higher or lower by over 10%, the test fails. No reference: the test fails if the MAPE value is higher than the MAPE of the dummy model that predicts the optimal constant (weighted median of the target value).
TestValueAbsMaxError()
Dataset-level. Computes the absolute maximum error and compares it to the reference if available
Expects +/-10% or better than a dummy model. With reference: if the absolute maximum error is higher or lower by over 10%, the test fails. No reference: the test fails if the absolute maximum error is higher than the absolute maximum error of the dummy model that predicts the optimal constant (median of the target value).
TestValueR2Score()
Dataset-level. Computes the R2 Score (coefficient of determination) and compares it to the reference if available.
Expects +/-10% or > 0. With reference: if R2 is higher or lower by over 10%, the test fails. No reference: the test fails if the R2 value is =< 0.

Classification

You can apply the tests for non-probabilistic, probabilistic classification, and ranking. Metrics will be calculated slightly differently depending on the provided inputs: only labels, probabilities, decision threshold, and/or K (to compute, e.g., [email protected]).
If there is no reference data, Evidently will compare the model performance to a dummy model. It is based on a set of heuristics to verify that the quality is better than random.
Test
Description
Default
TestAccuracyScore() classification_threshold: float k: union[float, int]
Dataset-level. Computes the Accuracy and compares it to the reference if available.
Expects +/-20% or better than a dummy model. With reference: if the Accuracy is over 20% higher or lower, the test fails. No reference: if the Accuracy is lower than the Accuracy of the dummy model, the test fails. The default decision threshold is 0.5.
TestPrecisionScore classification_threshold: float k: union[float, int]
Dataset-level. Computes the Precision and compares it to the reference if available.
Expects +/-20% or better than a dummy model. With reference: if the Precision is over 20% higher or lower, the test fails. No reference: if the Precision is lower than the Precision of the dummy mode, the test fails. The default decision threshold is 0.5.
TestRecallScore classification_threshold: float k: union[float, int]
Dataset-level. Computes the Recall and compares it to the reference if available.
Expects +/-20% or better than a dummy model. With reference: if the Recall is over 20% higher or lower, the test fails. No reference: if the Recall is lower than the Recall of the dummy model, the test fails. The default decision threshold is 0.5.
TestF1Score classification_threshold: float k: union[float, int]
Dataset-level. Computes the F1 score and compares it to the reference if available.
Expects +/-20% or better than a dummy model. With reference: if the F1 is over 20% higher or lower, the test fails. No reference: if the F1 is lower than the F1 of the dummy model, the test fails. The default decision threshold is 0.5.
TestPrecisionByClass(label='classN')
Dataset-level. Computes the Precision for the specified class and compares it to the reference if available.
Expects +/-20% or better than a dummy model. With reference: if the Precision is over 20% higher or lower, the test fails. No reference: if the Precision is lower than the Precision of the dummy model, the test fails. The default decision threshold is 0.5.
TestRecallByClass(label='classN')
Dataset-level. Computes the Recall for the specified class and compares it to the reference if available.
Expects +/-20% or better than a dummy model. With reference: if the Recall is over 20% higher or lower, the test fails. No reference: if the Recall is lower than the Recall of the dummy model, the test fails. The default decision threshold is 0.5.
TestF1ByClass(label='classN')
Dataset-level. Computes the F1 for the specified class and compares it to the reference if available.
Expects +/-20% or better than a dummy model. With reference: the test fails if the F1 is over 20% higher or lower. No reference: the test fails if the F1 is lower than the F1 of the dummy model. The default decision threshold is 0.5.
TestTPR classification_threshold: float k: union[float, int]
Dataset-level. Computes the True Positive Rate and compares it to the reference if available.
Expects +/-20% or better than a dummy model. With reference: the test fails if the TPR is over 20% higher or lower. No reference: the test fails if the TPR is lower than the TPR of the dummy model. The default decision threshold is 0.5.
TestTNR classification_threshold: float k: union[float, int]
Dataset-level. Computes the True Negative Rate and compares it to the reference if available.
Expects +/-20% or better than a dummy model. With reference: the test fails if the TNR is over 20% higher or lower. No reference: the test fails if the TNR is lower than the TNR of the dummy model. The default decision threshold is 0.5.
TestFPR classification_threshold: float k: union[float, int]
Dataset-level. Computes the False Positive Rate and compares it to the reference if available.
Expects +/-20% or better than a dummy model. With reference: the test fails if the FPR is over 20% higher or lower. No reference: the test fails if the FPR is higher than the FPR of the dummy model. The default decision threshold is 0.5.
TestFNR classification_threshold: float k: union[float, int]
Dataset-level. Computes the False Negative Rate and compares it to the reference if available.
Expects +/-20% or better than a dummy model. With reference: the test fails if the FNR is over 20% higher or lower. No reference: the test fails if the FNR is higher than the FNR of the dummy model. The default decision threshold is 0.5.

Probabilistic classification

Additional metrics apply to the probabilistic classification.
Test
Description
Default
TestRocAuc()
Dataset-level. Computes the ROC AUC and compares it to the reference if available.
Expects +/-20% or > 0.5 With reference: the test fails if the ROC AUC is over 20% higher or lower than in the reference. No reference: the test fails if ROC AUC is <= 0.5.
TestLogLoss()
Dataset-level. Computes the LogLoss and compares it to the reference if available.
Expects +/-20% or better than a dummy model With reference: the test fails if the LogLoss is over 20% higher or lower than in the reference. No reference: the test fails if LogLoss is higher than the LogLoss of the dummy model (equals 0.5 for a constant model).
Copy link
On this page
Data integrity
Data quality
Data drift
Regression
Classification
Probabilistic classification