All tests

List of all tests and test presets available in Evidently.

How to use this page

This is a reference page. You can return here:

  • To discover available tests and choose which to include in a custom test suite.

  • To understand which parameters you can change for a specific test or preset.

  • To verify which tests are included in a test preset.

You can use the menu on the right to navigate the sections. We organize individual tests into groups, e.g. Data Quality, Data Integrity, Regression, etc. Note that these groups do not match the presets with similar names. For example, there are more Data Quality tests than in the DataQualityTestPreset.

How to read the tables

  • Name: the name of the test or test preset.

  • Description: plain text explanation of the test, or the content of the preset. For tests, we specify whether it applies to the whole dataset or individual columns.

  • Parameters: available configurations.

    • Required parameters are necessary for calculations, e.g. a column name for a column-level test.

    • Optional parameters modify how the underlying metric is calculated, e.g. which statistical test or correlation method is used.

    • Test condition parameters help set the conditions (e.g. equal, not equal, greater than, etc.) that define the expectations from the test output. If the condition is violated, the test returns a fail. Here you can see the complete list of the standard condition parameteres. They apply to most of the tests, and are optional.

  • Default tests condition: they apply if you do not set a custom сondition.

    • With reference: the test conditions that apply when you pass a reference dataset and Evidently can derive expectations from it.

    • No reference: the test conditions that apply if you do not provide the reference. They are based on heuristics.

Test visualizations. Each test also includes a default render. If you want to see the visualization, navigate to the example notebooks.

We are doing our best to maintain this page up to date. In case of discrepancies, consult the API reference or the "All tests" notebook in the Examples section. If you notice an error, please send us a pull request to update the documentation!

Test Presets

Default conditions for each Test in the Preset match the Test's defaults. You can see them in the tables below. The listed Preset parameters apply to the relevant individual Tests inside the Preset.

Preset name and DescriptionParameters

NoTargetPerformanceTestPreset

  • TestShareOfDriftedColumns()

  • TestColumnDrift(column_name=prediction)

  • TestColumnShareOfMissingValues(column_name=column_name) for all or сolumns if provided

  • TestShareOfOutRangeValues(column_name=column_name) for all numerical_columns or among columns if provided

  • TestShareOfOutListValues(column_name=column_name) for all categorical_columns or among columns if provided

  • TestMeanInNSigmas(column_name=column_name, n=2) for all numerical_columns or among columns if provided

Optional:

  • columns

  • stattest

  • cat_stattest

  • num_stattest

  • per_column_stattest

  • text_stattest

  • stattest_threshold

  • cat_stattest_threshold

  • num_stattest_threshold

  • per_column_stattest_threshold

  • text_stattest_threshold

  • embeddings

  • embeddings_drift_method

  • drift_share

How to set data drift parameters, embeddings drift parameters.

DataStabilityTestPreset

  • TestNumberOfRows()

  • TestNumberOfColumns()

  • TestColumnsType()

  • TestColumnShareOfMissingValues()

  • TestShareOfOutRangeValues(column_name=column_name) for all numerical_columns or among columns if provided

  • TestShareOfOutListValues(column_name=column_name) for all categorical_columns or among columns if provided

  • TestMeanInNSigmas(column_name=column_name, n=2) for all numerical_columns or among columns if provided

Optional:

  • columns

DataQualityTestPreset

  • TestColumnShareOfMissingValues(column_name=column_name) for all or columns

  • TestMostCommonValueShare(column_name=column_name) for all or columns

  • TestNumberOfConstantColumns()

  • TestNumberOfDuplicatedColumns()

  • TestNumberOfDuplicatedRows()

  • TestHighlyCorrelatedColumns()

Optional:

  • columns

DataDriftTestPreset

  • TestShareOfDriftedColumns()

  • TestColumnDrift(column_name=column_name) for all or сolumns if provided

Optional:

  • columns

  • stattest

  • cat_stattest

  • num_stattest

  • per_column_stattest

  • text_stattest

  • stattest_threshold

  • cat_stattest_threshold

  • num_stattest_threshold

  • per_column_stattest_threshold

  • text_stattest_threshold

  • embeddings

  • embeddings_drift_method

  • drift_share

How to set data drift parameters, embeddings drift parameters.

RegressionTestPreset

  • TestValueMeanError()

  • TestValueMAE()

  • TestValueRMSE()

  • TestValueMAPE()

N/A

MulticlassClassificationTestPreset

  • TestAccuracyScore()

  • TestF1Score()

  • TestPrecisionByClass()

  • TestRecallByClass()

  • TestColumnDrift(column_name=target)

  • TestNumberOfRows()

If probabilistic classification, also:

  • TestLogLoss()

  • TestRocAuc()

Optional:

  • stattest

  • stattest_threshold

How to set data drift parameters.

BinaryClassificationTopKTestPreset

  • TestAccuracyScore(k=k)

  • TestPrecisionScore(k=k)

  • TestRecallScore(k=k)

  • TestF1Score(k=k)

  • TestColumnDrift(column_name=target)

  • TestRocAuc()

  • TestLogLoss()

Required:

  • k

Optional:

  • stattest

  • stattest_threshold

  • probas_threshold

How to set data drift parameters.

BinaryClassificationTestPreset

  • TestColumnDrift(column_name=target)

  • TestPrecisionScore()

  • TestRecallScore()

  • TestF1Score()

  • TestAccuracyScore()

If probabilistic classification, also:

  • TestRocAuc()

Optional:

  • stattest

  • stattest_threshold

  • probas_threshold

How to set data drift parameters

RecsysTestPreset

  • TestPrecisionTopK()

  • TestRecallTopK()

  • TestMAPK()

  • TestNDCGK()

  • TestHitRateK()

Required:

  • k

Optional:

  • min_rel_score: Optional[int]

  • no_feedback_users: bool

Data Integrity

Defaults for Data Integrity. If there is no reference data or defined conditions, data integrity will be checked against a set of heuristics. If you pass the reference data, Evidently will automatically derive all relevant statistics (e.g., number of columns, rows, share of missing values etc.) and apply default test conditions. You can also pass custom test conditions.

Defaults for Missing Values. The metrics that calculate the number or share of missing values detect four types of the values by default: Pandas nulls (None, NAN, etc.), "" (empty string), Numpy "-inf" value, Numpy "inf" value. You can also pass a custom missing values as a parameter and specify if you want to replace the default list. Example:

TestNumberOfMissingValues(missing_values=["", 0, "n/a", -9999, None], replace=True)
Test nameDescriptionParametersDefault test condition

TestNumberOfRows()

Dataset-level. Tests the number of rows against the reference or a defined condition.

Required: N/A Optional: N/A Test conditions:

  • standard parameters

Expects +/-10% or >30. With reference: the test fails if the number of rows differs by over 10% from the reference. No reference: the test fails if the number of rows is <= 30.

TestNumberOfColumns()

Dataset-level. Tests the number of columns against the reference or a defined condition.

Required: N/A Optional: N/A Test conditions:

  • standard parameters

Expects the same or non-zero. With reference: the test fails if the number of columns differs from the reference. No reference: the test fails if the number of columns is 0.

TestNumberOfMissingValues()

Dataset-level. Tests the number of missing values in the dataset against the reference or a defined condition.

Required: N/A Optional:

  • missing_values = [], replace = True/False (default = default list)

Test conditions:

  • standard parameters

Expects up to +10% or 0. With reference: the test fails if the share of missing values is over 10% higher than in reference. No reference: the test fails if the dataset contains missing values.

TestShareOfMissingValues()

Dataset-level. Tests the share of missing values in the dataset against the reference or a defined condition.

Required: N/A Optional:

  • missing_values = [], replace = True/False (default = default list)

Test conditions:

  • standard parameters

Expects up to +10% or 0. With reference: the test fails if the share of missing values is over 10% higher than in reference. No reference: the test fails if the dataset contains missing values.

TestNumberOfColumnsWithMissingValues()

Dataset-level. Tests the number of columns that contain missing values in the dataset against the reference or a defined condition.

Required: N/A Optional:

  • missing_values = [], replace = True/False (default = default list)

Test conditions:

  • standard parameters

Expects <= or 0. With reference: the test fails if the number of columns with missing values is higher than in reference. No reference: the test fails if the dataset contains columns with missing values.

TestShareOfColumnsWithMissingValues()

Dataset-level. Tests the share of columns that contain missing values in the dataset against the reference or a defined condition.

Required: N/A Optional:

  • missing_values = [], replace = True/False (default = default list)

Test conditions:

  • standard parameters

Expects <= or 0. With reference: the test fails if the share of columns with missing values is higher than in reference. No reference: the test fails if the dataset contains columns with missing values.

TestNumberOfRowsWithMissingValues()

Dataset-level. Tests the number of rows that contain missing values against the reference or a defined condition.

Required: N/A Optional:

  • missing_values = [], replace = True/False (default = default list)

Test conditions

  • standard parameters

Expects up to +10% or 0. With reference: the test fails if the share of rows with missing values is over 10% higher than in reference. No reference: the test fails if the dataset contains rows with missing values.

TestShareOfRowsWithMissingValues()

Dataset-level. Tests the share of rows that contain missing values against the reference or a defined condition.

Required: N/A Optional:

  • missing_values = [], replace = True/False (default = default list)

Test conditions

  • standard parameters

Expects up to +10% or 0. With reference: the test fails if the share of rows with missing values is over 10% higher than in reference. No reference: the test fails if the dataset contains rows with missing values.

TestNumberOfDifferentMissingValues()

Dataset-level. Tests the number of differently encoded missing values in the dataset against the reference or a defined condition. Detects 4 types of missing values by default and/or values from a user list.

Required: N/A Optional:

  • missing_values: list <br>replace: bool = True(default = default list)

Test conditions

  • standard parameters

Expects <= or none. With reference: the test fails if the current dataset has more types of missing values. No reference: the test fails if the current dataset contains missing values.

TestNumberOfConstantColumns()

Dataset-level. Tests the number of columns with all constant values against reference or a defined condition.

Required: N/A Optional: N/A Test conditions:

  • standard parameters

Expects =< or none. With reference: the test fails if the number of constant columns is higher than in the reference. No reference: the test fails if there is at least one constant column.

TestNumberOfEmptyRows()

Dataset-level. Tests the number of empty rows against reference or a defined condition.

Required: N/A Optional: N/A Test conditions:

  • standard parameters

Expects +/- 10% or none. With reference: the test fails if the share of empty rows is over 10% higher or lower than in the reference. No reference: the test fails if there is at least one empty row.

TestNumberOfEmptyColumns()

Dataset-level. Tests the number of empty columns against reference or a defined condition.

Required: N/A Optional: N/A Test conditions:

  • standard parameters

Expects =< or none. With reference: the test fails if the number of empty columns is higher than in the reference. No reference: the test fails if there is at least one empty column.

TestNumberOfDuplicatedRows()

Dataset-level. Tests the number of duplicate rows against reference or a defined condition.

Required: N/A Optional: N/A Test conditions:

  • standard parameters

Expects +/- 10% or none. With reference: the test fails if the share of duplicate rows is over 10% higher or lower than in the reference. No reference: the test fails if there is at least one duplicate row.

TestNumberOfDuplicatedColumns()

Dataset-level. Tests the number of duplicate columns against reference or a defined condition.

Required: N/A Optional: N/A Test conditions:

  • standard parameters

Expects =< or none. With reference: the test fails if the number of duplicate columns is higher than in the reference. No reference: the test fails if there is at least one duplicate column.

TestColumnsType()

Dataset-level. Tests the types of all columns against the reference.

Required: N/A Optional: columns_type: dict Test conditions: N/A

Expects types to match. With reference: the test fails if at least one column type does not match. No reference: N/A

TestColumnNumberOfMissingValues(column_name='name')

Column-level. Tests the number of missing values in a given column against the reference or a defined condition.

Required:

  • column_name

Optional:

  • missing_values = [], replace = True/False (default = default list)

Test conditions

  • standard parameters

Expects up to 10% or none. With reference: the test fails if the share of missing values in a column is over 10% higher than in reference. No reference: the test fails if the column contains missing values.

TestColumnShareOfMissingValues(column_name='name')

Column-level. Tests the share of missing values in a given column against the reference or a defined condition.

Required:

  • column_name

Optional:

  • missing_values = [], replace = True/False (default = default list)

Test conditions

  • standard parameters

Expects up to 10% or none. With reference: the test fails if the share of missing values in a column is over 10% higher than in reference. No reference: the test fails if the column contains missing values.

TestColumnNumberOfDifferentMissingValues(column_name='name')

Column-level. Tests the number of differently encoded missing values in the column against reference or a defined condition. Detects 4 types of missing values by default and/or values from a user list.

Required:

  • column_name

Optional:

  • missing_values = [], replace = True/False (default = default list)

Test conditions:

  • standard parameters

Expects <= or none. With reference: the test fails if the current column has more types of missing values. No reference: The test fails if the column contains missing values.

TestColumnAllConstantValues(column_name='name')

Column-level. Tests if all the values in a given column are constant.

Required:

  • column_name

Optional: N/A Test conditions: N/A

Expects non-constant. The test fails if all values in a given column are constant.

TestColumnAllUniqueValues(column_name='name')

Column-level. Tests if all the values in a given column are unique.

Required:

  • column_name

Optional: N/A Test conditions: N/A

Expects all unique (e.g., IDs). The test fails if at least one value in a given column is not unique.

TestColumnRegExp(column_name='name, reg_exp='^[0..9]')

Column-level. Tests the number of values in a column that do not match a defined regular expression, against reference or a defined condition.

Required:

  • column_name

  • reg_exp

Optional: N/A Test conditions:

  • standard parameters

With reference: the test fails if the share of values that match a regular expression is over 10% higher or lower than in the reference. No reference: the test fails if at least one of the values does not match a regular expression.

TestCategoryShare(column_name='education', category='Some-college', lt=0.5))

Column-level. Tests if the number of objects belonging to a defined category (or having a defined numerical value) is within the threshold.

Required:

  • column_name

  • category

Optional: N/A Test conditions:

  • standard parameters

Expects the category to be present. The test fails if the category is not present.

TestCategoryCount(column_name='education', category='Some-college', lt=0.5))

Column-level. Tests if the share of objects belonging to a defined category (or having a defined numerical value) is within the threshold.

Required:

  • column_name

  • category

Optional: N/A Test conditions:

  • standard parameters

Expects the category to be present. The test fails if the category is not present.

Data Quality

Defaults for data quality. If there is no reference data or defined conditions, data quality will be checked against a set of heuristics. If you pass the reference data, Evidently will automatically derive all relevant statistics (e.g., min value, max value, value range, value list, etc.) and apply default test conditions. You can also pass custom test conditions.

Test nameDescriptionParametersDefault test conditions

TestConflictTarget()

Dataset-level. Tests if there are conflicts in the target (instances where a different label is assigned for an identical input).

N/A

Expects no conflicts in the target (with or without reference).

TestConflictPrediction()

Dataset-level. Tests if there are conflicts in the prediction (instances where a different prediction is made for an identical input).

N/A

Expects no conflicts in the target (with or without reference).

TestTargetPredictionCorrelation()

Dataset-level. Tests the strength of correlation between the target and prediction.

Required: N/A Optional:

  • method (default = pearson, available = pearson, spearman, kendall, cramer_v)

Test conditions:

  • standard parameters

Expects +/- 0.25 in correlation strength, or > 0. With reference: the test fails if there is a 0.25+ change in the correlation strength between target and prediction. No reference: the test fails if the correlation between target and prediction <=0

TestHighlyCorrelatedColumns()

Dataset-level. Tests the strongest correlation between a pair of features, against reference or a defined condition.

Required: N/A Optional:

  • method (default = pearson, available = pearson, spearman, kendall, cramer_v)

Test conditions:

  • standard parameters

Expects +/- 10% in max correlation strength, or < 0.9. With reference: the test fails if there is a 10%+ change in the correlation strength for the most correlated feature pair. No reference: the test fails if there is at least one pair of features with the correlation >= 0.9

TestTargetFeaturesCorrelations()

Dataset-level. Tests if any of the features is highly correlated with the target. Example use: to detect target leak.

Required: N/A Optional:

  • 'method (default = pearson, available = pearson, spearman, kendall, cramer_v)

Test conditions:

  • standard parameters

Expects +/- 10% in max correlation strength, or < 0.9. With reference: the test fails if there is a 10%+ change in the correlation strength for the feature most correlated with the target. No reference: the test fails if at least one feature is correlated with the target >= 0.9

TestPredictionFeaturesCorrelations()

Dataset-level. Tests if any of the features is highly correlated with the prediction Example use: to detect when predictions rely on a single feature.

Required: N/A Optional:

  • method (default = pearson, available = pearson, spearman, kendall, cramer_v)

Test conditions:

  • standard parameters

Expects +/- 10% in max correlation strength, or < 0.9. With reference: the test fails if there is a 10%+ change in the correlation strength for the feature most correlated with the prediction. No reference: the test fails if at least one feature is correlated with the prediction >= 0.9

TestCorrelationChanges()

Dataset-level. Tests the number of correlation violations (significant change in the correlation strength between any two columns).

Required: N/A Optional:

  • method (default = pearson, available = pearson, spearman, kendall, cramer_v)

  • corr_diff (default = 0.25)

  • column_name(checks for correlation changes only between a chosen column and other columns in the dataset)

Test conditions:

  • standard parameters

Expects none. With reference: the test fails if at least 1 correlation violation is detected. No reference: N/A

TestColumnValueMin(column_name='num-column')

Column-level. Tests the minimum value of a given numerical column against reference or a defined condition.

Required:

  • column_name

Optional: N/A Test conditions:

  • standard parameters

Expects not lower. With reference: the test fails if the minimum value is lower than in the reference. No reference: N/A

TestColumnValueMax(column_name='num-column')

Column-level. Tests the maximum value of a given numerical column against reference or a defined condition.

Required:

  • column_name

Optional: N/A Test conditions:

  • standard parameters

Expects not higher. With reference: the test fails if the maximum value is higher than in the reference. No reference: N/A

TestColumnValueMean(column_name='num-column')

Column-level. Tests the mean value of a given numerical column against reference or a defined condition.

Required:

  • column_name

Optional: N/A Test conditions:

  • standard parameters

Expects +/-10%. With reference: the test fails if the mean value is different by more than 10%. No reference: N/A

TestColumnValueMedian(column_name='num-column')

Column-level. Tests the median value of a given numerical column against reference or a defined condition.

Required:

  • column_name

Optional: N/A Test conditions:

  • standard parameters

Expects +/-10%. With reference: the test fails if the median value is different by more than 10%. No reference: N/A

TestColumnValueStd(column_name='num-column')

Column-level. Tests the standard deviation of a given numerical column against reference or a defined condition.

Required:

  • column_name

Optional: N/A Test conditions:

  • standard parameters

Expects +/-10%. With reference: the test fails if the standard deviation is different by more than 10%. No reference: N/A

TestNumberOfUniqueValues(column_name='name')

Column-level. Tests the number of unique values in a given column against reference or a defined condition.

Required:

  • column_name

Optional: N/A Test conditions:

  • standard parameters

Expects +/-10%. With reference: the test fails if the share of unique values is different by more than 10%. No reference: N/A

TestUniqueValuesShare(column_name='name')

Column-level. Tests the share of unique values in a given column against reference or a defined condition.

Required:

  • column_name

Optional: N/A Test conditions:

  • standard parameters

Expects +/-10%. With reference: the test fails if the share of unique values is different by more than 10%. No reference: N/A

TestMostCommonValueShare(column_name='name')

Column-level. Tests the share of the most common value in a given column against reference or a defined condition.

Required:

  • column_name

Optional: N/A Test conditions:

  • standard parameters

Expects +/-10%. With reference: the test fails if the share of the most common value is different by more than 10% from the reference. No reference: the test fails if the share of the most common value is >= 80%.

TestMeanInNSigmas(column_name='num-column')

Column-level. Tests if the mean value in a given numerical column is within the expected range , defined in standard deviations. This test requires reference.

Required:

  • column_name

Optional:

  • n_sigmas

Expects +/- 2 std dev. With reference: the test fails if the current mean value is out of the +/- 2 std dev interval from the reference mean value. No reference: N/A

TestValueRange(column_name='num_column')

Column-level. Tests if a numerical column contains values out of the min-max range.

Required:

  • column_name

Optional:

  • left

  • right

Test conditions: N/A

Expects all values to be in range. With reference: the test fails if the column contains values out of the min-max range as seen in the reference. No reference: N/A

TestShareOfOutRangeValues(column_name='num_column')

Column-level. Tests the share of values out of the min-max range against reference or a defined condition.

Required:

  • column_name

Optional:

  • left

  • right

Test conditions:

  • standard parameters

Expects all values to be in range.

TestNumberOfOutRangeValues(column_name='num_column')

Column-level. Tests the number of values out of the min-max range against reference or a defined condition.

Required:

  • column_name

Optional:

  • left

  • right

Test conditions:

  • standard parameters

Expects all values to be in range. With reference: the test fails if at least 1 value is out of range (as seen in reference). No reference: N/A

TestValueList(column_name='cat_column')

Column-level. Tests if a categorical column contains values out of the list.

Required:

  • column_name

Optional:

  • values: List[str]

Test conditions: N/A

Expects all values to be in the list. With reference: the test fails if the column contains values out of the list (as seen in reference). No reference: N/A

TestNumberOfOutListValues(column_name='cat_column')

Column-level. Tests the number of values in a given column that are out of list, against reference or a defined condition.

Required:

  • column_name

Optional:

  • values: List[str]

Test conditions:

  • standard parameters

Expects all values to be in the list. With reference: the test fails if the column contains values out of the list (as seen in reference). No reference: N/A

TestShareOfOutListValues(column_name='cat_column')

Column-level. Tests the share of values in a given column that are out of list against reference or a defined condition.

Required:

  • column_name

Optional:

  • values: List[str]

Test conditions:

  • standard parameters

Expects all values to be in the list. With reference: the test fails if the column contains values out of the list (as seen in reference). No reference: N/A

TestColumnQuantile(column_name='num_column', quantile=0.25)

Column-level. Computes a quantile value and compares it to the reference or against a defined condition.

Required:

  • column_name

  • quantile

Optional: N/A Test conditions:

  • standard parameters

Expects +/-10%. With reference: the test fails if the quantile value is over 10% higher or lower. No reference: N/A

Data Drift

Defaults for Data Drift. By default, all data drift tests use the Evidently drift detection logic that selects a different statistical test or metric based on feature type and volume. You always need a reference dataset.

To modify the logic or select a different test, you should set data drift parameters.

Test nameDescriptionParametersDefault test conditions

TestNumberOfDriftedColumns()

Dataset-level. Compares the distribution of each column in the current dataset to the reference and tests the number of drifting features against a defined condition.

Required: N/A Optional:

  • сolumns

  • stattest(default=automated selection)

  • cat_stattest

  • num_stattest

  • per_column_stattest

  • stattest_threshold(default=test default)

  • cat_stattest_threshold

  • num_stattest_threshold

  • per_column_stattest_threshold

Test conditions:

  • standard parameters

Expects =< ⅓ features to drift. With reference: If > 1/3 of features drifted, the test fails. No reference: N/A

TestShareOfDriftedColumns()

Dataset-level. Compares the distribution of each column in the current dataset to the reference and tests the share of drifting features against a defined condition.

Required: N/A Optional:

  • сolumns

  • stattest(default=automated selection)

  • cat_stattest

  • num_stattest

  • per_column_stattest

  • stattest_threshold(default=test default)

  • cat_stattest_threshold

  • num_stattest_threshold

  • per_column_stattest_threshold

Test conditions:

  • standard parameters

Expects =< ⅓ features to drift. With reference: If > 1/3 of features drifted, the test fails. No reference: N/A

TestColumnDrift(column_name='name')

Column-level. Tests if there is a distribution shift in a given column compared to the reference.

Required:

  • column_name

Optional:

  • stattest(default=automated selection)

  • stattest_threshold(default=test default)

Expects no drift. With reference: the test fails if the distribution drift is detected in a given column. No reference: N/A

TestEmbeddingsDrift(embeddings_name='small_subset')

Column-level. Tests if there is drift in embeddings compared to reference.

Required:

  • embeddings_name

Optional:

  • drift_method(default=model)

Expects no drift. With reference: the test fails if the drift is detected in a given subset of columns. No reference: N/A

Regression

Defaults for Regression tests: if there is no reference data or defined conditions, Evidently will compare the model performance to a dummy model that predicts the optimal constant (varies by the metric). You can also pass the reference dataset and run the test with default conditions, or define custom test conditions.

Test nameDescriptionParametersDefault test conditions

TestValueMAE()

Dataset-level. Computes the Mean Absolute Error (MAE) and compares it to the reference or against a defined condition.

Required: N/A Optional: N/A Test conditions:

  • standard parameters

Expects +/-10% or better than a dummy model. With reference: if MAE is higher or lower by over 10%, the test fails. No reference: the test fails if the MAE value is higher than the MAE of the dummy model that predicts the optimal constant (median of the target value).

TestValueRMSE()

Dataset-level. Computes the Root Mean Square Error (RMSE) and compares it to the reference or against a defined condition.

Required: N/A Optional: N/A Test conditions

  • standard parameters

Expects +/-10% or better than a dummy model. With reference: if RMSE is higher or lower by over 10%, the test fails. No reference: the test fails if the RMSE value is higher than the RMSE of the dummy model that predicts the optimal constant (mean of the target value).

TestValueMeanError()

Dataset-level. Computes the Mean Error (ME) and tests if it is near zero or compares it against a defined condition.

Required: N/A Optional: N/A Test conditions

  • standard parameters

Expects the Mean Error to be near zero. With/without reference: the test fails if the Mean Error is skewed and the condition is violated. Condition: eq = approx(absolute=0.1*error_std) error_std = (curr_true - curr_preds).std()

TestValueMAPE()

Dataset-level. Computes the Mean Absolute Percentage Error (MAPE) and compares it to the reference or against a defined condition.

Required: N/A Optional: N/A Test conditions:

  • standard parameters

Expects +/-10% or better than a dummy model. With reference: if MAPE is higher or lower by over 10%, the test fails. No reference: the test fails if the MAPE value is higher than the MAPE of the dummy model that predicts the optimal constant (weighted median of the target value).

TestValueAbsMaxError()

Dataset-level. Computes the absolute maximum error and compares it to the reference or against a defined condition.

Required: N/A Optional: N/A Test conditions:

  • standard parameters

Expects +/-10% or better than a dummy model. With reference: if the absolute maximum error is higher or lower by over 10%, the test fails. No reference: the test fails if the absolute maximum error is higher than the absolute maximum error of the dummy model that predicts the optimal constant (median of the target value).

TestValueR2Score()

Dataset-level. Computes the R2 Score (coefficient of determination) and compares it to the reference or against a defined condition.

Required: N/A Optional: N/A Test conditions:

  • standard parameters

Expects +/-10% or > 0. With reference: if R2 is higher or lower by over 10%, the test fails. No reference: the test fails if the R2 value is =< 0.

Classification

You can apply the tests for non-probabilistic, probabilistic classification, and ranking. The underlying metrics will be calculated slightly differently depending on the provided inputs: only labels, probabilities, decision threshold, and/or K (to compute, e.g., precision@K).

Defaults for Classification tests. If there is no reference data or defined conditions, Evidently will compare the model performance to a dummy model. It is based on a set of heuristics to verify that the quality is better than random. You can also pass the reference dataset and run the test with default conditions, or define custom test conditions.

Test nameDescriptionParametersDefault test conditions

TestAccuracyScore()

Dataset-level. Computes the Accuracy and compares it to the reference or against a defined condition.

Required: N/A Optional:

  • threshold_probas(default for classification = None; default for probabilistic classification = 0.5)

  • k

Test conditions:

  • standard parameters

Expects +/-20% or better than a dummy model. With reference: if the Accuracy is over 20% higher or lower, the test fails. No reference: if the Accuracy is lower than the Accuracy of the dummy model, the test fails.

TestPrecisionScore()

Dataset-level. Computes the Precision and compares it to the reference or against a defined condition.

Required: N/A Optional:

  • threshold_probas(default for classification = None; default for probabilistic classification = 0.5)

  • k

Test conditions:

  • standard parameters

Expects +/-20% or better than a dummy model. With reference: if the Precision is over 20% higher or lower, the test fails. No reference: if the Precision is lower than the Precision of the dummy mode, the test fails.

TestRecallScore()

Dataset-level. Computes the Recall and compares it to the reference or against a defined condition.

Required: N/A Optional:

  • threshold_probas(default for classification = None; default for probabilistic classification = 0.5)

  • k

Test conditions:

  • standard parameters

Expects +/-20% or better than a dummy model. With reference: if the Recall is over 20% higher or lower, the test fails. No reference: if the Recall is lower than the Recall of the dummy model, the test fails.

TestF1Score()

Dataset-level. Computes the F1 score and compares it to the reference or against a defined condition.

Required: N/A Optional:

  • threshold_probas(default for classification = None; default for probabilistic classification = 0.5)

  • k

Test conditions:

  • standard parameters

Expects +/-20% or better than a dummy model. With reference: if the F1 is over 20% higher or lower, the test fails. No reference: if the F1 is lower than the F1 of the dummy model, the test fails.

TestPrecisionByClass(label='classN')

Dataset-level. Computes the Precision for the specified class and compares it to the reference or against a defined condition.

Required:

  • label

Optional:

  • probas_threshold(default for classification = None; default for probabilistic classification = 0.5)

  • k (default = None)

Test conditions:

  • standard parameters

Expects +/-20% or better than a dummy model. With reference: if the Precision is over 20% higher or lower, the test fails. No reference: if the Precision is lower than the Precision of the dummy model, the test fails.

TestRecallByClass(label='classN')

Dataset-level. Computes the Recall for the specified class and compares it to the reference or against a defined condition.

Required:

  • label

Optional:

  • probas_threshold(default for classification = None; default for probabilistic classification = 0.5)

  • k (default = None)

Test conditions:

  • standard parameters

Expects +/-20% or better than a dummy model. With reference: if the Recall is over 20% higher or lower, the test fails. No reference: if the Recall is lower than the Recall of the dummy model, the test fails.

TestF1ByClass(label='classN')

Dataset-level. Computes the F1 for the specified class and compares it to the reference or against a defined constraint.

Required:

  • label

Optional:

  • probas_threshold(default for classification = None; default for probabilistic classification = 0.5)

  • k (default = None)

Test conditions:

  • standard parameters

Expects +/-20% or better than a dummy model. With reference: the test fails if the F1 is over 20% higher or lower. No reference: the test fails if the F1 is lower than the F1 of the dummy model.

TestTPR()

Dataset-level. Computes the True Positive Rate and compares it to the reference or against a defined condition.

Required: N/A Optional:

  • probas_threshold(default for classification = None; default for probabilistic classification = 0.5)

  • k (default = None)

Test conditions:

  • standard parameters

Expects +/-20% or better than a dummy model. With reference: the test fails if the TPR is over 20% higher or lower. No reference: the test fails if the TPR is lower than the TPR of the dummy model.

TestTNR()

Dataset-level. Computes the True Negative Rate and compares it to the reference or against a defined condition.

Required: N/A Optional:

  • probas_threshold(default for classification = None; default for probabilistic classification = 0.5)

  • k (default = None)

Test conditions:

  • standard parameters

Expects +/-20% or better than a dummy model. With reference: the test fails if the TNR is over 20% higher or lower. No reference: the test fails if the TNR is lower than the TNR of the dummy model.

TestFPR()

Dataset-level. Computes the False Positive Rate and compares it to the reference or against a defined condition.

Required: N/A Optional:

  • probas_threshold(default for classification = None; default for probabilistic classification = 0.5)

  • k (default = None)

Test conditions:

  • standard parameters

Expects +/-20% or better than a dummy model. With reference: the test fails if the FPR is over 20% higher or lower. No reference: the test fails if the FPR is higher than the FPR of the dummy model.

TestFNR()

Dataset-level. Computes the False Negative Rate and compares it to the reference or against a defined condition.

Required: N/A Optional:

  • probas_threshold(default for classification = None; default for probabilistic classification = 0.5)

  • k (default = None)

Test conditions:

  • standard parameters

Expects +/-20% or better than a dummy model. With reference: the test fails if the FNR is over 20% higher or lower. No reference: the test fails if the FNR is higher than the FNR of the dummy model.

TestRocAuc()

Dataset-level. Applies to probabilistic classification. Computes the ROC AUC and compares it to the reference or against a defined condition.

Required: N/A Optional: N/A Test conditions:

  • standard parameters

Expects +/-20% or > 0.5 With reference: the test fails if the ROC AUC is over 20% higher or lower than in the reference. No reference: the test fails if ROC AUC is <= 0.5.

TestLogLoss()

Dataset-level. Applies to probabilistic classification. Computes the LogLoss and compares it to the reference or against a defined condition.

Required: N/A Optional: N/A Test conditions:

  • standard parameters

Expects +/-20% or better than a dummy model. With reference: the test fails if the LogLoss is over 20% higher or lower than in the reference. No reference: the test fails if LogLoss is higher than the LogLoss of the dummy model (equals 0.5 for a constant model).

Ranking and Recommendations

Check individual metric descriptions here.

Optional shared parameters:

  • no_feedback_users: bool = False. Specifies whether to include the users who did not select any of the items, when computing the quality metrics. Default: False.

  • min_rel_score: Optional[int] = None. Specifies the minimum relevance score to consider relevant when calculating the quality metrics for non-binary targets (e.g., if a target is a rating or a custom score).

Test nameDescriptionParametersDefault test conditions

TestPrecisionTopK(k=k)

Dataset-level. Computes the Precision at the top K and compares it to the reference or against a defined condition.

Required:

  • k

Optional:

  • no_feedback_users

  • min_rel_score

Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the Precision at the top K is over 10% higher or lower, the test fails. No reference: Tests if precision > 0.

TestRecallTopK(k=k)

Dataset-level. Computes the Recall at the top K and compares it to the reference or against a defined condition.

Required:

  • k

Optional:

  • no_feedback_users

  • min_rel_score

Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the Recall at the top K is over 10% higher or lower, the test fails. No reference: Tests if recall > 0.

TestFBetaTopK(k=k)

Dataset-level. Computes the F-beta score at the top K and compares it to the reference or against a defined condition.

Required:

  • k

Optional:

  • no_feedback_users

  • min_rel_score

Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the F-beta score at the top K is over 10% higher or lower, the test fails. No reference: Tests if F-beta > 0.

TestHitRateK(k=k)

Dataset-level. Computes the Hit Rate at the top K recommendations and compares it to the reference or against a defined condition.

Required:

  • k

Optional:

  • no_feedback_users

  • min_rel_score

Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the Hit Rate at the top K is over 10% higher or lower, the test fails. No reference: Tests if Hit Rate > 0.

TestMAPK(k=k)

Dataset-level. Computes the Mean Average Precision at the top K and compares it to the reference or against a defined condition.

Required:

  • k

Optional:

  • no_feedback_users

  • min_rel_score

Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the MAP at the top K is over 10% higher or lower, the test fails. No reference: Tests if MAP > 0.

TestMRRK(k=k)

Dataset-level. Computes the Mean Reciprocal Rank at the top K and compares it to the reference or against a defined condition.

Required:

  • k

Optional:

  • no_feedback_users

  • min_rel_score

Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the MRR at the top K is over 10% higher or lower, the test fails. No reference: Tests if MRR > 0.

TestNDCGK(k=k)

Dataset-level. Computes the Normalized Discounted Cumulative Gain at the top K and compares it to the reference or against a defined condition.

Required:

  • k

Optional:

  • no_feedback_users

  • min_rel_score

Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the Normalized Discounted Cumulative Gain at the top K is over 10% higher or lower, the test fails. No reference: Tests if NDCG > 0.

TestNovelty(k=k)

Dataset-level. Computes the Novelty at the top K recommendations and compares it to the reference or against a defined condition. Requires a training dataset.

Required:

  • k

Optional: N/A Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the Novelty at the top K is over 10% higher or lower, the test fails. No reference: Tests if novelty > 0.

TestPersonalization(k=k)

Dataset-level. Computes the Personalization at the top K recommendations and compares it to the reference or against a defined condition.

Required:

  • k

Optional: N/A Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the Personalization at the top K is over 10% higher or lower, the test fails. No reference: Tests if personalization > 0.

TestSerendipity(k=k, item_features=item_features)

Dataset-level. Computes the Serendipity at the top K recommendations considering item features and compares it to the reference or against a defined condition. Requires a training dataset.

Required:

  • k

  • item_features

Optional:

  • min_rel_score

Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the Serendipity at the top K is over 10% higher or lower, the test fails. No reference: Tests if serendipity > 0.

TestDiversity(k=k, item_features=item_features)

Dataset-level. Computes the Diversity at the top K recommendations considering item features and compares it to the reference or against a defined condition.

Required:

  • k

  • item_features

Optional: N/A Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the Diversity at the top K is over 10% higher or lower, the test fails. No reference: Tests if diversity > 0.

TestARP(k=k)

Dataset-level. Computes the Average Recommendation Popularity at the top K recommendations and compares it to the reference or against a defined condition. Requires a training dataset.

Required:

  • k

Optional:

  • normalize_arp (default: False)

Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the ARP at the top K is over 10% higher or lower, the test fails. No reference: Tests if ARP > 0.

TestGiniIndex(k=k)

Dataset-level. Computes the Gini Index at the top K recommendations and compares it to the reference or against a defined condition. Requires a training dataset.

Required:

  • k

Optional: N/A Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the Gini Index at the top K is over 10% higher or lower, the test fails. No reference: Tests if Gini Index < 1.

TestCoverage(k=k)

Dataset-level. Computes the Coverage at the top K recommendations and compares it to the reference or against a defined condition. Requires a training dataset.

Required:

  • k

Optional: N/A Test conditions:

  • standard parameters

Expects +/-10% from reference. With reference: if the Coverage at the top K is over 10% higher or lower, the test fails. No reference: Tests if Coverage > 0.

Last updated