How to change data drift detection methods and conditions.
gt
, lt
, etc.) This accounts for nuances like varying role of thresholds across drift detection methods, where “greater” can be better or worse depending on the method. drift_share
threshold.
Parameter | Description | Applies To |
---|---|---|
method | Defines the drift detection method for a given column (if one column is tested), or all columns in the dataset (if multiple columns are tested and the method can apply to all columns). | ValueDrift() , DriftedColumnsCount() , DataDriftPreset() |
threshold | Sets the drift threshold in a given column or all columns. The threshold meaning varies based on the drift detection method, e.g., it can be the value of a distance metric or a p-value of a statistical test. | ValueDrift() , DriftedColumnsCount() , DataDriftPreset() |
drift_share | Defines the share of drifting columns as a condition for Dataset Drift. Default: 0.5 | DriftedColumnsCount() , DataDriftPreset() |
cat_method cat_threshold | Sets the drift method and/or threshold for all categorical columns. | DriftedColumnsCount() , DataDriftPreset() |
num_method num_threshold | Sets the drift method and/or threshold for all numerical columns. | DriftedColumnsCount() , DataDriftPreset() |
per_column_method per_column_threshold | Sets the drift method and/or threshold for the listed columns (accepts a dictionary). | DriftedColumnsCount() , DataDriftPreset() |
text_method text_threshold | Defines the drift detection method and threshold for all text columns. | DriftedColumnsCount() , DataDriftPreset() |
stattest
(or num_stattest
, etc.) parameter.
StatTest | Applicable to | Drift score |
---|---|---|
ks Kolmogorov–Smirnov (K-S) test | tabular data only numerical Default method for numerical data, if ≤ 1000 objects | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
chisquare Chi-Square test | tabular data only categorical Default method for categorical with > 2 labels, if ≤ 1000 objects | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
z Z-test | tabular data only categorical Default method for binary data, if ≤ 1000 objects | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
wasserstein Wasserstein distance (normed) | tabular data only numerical Default method for numerical data, if > 1000 objects | returns distance drift detected when distance ≥ threshold default threshold: 0.1 |
kl_div Kullback-Leibler divergence | tabular data numerical and categorical | returns divergence drift detected when divergence ≥ threshold default threshold: 0.1 |
psi Population Stability Index (PSI) | tabular data numerical and categorical | returns psi_value drift detected when psi_value ≥ threshold default threshold: 0.1 |
jensenshannon Jensen-Shannon distance | tabular data numerical and categorical Default method for categorical, if > 1000 objects | returns distance drift detected when distance ≥ threshold default threshold: 0.1 |
anderson Anderson-Darling test | tabular data only numerical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
fisher_exact Fisher’s Exact test | tabular data only categorical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
cramer_von_mises Cramer-Von-Mises test | tabular data only numerical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
g-test G-test | tabular data only categorical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
hellinger Hellinger Distance (normed) | tabular data numerical and categorical | returns distance drift detected when distance >= threshold default threshold: 0.1 |
mannw Mann-Whitney U-rank test | tabular data only numerical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
ed Energy distance | tabular data only numerical | returns distance drift detected when distance >= threshold default threshold: 0.1 |
es Epps-Singleton test | tabular data only numerical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
t_test T-Test | tabular data only numerical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
empirical_mmd Empirical-MMD | tabular data only numerical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
TVD Total-Variation-Distance | tabular data only categorical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
stattest
(or text_stattest
) parameter.
StatTest | Description | Drift score |
---|---|---|
perc_text_content_drift Text content drift (domain classifier, with statistical hypothesis testing) | Applies only to text data. Trains a classifier model to distinguish between text in “current” and “reference” datasets. Default for text data ≤ 1000 objects. |
|
abs_text_content_drift Text content drift (domain classifier) | Applies only to text data. Trains a classifier model to distinguish between text in “current” and “reference” datasets. Default for text data when > 1000 objects. |
|
Parameter | Type | Description |
---|---|---|
name | str | A short name used to reference the Stat Test from the options (registered globally). |
display_name | str | A long name displayed in the Report. |
func | Callable | The StatTest function. |
allowed_feature_types | List[str] | The list of allowed feature types for this function (cat , num ). |
(reference_data: pd.Series, current_data: pd.Series, threshold: float) -> Tuple[float, bool]
signature.
Accepts:
reference_data: pd.Series
- The reference data series.
current_data: pd.Series
- The current data series to compare.
feature_type: str
- The type of feature being analyzed.
threshold: float
- The test threshold for drift detection.
score: float
- Stat Test score (actual value)
drift_detected: bool
- indicates is drift detected with given threshold