Data drift parameters
How to set custom data drift conditions and thresholds for tabular and text data.
Pre-requisites:
- You know how to generate Reports or Test Suites with default parameters.
- You know how to pass custom parameters for Reports or Test Suites.
- You know how to use Column Mapping to set the input data type.
All Presets, Tests, and Metrics that include data or target (prediction) drift evaluation use the default Data Drift algorithm. It automatically selects an appropriate drift detection method based on the feature type and volume.
You can override the defaults by passing a custom parameter to the chosen Test, Metric, or Preset. You can define the drift detection method, the threshold, or both.
You can refer to an example How-to-notebook showing how to pass custom drift parameters:
To set a custom drift method and threshold on the column level:
ColumnDriftMetric(column_name='feature1', stattest='wasserstein', stattest_threshold=0.2)
If you have a Preset, Test or Metric that checks for drift in multiple columns at the same time, you can set a custom drift method for all columns, all numerical/categorical columns, or for each column individually.
Here is how you set the drift detection method for all categorical columns:
DataDriftPreset(cat_stattest='ks', cat_statest_threshold=0.05)
To set a custom condition for the dataset drift (share of drifting columns in the dataset) in the relevant Metrics or Presets:
DatasetDriftMetric(drift_share=0.7)
Note that this works slightly differently for Tests. To set a custom condition for the dataset drift when you run a relevant Test, you should set a condition for the share of drifted features using standard
lt
and gt
parameters:TestShareOfDriftedColumns(lt=0.5)
When you set drift threshold for
ColumnDriftTest()
, you should use stattest_threshold
and other parameters the same way as it works in Metrics (not lt
and gt
).The following methods and parameters apply to tabular data (as parsed automatically or specified as numerical or categorical columns in the column mapping).
The following drift detection parameters are available in the
DataDriftTable()
, DatasetDriftMetric()
, ColumnDriftMetric()
, related Tests, and Presets that contain them.Parameter | Description |
---|---|
stattest | Defines the drift detection method for a given column (if a single column is tested), or all columns in the dataset (if multiple columns are tested). |
stattest_threshold | Sets the drift threshold in a given column or all columns.
The threshold meaning varies based on the drift detection method, e.g., it can be the value of a distance metric or a p-value of a statistical test. |
drift_share | Defines the share of drifting columns as a condition for Dataset Drift in DatasetDriftMetric or inside a Preset. |
cat_stattest
cat_stattest_threshold | Sets the drift method and/or threshold for all categorical columns in the dataset. |
num_stattest
num_stattest_threshold | Sets the drift method and/or threshold for all numerical columns in the dataset. |
per_column_stattest
per_column_stattest_threshold | Sets the drift method and/or threshold for the listed columns (accepts a dictionary). |
How to check available parameters. You can verify which parameters are available for a specific test, metric, or preset in the All tests or All metrics tables or consult the API reference
To use the following drift detection methods, pass them using the
stattest
parameter.StatTest | Applicable to | Drift score |
---|---|---|
ks
Kolmogorov–Smirnov (K-S) test | tabular data
only numerical
Default method for numerical data, if <= 1000 objects | returns p_value
drift detected when p_value < threshold
default threshold: 0.05 |
chisquare
Chi-Square test | tabular data
only categorical
Default method for categorical with > 2 labels, if <= 1000 objects | returns p_value
drift detected when p_value < threshold
default threshold: 0.05 |
z
Z-test | tabular data
only categorical
Default method for binary data, if <= 1000 objects | returns p_value
drift detected when p_value < threshold
default threshold: 0.05 |
wasserstein
Wasserstein distance (normed) | tabular data
only numerical
Default method for numerical data, if > 1000 objects | returns distance
drift detected when distance >= threshold
default threshold: 0.1 |
kl_div
Kullback-Leibler divergence | tabular data
numerical and categorical | returns divergence
drift detected when divergence >= threshold
default threshold: 0.1 |
psi
Population Stability Index (PSI) | tabular data
numerical and categorical | returns psi_value
drift detected when psi_value >= threshold
default threshold: 0.1 |
jensenshannon
Jensen-Shannon distance | tabular data
numerical and categorical
Default method for categorical, if > 1000 objects | returns distance
drift detected when distance >= threshold
default threshold: 0.1 |
anderson
Anderson-Darling test | tabular data
only numerical | returns p_value
drift detected when p_value < threshold
default threshold: 0.05 |
fisher_exact
Fisher's Exact test | tabular data
only categorical | returns p_value
drift detected when p_value < threshold
default threshold: 0.05 |
cramer_von_mises
Cramer-Von-Mises test | tabular data
only numerical | returns p_value
drift detected when p_value < threshold
default threshold: 0.05 |
g-test
G-test | tabular data
only categorical | returns p_value
drift detected when p_value < threshold
default threshold: 0.05 |
hellinger
Hellinger Distance (normed) | tabular data
numerical and categorical | returns distance
drift detected when distance >= threshold
default threshold: 0.1 |
mannw
Mann-Whitney U-rank test | tabular data
only numerical | returns p_value
drift detected when p_value < threshold
default threshold: 0.05 |
ed
Energy distance | tabular data
only numerical | returns distance
drift detected when distance >= threshold
default threshold: 0.1 |
es
Epps-Singleton tes | tabular data
only numerical | returns p_value
drift detected when p_value < threshold
default threshold: 0.05 |
t_test
T-Test | tabular data
only numerical | returns p_value
drift detected when p_value < threshold
default threshold: 0.05 |
emperical_mmd
Emperical-MMD | tabular data
only numerical | returns p_value
drift detected when p_value < threshold
default threshold: 0.05 |
TVD
Total-Variation-Distance | tabular data
only categorical | returns p_value
drift detected when p_value < threshold
default threshold: 0.05 |
Text drift detection applies to columns with raw text data, as specified in column mapping.
Embedding drift detection. If you work with embeddings, you can use Embeddings Drift Detection methods.
The following text drift detection parameters are available in the
DataDriftTable()
, DatasetDriftMetric()
, ColumnDriftMetric()
, related Tests and Presets that contain them.Parameter | Description |
---|---|
stattest | Defines the drift detection method for a given column that contains text data, or for all columns in the dataset if all columns contain text data. |
stattest_threshold | Sets the threshold as a drift detection parameter. |
text_stattest | Defines the drift detection method for all text columns in the dataset. |
text_stattest_threshold | Sets the threshold as a drift detection parameter. |
To use the following text drift detection methods, pass them using the
stattest
parameter.StatTest | Description | Drift score |
---|---|---|
perc_text_content_drift
Text content drift (domain classifier, with statistical hypothesis testing) | Applies only to text data. Trains a classifier model to distinguish between text in “current” and “reference” datasets.
Default for text data when <= 1000 objects. |
|
abs_text_content_drift
Text content drift (domain classifier) | Applies only to text data. Trains a classifier model to distinguish between text in “current” and “reference” datasets.
Default for text data when > 1000 objects. |
|
You can also check for distribution drift in text descriptors (such as text length, etc.)
To use this method, call a separate
TextDescriptorsDriftMetric()
. You can pass any of the tabular drift detection methods as a parameter.report = Report(metrics=[
TextDescriptorsDriftMetric("Review_Text"),
])
report.run(reference_data=reviews_ref, current_data=reviews_cur, column_mapping=column_mapping)
report
Last modified 4mo ago