Data Drift
TL;DR: The report detects changes in feature distributions.
    Performs a suitable statistical test for numerical and categorical features
    Plots feature values and distributions for the two datasets.

Summary

The Data Drift report helps detect and explore changes in the input data.

Requirements

You will need two datasets. The reference dataset serves as a benchmark. We analyze the change by comparing the current production data to the reference data.
The dataset should include the features you want to evaluate for drift. The structure (column names) of both datasets should be identical.
    In the case of pandas DataFrame,all column names should be string .
    All feature columns analyzed for drift should have the numerical type (np.number)
    Categorical data can be encoded as numerical labels and specified in the column_mapping.
    The DateTime column is the only exception. If available, it can be used as the x-axis in the plots.
You can potentially choose any two datasets for comparison. But keep in mind that only the reference dataset will be used as a basis for comparison.

How it works

To estimate the data drift, we compare distributions of each individual feature in the two datasets.
We use statistical tests to detect if the distribution has changed significantly.
All tests use a 0.95 confidence level by default. To set a different confidence level, set it in the column_mapping configuration (see example below).

How it looks

The report includes 4 components. All plots are interactive.

1. Data Drift Summary

The report returns the share of drifting features and an aggregate Dataset Drift result. For example:
Dataset Drift sets a rule on top of the results of the statistical tests for individual features. By default, Dataset Drift is detected if at least 50% of features drift at 0.95 confidence level.
To set custom drift conditions, you need to specify the following parameters in the column_mapping:
    drift_conf_level” - test confidence level for the individual features (default value 0.95, float)
    drift_features_share” - share of the drifted features (default value 0.5, float)
The Dataset Drift will be detected if the “drift_features_share” share of the features drift at the defined “drift_conf_level” confidence level.
To set drift conditions using Python interface (e.g. Jupyter notebook):
1
column_mapping['drift_conf_level'] = 0.99 column_mapping['drift_features_share'] = 0.5
Copied!
To set drift conditions using the CLI config:
1
"column_mapping" : { "drift_conf_level":0.99, "drift_features_share":0.5 }
Copied!

2. Data Drift Table

The table shows the drifting features first, sorting them by P-value. You can also choose to sort the rows by the feature name or type.

3. Data Drift by Feature

By clicking on each feature, you can explore the values mapped in a plot.
    The dark green line is the mean, as seen in the reference dataset.
    The green area covers one standard deviation from the mean.

4. Data Distribution by Feature

You can also zoom on distributions to understand what has changed.

When to use this report

Here are a few ideas on when to use the report:
    1.
    In production: as early monitoring of model quality. In absence of ground truth labels, you can monitor for changes in the input data. Use it e.g. to decide when to retrain the model, apply business logic on top of the model output, or whether to act on predictions. You can combine it with monitoring model outputs using the Numerical or Categorical Target Drift report.
    2.
    In production: to debug the model decay. Use the tool to explore how the input data has changed.
    3.
    In A/B test or trial use. Detect training-serving skew and get the context to interpret test results.
    4.
    Before deployment. Understand drift in the offline environment. Explore past shifts in the data to define retraining needs and monitoring strategy. Here is a blog about it.
    5.
    To find useful features when building a model. You can also use the tool to compare feature distributions in different classes to surface the best discriminants.

JSON Profile

If you choose to generate a JSON profile, it will contain the following information:
1
{
2
"data_drift": {
3
"name": "data_drift",
4
"datetime": "datetime",
5
"data": {
6
"utility_columns": {
7
"date": null,
8
"id": null,
9
"target": null,
10
"prediction": null,
11
"drift_conf_level": value,
12
"drift_features_share": value,
13
"nbinsx": {
14
"feature_name": value,
15
"feature_name": value
16
},
17
"xbins": null
18
},
19
},
20
"cat_feature_names": [],
21
"num_feature_names": [],
22
"metrics": {
23
"feature_name" :{
24
"prod_small_hist": [
25
[],
26
[]
27
],
28
"ref_small_hist": [
29
[],
30
[]
31
],
32
"feature_type": "num",
33
"p_value": p_value
34
},
35
"n_features": value,
36
"n_drifted_features": value,
37
"share_drifted_features": value,
38
"dataset_drift": false
39
}
40
},
41
"timestamp": "timestamp"
42
}
Copied!

Histogram plots customization

You can customize the way the distribution plots look for the individual features. It is helpful, for example, if you have NULL or other specific values and want to see them in a separate bin.
To customize the plots, specify the following parameters inside the column_mapping:
    nbinsx” - to set the number of bins (default value = 10, integer)
    xbins” - to define the specific bins (default value = none).
You can set different options for each feature. For example, specify “nbinsx” for a subset of the features, “xbins” for another subset of the features, and apply defaults for the rest. Here is an example.
The Data Drift report has two sets of histograms: 1) preview in the Data Drift table 2) an interactive plot inside the Data Drift table that expands when you click on each feature.
Only “nbinsx”, if specified, impacts the histogram previews in the DataDrift table. In case you set both parameters, “xbins” will define the interactive plot, while “nbinsx” will affect the preview.
Both “nbinsx” and “xbins” can influence how the interactive plots look inside the table. If you set one of the parameters, it will define the plot view. If you set both parameters, “xbins” will have a priority.

Examples

To generate a Dashboard, HTML report, or JSON Profile using Python:
1
column_mapping['nbinsx'] = {'TAX': 3, 'PTRATIO': 5}
2
column_mapping['xbins'] = {
3
'CRIM': dict(start=-10., end=100.,size=5.), # OPTION 1
4
'NOX': histogram.XBins(start=-0.5, end=1.5, size=.05) # OPTION 2 (NB: Xbins is not JSON serializable)
5
}
Copied!
To generate an HTML report or JSON profile using CLI (JSON config):
1
"column_mapping" : {
2
"nbinsx":{"TAX": 3,
3
"PTRATIO": 5},
4
"xbins":{"CRIM": {"start":10.0, "end":100.0, "size":5.0}}
5
}
Copied!

Data Drift Dashboard Examples

    Browse our examples for sample Jupyter notebooks.
You can also read the initial release blog.
Last modified 1mo ago