Data Drift

TL;DR: The report detects changes in feature distributions.

  • Performs a suitable statistical test for numerical and categorical features

  • Plots feature values and distributions for the two datasets.

Summary

The Data Drift report helps detect and explore changes in the input data.

Requirements

You will need two datasets. The reference dataset serves as a benchmark. We analyze the change by comparing the current production data to the reference data.

The dataset should include the features you want to evaluate for drift. The structure (column names) of both datasets should be identical.

  • In the case of pandas DataFrame,all column names should be string .

  • All feature columns analyzed for drift should have the numerical type (np.number)

  • Categorical data can be encoded as numerical labels and specified in the column mapping.

  • The DateTime column is the only exception. If available, it can be used as the x-axis in the data plots.

You can potentially choose any two datasets for comparison. But keep in mind that only the reference dataset will be used as a basis for comparison.

How it works

To estimate the data drift, we compare distributions of each individual feature in the two datasets.

We use statistical tests to detect if the distribution has changed significantly.

Both tests use a 0.95 confidence level. We will add some levers later on, but we believe this to be a good enough default approach.

Currently, we estimate the data drift for each feature individually. Integral data drift is not evaluated.

How it looks

The report includes 3 components. All plots are interactive.

1. Data Drift Table

The table shows the drifting features first, sorting them by P-value. You can also choose to sort the rows by the feature name or type.

2. Data Drift by Feature

By clicking on each feature, you can explore the values mapped in a plot.

  • The dark green line is the mean, as seen in the reference dataset.

  • The green area covers one standard deviation from the mean.

3. Data Distribution by Feature

You can also zoom on distributions to understand what has changed.

When to use this report

Here are a few ideas on when to use the report:

  1. Support model maintenance. Decide when to retrain the model, or which features to drop when they are too volatile.

  2. Before acting on predictions. Check that the new data is from the same distribution.

  3. When debugging model decay. If the model quality dropped, use the tool to explore where the change comes from.

  4. In A/B test or trial use. Detect training-serving skew and get better context to interpret test results.

  5. Before deployment. Understand drift in the offline environment. Explore past shifts in the data to define retraining needs and monitoring strategy.

  6. To find useful features when building a model. You can also use the tool to compare feature distributions in different classes to surface the best discriminants.

JSON Profile

If you choose to generate a JSON profile, it will contain the following information:

{
"data_drift": {
"name": "data_drift",
"datetime": "datetime",
"data": {
"utility_columns": {
"date": null,
"id": null,
"target": null,
"prediction": null
},
"cat_feature_names": [],
"num_feature_names": [],
"metrics": {
"feature_name" :{
"prod_small_hist": [
[],
[]
],
"ref_small_hist": [
[],
[]
],
"feature_type": "num",
"p_value": p_value
}
}
},
"timestamp": "timestamp"
}

Examples

  • Browse our examples for sample Jupyter notebooks.

You can also read the initial release blog.