Classification Performance

TL;DR: You can use the pre-built Reports and Test suites to analyze the performance of a classification model. The Presets work for binary and multi-class classification, probabilistic and non-probabilistic classification.

  • Report: for visual analysis or metrics export, use the ClassificationPreset.

  • Test Suite: for pipeline checks, use the MulticlassClassificationTestPreset, BinaryClassificationTopKTestPreset or BinaryClassificationTestPreset.

Use Case

These presets help evaluate and test the quality of classification models. You can use them:

1. To monitor the performance of a classification model in production. You can run the test suite as a regular job (e.g., weekly or when you get the labels) to contrast the model performance against the expectation. You can generate visual reports for documentation and sharing with stakeholders.

2. To trigger or decide on the model retraining. You can use the test suite to check if the model performance is below the threshold to initiate a model update.

3. To debug or improve model performance. If you detect a quality drop, you can use the visual report to explore the model errors and underperforming segments. By manipulating the input data frame, you can explore how the model performs on different data segments (e.g., users from a specific region). You can also combine it with the Data Drift report.

4. To analyze the results of the model test. You can explore the results of an online or offline test and contrast it to the performance in training. You can also use this report to compare the model performance in an A/B test or during a shadow model deployment.

To run performance checks as part of the pipeline, use the Test Suite. To explore and debug, use the Report.

Classification Performance Report

If you want to visually explore the model performance, create a new Report object and include the ClassificationPreset.

Code example

classification_performance_report = Report(metrics=[
    ClassificationPreset(),
])

classification_performance_report.run(reference_data=bcancer_ref, current_data=bcancer_cur)

classification_performance_report

How it works

This report evaluates the quality of a classification model.

  • Can be generated for a single dataset, or compare it against the reference (e.g. past performance or alternative model).

  • Works for binary and multi-class, probabilistic and non-probabilistic classification.

  • Displays a variety of metrics and plots related to the model performance.

  • Helps explore regions where the model makes different types of errors.

Data Requirements

To run this report, you need to have both target and prediction columns available. Input features are optional. Pass them if you want to explore the relations between features and target.

Refer to the column mapping section to see how to pass model predictions and labels in different cases.

The tool does not yet work for multi-label classification. It expects a single true label.

To generate a comparative report, you will need two datasets.

You can also run this report for a single dataset, with no comparison performed.

How it looks

The report includes multiple components. The composition might vary based on problem type (there are more plots in the case of probabilistic classification). All plots are interactive.

Aggregated visuals in plots. Starting from v 0.3.2, all visuals in the Evidently Reports are aggregated by default. This helps decrease the load time and report size for larger datasets. If you work with smaller datasets or samples, you can pass an option to generate plots with raw data. You can choose whether you want it on not based on the size of your dataset.

1. Model Quality Summary Metrics

Evidently calculates a few standard model quality metrics: Accuracy, Precision, Recall, F1-score, ROC AUC, and LogLoss.

To support the model performance analysis, Evidently also generates interactive visualizations. They help analyze where the model makes mistakes and come up with improvement ideas.

2. Class Representation

Shows the number of objects of each class.

3. Confusion Matrix

Visualizes the classification errors and their type.

4. Quality Metrics by Class

Shows the model quality metrics for the individual classes. In the case of multi-class problems, it will also include ROC AUC.

5. Class Separation Quality

A scatter plot of the predicted probabilities shows correct and incorrect predictions for each class.

It serves as a representation of both model accuracy and the quality of its calibration. It also helps visually choose the best probability threshold for each class.

6. Probability Distribution

A similar view as above, it shows the distribution of predicted probabilities.

7. ROC Curve

ROC Curve (receiver operating characteristic curve) shows the share of true positives and true negatives at different classification thresholds.

8. Precision-Recall Curve

The precision-recall curve shows the trade-off between precision and recall for different classification thresholds.

9. Precision-Recall Table

The table shows possible outcomes for different classification thresholds and prediction coverage. If you have two datasets, the table is generated for both.

Each line in the table defines a case when only top-X% predictions are considered, with a 5% step. It shows the absolute number of predictions (Count) and the probability threshold (Prob) that correspond to this combination.

The table then shows the quality metrics for a given combination. It includes Precision, Recall, the share of True Positives (TP), and False Positives (FP).

This helps explore the quality of the model if you choose to act only on some of the predictions.

10. Classification Quality by Feature

In this table, we show a number of plots for each feature. To expand the plots, click on the feature name.

In the tab “ALL”, you can see the distribution of classes against the values of the feature. If you compare the two datasets, it visually shows the changes in the feature distribution and in the relationship between the values of the feature and the target.

For each class, you can see the predicted probabilities alongside the values of the feature.

It visualizes the regions where the model makes errors of each type and reveals the low-performance segments. You can compare the distributions and see if the errors are sensitive to the values of a given feature.

Metrics output

You can get the report output as a JSON or a Python dictionary:

See JSON example
{
  "probabilistic_classification_performance": {
    "name": "probabilistic_classification_performance",
    "datetime": "datetime",
    "data": {
      "utility_columns": {
        "date": null,
        "id": null,
        "target": "target",
        "prediction": [
          "label1",
          "label2",
          "label3"
        ]
      },
      "cat_feature_names": [],
      "num_feature_names": [],
      "metrics": {
        "reference": {
          "accuracy": accuracy,
          "precision": precision,
          "recall": recall,
          "f1": f1,
          "roc_auc": roc_auc,
          "log_loss": log_loss,
          "metrics_matrix": {
            "label1": {
              "precision": precision,
              "recall": recall,
              "f1-score": f1,
              "support": support
            },
            "accuracy": accuracy,
            "macro avg": {
              "precision": precision,
              "recall": recall,
              "f1-score": f1,
              "support": support
            },
            "weighted avg": {
              "precision": precision,
              "recall": recall,
              "f1-score": f1,
              "support": support
            }
          },
          "roc_aucs": [
            roc_auc_label_1,
            roc_auc_label_2,
            roc_auc_label_3
          ],
          "confusion_matrix": {
            "labels": [],
            "values": []
          },
          "roc_curve": {
            "label1": {
              "fpr": [],
              "tpr": [],
              "thrs": []
          },  
          "pr_curve": {
            "label1": []
        },
        "current": {
          "accuracy": accuracy,
          "precision": precision,
          "recall": recall,
          "f1": f1,
          "roc_auc": roc_auc,
          "log_loss": log_loss,
          "metrics_matrix": {
            "label1": {
              "precision": precision,
              "recall": recall,
              "f1-score": f1,
              "support": support
          },
          "roc_aucs": [
            roc_auc_label_1,
            roc_auc_label_2,
            roc_auc_label_3
          ],
          "confusion_matrix": {
            "labels": [],
            "values": [],
          },
          "roc_curve": {
            "label1": {
              "fpr": [],
              "tpr": [],
              "thrs": []
          },
          "pr_curve": {
            "label1": []
          }
        }
      }
    }
  },
  "timestamp": "timestamp"
}

Report customization

  • You can perform the analysis of relations between features and target only for selected columns.

  • You can pass relevant parameters to change the way some of the metrics are calculated, such as decision threshold or K to evaluate precision@K. See the available parameters here

  • If you want to exclude some of the metrics, you can create a custom report by combining the chosen metrics. See the complete list here

Classification Performance Test Suite

If you want to run classification performance checks as part of a pipeline, you can create a Test Suite and use one of the classification presets. There are several presets for different classification tasks. They apply to Multiclass Classification, Binary Classification, and Binary Classification at topK accordingly:

MulticlassClassificationTestPreset
BinaryClassificationTopKTestPreset
BinaryClassificationTestPreset

Code example

binary_topK_classification_performance = TestSuite(tests=[
    BinaryClassificationTopKTestPreset(k=10),
])

binary_topK_classification_performance.run(reference_data=ref, current_data=cur)
binary_topK_classification_performance

How it works

You can use the test presets to evaluate the quality of a classification model when you have the ground truth labels.

  • Each preset compares relevant quality metrics for the model type and against the defined expectation.

  • They also test for the target drift to detect shift in the distribution of classes and/or probabilities. It might indicate emerging concept drift.

  • For Evidently to generate the test conditions automatically, you should pass the reference dataset (e.g., performance during model validation or a previous period). You can also set the performance expectations manually by passing a custom test condition.

  • If you neither pass the reference dataset nor set custom test conditions, Evidently will compare the model performance to a dummy model.

Head here to the All tests table to see the composition of each preset and default parameters.

Test Suite customization

  • You can set custom test conditions.

  • You can pass relevant parameters to change how some of the metrics are calculated, such as classification decision threshold or K to evaluate precision@K. See the available parameters.

  • If you want to exclude some tests or add additional ones, you can create a custom test suite by combining the chosen tests. See the complete list here.

Examples

  • Browse the examples for sample Jupyter notebooks and Colabs.

  • See a blog post and a tutorial "What is your model hiding" where we analyze the performance of two models with identical ROC AUC to choose between the two.

Last updated