Probabilistic Classification Performance
Last updated
Last updated
TL;DR: The report analyzes the performance of a probabilistic classification model.
Works for a single model or helps compare the two
Works for binary and multi-class classification
Displays a variety of plots related to the model performance
Helps explore regions where the model makes different types of errors
Probabilistic Classification Performance report evaluates the quality of a probabilistic classification model. It works both for binary and multi-class classification.
If you have a non-probabilistic classification model, refer to a .
This report can be generated for a single model, or as a comparison. You can contrast your current production model performance against the past or an alternative model.
To run this report, you need to have input features, and both target and prediction columns available.
In the column mapping, you need to specify the names of your Prediction columns. The tool expects a separate column for each class, even for binary classification.
NOTE: Column order in Binary Classification. For binary classification, class order matters. The tool expects that the target (so-called positive) class is the first in the column_mapping['prediction']
list.
The column names can be numerical labels like "0", "1", "2" or class names like "virginica", "setoza", "versicolor". Each column should contain the predicted probability [0;1] for the corresponding class.
You can find an example below:
The Target column should contain the true labels that match the Prediction column names. The tool performs the matching and evaluates the model quality by looking for the names from the "prediction" list inside the Target column.
The tool does not yet work for multi-label classification. It expects a single true label.
To generate a comparative report, you will need two datasets. The reference dataset serves as a benchmark. We analyze the change by comparing the current production data to the reference data.
You can also run this report for a single DataFrame
, with no comparison performed. In this case, pass it as reference_data
.
The report includes 10 components. All plots are interactive.
We calculate a few standard model quality metrics: Accuracy, Precision, Recall, F1-score, ROC AUC, and LogLoss.
To support the model performance analysis, we also generate interactive visualizations. They help analyze where the model makes mistakes and come up with improvement ideas.
Shows the number of objects of each class.
Visualizes the classification errors and their type.
Shows the model quality metrics for the individual classes. In the case of multi-class problems, it will also include ROC AUC.
A scatter plot of the predicted probabilities that shows correct and incorrect predictions for each class.
It serves as a representation of both model accuracy and the quality of its calibration. It also helps visually choose the best probability threshold for each class.
A similar view as above, it shows the distribution of predicted probabilities.
ROC Curve (receiver operating characteristic curve) shows the share of true positives and true negatives at different classification thresholds.
The precision-recall curve shows the trade-off between precision and recall for different classification thresholds.
The table shows possible outcomes for different classification thresholds and prediction coverage. If you have two datasets, the table is generated for both.
Each line in the table defines a case when only top-X% predictions are considered, with a 5% step. It shows the absolute number of predictions (Count) and the probability threshold (Prob) that correspond to this combination.
The table then shows the quality metrics for a given combination. It includes Precision, Recall, the share of True Positives (TP), and False Positives (FP).
This helps explore the quality of the model if you choose to act only on some of the predictions.
In this table, we show a number of plots for each feature. To expand the plots, click on the feature name.
If you compare the two datasets, it visually shows the changes in the feature distribution and in the relationship between the values of the feature and the target.
Then, for each class, we plot the predicted probabilities alongside the values of the feature.
It visualizes the regions where the model makes errors of each type and reveals the low-performance segments. You can compare the distributions and see if the errors are sensitive to the values of a given feature.
1. To analyze the results of the model test. You can explore the results of an online or offline test and contrast it to the performance in training. Though this is not the primary use case, you can use this report to compare the model performance in an A/B test, or during a shadow model deployment.
2. To generate regular reports on the performance of a production model. You can run this report as a regular job (e.g. weekly or at every batch model run) to analyze its performance and share it with other stakeholders.
3. To analyze the model performance on the slices of data. By manipulating the input data frame, you can explore how the model performs on different data segments (e.g. users from a specific region).
4. To trigger or decide on the model retraining. You can use this report to check if your performance is below the threshold to initiate a model update and evaluate if retraining is likely to improve performance.
5. To debug or improve model performance. You can use the Classification Quality table to identify underperforming segments and decide on the ways to address them.
If you choose to generate a JSON profile, it will contain the following information:
In the tab “ALL”, we plot the distribution of classes against the values of the feature. This is the “Target Behavior by Feature” plot from the report.
You can set to set a custom classification threshold or сut the data above the given quantile from the histogram plots in the "Classification Quality by Feature" table.
You can select which components of the reports to display or choose to show the short version of the report: .
If you want to create a new plot or metric, you can .
Here are our suggestions on when to use it—you can also combine it with the and reports to get a comprehensive picture.
Browse our for sample Jupyter notebooks.
See a tutorial "" where we analyze the performance of two models with identical ROC AUC to choose between the two.
You can also read the .