TL;DR: You can use Evidently to evaluate the performance of two models on the same data (in test / shadow deployment). Here is an example Jupyter notebook.
In this tutorial, we train two classification models that predict employee attrition. Then, we compare and evaluate the differences in their performance despite the similar ROC AUC. The tutorial uses the Probabilistic Classification report.