Ranking
Recall
Evidently Metric:RecallTopK
.
Recall at K reflects the ability of the recommender or ranking system to retrieve all relevant items within the top K results.
Implemented method:
- Compute recall at K by user. Compute the recall at K for each individual user (or query), by measuring the share of all relevant items in the dataset that appear in the top K results.
- Compute overall recall. Average the results across all users (queries) in the dataset.
Precision
Evidently Metric:PrecisionTopK
.
Precision at K reflects the ability of the system to suggest items that are truly relevant to the users’ preferences or queries.
Implemented method:
- Compute precision at K by user. Compute the precision at K for each user (or query) by measuring the share of the relevant results within the top K.
- Compute overall precision. Average the results across all users (queries) in the dataset.
F Beta
Evidently Metric:FBetaTopK
.
The F Beta score at K combines precision and recall into a single value, providing a balanced measure of a recommendation or ranking system’s performance.
Beta
is a parameter that determines the weight assigned to recall relative to precision. Beta
> 1 gives more weight to recall, while beta
< 1 favors precision.
If Beta
= 1 (default), it is a traditional F1 score that provides a harmonic mean of precision and recall at K. It provides a balanced estimation, considering both false positives (items recommended that are not relevant) and false negatives (relevant items not recommended).
Range: 0 to 1.
Interpretation: Higher F Beta at K values indicate better overall performance.
Mean average precision (MAP)
Evidently Metric:MAP
.
MAP (Mean Average Precision) at K assesses the ability of the recommender or retrieval system to suggest relevant items in the top-K results, while placing more relevant items at the top.
Compared to precision at K, MAP at K is rank-aware. It penalizes the system for placing relevant items lower in the list, even if the total number of relevant items at K is the same.
Implemented method:
- Compute Average Precision (AP) at K by user. The Average Precision at K is computed for each user (or query) as an average of precision values at each relevant item position within the top K. To do that, we sum up precision at all values of K when the item is relevant (e.g., Precision @1, Precision@2..), and divide it by the total number of relevant items in K.
- Compute Mean Average Precision (MAP) at K. Average the results across all users (or queries) in the dataset.
Mean average recall (MAR)
Evidently Metric:MAR
.
MAR (Mean Average Recall) at K assesses the ability of a recommendation system to retrieve all relevant items within the top-K results, averaged by all relevant positions.
Implemented method:
- Compute the average recall at K by user. Compute and average the recall at each relevant position within the top K for every user (or query). To do that, we sum up the recall at all values of K when the item is relevant (e.g. Recall @1, Recall@2..), and divide it by the total number of relevant recommendations in K.
- Compute mean average recall at K. Average the results across all users (or queries).
Normalized Discounted Cumulative Gain (NDCG)
Evidently Metric:NDCG
.
NDCG (Normalized Discounted Cumulative Gain) at K reflects the ranking quality, comparing it to an ideal order where all relevant items for each user (or query) are placed at the top of the list.
Implemented method:
- Provide the item relevance score. You can assign a relevance score for each item in each top-K list for user or query. Depending on the model type, it can be a binary outcome (1 is relevant, 0 is not) or a score.
- Compute the discounted cumulative gain (DCG) at K by the user or query. DCG at K measures the quality of the ranking (= total relevance) for a list of top-K items. We add a logarithmic discount to account for diminishing returns from each following item being lower on the list. To get the resulting DCG, you can compute a weighted sum of the relevance scores for all items from the top of the list to K with an applied discount.
- Compute the normalized DCG (NDCG). To normalize the metric, we divide the resulting DCG by the ideal DCG (IDCG) at K. Ideal DCG at K represents the maximum achievable DCG when the items are perfectly ranked in descending order of relevance.
Hit Rate
Evidently Metric:HitRate
.
Hit Rate at K calculates the share of users or queries for which at least one relevant item is included in the K.
Implemented method:
- Compute “hit” for each user. For each user or query, we evaluate if any of the top-K recommended items is relevant. It is a binary metric equal to 1 if any relevant item is included in K, or 0 otherwise.
- Compute average hit rate. The average of this metric is calculated across all users or queries.
Mean Reciprocal Rank (MRR)
Evidently Metric:MRR
Mean Reciprocal Rank (MRR) measures the ranking quality considering the position of the first relevant item in the list.
Implemented method:
- For each user or query, identify the position of the first relevant item in the recommended list.
- Calculate the reciprocal rank, taking the reciprocal of the position of the first relevant item for each user or query (i.e., 1/position). Example: if the first relevant item is at the top of the list - the reciprocal rank is 1, if it is on the 2nd position - the reciprocal rank ½, if on the 3rd - ⅓, etc.
- Calculate the mean reciprocal rank (MRR). Compute the average reciprocal rank across all users or queries.
Score Distribution (Entropy)
Evidently Metric:ScoreDistribution
This metric computes the predicted score entropy. It applies only when the recommendations_type
is a score.
Implementation:
- Apply softmax transformation for top-K scores for all users.
- Compute the KL divergence (relative entropy in scipy).
RecSys
These metrics are coming soon to the new Evidently API! Check the old docs for now.