For an intro, read Core Concepts and check quickstarts for LLMs or ML.

Text Evals

Summarizes results of text or LLM evals. To score individual inputs, first use descriptors.

Data definition. You may need to map text columns.

MetricDescriptionParametersTest Defaults
TextEvals()Optional:
  • columns
As in Metrics included in ValueStats

Columns

Use to aggregate descriptor results or check data quality on column level.

You may need to map column types using Data definition.

Value stats

Descriptive statistics.

MetricDescriptionParametersTest Defaults
ValueStats()
  • Small Preset, column-level.
  • Computes various descriptive stats (min, max, mean, quantiles, most common, etc.)
  • Returns different stats based on the column type (text, categorical, numerical, datetime).
Required:
  • column
Optional:
  • No reference. As in individual Metrics.
  • With reference. As in indiviudal Metrics.
MinValue()
  • Column-level.
  • Returns min value for a given numerical column.
  • Metric result: value.
Required:
  • column
Optional:
  • No reference. N/A.
  • With reference. Fails if Min Value is differs by more than 10% (+/-).
StdValue()
  • Column-level.
  • Computes the standard deviation of a given numerical column.
  • Metric result: value.
Required:
  • column
Optional:
  • No reference. N/A.
  • With reference. Fails if the standard deviation differs by more than 10% (+/-).
MeanValue()
  • Column-level.
  • Computes the mean value of a given numerical column.
  • Metric result: value.
Required:
  • column
Optional:
  • No reference. N/A.
  • With reference. Fails if the mean value differs by more than 10%.
MaxValue()
  • Column-level.
  • Computes the max value of a given numerical column.
  • Metric result: value.
Required:
  • column
Optional:
  • No reference. N/A.
  • With reference. Fails if the max value is higher than in the reference.
MedianValue()
  • Column-level.
  • Computes the median value of a given numerical column.
  • Metric result: value.
Required:
  • column
Optional:
  • No reference. N/A.
  • With reference. Fails if the median value differs by more than 10% (+/-).
QuantileValue()
  • Column-level.
  • Computes the quantile value of a given numerical column.
  • Defaults to 0.5 if no quantile is specified.
  • Metric result: value.
Required:
  • column
Optional:
  • No reference. N/A.
  • With reference. Fails if quantile value differs by more than 10% (+/-).
CategoryCount()

Example:
CategoryCount(
column="city",
category="NY")
  • Column-level.
  • Counts occurrences of the specified category or categories.
  • To check the joint share of several categories, pass the list categories=["a", "b"].
  • Metric result: count, share.
Required:
  • column
  • category
  • categories
Optional:
  • No reference. N/A.
  • With reference. Fails if the specified category is not present.

Column data quality

Column-level data quality metrics.

Data definition. You may need to map column types.

MetricDescriptionParametersTest Defaults
MissingValueCount()
  • Column-level.
  • Counts the number and share of missing values.
  • Metric result: count, share.
Required:
  • column
Optional:
  • No reference: Fails if there are missing values.
  • With reference: Fails if share of missing values is >10% higher.
NewCategoriesCount() (Coming soon)
  • Column-level.
  • Counts new categories compared to reference (reference required).
  • Metric result: count, share.
Required:
  • column
Optional:
Expect 0.
MissingCategoriesCount() (Coming soon)
  • Column-level.
  • Counts missing categories compared to reference.
  • Metric result: count, share.
Required:
  • column
Optional:
Expect 0.
InRangeValueCount()

Example:
InRangeValueCount(
column="age",
left="1", right="18")
  • Column-level.
  • Counts the number and share of values in the set range.
  • Metric result: count, share.
Required:
  • column
  • left
  • right
Optional:
  • No reference: N/A.
  • With reference: Fails if column contains values out of the min-max reference range.
OutRangeValueCount()
  • Column-level.
  • Counts the number and share of values out of the set range.
  • Metric result: count, share.
Required:
  • column
  • left
  • right
Optional:
  • No reference: N/A.
  • With reference: Fails if any value is out of min-max reference range.
InListValueCount()
  • Column-level.
  • Counts the number and share of values in the set list.
  • Metric result: count, share.
Required:
  • column
  • values
Optional:
  • No reference: N/A.
  • With reference: Fails if any value is out of list.
OutListValueCount()

Example:
OutListValueCount(
column="city",
values=["Lon", "NY"])
  • Column-level.
  • Counts the number and share of values out of the set list.
  • Metric result: count, share.
Required:
  • column
  • values
Optional:
  • No reference: N/A.
  • With reference: Fails if any value is out of list.
UniqueValueCount()
  • Column-level.
  • Counts the number and share of unique values.
  • Metric result: values (dict with count, share).
Required:
  • column
Optional:
  • No reference: N/A.
  • With reference: Fails if the share of unique values differs by >10% (+/-).
MostCommonValueCount() (Coming soon)
  • Column-level.
  • Identifies the most common value and provides its count/share.
  • Metric result: value: count, share.
Required:
  • column
Optional:
  • No reference: Fails if most common value share is ≥80%.
  • With reference: Fails if most common value share differs by >10% (+/-).

Dataset

Use for exploratory data analysis and data quality checks.

Data definition. You may need to map column types, ID and timestamp.

Dataset stats

Descriptive statistics.

MetricDescriptionParametersTest Defaults
DataSummaryPreset()
  • Large Preset.
  • Combines DatasetStats and ValueStats for all or specified columns.
  • Metric result: for all Metrics.
  • Preset page
Optional:
  • columns
As in individual Metrics.
DatasetStats()
  • Small preset.
  • Dataset-level.
  • Calculates descriptive dataset stats, including columns by type, rows, missing values, empty columns, etc.
  • Metric result: for all Metrics.
None
  • No reference: As in included Metrics
  • With reference: As in included Metrics.
RowCount()
  • Dataset-level.
  • Counts the number of rows.
  • Metric result: value.
Optional:
  • No reference: N/A.
  • With reference: Fails if row count differs by >10%.
ColumnCount()
  • Dataset-level.
  • Counts the number of columns.
  • Metric result: value.
Optional:
  • No reference: N/A.
  • With reference: Fails if not equal to reference.

Dataset data quality

Dataset-level data quality metrics.

Data definition. You may need to map column types, ID and timestamp.

MetricDescriptionParametersTest Defaults
ConstantColumnsCount()
  • Dataset-level.
  • Counts the number of constant columns.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one constant column.
  • With reference: Fails if count is higher than in reference.
EmptyRowsCount()
  • Dataset-level.
  • Counts the number of empty rows.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one empty row.
  • With reference: Fails if share differs by >10%.
EmptyColumnsCount()
  • Dataset-level.
  • Counts the number of empty columns.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one empty column.
  • With reference: Fails if count is higher than in reference.
DuplicatedRowCount()
  • Dataset-level.
  • Counts the number of duplicated rows.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one duplicated row.
  • With reference: Fails if share differs by >10% (+/-).
DuplicatedColumnsCount()
  • Dataset-level.
  • Counts the number of duplicated columns.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one duplicated column.
  • With reference: Fails if count is higher than in reference.
DatasetMissingValueCount()
  • Dataset-level.
  • Calculates the number and share of missing values.
  • Displays the number of missing values per column.
  • Metric result: value.
Required:
  • columns
Optional:
  • No reference: Fails if there are missing values.
  • With reference: Fails if share is >10% higher than reference (+/-).
AlmostEmptyColumnCount() (Coming soon)
  • Dataset-level.
  • Counts almost empty columns (95% empty).
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one almost empty column.
  • With reference: Fails if count is higher than in reference.
AlmostConstantColumnsCount()
  • Dataset-level.
  • Counts almost constant columns (95% identical values).
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one almost constant column.
  • With reference: Fails if count is higher than in reference.
RowsWithMissingValuesCount() (Coming soon)
  • Dataset-level.
  • Counts rows with missing values.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one row with missing values.
  • With reference: Fails if share differs by >10% (+/-)
ColumnsWithMissingValuesCount()
  • Dataset-level.
  • Counts columns with missing values.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one column with missing values.
  • With reference: Fails if count is higher than in reference.

Data Drift

Use to detect distribution drift for text, tabular, embeddings data or over computed text descriptors. 20+ drift methods listed separately: text and tabular, embeddings.

Data definition. You may need to map column types, ID and timestamp.

Metrics explainers. Understand data drift works.

MetricDescriptionParametersTest Defaults
DataDriftPreset()
  • Large Preset.
  • Requires reference.
  • Calculates data drift for all or set columns.
  • Uses the default or set method.
  • Returns drift score for each column.
  • Visualizes all distributions.
  • Metric result: all Metrics.
  • Preset page.
Optional:
  • columns
  • method
  • cat_method
  • num_method
  • per_column_method
  • threshold
  • cat_threshold
  • num_threshold
  • per_column_threshold
See drift options.
  • With reference: Data drift defaults, depending on column type. See drift methods.
DriftedColumnsCount()
  • Dataset-level.
  • Requires reference.
  • Calculates the number and share of drifted columns in the dataset.
  • Each column is tested for drift using the default algorithm or set method.
  • Returns only the total number of drifted columns.
  • Metric result: count, share.
Optional:
  • columns
  • method
  • cat_method
  • num_method
  • per_column_method
  • threshold
  • cat_threshold
  • num_threshold
  • per_column_threshold
See drift options.
  • With reference: Fails if 50% of columns are drifted.
ValueDrift()
  • Column-level.
  • Requires reference.
  • Calculates data drift for a defined column (num, cat, text).
  • Visualizes distributions.
  • Metric result: value.
Required:
  • column
Optional:
  • method
  • threshold
See drift options.
  • With reference: Data drift defaults, depending on column type. See drift methods.
MultivariateDrift() (Coming soon)
  • Dataset-level.
  • Requires reference.
  • Computes a single dataset drift score.
  • Default method: share of drifted columns.
  • Metric result: value.
Optional:
  • columns
  • method
See drift options.
  • With reference: Defaults for method. See methods.
EmbeddingDrift() (Coming soon)
  • Column-level.
  • Requires reference.
  • Calculates data drift for embeddings.
  • Requires embedding columns set in data definition.
  • Metric result: value.
Required:
  • embeddings
  • method
See embedding drift options.
  • With reference: Defaults for method. See methods.

Correlations

Use for exploratory data analysis, drift monitoring (correlation changes) or to check alignment between scores (e.g. LLM-based descriptors against human labels).

Data definition. You may need to map column types.

MetricDescriptionParametersTest Defaults
DatasetCorrelations() (Coming soon)
  • Calculates the correlations between all or set columns in the dataset.
  • Supported methods: Pearson, Spearman, Kendall, Cramer_V.
Optional: N/A
Correlation() (Coming soon)
  • Calculates the correlation between two defined columns.
Required:
  • column_x
  • column_y
Optional:
  • method (default: pearson, available: pearson, spearman, kendall, cramer_v)
  • Test conditions
N/A
CorrelationChanges() (Coming soon)
  • Dataset-level.
  • Reference required.
  • Checks the number of correlation violations (significant changes in correlation strength between columns) across all or set columns.
Optional:
  • columns
  • method (default: pearson, available: pearson, spearman, kendall, cramer_v)
  • corr_diff (default: 0.25)
  • Test conditions
  • With reference: Fails if at least one correlation violation is detected.

Classification

Use to evaluate quality on a classification task (probabilistic, non-probabilistic, binary and multi-class).

Data definition. You may need to map prediction, target columns and classification type.

General

Use for binary classification and aggregated results for multi-class.

MetricDescriptionParametersTest Defaults
ClassificationPreset()
  • Large Preset with many classification Metrics and visuals.
  • See Preset page.
  • Metric result: all Metrics.
Optional: probas_threshold .As in individual Metrics.
ClassificationQuality()
  • Small Preset.
  • Summarizes quality Metrics in a single widget.
  • Metric result: all Metrics.
Optional: probas_thresholdAs in individual Metrics.
LabelCount() (Coming soon)
  • Distribution of predicted classes.
  • Can visualize class balance and/or probability distribution.
Required:
  • Set at least one visualization: class_balance, prob_distribution.
Optional:
N/A
Accuracy()
  • Calculates accuracy.
  • Metric result: value.
Optional:
  • No reference: Fails if lower than dummy model accuracy.
  • With reference: Fails if accuracy differs by >20%.
Precision()
  • Calculates precision.
  • Visualizations available: Confusion Matrix, PR Curve, PR Table.
  • Metric result: value.
Required:
  • Set at least one visualization: conf_matrix, pr_curve, pr_table.
Optional:
  • probas_threshold (default: None or 0.5 for probabilistic classification)
  • top_k
  • Test conditions
  • No reference: Fails if Precision is lower than the dummy model.
  • With reference: Fails if Precision differs by >20%.
Recall()
  • Calculates recall.
  • Visualizations available: Confusion Matrix, PR Curve, PR Table.
  • Metric result: value.
Required:
  • Set at least one visualization: conf_matrix, pr_curve, pr_table.
Optional:
  • No reference: Fails if lower than dummy model recall.
  • With reference: Fails if Recall differs by >20%.
F1Score()
  • Calculates F1 Score.
  • Metric result: value.
Required:
  • Set at least one visualization: conf_matrix.
Optional:
  • No reference: Fails if lower than dummy model F1.
  • With reference: Fails if F1 differs by >20%.
TPR()
  • Calculates True Positive Rate (TPR).
  • Metric result: value.
Required:
  • Set at least one visualization: pr_table.
Optional:
  • No reference: Fails if TPR is lower than the dummy model.
  • With reference: Fails if TPR differs by >20%.
TNR()
  • Calculates True Negative Rate (TNR).
  • Metric result: value.
Required:
  • Set at least one visualization: pr_table.
Optional:
  • No reference: Fails if TNR is lower than the dummy model.
  • With reference: Fails if TNR differs by >20%.
FPR()
  • Calculates False Positive Rate (FPR).
  • Metric result: value.
Required:
  • Set at least one visualization: pr_table.
Optional:
  • No reference: Fails if FPR is higher than the dummy model.
  • With reference: Fails if FPR differs by >20%.
FNR()
  • Calculates False Negative Rate (FNR).
  • Metric result: value.
Required:
  • Set at least one visualization: pr_table.
Optional:
  • No reference: Fails if FNR is higher than the dummy model.
  • With reference: Fails if FNR differs by >20%.
LogLoss()
  • Calculates Log Loss.
  • Metric result: value.
Required:
  • Set at least one visualization: pr_table.
Optional:
  • No reference: Fails if LogLoss is higher than the dummy model (equals 0.5 for a constant model).
  • With reference: Fails if LogLoss differs by >20%.
RocAUC()
  • Calculates ROC AUC.
  • Can visualize PR curve or table.
  • Metric result: value.
Required:
  • Set at least one visualization: pr_table, roc_curve.
Optional:
  • No reference: Fails if ROC AUC is ≤ 0.5.
  • With reference: Fails if ROC AUC differs by >20%.
Lift() (Coming soon)
  • Calculates lift.
  • Can visualize lift curve or table.
  • Metric result: value.
Required:
  • Set at least one visualization: lift_table, lift_curve.
Optional:
N/A

Dummy metrics:

By label

Use when you have multiple classes and want to evaluate quality separately.

MetricDescriptionParametersTest Defaults
ClassificationQualityByLabel()
  • Small Preset summarizing classification quality Metrics by label.
  • Metric result: all Metrics.
NoneAs in individual Metrics.
PrecisionByLabel()
  • Calculates precision by label in multiclass classification.
  • Metric result (dict): label: value.
Optional:
  • No reference: Fails if Precision is lower than the dummy model.
  • With reference: Fails if Precision differs by >20%.
F1ByLabel()
  • Calculates F1 Score by label in multiclass classification.
  • >Metric result (dict): label: value.
Optional:
  • No reference: Fails if F1 is lower than the dummy model.
  • With reference: Fails if F1 differs by >20%.
RecallByLabel()
  • Calculates recall by label in multiclass classification.
  • >Metric result (dict): label: value
Optional:
  • No reference: Fails if Recall is lower than the dummy model.
  • With reference: Fails if Recall differs by >20%.
RocAUCByLabel()
  • Calculates ROC AUC by label in multiclass classification.
  • Metric result (dict): label: value
Optional:
  • No reference: Fails if ROC AUC is ≤ 0.5.
  • With reference: Fails if ROC AUC differs by >20%.

Regression

Use to evaluate the quality of a regression model.

Data definition. You may need to map prediction and target columns.

MetricDescriptionParametersTest Defaults
RegressionPreset
  • Large Preset.
  • Includes a wide range of regression metrics with rich visuals.
  • Metric result: all metrics.
  • See Preset page.
None.As in individual metrics.
RegressionQuality
  • Small Preset.
  • Summarizes key regression metrics in a single widget.
  • Metric result: all metrics.
None.As in individual metrics.
MeanError()
  • Calculates the mean error.
  • Visualizations available: Error Plot, Error Distribution, Error Normality.
  • Metric result: mean_error, error_std.
Required:
  • Set at least one visualization: error_plot, error_distr, error_normality.
Optional:
  • No reference/With reference: Expect ME to be near zero. Fails if Mean Error is skewed and condition is violated: eq = approx(absolute=0.1 * error_std).
MAE()
  • Calculates Mean Absolute Error (MAE).
  • Visualizations available: Error Plot, Error Distribution, Error Normality.
  • Metric result: mean_absolute_error, absolute_error_std.
Required:
  • Set at least one visualization: error_plot, error_distr, error_normality.
Optional:
  • No reference: Fails if MAE is higher than the dummy model predicting the median target value.
  • With reference: Fails if MAE differs by >10%.
RMSE()
  • Calculates Root Mean Square Error (RMSE).
  • Metric result: rmse.
Optional:
  • No reference: Fails if RMSE is higher than the dummy model predicting the mean target value.
  • With reference: Fails if RMSE differs by >10%.
MAPE()
  • Calculates Mean Absolute Percentage Error (MAPE).
  • Visualizations available: Percentage Error Plot.
  • Metric result: mean_perc_absolute_error, perc_absolute_error_std.
Required:
  • Set at least one visualization: perc_error_plot.
Optional:
  • No reference: Fails if MAPE is higher than the dummy model predicting the weighted median target value.
  • With reference: Fails if MAPE differs by >10%.
R2Score()
  • Calculates R² (Coefficient of Determination).
  • Metric result: r2score.
Optional:
  • No reference: Fails if R² ≤ 0.
  • With reference: Fails if R² differs by >10%.
AbsMaxError()
  • Calculates Absolute Maximum Error.
  • Metric result: abs_max_error.
Optional:
  • No reference: Fails if absolute maximum error is higher than the dummy model predicting the median target value.
  • With reference: Fails if it differs by >10%.

Dummy metrics:

Ranking

Use to evaluate ranking, search / retrieval or recommendations.

Data definition. You may need to map prediction and target columns and ranking type. Some metrics require additional training data.

Metric explainers. Check ranking metrics explainers.

MetricDescriptionParametersTest Defaults
RecSysPreset()
  • Larget Preset.
  • Includes a range of recommendation system metrics.
  • Metric result: all metrics.
  • See Preset page.
None.As in individual metrics.
RecallTopK()
  • Calculates Recall at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if recall > 0.
  • With reference: Fails if Recall differs by >10%.
FBetaTopK()
  • Calculates F-beta score at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if F-beta > 0.
  • With reference: Fails if F-beta differs by >10%.
PrecisionTopK()
  • Calculates Precision at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if Precision > 0.
  • With reference: Fails if Precision differs by >10%.
MAP()
  • Calculates Mean Average Precision at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if MAP > 0.
  • With reference: Fails if MAP differs by >10%.
NDCG()
  • Calculates Normalized Discounted Cumulative Gain at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if NDCG > 0.
  • With reference: Fails if NDCG differs by >10%.
MRR()
  • Calculates Mean Reciprocal Rank at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if MRR > 0.
  • With reference: Fails if MRR differs by >10%.
HitRate()
  • Calculates Hit Rate at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if Hit Rate > 0.
  • With reference: Fails if Hit Rate differs by >10%.
ScoreDistribution()
  • Computes the predicted score entropy (KL divergence).
  • Applies only when the recommendations_type is a score..
  • Metric result: value.
Required:
  • k
Optional:
  • No reference:value
  • With reference: value.
Personalization() (Coming soon)
  • Calculates Personalization score at the top K recommendations.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if Personalization > 0.
  • With reference: Fails if Personalization differs by >10%.
ARP() (Coming soon)
  • Computes Average Recommendation Popularity at the top K recommendations.
  • Requires a training dataset.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if ARP > 0.
  • With reference: Fails if ARP differs by >10%.
Coverage()(Coming soon)
  • Calculates Coverage at the top K recommendations.
  • Requires a training dataset.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if Coverage > 0.
  • With reference: Fails if Coverage differs by >10%.
GiniIndex()(Coming soon)
  • Calculates Gini Index at the top K recommendations.
  • Requires a training dataset.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if Gini Index < 1.
  • With reference: Fails if Gini Index differs by >10%.
Diversity() (Coming soon)
  • Calculates Diversity at the top K recommendations.
  • Requires item features.
  • Metric result: value.
Required:
  • k
  • item_features
Optional:
  • No reference: Tests if Diversity > 0.
  • With reference: Fails if Diversity differs by >10%.
Serendipity()(Coming soon)
  • Calculates Serendipity at the top K recommendations.
  • Requires a training dataset.
  • Metric result: value.
Required:
  • k
  • item_features
Optional:
  • No reference: Tests if Serendipity > 0.
  • With reference: Fails if Serendipity differs by >10%.
Novelty() (Coming soon)
  • Calculates Novelty at the top K recommendations.
  • Requires a training dataset.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if Novelty > 0.
  • With reference: Fails if Novelty differs by >10%.

Relevant for RecSys metrics:

  • no_feedback_user: bool = False. Specifies whether to include the users who did not select any of the items, when computing the quality metric. Default: False.

  • min_rel_score: Optional[int] = None. Specifies the minimum relevance score to consider relevant when calculating the quality metrics for non-binary targets (e.g., if a target is a rating or a custom score).