Text Overview
TL;DR: You can explore and compare text datasets.
Report: for visual analysis or metrics export, use the
TextOverviewPreset
.
Use case
You can evaluate and explore text data:
1. To monitor input data for NLP models. When you do not have true labels or actuals, you can monitor changes in the input data (data drift) and descriptive text characteristics. You can run batch checks, for example, comparing the latest batch of text data to earlier or training data. You can often combine it with evaluating Prediction Drift.
2. When you are debugging the model decay. If you observe a drop in the model performance, you can use this report to understand changes in the input data patterns.
3. Exploratory data analysis. You can use the visual report to explore the text data you want to use for training. You can also use it to compare any two datasets.
Text Overview Report
If you want to visually explore the text data, you can create a new Report object and use the TextOverviewPreset
.
Code example
Note that to calculate text-related metrics, you must also import additional libraries:
How it works
The TextOverviewPreset
provides an overview and comparison of text datasets.
Generates a descriptive summary of the text columns in the dataset.
Performs data drift detection to compare the two texts using the domain classifier approach.
Shows distributions of the text descriptors in two datasets, and their correlations with other features.
Performs drift detection for text descriptors.
Data Requirements
You can pass one or two datasets. The reference dataset serves as a benchmark. Evidently analyzes the change by comparing the current production data to the reference data. If you pass a single dataset, there will be no comparison.
To run this preset, you must have text columns in your dataset. Additional features and prediction/target are optional. Pass them if you want to analyze the correlations with text descriptors.
Column mapping. You must explicitly specify the columns that contain text features in column mapping to run this report.
How it looks
The report includes 5 components. All plots are interactive.
Aggregated visuals in plots. Starting from v 0.3.2, all visuals in the Evidently Reports are aggregated by default. This helps decrease the load time and report size for larger datasets. If you work with smaller datasets or samples, you can pass an option to generate plots with raw data. You can choose whether you want it on not based on the size of your dataset.
1. Text Column Summary
The report first shows the descriptive statistics for the text column(s).
2. Text Descriptors Distribution
The report generates several features that describe different text properties and shows the distributions of these text descriptors.
Text length
Non-letter characters
Out-of-vocabulary words
3. Text Descriptors Correlations
If the dataset contains numerical features and/or target, the report will show the correlations between features and text descriptors in the current and reference dataset. It helps detects shifts in the relationship.
Text length
Non-letter characters
Out-of-vocabulary words
4. Text Column Drift
If you pass two datasets, the report performs drift detection using the default data drift method for texts (domain classifier). It returns the ROC AUC of the binary classifier model that can discriminate between reference and current data. If the drift is detected, it also shows the top words that help distinguish between the reference and current dataset.
5. Text Descriptors Drift
If you pass two datasets, the report also performs drift detection for text descriptors to show statistical shifts in patterns between test characteristics.
Metrics output
You can also get the report output as a JSON or a Python dictionary.
Report customization
You can use a different color schema for the report.
You can create a different report or test suite from scratch, taking this one as an inspiration.
Examples
Head to an example how-to notebook to see an example Text Overview preset and other metrics and tests for text data.
Last updated