TL;DR: You can explore and compare text datasets.
- Report: for visual analysis or metrics export, use the
You can evaluate and explore text data:
1. To monitor input data for NLP models. When you do not have true labels or actuals, you can monitor changes in the input data (data drift) and descriptive text characteristics. You can run batch checks, for example, comparing the latest batch of text data to earlier or training data. You can often combine it with evaluating Prediction Drift.
2. When you are debugging the model decay. If you observe a drop in the model performance, you can use this report to understand changes in the input data patterns.
3. Exploratory data analysis. You can use the visual report to explore the text data you want to use for training. You can also use it to compare any two datasets.
If you want to visually explore the text data, you can create a new Report object and use the
text_overview_report = Report(metrics=[
Note that to calculate text-related metrics, you must also import additional libraries:
TextOverviewPresetprovides an overview and comparison of text datasets.
- Generates a descriptive summary of the text columns in the dataset.
- Performs data drift detection to compare the two texts using the domain classifier approach.
- Shows distributions of the text descriptors in two datasets, and their correlations with other features.
- Performs drift detection for text descriptors.
- You can pass one or two datasets. The reference dataset serves as a benchmark. Evidently analyzes the change by comparing the current production data to the reference data. If you pass a single dataset, there will be no comparison.
- To run this preset, you must have text columns in your dataset. Additional features and prediction/target are optional. Pass them if you want to analyze the correlations with text descriptors.
The report includes 5 components. All plots are interactive.
Aggregated visuals in plots. Starting from v 0.3.2, all visuals in the Evidently Reports are aggregated by default. This helps decrease the load time and report size for larger datasets. If you work with smaller datasets or samples, you can pass an option to generate plots with raw data. You can choose whether you want it on not based on the size of your dataset.
The report first shows the descriptive statistics for the text column(s).
The report generates several features that describe different text properties and shows the distributions of these text descriptors.
If the dataset contains numerical features and/or target, the report will show the correlations between features and text descriptors in the current and reference dataset. It helps detects shifts in the relationship.
If you pass two datasets, the report performs drift detection using the default data drift method for texts (domain classifier). It returns the ROC AUC of the binary classifier model that can discriminate between reference and current data. If the drift is detected, it also shows the top words that help distinguish between the reference and current dataset.
If you pass two datasets, the report also performs drift detection for text descriptors to show statistical shifts in patterns between test characteristics.
You can also get the report output as a JSON or a Python dictionary.