At the moment Evidently works with datasets in Pandas DataFrameformatonly. These datasets should fit into memory to be processed correctly.
In this tutorial you will see how to load and sample data from other data sources to Pandas DataFrame for further analysis with Evidently.
Tensorflow Datasets
Tensorflow supports conversion from Tensorflow Dataset to Pandas DataFrame with as_dataframe method.
For bigger datasets that do not fit into memory use take for sampling before conversion. Check that the dataset is shuffled to obtain a representative sample.
import tensorflow_datasets as tfdsMAXIMUM_DATASET_SIZE =10000# set up the maximum number of lines in your sample# tensorflow_ds is a shuffled Tensorflow Datasetpandas_df = tfds.as_dataframe(tensorflow_ds.take(MAXIMUM_DATASET_SIZE))
Note that as_dataframe method loads everything in memory, make sure to run it on a sample from your dataset to control for its size
Pytorch Datapipes
To sample data from Pytorch Datapipes shuffle it first with shuffle() and take the first batch of the chosen size. This sample can be converted to Pandas DataFrame
See example with AG News dataset:
import pandas as pdfrom torchdata.datapipes.iter import HttpReaderMAXIMUM_DATASET_SIZE =10000# set up the maximum number of lines in your sample# Load data to Pytorch DatapipeURL ="https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv"ag_news_train =HttpReader([URL]).parse_csv().map(lambdat: (int(t[0]), " ".join(t[1:])))# Shuffle and sample databatches = ag_news_train.shuffle().batch(MAXIMUM_DATASET_SIZE)sample =next(iter(batches))# Load sampled data to Pandas DataFramepandas_df = pd.DataFrame({'text': [el[1] for el in sample],'label': [el[0] for el in sample]})
Note that resulting Pandas DataFrame schema is arbitrary, just make sure to specify text and target columns with column_mapping later
PySpark DataFrames
PySpark supports conversion to Pandas DataFrame with toPandas() method.
For bigger DataFrames that do not fit into memory use sample for sampling before conversion.
fraction =0.5# set the fraction of original DataFrame to be sampled# df_spark is a PySpark DataFramedf_pandas = df_spark.sample(withReplacement=False, fraction=fraction, seed=None).toPandas()
You can ensure that sampling provides the same result each run by passing a fixed seed value to sample method
Files in a directory
If your data is organized in separate files for each text with folder names corresponding to class labels, like so:
use the following steps to sample data preserving the balance of classes:
import os, randomimport pandas as pdDIRECTORY_NAME ='/main_directory/'# define data source directoryMAXIMUM_DATASET_SIZE =10000# set up the maximum number of lines in your sample# find the names of classesclasses_names = [class_name for class_name in os.listdir(DIRECTORY_NAME)\if os.path.isdir(os.path.join(DIRECTORY_NAME, class_name))]# determine classes sizesclasses_sizes_dict ={class_name:len(os.listdir(os.path.join(DIRECTORY_NAME, class_name)))\for class_name in classes_names}total_size =sum(classes_sizes_dict.values())# sample objects from classes in the correct proportiontexts = []labels = []for class_name in classes_names: sample_size =int((classes_sizes_dict[class_name] / total_size) * MAXIMUM_DATASET_SIZE) random_files_names = random.sample(os.listdir(os.path.join(DIRECTORY_NAME, class_name)), sample_size)for file_name in random_files_names:withopen(os.path.join(DIRECTORY_NAME, class_name, file_name), 'r')as f: text = f.read() texts.append(text) labels.append(class_name)# load sampled data to Pandas DataFramedf_pandas = pd.DataFrame({'text': texts, 'label': labels})