Comment on page
Run Evidently on Spark
How to run calculations on Spark.
You can run distributed computation using Spark if you work with large datasets.
Currently, the following Tests, Metrics and Presets are supported:
ColumnDriftMetric()
DataDriftTable()
DatasetDriftMetric()
DataDriftPreset()
TestColumnDrift()
TestShareOfDriftedColumns()
TestNumberOfDriftedColumns()
DataDriftTestPreset()
For drift calculation, the following methods are supported:
chisquare
jensen shannon
psi
wasserstein
The following data types are supported:
numerical_features
categorical_features
You can refer to an example How-to-notebook showing how to use Evidently on Spark:
To run Evidently on a Spark DataFrame, you need to specify the corresponding engine in the
run()
method for the Report calculation:To import
SparkEngine
from Evidently, use the following command:from evidently.spark.engine import SparkEngine
Pass the
SparkEngine
to the run
method when you create the Report:spark_report_table = Report(metrics=[
DataDriftTable()
])
spark_report_table.run(reference_data=reference, current_data=current, engine=SparkEngine)
spark_report_table.show() # OR spark_report_table.show(mode='inline')
Notebook example on setting Test criticality:
Last modified 19d ago