Core concepts and components of the Evidently Python library.
Type | Example checks |
---|---|
🔡 Text qualities | Length, sentiment, special symbols, pattern matches, etc. |
📝 LLM output quality | Semantic similarity, relevance, RAG faithfulness, custom LLM judges, etc. |
🛢 Data quality | Missing values, duplicates, min-max ranges, correlations, etc. |
📊 Data drift | 20+ tests and distance metrics to detect distribution drift. |
🎯 Classification | Accuracy, precision, recall, ROC AUC, confusion matrix, bias, etc. |
📈 Regression | MAE, ME, RMSE, error distribution, error normality, error bias, etc. |
🗂 Ranking (inc. RAG) | NDCG, MAP, MRR, Hit Rate, etc. |
Question | Context | Answer |
---|---|---|
How old is the universe? | The universe is believed to have originated from the Big Bang that occurred 13.8 billion years ago. | 13.8 billion years old. |
What’s the lifespan of Baobab trees? | Baobab trees can live up to 2,500 years. They are often called the “Tree of Life”. | Up to 2,500 years. |
What is the speed of light? | The speed of light in a vacuum is approximately 299,792 kilometers per second (186,282 miles per second). | Close to 299,792 km per second. |
Dataset
object. This allows attaching extra meta-information so that your data is processed correctly.current
) dataset. Optionally, you can prepare a second (reference
) dataset that will be used during the evaluation. Both must have identical structures.Dataset
is ready, you can run evaluations. You can either:
descriptors
to your dataset, and then compute a summary Report.Descriptors
.
A Descriptor is a row-level score or label that assesses a specific quality of a given text. It’s different from metrics (like accuracy or precision) that give a score for an entire dataset. You can use descriptors to assess LLM outputs in summarization, Q&A, chatbots, agents, RAGs, etc.
Descriptors range from deterministic to complex ML- or LLM-based checks.
A simple example of a descriptor is TextLength
. A more complex example is a customizable LLMEval
descriptor: where you prompt an LLM to act as a judge and, for example, label responses as “relevant” or “not relevant”.
Descriptors can also use two texts at once, like checking SemanticSimilarity
between two columns to compare new response to the reference one.
You can use built-in descriptors, configure templates (like LLM judges or regular expressions) or add custom checks in Python. Each Descriptor returns a result that can be:
TextEvals
summarizes the scores from all text descriptors.Metrics
you want to include.
MeanValue
or MissingValueCount
to complex algorithmic evals like DriftedColumnsCount
.
Each Metric computes a single value and has an optional visual representation (or several to choose from). For convenience, there are also small Presets that combine a handful of scores in a single widget, like ValueStats
that shows many relevant descriptive value statistics at once.
DatasetStats
give quick overview of all dataset-level stats, ClassificationQuality
computes multiple metrics like Precision, Recall, Accuracy, ROC AUC, etc.
greater than (gt)
or less than (lt)
. By picking different Metrics to test against, you can formulate fine-grained conditions like “less than 10% of texts can fall outside 10–100 character length.”