Using canary
This guide covers everything from a first install to understanding the statistics behind each alert. No ML monitoring background is assumed — only that you have a model making predictions in production and you want to know when something goes wrong.
TL;DR: Wrap your model. Call .predict() as usual. Check the dashboard at localhost:8501. Yellow means watch it. Red means act.
Installation
pip install canary-ml
Installs canary-ml with its core dependencies: numpy, scipy, scikit-learn, and rich. Requires Python 3.9–3.12.
Keras/TensorFlow model monitoring requires an additional install:
pip install canary-ml[keras]
If you pass a Keras model without TensorFlow installed, canary-ml raises a clear ImportError with the install command.
Quickstart
Three things happen when you wrap a model: canary learns the reference distribution from your training data, writes a baseline snapshot to disk, and fits an anomaly detector. After that, every .predict() call silently runs the tests.
from canary_ml import ModelMonitor
# 1. Wrap once — happens at startup
monitor = ModelMonitor(
model=your_model,
reference_data=X_train,
alert_threshold=0.2,
log_path="./canary_logs",
verbose=True,
)
# 2. Use exactly as before — monitoring is a side effect
predictions = monitor.predict(X_new)
# 3. In scripts: wait for the background thread before reading
monitor.wait()
# 4. Inspect the latest report
report = monitor.get_report()
print(report.summary())
# DriftReport | psi=0.41 | features_drifted=3/8 | anomaly_rate=3.2% | ALERT
# 5. Open the dashboard
monitor.serve_dashboard(port=8501)
The model is not modified. If monitoring fails for any reason (disk full, unexpected input shape), the exception is swallowed silently and your predict() call returns normally. Monitoring should never break production.
Configuration
| Parameter | Default | What it controls |
|---|---|---|
model | required | Any object with a .predict() method |
reference_data | required | Your training or validation data — the "normal" baseline |
alert_threshold | 0.2 | PSI score above which a drift alert fires |
performance_threshold | 0.05 | Accuracy drop (percentage points) below reference that fires a performance alert. Requires predict_proba. |
anomaly_contamination | 0.05 | Expected fraction of anomalies; the anomaly-rate alert fires at 4× this value |
categorical_threshold | 20 | Max unique values for a feature to be treated as categorical (chi² instead of KS) |
store_samples | True | Set False to skip storing raw feature rows in the log (recommended in PII-sensitive environments) |
log_path | "./canary_logs" | Directory for monitor.jsonl and reference.json |
verbose | True | Print a rich alert panel to stdout when an alert fires |
on_alert | None | Callback function called with the DriftReport on alert |
Choosing alert_threshold
The threshold controls how sensitive canary is to input distribution changes. The standard PSI interpretation is:
The default of 0.2 is the industry standard. For high-stakes models (fraud, medical), lower it to 0.1. For noisy data with natural variation, raise it to 0.25 to reduce false alarms.
Data flow
Every call to .predict(X) triggers this pipeline in the background:
The whole pipeline runs in a try/except block — if anything fails, your prediction still returns. Monitoring is strictly a side effect.
PSI — Global drift score
The Population Stability Index measures how much the overall distribution of your data has shifted since training. It's the single number that drives the alert threshold.
The intuition
Imagine splitting your training data into 10 buckets by value range, then checking whether new data falls into those buckets in the same proportions. If 30% of training samples fell into bucket 3, but now only 5% do, something has shifted. PSI quantifies that shift across all buckets at once.
How it's calculated
For each of n quantile-based bins, the formula is:
PSI = Σ (actual% − expected%) × ln(actual% / expected%)
Where "expected" is the reference distribution and "actual" is the current batch. Small differences multiply out to near-zero. Large differences grow quickly due to the log ratio.
What it means in practice
A PSI alert doesn't mean your model is wrong — it means the inputs it's seeing are different from what it was trained on. This can happen naturally (seasonality, new customer segments) or signal a real problem (data pipeline bug, population shift). The dashboard's feature heatmap helps you identify which features are driving the score.
Example: A fraud detection model trained on 2023 transaction data may see PSI spike in December due to holiday spending patterns. The model might still be accurate — but you should verify, because the inputs are genuinely different.
KS test — Per-feature drift
While PSI gives a global score, the Kolmogorov-Smirnov test runs independently on each continuous feature and tells you exactly which ones have drifted.
The intuition
The KS test asks: "Could these two sets of values have come from the same distribution?" It compares the cumulative distribution functions of the reference data and the current batch, and measures the maximum gap between them.
KS statistic = max|F_reference(x) − F_current(x)|
A statistic of 0 means identical distributions. A statistic of 1 means completely non-overlapping. Values above roughly 0.1–0.15 with a p-value below 0.05 indicate drift.
Reading KS results
| Attribute | Type | What it means |
|---|---|---|
statistic | float [0–1] | Maximum gap between CDFs. Higher = more drift. |
p_value | float [0–1] | Probability of seeing this gap by chance. Below 0.05 = statistically significant. |
drift | bool | True when p_value < 0.05 |
How the heatmap uses KS
Each cell in the Feature Drift Map shows the KS statistic for one feature at one batch, colored from green (low, stable) to red (high, drifted). This lets you see which features started drifting first and whether drift is recovering.
Tip: A feature with a high KS statistic but stable PSI might be a minor outlier. A feature driving both high KS and high PSI is where you should focus first.
Chi² test — Categorical features
For categorical features (those with 20 or fewer unique values), canary uses the chi-squared test instead of KS. The KS test requires continuous values; chi² works on counts.
The intuition
If your reference data had 60% class A and 40% class B, and your current batch has 20% A and 80% B, the chi² test detects that imbalance. It compares observed category frequencies against expected ones.
When it fires
Like the KS test, chi² uses a p-value threshold of 0.05. A feature is flagged as drifted when the probability of seeing the current category distribution by chance — if nothing had changed — falls below 5%.
Categorical threshold: Features with ≤ 20 unique values get chi² automatically. Features with more unique values are treated as continuous and use KS.
Anomaly detection
Drift tests compare distributions across a batch. Anomaly detection works at the sample level — it flags individual inputs that look unusual compared to training data.
The ensemble
canary runs two detectors and flags any sample caught by either:
Isolation Forest
Isolation Forest builds a set of random decision trees, each attempting to "isolate" individual samples by randomly splitting on features and values. Anomalies are isolated in very few splits — they're outliers that stand apart from the normal cloud. Normal samples require many splits to isolate because they blend in with the majority.
The detector is fitted once on reference_data at startup. Every new batch is scored against it. Samples below the anomaly score threshold are flagged.
Z-score detector
Simpler and faster: for each feature, compute how many standard deviations the value is from the training mean. Any sample where any feature exceeds |z| > 3 is flagged. In a normal distribution, only 0.3% of values exceed this threshold.
The anomaly rate
The rate shown in the dashboard is the fraction of samples in the current batch flagged by either detector. A healthy rate is below 2–3%. A rate above 5–10% often indicates a real data quality problem or distribution shift severe enough that individual samples are "foreign" to the model.
High anomaly rate ≠ high PSI, necessarily. You can have a stable overall distribution (low PSI) but a small number of very extreme outliers (high anomaly rate). Both matter, but they signal different problems.
Confidence estimate — Label-free performance estimation
Drift tests tell you the inputs have changed. But the question that matters most is: is the model still making accurate predictions? Normally you can only answer that after collecting ground-truth labels — which can take days or weeks in production.
canary's confidence estimate answers it immediately, using nothing but the model's predicted probabilities. It is most accurate when probabilities are well-calibrated; if your model is overconfident, treat the absolute values as a directional signal — the delta is what matters.
The key insight
If a model predicts class A with 90% confidence, it is — assuming well-calibrated probabilities — right about 90% of the time on those predictions. For any sample, the probability of a correct prediction equals max(p_i) across all classes, because you always predict the most confident class.
# For each sample, expected accuracy contribution = max probability
# Binary: max(p, 1-p) → P(correct)
# Multi-class: max(p_class1, p_class2, ...) → P(correct)
E[accuracy] = mean( max(probas, axis=1) ) # average over batch
This works identically for binary and multi-class classifiers. It requires no labels. It produces an estimate on every batch, in real time.
Reference vs current
At monitor creation, canary runs predict_proba(reference_data) once and stores the resulting estimate as reference accuracy — the model's expected performance on its training distribution. On every subsequent .predict() call, it recomputes the estimate and tracks the delta.
The performance alert threshold
The performance_threshold parameter (default 0.05) sets how large a drop triggers an alert. A value of 0.05 means: alert when estimated accuracy has fallen more than 5 percentage points below reference. You can lower it for high-stakes models.
monitor = ModelMonitor(
model=clf,
reference_data=X_train,
performance_threshold=0.03, # alert at 3pp drop — more sensitive
)
The calibration caveat
The estimate is most accurate when your model's probabilities are well-calibrated — i.e., when a prediction of 80% confidence really is correct ~80% of the time. Most sklearn classifiers (logistic regression, calibrated SVMs, random forests) are reasonably calibrated out of the box. Tree ensembles and neural networks are often less so. Use sklearn.calibration.CalibratedClassifierCV if you need precise absolute values.
What this means in practice: Even with a poorly calibrated model, the estimate reliably tracks relative changes — a drop from 0.88 to 0.74 is a real signal even if the absolute values are off. The delta matters more than the absolute number.
No predict_proba? The confidence estimate is silently skipped for models that don't implement predict_proba (most regressors, some classifiers). The Est. Accuracy card will show —. Everything else works normally.
When the confidence estimate catches what PSI misses
The most valuable scenario: inputs look stable but performance is dropping. This happens when the relationship between inputs and outputs has changed — for example, a fraud model whose fraud patterns have evolved while the transaction amounts and frequencies remain similar. PSI stays low. KS stays green. But the confidence estimate catches the performance cliff.
Stat cards
The four cards at the top of the dashboard show the most recent batch's key metrics at a glance.
| Card | What to look for |
|---|---|
| PSI Score | Your main alarm. Green below 0.1, yellow 0.1–0.2, red above 0.2. |
| KS Statistic | The highest KS value across all features. Number of features drifted is shown underneath. |
| Anomaly Rate | Percent of input samples flagged. Normal is under 2–3%. Above 10% needs attention. |
| Est. Accuracy | Confidence-based accuracy estimate (CBPE) and its delta vs. reference. Shows — if the model has no predict_proba. |
Batch size matters: Statistical tests need enough samples to be reliable. With fewer than ~30 samples, KS and chi² p-values can be noisy. If you're calling .predict() one sample at a time, consider buffering batches.
Drift timeline
The timeline shows PSI score (yellow) and anomaly rate (violet) across the last 50 batches. It's your first signal that something is changing over time.
Patterns to watch for
- Gradual PSI rise — classic concept drift or data pipeline degradation. Usually needs retraining.
- Sudden PSI spike — often a data engineering issue: a new data source, encoding change, or upstream schema change.
- Anomaly rate spike without PSI rise — a small number of very bad samples. Check your ingestion pipeline for corrupted records.
- Both rising together — the most serious pattern. Inputs are different AND individual samples are outliers.
Feature drift map
The heatmap shows KS statistics for each feature (rows) across recent batches (columns). It's the fastest way to pinpoint which features are causing drift and when it started.
Reading the colors
| Color | KS range | Meaning |
|---|---|---|
| Green | 0 – 0.05 | Stable. This feature looks like training data. |
| Yellow | 0.05 – 0.2 | Mild to moderate shift. Worth watching. |
| Red | 0.2+ | Severe drift. This feature is a likely driver of the PSI alert. |
Each row shows one feature's full history — nothing in the heatmap is selectable or reorderable. The Distribution Shift panel below always shows the three features with the highest current KS statistic.
Distribution shift
The distribution panel shows KDE (kernel density estimate) curves — smooth continuous approximations of where values tend to fall.
- Grey/dim curve — your baseline (training data). This stays fixed.
- Yellow curve — the most recent batch. You want this to overlap with the baseline.
What to look for
A simple horizontal shift (mean moved right or left) suggests the feature values have scaled or changed range — possible encoding issue or data source change. A change in shape (one peak becoming two) suggests the model is now seeing a different mix of populations. A very flat current curve with narrow baseline suggests sparse or clipped new data.
The three rows shown are the features with the highest KS statistics in the latest batch — your most-drifted features at this moment.
Alerts & callbacks
When PSI exceeds alert_threshold, canary fires an alert. By default a rich panel is printed to stdout (verbose=True). Set verbose=False to suppress console output, or pass on_alert for programmatic handling.
def notify(report):
# report is a DriftReport with all metrics attached
requests.post("https://hooks.slack.com/...", json={
"text": f"Drift alert: PSI={report.psi_score:.2f}, "
f"{report.features_drifted} features drifted"
})
monitor = ModelMonitor(
model=clf,
reference_data=X_train,
on_alert=notify,
)
The callback receives the full DriftReport. Available attributes:
| Attribute | Type | Description |
|---|---|---|
psi_score | float | Global PSI value |
drift_detected | bool | True if any feature's KS/chi² p < 0.05 |
ks_results | dict | Per-feature {statistic, p_value, drifted} |
features_drifted | int | Count of features with KS p < 0.05 |
anomaly_rate | float | Fraction of samples flagged as anomalies |
alert_triggered | bool | True if PSI > threshold, anomaly rate is high, or performance drops |
alert_reasons | list | Which conditions fired: "drift", "anomaly", "performance" |
timestamp | str | ISO 8601 timestamp |
API
ModelMonitor
monitor = ModelMonitor(
model, # required — .predict()-compatible
reference_data, # required — np.ndarray or pd.DataFrame
feature_names=None, # optional column names, used by the dashboard
alert_threshold=0.2, # PSI threshold for alert
performance_threshold=0.05, # accuracy drop (pp) that fires a perf alert
anomaly_contamination=0.05, # expected fraction of anomalies; alert at 4x
categorical_threshold=20, # max unique values to treat a feature as categorical
store_samples=True, # False to skip storing raw feature rows
log_path="./canary_logs", # where to write logs
verbose=True, # default True — set False to suppress output
on_alert=None, # callable(DriftReport)
)
| Method | Returns | Description |
|---|---|---|
.predict(X) | model output | Run prediction + monitoring. Monitoring never raises. |
.predict_proba(X) | model output | Passthrough to model.predict_proba(); also feeds the confidence estimate. |
.wait() | — | Block until background monitoring completes. Call before get_report() in scripts. |
.get_report() | DriftReport | None | Latest monitoring report. None if no predictions yet. |
.get_history(n=50) | list[dict] | Last n raw log entries. |
.reset_baseline(new_data) | — | Replace the reference distribution and refit the anomaly detector. |
.serve_dashboard(port=8501, host="127.0.0.1") | — | Start dashboard in a daemon thread. Non-blocking. Use host="0.0.0.0" to expose it beyond localhost — only if you understand the logged data may include raw features. |
Standalone server
The dashboard can be served independently from existing logs without a running model:
python -m canary_ml.server ./canary_logs 8501
Log format
canary writes two files to log_path:
monitor.jsonl
One JSON object per .predict() call, newline-delimited:
{
"timestamp": "2026-06-07T14:23:01",
"psi_score": 0.41,
"drift_detected": true,
"ks_results": {
"0": {"statistic": 0.38, "p_value": 0.001, "drifted": true},
"1": {"statistic": 0.04, "p_value": 0.88, "drifted": false}
},
"features_drifted": 1,
"anomaly_rate": 0.032,
"alert_triggered": true,
"feature_sample": [[...], ...] // up to 500 rows of raw input
}
reference.json
Up to 500 rows of your reference data, used by the dashboard to render the baseline distribution curves. Written once at monitor creation. Format: array of arrays (rows × features).
Log rotation: canary appends forever. For long-running services, manage rotation externally (logrotate, a cron that truncates the oldest N lines, etc.). The dashboard always reads the most recent entries.
Aitor Bazo · 2026 · MIT