canary — Guide & Manual

guide & manual

Using canary

This guide covers everything from a first install to understanding the statistics behind each alert. No ML monitoring background is assumed — only that you have a model making predictions in production and you want to know when something goes wrong.

TL;DR: Wrap your model. Call .predict() as usual. Check the dashboard at localhost:8501. Yellow means watch it. Red means act.

getting started

Installation

pip install canary-ml

Installs canary-ml with its core dependencies: numpy, scipy, scikit-learn, and rich. Requires Python 3.9–3.12.

Keras/TensorFlow model monitoring requires an additional install:

pip install canary-ml[keras]

If you pass a Keras model without TensorFlow installed, canary-ml raises a clear ImportError with the install command.

getting started

Quickstart

Three things happen when you wrap a model: canary learns the reference distribution from your training data, writes a baseline snapshot to disk, and fits an anomaly detector. After that, every .predict() call silently runs the tests.

from canary_ml import ModelMonitor

# 1. Wrap once — happens at startup
monitor = ModelMonitor(
    model=your_model,
    reference_data=X_train,
    alert_threshold=0.2,
    log_path="./canary_logs",
    verbose=True,
)

# 2. Use exactly as before — monitoring is a side effect
predictions = monitor.predict(X_new)

# 3. In scripts: wait for the background thread before reading
monitor.wait()

# 4. Inspect the latest report
report = monitor.get_report()
print(report.summary())
# DriftReport | psi=0.41 | features_drifted=3/8 | anomaly_rate=3.2% | ALERT

# 5. Open the dashboard
monitor.serve_dashboard(port=8501)

The model is not modified. If monitoring fails for any reason (disk full, unexpected input shape), the exception is swallowed silently and your predict() call returns normally. Monitoring should never break production.

getting started

Configuration

Parameter	Default	What it controls
`model`	required	Any object with a `.predict()` method
`reference_data`	required	Your training or validation data — the "normal" baseline
`alert_threshold`	`0.2`	PSI score above which a drift alert fires
`performance_threshold`	`0.05`	Accuracy drop (percentage points) below reference that fires a performance alert. Requires `predict_proba`.
`anomaly_contamination`	`0.05`	Expected fraction of anomalies; the anomaly-rate alert fires at 4× this value
`categorical_threshold`	`20`	Max unique values for a feature to be treated as categorical (chi² instead of KS)
`store_samples`	`True`	Set `False` to skip storing raw feature rows in the log (recommended in PII-sensitive environments)
`log_path`	`"./canary_logs"`	Directory for `monitor.jsonl` and `reference.json`
`verbose`	`True`	Print a rich alert panel to stdout when an alert fires
`on_alert`	`None`	Callback function called with the `DriftReport` on alert

Choosing alert_threshold

The threshold controls how sensitive canary is to input distribution changes. The standard PSI interpretation is:

0 — stable 0.1 — watch 0.2 — alert 0.5+

The default of 0.2 is the industry standard. For high-stakes models (fraud, medical), lower it to 0.1. For noisy data with natural variation, raise it to 0.25 to reduce false alarms.

how it works

Data flow

Every call to .predict(X) triggers this pipeline in the background:

input

X_new batch

→

drift

PSI + KS / chi²

→

anomaly

IsoForest + z-score

→

report

DriftReport

→

storage

monitor.jsonl

The whole pipeline runs in a try/except block — if anything fails, your prediction still returns. Monitoring is strictly a side effect.

how it works

PSI — Global drift score

The Population Stability Index measures how much the overall distribution of your data has shifted since training. It's the single number that drives the alert threshold.

The intuition

Imagine splitting your training data into 10 buckets by value range, then checking whether new data falls into those buckets in the same proportions. If 30% of training samples fell into bucket 3, but now only 5% do, something has shifted. PSI quantifies that shift across all buckets at once.

How it's calculated

For each of n quantile-based bins, the formula is:

PSI = Σ (actual% − expected%) × ln(actual% / expected%)

Where "expected" is the reference distribution and "actual" is the current batch. Small differences multiply out to near-zero. Large differences grow quickly due to the log ratio.

Stable

< 0.1

Distribution is consistent. No action needed.

Moderate

0.1 – 0.2

Some shift. Worth investigating feature-by-feature.

Alert

> 0.2

Significant shift. Model performance may be degrading.

What it means in practice

A PSI alert doesn't mean your model is wrong — it means the inputs it's seeing are different from what it was trained on. This can happen naturally (seasonality, new customer segments) or signal a real problem (data pipeline bug, population shift). The dashboard's feature heatmap helps you identify which features are driving the score.

Example: A fraud detection model trained on 2023 transaction data may see PSI spike in December due to holiday spending patterns. The model might still be accurate — but you should verify, because the inputs are genuinely different.

how it works

KS test — Per-feature drift

While PSI gives a global score, the Kolmogorov-Smirnov test runs independently on each continuous feature and tells you exactly which ones have drifted.

The intuition

The KS test asks: "Could these two sets of values have come from the same distribution?" It compares the cumulative distribution functions of the reference data and the current batch, and measures the maximum gap between them.

KS statistic = max|F_reference(x) − F_current(x)|

A statistic of 0 means identical distributions. A statistic of 1 means completely non-overlapping. Values above roughly 0.1–0.15 with a p-value below 0.05 indicate drift.

Reading KS results

Attribute	Type	What it means
`statistic`	float [0–1]	Maximum gap between CDFs. Higher = more drift.
`p_value`	float [0–1]	Probability of seeing this gap by chance. Below 0.05 = statistically significant.
`drift`	bool	`True` when p_value < 0.05

How the heatmap uses KS

Each cell in the Feature Drift Map shows the KS statistic for one feature at one batch, colored from green (low, stable) to red (high, drifted). This lets you see which features started drifting first and whether drift is recovering.

Tip: A feature with a high KS statistic but stable PSI might be a minor outlier. A feature driving both high KS and high PSI is where you should focus first.

how it works

Chi² test — Categorical features

For categorical features (those with 20 or fewer unique values), canary uses the chi-squared test instead of KS. The KS test requires continuous values; chi² works on counts.

The intuition

If your reference data had 60% class A and 40% class B, and your current batch has 20% A and 80% B, the chi² test detects that imbalance. It compares observed category frequencies against expected ones.

When it fires

Like the KS test, chi² uses a p-value threshold of 0.05. A feature is flagged as drifted when the probability of seeing the current category distribution by chance — if nothing had changed — falls below 5%.

Categorical threshold: Features with ≤ 20 unique values get chi² automatically. Features with more unique values are treated as continuous and use KS.

how it works

Anomaly detection

Drift tests compare distributions across a batch. Anomaly detection works at the sample level — it flags individual inputs that look unusual compared to training data.

The ensemble

canary runs two detectors and flags any sample caught by either:

Isolation Forest

Isolation Forest builds a set of random decision trees, each attempting to "isolate" individual samples by randomly splitting on features and values. Anomalies are isolated in very few splits — they're outliers that stand apart from the normal cloud. Normal samples require many splits to isolate because they blend in with the majority.

The detector is fitted once on reference_data at startup. Every new batch is scored against it. Samples below the anomaly score threshold are flagged.

Z-score detector

Simpler and faster: for each feature, compute how many standard deviations the value is from the training mean. Any sample where any feature exceeds |z| > 3 is flagged. In a normal distribution, only 0.3% of values exceed this threshold.

The anomaly rate

The rate shown in the dashboard is the fraction of samples in the current batch flagged by either detector. A healthy rate is below 2–3%. A rate above 5–10% often indicates a real data quality problem or distribution shift severe enough that individual samples are "foreign" to the model.

High anomaly rate ≠ high PSI, necessarily. You can have a stable overall distribution (low PSI) but a small number of very extreme outliers (high anomaly rate). Both matter, but they signal different problems.

how it works

Confidence estimate — Label-free performance estimation

Drift tests tell you the inputs have changed. But the question that matters most is: is the model still making accurate predictions? Normally you can only answer that after collecting ground-truth labels — which can take days or weeks in production.

canary's confidence estimate answers it immediately, using nothing but the model's predicted probabilities. It is most accurate when probabilities are well-calibrated; if your model is overconfident, treat the absolute values as a directional signal — the delta is what matters.

The key insight

If a model predicts class A with 90% confidence, it is — assuming well-calibrated probabilities — right about 90% of the time on those predictions. For any sample, the probability of a correct prediction equals max(p_i) across all classes, because you always predict the most confident class.

# For each sample, expected accuracy contribution = max probability
# Binary:     max(p, 1-p)                  → P(correct)
# Multi-class: max(p_class1, p_class2, ...) → P(correct)

E[accuracy] = mean( max(probas, axis=1) )   # average over batch

This works identically for binary and multi-class classifiers. It requires no labels. It produces an estimate on every batch, in real time.

Reference vs current

At monitor creation, canary runs predict_proba(reference_data) once and stores the resulting estimate as reference accuracy — the model's expected performance on its training distribution. On every subsequent .predict() call, it recomputes the estimate and tracks the delta.

Reference accuracy

92.1%

Baseline — computed once at init.

Current estimate

86.4%

Latest batch — recomputed each predict.

Delta

−5.7pp

Alert fires at −5pp by default.

The performance alert threshold

The performance_threshold parameter (default 0.05) sets how large a drop triggers an alert. A value of 0.05 means: alert when estimated accuracy has fallen more than 5 percentage points below reference. You can lower it for high-stakes models.

monitor = ModelMonitor(
    model=clf,
    reference_data=X_train,
    performance_threshold=0.03,   # alert at 3pp drop — more sensitive
)

The calibration caveat

The estimate is most accurate when your model's probabilities are well-calibrated — i.e., when a prediction of 80% confidence really is correct ~80% of the time. Most sklearn classifiers (logistic regression, calibrated SVMs, random forests) are reasonably calibrated out of the box. Tree ensembles and neural networks are often less so. Use sklearn.calibration.CalibratedClassifierCV if you need precise absolute values.

What this means in practice: Even with a poorly calibrated model, the estimate reliably tracks relative changes — a drop from 0.88 to 0.74 is a real signal even if the absolute values are off. The delta matters more than the absolute number.

No predict_proba? The confidence estimate is silently skipped for models that don't implement predict_proba (most regressors, some classifiers). The Est. Accuracy card will show —. Everything else works normally.

When the confidence estimate catches what PSI misses

The most valuable scenario: inputs look stable but performance is dropping. This happens when the relationship between inputs and outputs has changed — for example, a fraud model whose fraud patterns have evolved while the transaction amounts and frequencies remain similar. PSI stays low. KS stays green. But the confidence estimate catches the performance cliff.

reading the dashboard

Stat cards

The four cards at the top of the dashboard show the most recent batch's key metrics at a glance.

Card	What to look for
PSI Score	Your main alarm. Green below 0.1, yellow 0.1–0.2, red above 0.2.
KS Statistic	The highest KS value across all features. Number of features drifted is shown underneath.
Anomaly Rate	Percent of input samples flagged. Normal is under 2–3%. Above 10% needs attention.
Est. Accuracy	Confidence-based accuracy estimate (CBPE) and its delta vs. reference. Shows `—` if the model has no `predict_proba`.

Batch size matters: Statistical tests need enough samples to be reliable. With fewer than ~30 samples, KS and chi² p-values can be noisy. If you're calling .predict() one sample at a time, consider buffering batches.

reading the dashboard

Drift timeline

The timeline shows PSI score (yellow) and anomaly rate (violet) across the last 50 batches. It's your first signal that something is changing over time.

Patterns to watch for

Gradual PSI rise — classic concept drift or data pipeline degradation. Usually needs retraining.
Sudden PSI spike — often a data engineering issue: a new data source, encoding change, or upstream schema change.
Anomaly rate spike without PSI rise — a small number of very bad samples. Check your ingestion pipeline for corrupted records.
Both rising together — the most serious pattern. Inputs are different AND individual samples are outliers.

reading the dashboard

Feature drift map

The heatmap shows KS statistics for each feature (rows) across recent batches (columns). It's the fastest way to pinpoint which features are causing drift and when it started.

Reading the colors

Color	KS range	Meaning
Green	0 – 0.05	Stable. This feature looks like training data.
Yellow	0.05 – 0.2	Mild to moderate shift. Worth watching.
Red	0.2+	Severe drift. This feature is a likely driver of the PSI alert.

Each row shows one feature's full history — nothing in the heatmap is selectable or reorderable. The Distribution Shift panel below always shows the three features with the highest current KS statistic.

reading the dashboard

Distribution shift

The distribution panel shows KDE (kernel density estimate) curves — smooth continuous approximations of where values tend to fall.

Grey/dim curve — your baseline (training data). This stays fixed.
Yellow curve — the most recent batch. You want this to overlap with the baseline.

What to look for

A simple horizontal shift (mean moved right or left) suggests the feature values have scaled or changed range — possible encoding issue or data source change. A change in shape (one peak becoming two) suggests the model is now seeing a different mix of populations. A very flat current curve with narrow baseline suggests sparse or clipped new data.

The three rows shown are the features with the highest KS statistics in the latest batch — your most-drifted features at this moment.

advanced

Alerts & callbacks

When PSI exceeds alert_threshold, canary fires an alert. By default a rich panel is printed to stdout (verbose=True). Set verbose=False to suppress console output, or pass on_alert for programmatic handling.

def notify(report):
    # report is a DriftReport with all metrics attached
    requests.post("https://hooks.slack.com/...", json={
        "text": f"Drift alert: PSI={report.psi_score:.2f}, "
                 f"{report.features_drifted} features drifted"
    })

monitor = ModelMonitor(
    model=clf,
    reference_data=X_train,
    on_alert=notify,
)

The callback receives the full DriftReport. Available attributes:

Attribute	Type	Description
`psi_score`	float	Global PSI value
`drift_detected`	bool	True if any feature's KS/chi² p < 0.05
`ks_results`	dict	Per-feature `{statistic, p_value, drifted}`
`features_drifted`	int	Count of features with KS p < 0.05
`anomaly_rate`	float	Fraction of samples flagged as anomalies
`alert_triggered`	bool	True if PSI > threshold, anomaly rate is high, or performance drops
`alert_reasons`	list	Which conditions fired: `"drift"`, `"anomaly"`, `"performance"`
`timestamp`	str	ISO 8601 timestamp

advanced

API

ModelMonitor

monitor = ModelMonitor(
    model,                      # required — .predict()-compatible
    reference_data,             # required — np.ndarray or pd.DataFrame
    feature_names=None,         # optional column names, used by the dashboard
    alert_threshold=0.2,        # PSI threshold for alert
    performance_threshold=0.05, # accuracy drop (pp) that fires a perf alert
    anomaly_contamination=0.05, # expected fraction of anomalies; alert at 4x
    categorical_threshold=20,   # max unique values to treat a feature as categorical
    store_samples=True,         # False to skip storing raw feature rows
    log_path="./canary_logs",   # where to write logs
    verbose=True,               # default True — set False to suppress output
    on_alert=None,              # callable(DriftReport)
)

Method	Returns	Description
`.predict(X)`	model output	Run prediction + monitoring. Monitoring never raises.
`.predict_proba(X)`	model output	Passthrough to `model.predict_proba()`; also feeds the confidence estimate.
`.wait()`	—	Block until background monitoring completes. Call before `get_report()` in scripts.
`.get_report()`	`DriftReport \| None`	Latest monitoring report. None if no predictions yet.
`.get_history(n=50)`	`list[dict]`	Last n raw log entries.
`.reset_baseline(new_data)`	—	Replace the reference distribution and refit the anomaly detector.
`.serve_dashboard(port=8501, host="127.0.0.1")`	—	Start dashboard in a daemon thread. Non-blocking. Use `host="0.0.0.0"` to expose it beyond localhost — only if you understand the logged data may include raw features.

Standalone server

The dashboard can be served independently from existing logs without a running model:

python -m canary_ml.server ./canary_logs 8501

advanced

Log format

canary writes two files to log_path:

monitor.jsonl

One JSON object per .predict() call, newline-delimited:

{
  "timestamp": "2026-06-07T14:23:01",
  "psi_score": 0.41,
  "drift_detected": true,
  "ks_results": {
    "0": {"statistic": 0.38, "p_value": 0.001, "drifted": true},
    "1": {"statistic": 0.04, "p_value": 0.88,  "drifted": false}
  },
  "features_drifted": 1,
  "anomaly_rate": 0.032,
  "alert_triggered": true,
  "feature_sample": [[...], ...]  // up to 500 rows of raw input
}

reference.json

Up to 500 rows of your reference data, used by the dashboard to render the baseline distribution curves. Written once at monitor creation. Format: array of arrays (rows × features).

Log rotation: canary appends forever. For long-running services, manage rotation externally (logrotate, a cron that truncates the oldest N lines, etc.). The dashboard always reads the most recent entries.

Aitor Bazo · 2026 · MIT