Monitoring March 2026

AI drift is the silent killer in regulated industries

By Alex Hessami, founder of SealVera · March 2026

Your model didn't change. Your prompt didn't change. But your approval rate quietly dropped 22 points over six weeks and nobody noticed. This is how AI drift happens in production — and why the consequences in regulated industries are worse than anywhere else.

The scenario that should keep you up at night

One of the things that surprised me most while building SealVera's drift monitoring was how often teams had no idea their model's behavior had changed. Not because they weren't paying attention. Because they had no baseline to compare against.

Picture this: a journalist is working on a story about lending disparities. They FOIA some data, run the numbers, and notice that a digital lender's approval rates dropped sharply for applicants in certain zip codes starting around mid-Q3. The lender's team is caught completely off guard. Their code hadn't changed. Their model hadn't changed. Their prompts hadn't changed. They hadn't pushed a single release that touched the underwriting logic in six weeks.

And yet: approval rates had drifted 22 points downward. Thousands of applicants affected. The company's first awareness of this wasn't an internal alert. It was a journalist on the phone.

This isn't a hypothetical designed to scare you. It's a composite of real patterns I've seen — and a scenario that becomes more likely with every passing quarter that AI agents sit in production without behavioral monitoring. The question isn't whether drift happens. It's whether you find out before or after someone else does.

What drift actually means when your model is an LLM

In classical ML, drift has a reasonably clean definition. Your training data had a certain distribution; your live inputs have started looking different; your model's accuracy degrades predictably. You retrain. You move on.

LLM drift in production is messier, and it comes from more vectors simultaneously.

Upstream model updates. If you're calling a foundation model API — GPT-4o, Claude, Gemini, any of them — the model behind that API endpoint changes without your knowledge. Providers push updates to improve safety, capability, or cost. They don't always announce these changes, and they never announce them in a way that lets you know exactly how your specific prompts are affected. The model you're calling today is not the same model you called six months ago, even if the API version string looks identical.

Prompt sensitivity creep. Prompts that worked well when they were written start producing subtly different outputs over time as the underlying model shifts. A credit-assessment prompt that was tuned for one model checkpoint will behave differently against a checkpoint trained three months later. The prompts didn't change. The outputs did.

Context window and retrieval drift. If your agent retrieves context — from a vector database, a knowledge base, or a document store — the content of that context changes as you update your data. An underwriting agent that pulls current policy guidelines will behave differently after a policy update, even if no one touched the agent's logic. The decision changed because the world it was reading from changed.

Input distribution shift. Your applicant pool isn't static. Economic conditions, seasonal patterns, marketing campaigns — all of it shifts the composition of who's submitting applications. A model calibrated on last year's applicant mix may perform very differently on this year's mix, even if the model itself hasn't changed at all.

The insidious part: these effects compound. A small upstream model update, combined with a modest shift in input distribution, combined with a policy document refresh — none of them dramatic on their own — can produce a decision pattern that looks very different from your baseline. And none of it shows up in your deployment logs, because technically nothing was deployed.

Why regulated industries are uniquely exposed

Every production AI system is exposed to drift. But the consequences scale with what the AI is deciding.

In regulated industries — fintech, insurtech, healthcare — your AI isn't recommending a playlist or ranking search results. It's making consequential decisions about people's access to financial services. Every drifted decision is potentially a compliance event. Every drifted decision that affected a protected class is potentially a fair lending violation. The liability doesn't accumulate when you discover drift. It accumulates from the moment the first drifted decision was made.

A 22-point drop in approval rates over six weeks isn't just a business problem. Under ECOA and the Fair Housing Act in the US, or under the EU AI Act's requirements for high-risk AI systems, that pattern triggers obligations: explanation rights, adverse action notices, audit trails. If you can't demonstrate that your system was operating within expected parameters, you can't demonstrate that it was operating fairly. And if you can't demonstrate fairness retrospectively — for every affected applicant — you have a problem that no amount of engineering can fix after the fact.

The difference between a drift incident and a regulatory enforcement action is, often, whether you caught it yourself or someone else caught it for you.

The detection problem: why you'll find out the wrong way

Most teams running AI in production have monitoring that looks at system health: latency, error rates, API call volumes. Some have model-level monitoring: token usage, response times, maybe a dashboard showing throughput.

Almost none have behavioral monitoring: what is the agent actually deciding, and how does that compare to what it was deciding last week?

The result is a predictable detection pattern. Teams find out about AI model drift in production through one of three channels: a customer complaint that escalates far enough to trigger an internal investigation, an external report (journalist, regulator, researcher) who ran the numbers you weren't running, or a periodic manual audit that happens to land during a drift window. The median time from drift onset to detection in teams without behavioral monitoring is measured in weeks, not hours. By the time you know something is wrong, you have weeks of affected decisions to explain.

The uncomfortable math: If your AI agent processes 500 applications per day and drifts for 42 days before you notice, that's 21,000 decisions made under conditions you didn't know about. Even if only 10% were meaningfully affected, that's 2,100 applicants who may have grounds to request re-review — and a compliance exposure that compounds daily.

What you actually need to catch drift

LLM drift monitoring isn't a fundamentally hard problem. It's an instrumentation problem. Most teams don't have the right telemetry, so they can't detect what they're not measuring. Here's what the measurement layer actually needs to look like:

Baseline decision distributions, per agent. For each AI agent making consequential decisions, establish what "normal" looks like: approval rate, rejection rate, the distribution of scores or confidence levels, breakdown by major applicant segments. This baseline needs to be established early — ideally at launch — and refreshed deliberately when you intentionally change something.

Approval rate tracking over time. This sounds trivially obvious and is almost universally absent. Track your agent's approval rate as a time series. Apply statistical process control: flag when the rate moves outside your expected control limits. A 22-point drop over six weeks should have been visible as a trend within the first two weeks, if anyone was looking at this number.

Confidence scoring and distribution monitoring. If your agent produces a confidence score or probability estimate alongside its decision, monitor the distribution of those scores. Drift often shows up in confidence distributions before it shows up in outcome rates — the model becomes less certain about cases it used to handle confidently, which is an early signal worth catching.

Anomaly alerts within hours, not weeks. The monitoring has to be automated and it has to have teeth. A weekly report that someone reads on Friday afternoon isn't drift detection — it's drift archaeology. You want alerts that fire within hours of a statistically significant deviation from baseline. The alert should go to someone who can act on it: pause the agent, trigger a review, roll back to a known-good configuration.

What behavioral monitoring looks like in practice

Behavioral monitoring for AI agents works from a simple premise: if the agent's behavior is stable, its decision distribution should be stable. When the distribution shifts beyond expected variance, something changed — and you need to know what.

In practice, you're tracking a few key signals per agent, per time window:

Decision outcome rates: What fraction of inputs are approved, rejected, escalated, or flagged? These rates should be relatively stable in the absence of intentional changes. Drift shows up here first as a directional trend.
Score distributions: If the agent produces a numeric score, the histogram of scores should look similar from week to week. A shift in the median, or a change in variance, often precedes a change in outcome rates.
Segment-level rates: Aggregate approval rates can be stable while segment-level rates diverge — which is exactly the pattern that gets you in trouble under fair lending law. Monitor outcome rates broken down by the attributes you care about: geography, product type, applicant cohort.
Response pattern signatures: For LLM agents, you can track patterns in the text or structured output of responses. If your agent's rationale text starts using different terminology, different confidence language, or structuring its output differently, that's a behavioral signal worth investigating.

"Normal" is defined empirically from your baseline window. Deviation is flagged when outcomes move outside a confidence interval — typically 2–3 standard deviations from the rolling mean, adjusted for volume. The alert doesn't mean something is definitely wrong; it means something changed and you need to find out why before continuing to accumulate decisions under unknown conditions.

The key design principle: Monitoring should be suspicious of stability, not just instability. If your approval rate has been flat for eight weeks and then suddenly becomes very flat — suspiciously flat, with lower variance than historical — that's also worth investigating. Sometimes drift manifests as a model becoming less responsive to input variation, not just less accurate.

The cost of finding out the wrong way

There are two ways to discover that your AI agent has been drifting in production. The first is your own monitoring catching it — an alert fires, your team investigates, you identify the cause, you pause and remediate, you re-review affected decisions, you document what happened. That's a bad week. Expensive, stressful, requires explanation to stakeholders. But it's recoverable.

The second is a regulator, a journalist, or a plaintiff's attorney finding it first. Now you're explaining why you didn't know. You're reconstructing a timeline of decisions you don't have the telemetry to explain. You're defending an approval rate pattern that you have no causal account for. The CFPB doesn't have a lot of patience for "we weren't monitoring that." Neither does a fair lending examiner who's looking at a 22-point approval rate drop correlated with applicant geography.

The gap between those two outcomes is almost entirely an instrumentation problem. Teams that have behavioral monitoring in place catch drift in hours and have a clear remediation story. Teams that don't catch it in weeks — or not at all — and have a much harder conversation when it surfaces.

You've invested significantly in building AI that makes good decisions. The monitoring infrastructure to verify that it's still making good decisions — continuously, not just at launch — is not optional in a regulated industry. It's the part that makes everything else defensible.

Catch drift before your regulator does.

SealVera monitors agent behavior continuously — approval rates, decision patterns, confidence distributions. When something shifts, you find out in minutes.

Start free — no credit card