Your AI agent just got subpoenaed. Now what?
It's not hypothetical anymore. Courts are asking for AI decision records. Here's what the discovery process looks like when your decision-maker is an LLM — and what you need to be able to produce.
The scenario nobody planned for
When I was building SealVera, I spent time talking to litigation attorneys about what discovery looks like when AI is involved. What they described was worse than I expected.
Picture this: a regional health insurer has been using an AI system to handle prior authorization requests — the administrative gatekeeping that decides whether a patient's proposed treatment is covered. The system processes thousands of requests a week. It's faster than human reviewers, more consistent, and management loves the throughput numbers.
Then a class action lands. Plaintiffs allege that the AI denied medically necessary claims at a statistically improbable rate for certain diagnostic codes. Their attorneys issue a discovery request. It reads, in part: "All records of AI-assisted decisions including model inputs, outputs, reasoning, model version, and any human review steps, for the period January 2024 through December 2025."
The engineering team pulls the application logs. They find request timestamps and response codes. They find some database entries with decision: "denied". What they don't find: the actual prompt sent to the model. The full patient data that was serialized and passed as context. The model's output before post-processing. Any chain-of-thought or reasoning trace. The model version that was running at 2:17 AM on March 4th, 2024.
They have records of the fact that decisions happened. They have almost nothing about how or why.
This is not a hypothetical. Variants of this scenario are already playing out in U.S. courts, with insurers, lenders, and HR platforms all facing the same uncomfortable question: when an LLM is your decision-maker, what does the paper trail look like?
What discovery actually looks like when the decision-maker is an LLM
Legal discovery for AI-assisted decisions is still evolving, but the contours are becoming clear. Plaintiffs' attorneys — particularly in class actions involving insurance denials, credit decisions, or employment screening — are getting better at asking the right questions.
Expect requests for:
- The complete input to the model: not just a summary, but the actual prompt, system instructions, and any structured data passed as context
- The raw model output: what the model returned before any downstream parsing or business logic modified it
- Reasoning traces: chain-of-thought output, tool calls, intermediate steps in an agentic workflow
- Model provenance: which model, which version, which API endpoint, what temperature and sampling settings were in use at the time of the specific decision
- Human-in-the-loop documentation: evidence of whether a human reviewed the output, when, and what they saw
- Statistical patterns across decisions: aggregate evidence that may establish discriminatory impact
Courts have begun treating AI decision records similarly to how they treat algorithm documentation in financial services — with the expectation that the organization can reconstruct not just what happened, but how the system arrived there. The Federal Rules of Civil Procedure already require preservation of electronically stored information (ESI) reasonably anticipated to be relevant to litigation. Whether your LLM inference logs qualify is no longer a theoretical question.
The three things you need to be able to produce
Strip away the legal complexity and the ask is simple: input, output, reasoning. But each one has nuance that matters in court.
The input. This means the complete, verbatim payload sent to the model — system prompt, user message, any retrieved context chunks, tool definitions, conversation history. Not a reconstruction. Not a template with fields filled in. The actual bytes that the model processed. If your system serializes patient data into a prompt at runtime, you need to capture that serialized form, not the source records. The distinction matters because your post-hoc reconstruction might differ from what actually ran.
The output. The raw completion from the model, before your application code touches it. If you parse the JSON, extract a field, and discard the rest, you've destroyed evidence. If you post-process the response through a rules engine before storing the result, opposing counsel will want to see both the model's words and your transformation of them.
The reasoning. This is where most teams are most unprepared. "The model decided" is not an acceptable answer in a courtroom. Judges and juries don't operate on probabilistic black boxes — they need a chain of causation they can follow. For simple completions, this means capturing any chain-of-thought output. For agentic systems, it means logging every tool call, every intermediate state, every branching decision point. If your model uses structured reasoning (thinking tokens, scratchpad outputs, or explicit reasoning fields), those need to be preserved as part of the decision record, not filtered out.
The uncomfortable truth: "The model decided" has the same legal weight as "a magic 8-ball decided." Courts require human-accountable explanations. If your AI system makes consequential decisions, someone in your organization is accountable for those decisions — and they need records to defend them.
Logs are not records
This is the distinction that most engineering teams miss, and it's the one that will matter most when a litigator scrutinizes your documentation.
A log is a file your application writes to. It's mutable. It can be rotated, archived, modified, or deleted. Most logging infrastructure doesn't even treat this as a problem — log rotation is a feature. Logs are designed for operational debugging, not legal accountability. They have no chain of custody. They carry no guarantee that what you're reading today matches what was written at the time of the decision. Opposing counsel will point this out.
A record is different. A record is signed, timestamped, and written once. Its integrity can be independently verified. It exists as evidence of a specific event at a specific moment in time, and any subsequent modification is detectable. When a lawyer asks for "all records of AI-assisted decisions," the operative word isn't AI — it's records. They mean something with legal weight, not your CloudWatch log group.
The practical implication: you need a system that creates records at decision time, not logs that you hope to reconstruct later.
What tamper-evidence actually means
The term gets thrown around in compliance marketing a lot. Here's what it means in practice.
When you create a decision record, you serialize the relevant data — input, output, reasoning, metadata — into a canonical byte string. You then compute a cryptographic hash of that string (SHA-256 is the current standard). That hash is then signed using an asymmetric key pair: your private key produces a signature that anyone with the corresponding public key can verify.
The result: a record that is bound to its content. If a single byte of the record changes after signing — a character in the prompt, a timestamp, a field value — the signature verification fails. You cannot modify the record without breaking it, and you cannot re-sign it without using the original private key (which is logged, audited, and ideally held by a third party).
For additional assurance, records can be anchored to an external timestamp authority or a blockchain, creating a proof that the record existed at a specific time that doesn't depend solely on your own infrastructure. This is what courts mean when they ask for tamper-evident records — not password protection, not access controls, but cryptographic commitment.
The key management is where most implementations cut corners. A signature is only as trustworthy as the key. Your signing keys need to be isolated from the systems that create records, rotated on a schedule, and ideally attested by a hardware security module (HSM) or a cloud KMS with independent audit logs.
What to build before you get the letter
The following is the minimum viable compliance posture for any team shipping AI-assisted decisions in healthcare, insurance, lending, or employment. Do this now.
- Intercept at the model boundary. Build or adopt a middleware layer that captures the complete request and response before your application logic processes either. This is the only way to guarantee you're recording what the model actually saw and said.
- Capture model provenance with every record. Model name, version/snapshot identifier, temperature, top-p, any system prompt hash. These must be captured at inference time, not reconstructed from configuration files that may have changed.
- Preserve reasoning traces explicitly. If you use chain-of-thought prompting, capture the thinking output. If you use an agentic framework, log the full trace including tool calls and their results. Don't filter "internal" outputs — they're exactly what courts will ask for.
- Sign records immediately. Don't batch sign. Don't sign at end-of-day. Sign at the moment of creation, before the record touches any mutable storage. Every millisecond of delay is an opportunity for the record to be altered before it's committed.
- Separate your signing infrastructure from your application. If the same system that makes decisions also controls the keys that sign records, the signature proves nothing. Key management needs to be independent, auditable, and ideally third-party attested.
- Define retention policies and test them. Know how long you're keeping what, where it's stored, and how you'd produce a specific decision record given a case ID and a date range. Run a drill. Pretend a lawyer just called — how long does it take your team to produce 10,000 records from 18 months ago?
If you're in healthcare: HIPAA already requires audit controls and integrity controls on ePHI. AI decision records that incorporate patient data fall under this requirement. "We use logging" is not a sufficient answer — the rule specifically requires that ePHI not be improperly altered or destroyed.
The cost you're choosing between
There are two scenarios. In the first, you build AI decision record infrastructure before anything goes wrong. The cost is engineering time — a few weeks for a solid implementation, ongoing maintenance, storage for the records. Annoying, but manageable. The value is that you can defend every decision your AI system has ever made. You can respond to a discovery request in days, not months. You can demonstrate to regulators that your system is auditable. You can catch model drift before it becomes a legal problem.
In the second scenario, you get the letter first. Now you're in legal hold. Every engineer who touched the system is spending time with outside counsel instead of shipping features. You're trying to reconstruct, from application logs and database states and maybe some CloudWatch exports, what your model actually received and returned eighteen months ago. Your lawyers are billing by the hour. Your PR team is managing headlines. And at the end of it, the best you can offer the court is a best-effort reconstruction that opposing counsel will spend their opening statement attacking.
The engineering cost of retrofitting AI audit infrastructure after a legal event is typically five to twenty times the cost of doing it upfront — and that's before you factor in legal fees, settlement exposure, and the regulatory scrutiny that tends to follow a public enforcement action.
The law is moving toward your AI systems. The question isn't whether you'll need defensible AI decision records. It's whether you'll have them ready when someone asks.
Build the record before you need it.
SealVera creates cryptographically signed, legally defensible records of every AI decision. Minutes to set up. Valuable before the letter arrives.
Start free — no credit card