All insights
AI & Automation

Confidence scoring explained: how to make AI defensible in a SOC audit

In a SOC 2 audit, "the model is ninety-six percent accurate" is not an answer. It is a question with a follow-up the team is rarely prepared for: what happens to the other four percent? Confidence scoring is the mechanism that converts "the model is good" into "the system is defensible." Without it, every AI deployment is one auditor's question away from a finding.

Published2 min read
Confidence scoring explained: how to make AI defensible in a SOC audit
AI & Automation2 min read
Share

In a SOC 2 audit, "the model is ninety-six percent accurate" is not an answer. It is a question with a follow-up the team is rarely prepared for: what happens to the other four percent?

Confidence scoring is the mechanism that converts "the model is good" into "the system is defensible." Without it, every AI deployment is one auditor's question away from a finding.

A defensible confidence framework has four components.

  1. A documented threshold per workflow. Not "above eighty percent." A specific number — 0.87, 0.92, whatever the data supports — written into the spec, defended by the test set that produced it, and reviewed quarterly.
  2. A binary action rule at the threshold. Above it, the system acts. Below it, the system stops. There is no "the system flags it for review" without a named queue, an SLA, and a documented owner. "Flagged for review" with no SLA is a polite way of saying nothing happens.
  3. An exception logging schema. Every below-threshold event writes a structured row — input, model output, confidence score, timestamp, exception path taken, eventual resolution. The auditor will sample twenty-five of these. If you cannot reproduce them, the system is not auditable.
  4. A quarterly threshold review. A thirty-minute meeting where the operations owner, the model owner, and the finance or compliance owner review exception logs, drift in confidence distribution, and any false positives that slipped above threshold. The review is documented. The threshold either holds or moves with a written reason.

In a healthcare receivables automation deployment I led — roughly fifty thousand transactions per month flowing into Sage Intacct via a custom integration — the confidence framework ran approximately sixty-five lines of specification across all transaction types. Every single line had to survive both internal compliance review and the external audit. None of it was about the model. All of it was about what happens when the model is uncertain.

If a model is uncertain and the system has no documented response, the system has no controls. "No controls" is the language an auditor uses to mean we cannot rely on the financial statements.

The model's accuracy is a question for your ML team. The confidence framework is a question for your CFO. They are not the same question, and confusing them is how AI deployments that worked perfectly in the demo fail the first real review.

Continue Reading

More From the Insights Blog.

View all insights
From ChatGPT pilot to production system: the architecture decisions that matter
AI & Automation

From ChatGPT pilot to production system: the architecture decisions that matter

The ChatGPT pilot is a conversation. The production system is an architecture. The gap between the two is where the majority of AI deployments die — and it is not because the pilot was wrong. It is because the architecture decisions that determine whether the pilot can scale were never made.

Read post
The handover document every production AI engagement should leave behind
AI & Automation

The handover document every production AI engagement should leave behind

When a production AI engagement ends, there is exactly one artifact that determines whether the system survives the consultant's exit: the handover document. Most engagements do not produce one. The system runs for nine months and then quietly degrades, because the knowledge of how it was built lives in an inbox the consultant no longer reads.

Read post
Why your IT team cannot ship the AI deployment your CFO is asking for
AI & Automation

Why your IT team cannot ship the AI deployment your CFO is asking for

When a CFO asks IT to "deploy AI for payables automation," the request lands in a department that is structurally not configured to deliver it. This is not an IT failure. It is a category error in how the work was assigned. Four structural mismatches: 1. IT teams measure uptime; AI deployments require judgment. IT is graded on whether systems are available. AI is graded on whether the system's outputs match the operational reality of the business. The first is a network problem; the second is a finance problem. They share almost no skills and no metrics.

Read post

Get Started

From Reading to Doing.

Every Best Practicify engagement begins with a 45-minute advisory session — a direct conversation with the practitioner who will lead the work, with enough information at the end to make a sound decision about whether the next step is a proposal, an RFP, or something else.