A model can read a supplier invoice, match it to a purchase order, reconcile the VAT, and tell you — with a self-reported confidence of 0.97 — that it is safe to post. What happens next decides whether you have built a useful tool or a liability. In an invoice-automation system we built for a UK residential interiors business, the answer is deliberately blunt: the model's recommendation is an input, not an instruction. A separate piece of deterministic code reads that recommendation, checks it against a configurable floor and a set of independent rules, and only then decides whether anything reaches the accounting ledger. The AI never writes to the ERP. It cannot.
That single architectural decision — keep the model's judgement and the act of writing to the system of record on opposite sides of a wall — is what this post is about. It is the difference between a model that advises and one that acts. For anything that touches money, only the first is defensible.
Key takeaways
- The AI never writes to the ERP; it returns a schema-constrained proposal, and deterministic code holds the only credential that can post to the accounting ledger.
- A configurable confidence floor (default 0.9) can only withhold automation, never compel a post; a low score always routes the invoice to manual review.
- Held invoices return one of an enumerated catalogue of reason codes (Q1–Q18), so every outcome is named, countable and reviewable, and new ambiguity resolves to a query, not a post.
- Independent deterministic checks — duplicate detection, price tolerance, mandatory attachment, date-age limits — run regardless of the model and behave identically if it is swapped out.
- The reported results (14/14 validation scenarios, 25/25 invoices separated) are labelled tests against known inputs, not live-production accuracy, which the build does not claim.
The model proposes; deterministic code disposes
In the build, an inbound invoice is treated as a job, not a request. Raw email bodies and attachments are staged, text and images extracted, and the model is asked to return a strict structured object: supplier, invoice number, purchase-order reference, line items, subtotal, VAT, gross total, currency, and a recommended decision. The output is constrained by a schema, so a response that does not match the expected shape is rejected before any logic runs on it. Structured output is not a nicety here; it is the boundary that lets deterministic code reason about the model's answer at all.
What the model returns is a proposal. It might say "post this invoice" with a confidence value attached, or "query this — the quantities do not match the goods receipt." Either way, the proposal is handed to ordinary, testable, version-controlled code. That code, not the model, holds the only credential that can write an accounts-payable document to the ERP.
This is the practical form of a principle we apply across our builds and describe on our trust page: AI assists, and a named control — code or a person — decides. We make the same separation physical in another build, where the database is configured read-only by construction so the model cannot change records it is only meant to read. Here the equivalent guarantee is at the application layer: there is no code path in which the model's output is executed as a write. The write is always performed by deterministic code that has independently satisfied itself the conditions hold.
A confidence floor is a governance control, not a tuning knob
The first deterministic check is a configurable confidence floor, defaulting to 0.9. If the model recommends posting an invoice but its own confidence sits below the floor, the code overrides the recommendation and routes the invoice to manual review. The model gets no benefit of the doubt.
Treating the floor as a governance control rather than a performance setting changes how you reason about it:
- It is configurable, and that is the point. A team that wants near-zero false auto-posts sets the floor high and accepts more manual review; a team comfortable with more straight-through processing lowers it deliberately, as a recorded decision with someone accountable for the trade-off. The floor is a dial a human owns, not a constant buried in a prompt.
- It is one-directional. A high confidence score never forces a post — other checks can still hold the invoice — but a low score always prevents an automatic one. Confidence can lose you the auto-post; it can never win you one on its own.
- Coverage gaps trigger the same outcome. If the system cannot fully map an invoice's lines to purchase-order and goods-receipt records, that missing coverage forces a query regardless of confidence. Absence of evidence is a reason to stop, not to proceed.
A model's self-reported confidence is not a calibrated probability and should not be trusted as one. The floor does not assume it is. It uses confidence only to withhold automation — a safe use of an unreliable signal, because the worst case of an over-cautious floor is more manual review, not a bad posting.
Every held invoice says why: enumerated reason codes
When the system declines to auto-post, it does not return a shrug. It returns one of an enumerated catalogue of query reason codes — Q1 through Q18 in this build — each naming a specific, anticipated failure mode: PO not found, supplier mismatch, price over tolerance, quantity or VAT mismatch, currency mismatch, duplicate invoice, partial invoice, missing mandatory attachment, and so on.
Enumerated codes do several governance jobs at once.
| Property | What it buys you |
|---|---|
| Every outcome is named | A held invoice carries a specific reason a person can act on, not a vague "needs review" |
| The set is finite and reviewable | Finance and audit can inspect the full catalogue of reasons the system will ever give |
| Codes are countable | You can measure why invoices are held over time, and spot drift or a new failure mode |
| New ambiguity has nowhere to hide | An unanticipated case cannot quietly resolve to "post"; it resolves to a query |
The contrast with free-text explanations matters. A model asked to "explain its decision" will always produce fluent prose, including for decisions that are wrong. A closed set of codes cannot invent a justification on the fly; it can only select from reasons the team has agreed are legitimate. The catalogue itself becomes an artefact a board or auditor can review — the same logic behind keeping an append-only decision ledger rather than trusting after-the-fact narration.
Independent deterministic checks run regardless of the model
The confidence floor and reason codes sit on top of checks that do not consult the model at all. These run as plain code against the extracted data and the live ERP records:
- Duplicate detection before any posting, so the same invoice cannot enter twice.
- Price tolerance in GBP against the purchase order, with over-tolerance lines held.
- A mandatory PDF attachment requirement, so a posted document always has its source attached.
- Date-age limits, so stale invoices are flagged rather than silently posted.
These checks are deliberately boring, and that is their virtue. They are deterministic, unit-testable, and produce the same result every time for the same input. They do not depend on a model version, a prompt, or a temperature setting. Swap the model out tomorrow and they behave identically. They are the floor beneath the floor: the controls that hold even if the model's judgement is wholly mistaken.
The audit trail is built from the decision, not bolted on after
Because the decision flows through code, the trail is a by-product of how the system works rather than something reconstructed later. The raw payload is staged in blob storage; the job state and the decision rationale — including which reason code was returned and the confidence the model reported — are persisted with timestamps; and when an invoice is posted, the original PDF is attached to the resulting ERP document. Test and production ERP databases are held strictly apart.
For a regulated finance function, that provenance is the asset. Under UK GDPR and the Data Protection Act 2018, the ICO is the lead regulator wherever AI processes personal data, and accountability — showing how a decision was reached — is a standing requirement. The reforms to automated decision-making under the Data (Use and Access) Act 2025 (section 80 in force 5 February 2026, introducing new Articles 22A–22D) relax some constraints on solely-automated decisions but keep the expectation of human review and contestability where it matters. The ICO's updated guidance on automated decision-making and profiling was in consultation, which closed on 29 May 2026, with final guidance expected in summer 2026; treat it as draft until published. In financial services the same logic applies through existing, technology-neutral rules such as Consumer Duty, the Senior Managers and Certification Regime and model-risk expectations, as we set out in the financial-services playbook. None of these regimes is satisfied by a model that "usually gets it right"; they are satisfied by a system that can show why it did what it did, and by whom.
The design here makes that easy because the model is never the actor. A human approving a queried invoice, or the code auto-posting a clean one, is the recorded decision-maker. The model's contribution — its extraction and its recommendation — is logged as evidence, not authority.
What the tests show, and what they do not
We can report two controlled results, labelled as exactly that. In a curated SAP AI-validation suite, 14 of 14 scenarios returned the expected decision and query code, with model confidence between 0.97 and 0.99 (dated 20 May 2026, synthetic and curated fixtures). In a separate batch-extraction test on one real multi-invoice PDF, the system correctly separated 25 of 25 invoices (dated 27 May 2026). Both are tests against known inputs.
What they are not: a measure of live-production accuracy. We have no measured production outcomes for this build — no time saved, no straight-through auto-post rate, no invoices-per-month in live use — and we will not invent them. The discovery-stage manual query rates the client reported are their historical baseline, not a result we delivered. The honest claim is narrower than a headline figure and more useful: the controls behave as specified, and when the model is wrong the system fails towards a human rather than a bad posting. The full build is described in the invoice-automation case study.
The transferable rule
Strip away the invoices and the principle generalises to any system where AI output meets a system of record:
- Constrain the output to a schema so deterministic code can reason about it.
- Make the model propose, never act — the credential that writes to the system of record lives only in code.
- Put a configurable floor between recommendation and action, owned by a named person, that can only withhold automation, never compel it.
- Enumerate the reasons for declining, so every held case is named, countable and reviewable, and no new ambiguity can resolve itself into action.
- Run independent deterministic checks that hold even if the model is entirely wrong.
This is what we mean by governance engineered in rather than bolted on. It is not a policy document about responsible AI; it is a wall in the code the model cannot climb. The model is good. The wall is what makes it safe to use.
Last reviewed: 29 May 2026.
If you are putting AI near a system of record and want the decision boundary in code rather than in a prompt, talk to us about how we would draw it.



