Building Regulator-Defensible Enterprise RAG Systems (FCA/PRA/SMCR)

This article defines what regulator-defensible enterprise Retrieval-augmented generation (RAG) looks like in Financial Services (at least in 2025–2026). Rather than focusing on model quality, it frames RAG through the questions regulators actually ask: what information was used, can the answer be reproduced, who is accountable, and how risk is controlled. It sets out minimum standards for context provenance, audit-grade logging, temporal and precedence-aware retrieval, human-in-the-loop escalation, and replayability. The result is a clear distinction between RAG prototypes and enterprise systems that can survive PRA/FCA and SMCR scrutiny.

Executive Summary (TL;DR)

What “good” looks like in 2025–2026 when an LLM is allowed anywhere near regulated Financial Services data. Enterprise Retrieval-augmented generation (RAG) systems in Financial Services are not judged on whether they produce impressive answers. They are judged on whether the institution can prove — under scrutiny — what the model saw, why it saw it, who approved it, and how the answer was used.

A regulator-defensible RAG system must provide:

Context provenance for every retrieved chunk (source, time validity, sensitivity, precedence rank, rule version)
Reproducibility (replay the exact context and model configuration that produced any answer)
Audit-grade logging for prompt → retrieval → response
Controls that map cleanly onto SMCR accountability
Human-in-the-loop escalation for high-risk queries
A retrieval strategy that respects temporal correctness and source precedence, not just cosine similarity

If you can’t answer the PRA’s core questions about your system in one sitting, you don’t have an enterprise RAG system. You have a prototype.

Part of the “land it early, manage it early” series on SCD2-driven Bronze architectures for regulated Financial Services. Regulator-safe RAG with provenance and temporal retrieval, for AI architects, compliance leads, and platform owners who need explainable AI under SMCR. This article delivers standards to make RAG audit-grade.

Table of Contents

Executive Summary (TL;DR)
Contents
1. The Five Questions the PRA Will Ask About Your RAG System
2. The Context Provenance Table Schema (Minimum Columns)
3. Retrieval Scoring + Precedence Metadata (Cosine Similarity Is Not Enough)
4. Prompt + Context + Response Logging Standard (Audit-Grade)
- 4.1 A pragmatic JSON envelope (interaction log)
- 4.2 Two-tier storage model
5. Human-in-the-Loop Triggers for High-Risk Queries
- 5.1 Practical triggers
- 5.2 The right mental model
6. The Reproducibility Guarantee (Replay Is the Standard)
7. Useful vs Regulator-Defensible: Where Most RAG Programmes Fail
- A “useful” RAG system:
- A “regulator-defensible” RAG system:
8. Conclusion: The New Definition of Enterprise RAG in FS

1. The Five Questions the PRA Will Ask About Your RAG System

When RAG systems are reviewed by regulators, the conversation does not start with architecture diagrams or model choices. It starts with accountability, evidence, and control. The questions that follow are not hypothetical — they reflect how PRA and FCA supervisors probe whether a system is understood, governed, and defensible. Answering them convincingly requires more than confidence; it requires a system that was designed with scrutiny in mind.

This section frames enterprise RAG through those questions, because they are the fastest way to distinguish maturity from experimentation.

You don’t get asked “what model are you using?” first. You get asked the questions that expose whether your system is controlled, evidenced, and reproducible.

1.1 PRA Question 1: What information did the model use to produce this answer?

Before debating model behaviour or output quality, regulators focus on inputs. In Financial Services, the provenance of information matters as much as the answer itself. This question establishes whether the institution can demonstrate control over what evidence was actually used, rather than offering a general description of likely sources.

Not “roughly”. Not “it probably came from these tables”.

They will want:

the exact retrieved chunks
the exact versions of those chunks
the as-of timestamp and PIT mode (if applicable)
the originating systems / lineage pointers

1.2 PRA Question 2: Can you reproduce this answer exactly?

Regulated decision-making assumes replayability. Even where outcomes are probabilistic, the evidential chain must be reconstructable. This question probes whether the organisation treats AI outputs as ephemeral convenience, or as artefacts that can be examined and replayed under challenge.

If the answer was used in complaints handling, remediation, AML, risk, or customer communications, you must be able to replay:

prompt
retrieval set
ranking inputs
model version and parameters
response

“LLMs are non-deterministic” is not an excuse. You manage non-determinism the same way you manage market data variability: control, record, and replay.

1.3 PRA Question 3: How do you prevent unauthorised access to sensitive data?

RAG systems are often described as “read-only”, but in practice they expose a powerful access path to sensitive information. This question tests whether retrieval respects the same access controls, purpose limitations, and data minimisation principles as the underlying platform — or quietly bypasses them.

RAG is a data access surface. It must respect:

PII restrictions
data minimisation
purpose limitation
role-based access
information barriers

If your vector store has more data than the user is entitled to query, you’ve already failed.

1.4 PRA Question 4: How do you detect and manage errors, bias, and unsafe outputs?

No model is perfect. Regulators are less concerned with the absence of errors than with whether errors are detectable, manageable, and contained. This question examines whether the institution has runtime controls and feedback mechanisms, rather than relying on trust in model behaviour.

This includes:

hallucinations (including time leaks)
missing context
overconfident summarisation
policy breaches
advice-like outputs

You need runtime controls and post-hoc analysis.

1.5 PRA Question 5: Who is accountable?

Under SMCR, accountability cannot be abstract or collective. This question forces clarity on ownership: of prompts, retrieval logic, data semantics, and business use. A system without named accountability may function technically, but it does not function institutionally.

Under SMCR, “the system did it” is not a thing.

They will ask:

who owns prompts?
who owns retrieval logic and indexing?
who owns the source data and PIT semantics?
who owns the user-facing application and the business outcome?

If you cannot map responsibilities to named roles, your governance story isn’t real.

1.6 The one-pager you should have ready

Complex systems that cannot be explained succinctly are rarely well controlled. This subsection reflects a practical truth: if the organisation cannot summarise its RAG system clearly for a regulator, it likely does not understand it well enough itself.

You should be able to produce a single-page summary containing:

system purpose and scope
data sources and access controls
retrieval method and ranking logic
logging and reproducibility method
risk classification and escalation triggers
accountable owners (RACI / SMCR mapping)

If you can’t fit the truth on one page, you probably don’t understand the system well enough yet.

2. The Context Provenance Table Schema (Minimum Columns)

Most RAG discussions treat provenance as an attribute of tooling. In regulated environments, provenance is an artefact — something that must be stored, queried, and retained. This section introduces the minimum schema required to turn “the model saw some context” into a defensible, inspectable record.

What follows is not exhaustive; it is the smallest structure that still supports audit, replay, and accountability.

If there is one artefact that separates “demo RAG” from “enterprise RAG”, it is a context provenance record that is complete and queryable.

At minimum, you need:

2.1 Interaction-level fields

Every RAG interaction exists in a business context. These fields anchor the interaction to a user, role, channel, and purpose, ensuring that outputs can be assessed in light of how and why they were generated.

interaction_id (immutable ID)
timestamp_utc
user_id (or service principal)
user_role / entitlement_profile
channel (complaints tool, analyst notebook, AML case system, etc.)
use_case_id (mapped to a risk category)

2.2 Prompt fields

Prompts encode policy, intent, and constraints. Treating them as transient strings makes it impossible to reason about behaviour later. This subsection establishes prompts as governed inputs with identity and versioning.

prompt_text (or stored reference)
prompt_hash
system_prompt_version (your guardrails policy version)
prompt_classification (low/medium/high risk)

2.3 Temporal fields (if you do Temporal RAG)

Where historical truth matters, time must be explicit. These fields ensure that retrieval semantics are anchored to a specific reconstruction mode, rather than left implicit or inferred after the fact.

as_of_timestamp
reconstruction_mode (AS_KNOWN vs AS_NOW_KNOWN)
reconstruction_rule_version

2.4 Retrieval fields (per retrieved chunk)

RAG does not retrieve “information”; it retrieves specific artefacts. This subsection defines the metadata needed to trace each retrieved chunk back to governed source records, precedence rules, and sensitivity classifications.

For each chunk in the retrieval set, store:

chunk_id
retrieval_rank
retrieval_score (vector similarity score)
retrieval_method (vector, BM25, hybrid)
index_id / index_version
source_type (table row, document, policy, email, case note)
source_system
source_uri (table + key, or doc reference)
source_primary_key (business key strongly preferred)
effective_from, effective_to (when sourced from temporal tables)
precedence_rank (if multiple sources exist)
sensitivity_label (PII/PCI/Confidential/etc.)
redaction_policy_version

2.5 Model + response fields

Models and responses are part of the evidence chain, not a black box at the end of it. These fields make it possible to reason about how an answer was produced and how it was handled operationally.

model_id
model_version
temperature / top_p / relevant parameters
response_text (or reference)
response_hash
safety_flags (policy violations, high-risk markers)
human_review_required (boolean)
human_approver_id (if applicable)
final_disposition (served / blocked / edited / escalated)

This is “minimum viable evidence”. Anything less leaves you unable to prove what happened.

3. Retrieval Scoring + Precedence Metadata (Cosine Similarity Is Not Enough)

Similarity search optimises relevance, not correctness. In regulated environments, retrieval must respect institutional rules about source authority and conflict resolution. This section explains why ranking purely by similarity is insufficient — and how precedence and entitlement must shape what the model is allowed to see.

Where historical questions are involved, this retrieval hierarchy assumes the use of Temporal RAG patterns described earlier in the series, ensuring that similarity scoring is applied only after temporal correctness and point-in-time reconstruction have been enforced.

In regulated environments, similarity search is only part of retrieval quality. You also need retrieval correctness.

3.1 The retrieval hierarchy you actually want

Defensible retrieval is a pipeline, not a single algorithm. The hierarchy outlined here reflects the order in which constraints should be applied to ensure that only appropriate, temporally correct, and authorised context reaches the model.

A defensible ranking pipeline typically looks like:

Entitlement filter (user is allowed to see it)
Temporal filter (if historical question, only retrieve from correct PIT slice)
Precedence weighting (if multi-source conflicts exist)
Similarity scoring (vector / hybrid scoring)
Diversity control (avoid 10 near-duplicate chunks)
Safety gating (exclude disallowed document classes)

3.2 Why precedence must influence retrieval

Where multiple systems represent the same business fact, the institution already has rules for which one wins. This subsection explains why ignoring those rules at retrieval time undermines the integrity of the entire platform, even if the model behaves “correctly”.

If your platform has a multi-source precedence framework (core banking vs CRM vs KYC etc.), then retrieval must respect it.

Otherwise, the LLM will:

retrieve a lower precedence record with better “textiness”
present it confidently
and you will later discover that the narrative is built on the wrong source system

The model isn’t “wrong” — your retrieval is.

3.3 Practical weighting

Perfection is not required to be defensible. This subsection shows how simple weighting approaches can materially improve correctness without introducing unnecessary complexity.

A simple approach:

similarity score gives you the candidates
precedence rank adjusts ordering within candidates

You don’t need sophistication to start; you need correctness.

4. Prompt + Context + Response Logging Standard (Audit-Grade)

RAG systems create a new class of record: AI-mediated narratives that influence regulated outcomes. This section sets out how those interactions should be logged — not as scattered telemetry, but as coherent evidence artefacts that can be retained, queried, and presented.

RAG systems create a new class of record: AI-mediated decisions and narratives.

Your logging standard should capture the entire chain.

4.1 A pragmatic JSON envelope (interaction log)

Rather than prescribing tooling, this subsection focuses on structure. A single, well-defined envelope simplifies audit, replay, and retention by keeping the entire interaction chain together.

Your system should log a single, structured record that can be stored immutably (append-only), for example:

prompt payload (user + system prompt versions)
retrieval payload (list of chunks with provenance)
model execution payload (model/version/params)
response payload (text + safety classification)
approval payload (if required)

The goal is not “lots of logs”. The goal is one coherent evidence artefact per interaction.

4.2 Two-tier storage model

Operational needs and regulatory retention requirements are not the same. This subsection explains how separating fast-access logs from immutable archives supports both without compromise.

A pattern that works well:

Fast log store for recent interactions (30–90 days): queryable for operations
Immutable archive for long retention: WORM-like storage or equivalent controls

Remember: if your RAG system is used in complaints or remediation, you may need retention aligned to those processes.

5. Human-in-the-Loop Triggers for High-Risk Queries

Not all RAG usage is equal. Some outputs merely assist understanding; others influence regulated decisions. This section introduces the idea that risk lies not just in the model, but in how and where its outputs are used — and that escalation should be driven by context, not guesswork.

One of the worst mistakes in FS is assuming that “internal use” means “low risk”.

Internal RAG can generate outputs that:

influence regulated decisions
are pasted into customer communications
become part of remediation evidence
shape AML case outcomes

5.1 Practical triggers

Rather than abstract risk scoring, this subsection grounds escalation in recognisable FS scenarios, making it easier to implement consistently across teams.

Require human review when:

query involves complaints, redress, remediation, Consumer Duty
query involves AML, sanctions, PEP, fraud investigation narratives
query asks for “why” or “justify” in a regulated process
query references large exposures, vulnerable customers, suitability/affordability
model confidence is low or context is weak
retrieved context includes sensitive classes (PII, case notes, protected logs)

5.2 The right mental model

Approval is not about validating the model’s intelligence. It is about authorising the use of its output in a specific regulatory context. This subsection reframes human-in-the-loop as a governance control, not a quality check.

You are not “approving the model”. You are approving the use of its output in a regulated context.

That distinction matters.

6. The Reproducibility Guarantee (Replay Is the Standard)

Reproducibility is where many RAG programmes quietly fail. This section sets the bar: if an organisation cannot reconstruct what evidence was used at the time, it cannot defend the outcome, regardless of how reasonable it appeared.

Where historical questions are involved, this retrieval standard assumes the use of Temporal RAG patterns described earlier in the series.

A regulator-defensible RAG system must support the question:

“Reproduce exactly what happened, using the same inputs.”

6.1 What reproducibility requires

Replay is not a single feature; it is an accumulation of versioned components. This subsection clarifies what must be retained and versioned for reproducibility to be meaningful.

You must version and retain:

index contents (or index references + snapshot IDs)
chunking strategy (version)
embedding model version
retrieval algorithm version
system prompt version
model version + parameters
PIT reconstruction rules (if temporal)
source data snapshots or stable references

6.2 Two levels of replay

Exact answer reproduction is desirable but not always necessary. This subsection explains why context replay is the minimum defensible standard — and why it is sufficient to support regulatory review.

Context replay (minimum): reproduce the retrieved context set exactly
Answer replay (nice-to-have): reproduce the model output exactly (harder due to non-determinism)

In FS, context replay is the minimum bar — it proves what evidence was used.

If you can replay the context, you can defend the decision-making chain even if the model output varies slightly.

FS Reality Check

This is where theory meets operational reality. Many organisations only discover the mutability of their vector stores when asked to reconstruct past behaviour. This reality check highlights why index governance must be treated as a data management problem, not an AI curiosity.

Most organisations discover too late that their vector store is a constantly mutating artefact, and they cannot reconstruct what was retrieved last month. That is the moment the RAG system becomes indefensible.

Your index needs versioning semantics. Treat it like a regulated dataset.

7. Useful vs Regulator-Defensible: Where Most RAG Programmes Fail

By this point, the distinction should be clear. This section makes it explicit. It contrasts speed and usefulness with evidence and control, not to criticise experimentation, but to underline why prototypes cannot simply be promoted into regulated workflows.

A final blunt note, because it will save time.

A “useful” RAG system:

answers questions
makes people faster
feels magical

A “regulator-defensible” RAG system:

can prove what it used
can replay context
enforces access controls
exposes temporal semantics
logs every interaction as an evidence artefact
maps responsibilities to accountable owners under SMCR

A surprising number of FS RAG programmes stop at “useful” and assume the rest can be bolted on later. In practice, the bolt-on phase becomes a rewrite.

8. Conclusion: The New Definition of Enterprise RAG in FS

This conclusion brings the argument together. In Financial Services, enterprise RAG is not defined by model capability or user satisfaction, but by whether the institution can explain, replay, and take responsibility for what the system does. This section closes by restating that standard — and the choice organisations face if they are not yet ready to meet it.

In 2025–2026, RAG is not novel. What’s novel is RAG that can survive scrutiny.

A regulator-defensible enterprise RAG system is one where:

every answer has provenance
every interaction can be replayed
every retrieval respects entitlement, time, and precedence
every high-risk use case has approval workflows
every component has accountable ownership under SMCR

If your institution can do that, then RAG becomes a powerful capability layer on top of Bronze/Silver/Gold/Platinum — not a compliance liability.

And if you can’t, the safest thing to do is to admit you’re still prototyping and keep it out of regulated workflows until the evidence chain is real.

Horkan

a blog by Wayne Horkan