Probabilistic & Graph-Based Identity in Regulated Financial Services

This article argues that probabilistic and graph-based identity techniques are unavoidable in regulated Financial Services, but only defensible when tightly governed. Deterministic entity resolution remains the foundation, providing anchors, constraints, and auditability. Probabilistic scores and identity graphs introduce likelihood and network reasoning, not truth, and must be time-bound, versioned, and replayable. When anchored to immutable history, SCD2 discipline, and clear guardrails, these techniques enhance fraud and AML insight; without discipline, they create significant regulatory risk.

1. Introduction: When Deterministic Identity Breaks Down

In regulated Financial Services, identity is rarely a clean, binary problem. Customers move, change names, share devices, open accounts through intermediaries, and appear across systems that were never designed to agree with one another. Deterministic entity resolution handles the obvious cases well, but it fails precisely where regulatory, fraud, and financial crime risks concentrate.

This article explores probabilistic and graph-based identity techniques in Financial Services — not as replacements for deterministic matching, but as controlled extensions of it. It explains how these techniques can be safely applied on top of a Bronze-layer, SCD2-anchored platform in a way that remains explainable, replayable, and defensible under FCA and PRA scrutiny. Written for architects, data platform leads, and risk owners in regulated FS.

The focus is not on novelty, but on governance, time, and accountability.

Part of the “land it early, manage it early” series on SCD2-driven Bronze architectures for regulated Financial Services. Governed probabilistic/graph ER extensions, for identity specialists, risk teams, and architects who need beyond-deterministic matching. This article delivers controls to make advanced ER regulator-defensible.

2. Why Probabilistic Identity Is Unavoidable in Financial Services

Once deterministic matching reaches its limits, Financial Services does not get the option to stop reasoning about identity. Risk does not disappear when certainty does. Instead, it concentrates in precisely those areas where intent, obfuscation, and coordination matter most.

Certain identity questions cannot be answered with keys and rules alone:

Fraud rings share devices and addresses but not names.
Beneficial owners intentionally obscure direct relationships.
Customers reappear after remediation under slightly altered details.
Networks matter more than individuals in AML scenarios.

In these cases, the question is not “are these records identical?” but “how likely is it that these records refer to the same underlying actor or network?”

Ignoring probabilistic identity does not reduce risk — it simply pushes it downstream into manual processes, spreadsheets, or opaque vendor systems.

3. Deterministic Identity as the Foundation, Not the Enemy

Before discussing probability, likelihood, or graphs, a harder boundary must be drawn. In regulated environments, the question is not whether probabilistic techniques are powerful, but whether they are anchored to something that regulators can trust.

One principle must be explicit:

Probabilistic identity without deterministic grounding is indefensible in regulated environments.

Deterministic entity resolution:

establishes hard anchors,
defines known non-matches,
constrains the search space,
and provides audit-safe baselines.

Probabilistic methods operate within this framework, never outside it. They propose hypotheses; they do not assert truth.

4. Probabilistic Matching: From Rules to Likelihood

Where deterministic rules fail to assert identity, probabilistic matching reframes the problem. Instead of asking for certainty, it asks how much evidence exists — and how that evidence should be weighed.

Probabilistic matching assigns a likelihood that two records refer to the same entity based on multiple signals.

4.1 Common Probabilistic Models

In regulated FS, the most defensible approaches remain:

Fellegi–Sunter models (explicit likelihood ratios)
Logistic regression over engineered features
Scorecard-style weighted signals

These approaches are favoured because:

features are interpretable,
thresholds are explicit,
outcomes are reproducible.

Black-box models may exist in experimentation layers, but production decisions must be explainable in plain language.

4.2 Signals Commonly Used

Typical probabilistic signals include:

name similarity scores,
address proximity and stability,
device or contact reuse,
behavioural overlap (login times, channels),
document reuse across identities.

Each signal is weak alone. Their value emerges in combination.

5. Graph-Based Identity: From Pairs to Networks

Probabilistic matching still reasons in pairs. But many of the highest-risk identity problems in Financial Services do not exist at the level of two records — they exist in the structure formed by many weak relationships.

Where probabilistic matching scores pairs, graph-based identity reasons about structures.

5.1 Identity as a Graph

In a graph model:

nodes represent entities, records, devices, documents, accounts;
edges represent relationships (uses, owns, shares, transacts);
edge weights represent confidence or frequency.

Identity becomes an emergent property of connectivity rather than direct equality.

5.2 Why Graphs Matter in FS

Graphs are essential when:

indirect relationships carry risk,
networks evolve over time,
single-entity views miss collective behaviour.

Fraud, AML, and organised financial crime almost always manifest as patterns, not records.

In regulated environments, graph techniques are selected not for mathematical novelty but for their ability to produce explainable structures — such as clusters, paths, and shared hubs — that can be articulated and challenged in plain language.

6. Time, Versioning, and Belief in Probabilistic Identity

At this point, the discussion shifts. Sections 4 and 5 describe how probabilistic and graph-based identity infer relationships. What follows addresses a different question: how long those inferences are valid, how they change over time, and how belief itself must be treated as a governed, versioned artefact.

Introducing probability and graphs does not relax the platform’s obligations around time. On the contrary, uncertainty increases the importance of temporal discipline, versioning, and historical reconstruction.

Probabilistic and graph-based identity must obey the same temporal discipline as deterministic ER.

6.1 Belief Is Time-Bound

A probability score is not a fact. It is:

a belief,
held under a specific model version,
given the data available at that time.

Therefore:

scores must be versioned,
thresholds must be versioned,
and historical beliefs must remain reconstructable.

6.2 SCD2 for Identity Hypotheses

Just as entity links are SCD2’d, so too must probabilistic assertions be:

entity_id	related_entity_id	score	model_version	effective_from	effective_to

This allows the platform to answer:

What did we believe at the time?
What would we believe now?
Why did a decision differ?

7. Graph Evolution and Temporal Reconstruction

Once identity is represented as a graph, change no longer happens only at the record level. The graph itself becomes a living structure whose evolution must be governed and explainable.

In practice, this mirrors bi-temporal data management: separating when an identity relationship was believed to be true from when that belief was recorded or revised, allowing platforms to reason about decisions as-known-at-the-time, not as-recomputed-later.

Graphs are not static.

Over time:

new nodes appear,
edges strengthen or decay,
clusters merge or fracture.

A regulated platform must support:

graph snapshots as-of time,
recomputation under new models,
explainable deltas between versions.

This is only possible when graphs are built from immutable Bronze history and versioned inference logic.

8. Governance: Making Probability Defensible

The hardest problem in probabilistic identity is not scoring, modelling, or graph construction. It is governance: deciding what probabilistic outputs are allowed to influence, under what conditions, and with what accountability.

The central regulatory challenge is not using probabilistic identity — it is controlling it.

8.1 What Regulators Expect

In the UK, this expectation aligns directly with FCA and PRA requirements around record-keeping, model risk management, and individual accountability under SM&CR. Firms must be able to demonstrate not only what decision was taken, but why it was reasonable given the information and models available at the time — and which Senior Manager was accountable for that framework.

Regulators do not require certainty. They require:

clear rationale for decisions,
documented thresholds,
consistency over time,
the ability to explain why alternatives were rejected.

8.2 Hard Guardrails

These guardrails exist not to constrain innovation, but to ensure that probabilistic identity remains compatible with audit, supervisory review, and post-incident reconstruction.

In practice, this means:

probabilistic outputs inform decisions; they do not replace policy,
high-risk actions require deterministic confirmation or human review,
models are treated as governed assets, not analytics experiments.

From a supervisory perspective, probabilistic identity models fall squarely under model risk management and SM&CR accountability: model assumptions, thresholds, and permitted uses must be owned, documented, reviewed, and traceable to accountable individuals.

9. Operating Probabilistic & Graph Identity Safely

Even well-governed models fail if they are poorly operated. At scale, probabilistic identity becomes an operational system, not an analytical one — and operational discipline determines whether it remains safe.

To operate safely at scale:

separate signal generation from decision making,
store intermediate scores, not just outcomes,
monitor drift in score distributions and graph structure,
feed manual reviews back as new evidence, not retroactive truth.

Most failures occur not in modelling, but in operational shortcuts.

10. Where This Fits in the Architecture

Probabilistic and graph-based identity should not float freely across the platform. Their value and risk are both determined by where they sit in the data architecture and what layers are allowed to consume them.

In a disciplined FS platform:

Bronze holds immutable history and deterministic ER.
Bronze+ / Spine hosts probabilistic scores and graph relationships.
Silver consumes governed decisions, not raw probabilities.
Gold / Platinum reason about outcomes, not inference mechanics.

Probabilistic identity augments the spine — it never bypasses it.

11. Common Failure Modes to Avoid

Most regulatory exposure from probabilistic identity does not come from edge cases or advanced techniques. It comes from repeatable, well-intentioned shortcuts that collapse uncertainty into false certainty.

Experience shows recurring mistakes:

treating probability as truth,
collapsing model output directly into “golden” views,
overwriting historical belief with new scores,
hiding uncertainty behind dashboards.

Each of these creates regulatory exposure.

12. Conclusion: Probability Without Discipline Is Risk

Probabilistic and graph-based identity are not optional in modern Financial Services. But neither are they shortcuts.

When anchored to:

immutable Bronze history,
deterministic entity resolution,
explicit temporal versioning,
and strong governance,

they become powerful, defensible tools for understanding identity at scale.

When applied casually, they become some of the most dangerous systems in the enterprise.

When anchored to immutable Bronze history, deterministic entity resolution, explicit temporal versioning, and strong governance, probabilistic and graph-based identity become powerful tools for understanding risk at scale — without sacrificing explainability or control.

Used this way, they do not replace certainty where it exists; they extend insight where certainty is impossible. In regulated Financial Services, that discipline is what allows firms to act earlier, see wider patterns, and still explain their decisions years later.

Done well, probabilistic and graph-based identity allow firms to see risk earlier, act proportionately, and remain fully accountable — extending regulatory confidence rather than undermining it.

Horkan

a blog by Wayne Horkan