Entity Resolution & Matching at Scale on the Bronze Layer

Entity resolution has become one of the hardest unsolved problems in modern UK Financial Services data platforms. This article sets out a Bronze-layer–anchored approach to resolving customers, accounts, and parties at scale using SCD2 as the temporal backbone. It explains how deterministic, fuzzy, and probabilistic matching techniques combine with blocking, clustering, and survivorship to produce persistent, auditable entity identities. By treating entity resolution as platform infrastructure rather than an application feature, firms can build defensible Customer 360 views, support point-in-time reconstruction, and meet growing FCA and PRA expectations.

Contents

Introduction: How UK Financial Services actually build Customer 360, Account 360 and Party 360 without collapsing under their own data

In modern UK Financial Services data platforms, the question “which source wins?” (precedence) is only half the problem. The harder and more fundamental question is: “are these even the same entity?”

Entity resolution (ER) — deciding whether two or more records represent the same real-world customer, account, party, household, or beneficial owner — is now one of the main reasons Customer 360, KYC, and risk modernisation programmes stall or fail.

This article describes how to design Bronze-layer–anchored entity resolution:

  • using SCD2 historical data as the temporal backbone,
  • applying deterministic, fuzzy, and probabilistic matching strategies,
  • maintaining persistent entity identifiers and clusters over time,
  • and doing all of this in a way that is regulator-defensible, rebuildable, and auditable under FCA/PRA scrutiny.

It is written from the perspective of a UK FS data platform where Bronze carries full temporal truth, Silver provides non-SCD current-state views, and higher layers build business context.

Without disciplined entity resolution, “Customer 360” degenerates into a naming convention rather than a defensible construct.

1. Problem Statement: Why Entity Resolution Is Now the Hard Problem

Before discussing techniques or architecture, it is necessary to understand why entity resolution has become one of the most persistent blockers in modern Financial Services data platforms. This section frames ER not as a data quality nuisance, but as a structural challenge created by scale, history, regulation, and fragmentation.

In earlier parts of this series we tackled:

  • SCD2 at the Bronze layer
  • Non-SCD Silver for current state
  • Advanced SCD2 optimisation
  • Golden-source precedence and point-in-time reconstruction

Those patterns answer questions like:

  • “What did our sources say, and when?”
  • “Which source wins for each attribute?”
  • “What did we believe on a given date?”

But they assume we already know which records belong to which real-world entity. In practice:

  • The same person appears under slightly different names and addresses.
  • A customer has accounts across multiple product systems, each with its own identifiers.
  • Corporate entities are duplicated with small variations in LEI, registration, or trading names.
  • Household relationships and beneficial owners are distributed across KYC, CRM, and legacy systems.

So the next unavoidable question is:

Given all of these SCD2’d records from multiple systems, which ones actually represent the same entity — and which ones definitely do not?

That is the role of Entity Resolution (ER):

  • grouping records that refer to the same underlying entity,
  • assigning them a stable, persistent entity_id,
  • and keeping that mapping consistent and explainable over time.

Done well, ER becomes the spine of Customer 360, Account 360, Party 360 and KYC views.
Done badly (or not at all), it quietly undermines every downstream model and regulatory submission.

2. Core Concepts: Records, Entities, Links and Clusters

Entity resolution discussions often fail because basic terms are used loosely or inconsistently. This section establishes a shared vocabulary, allowing the rest of the article to reason precisely about identity, belief, and change over time.

To talk precisely about ER on the Bronze layer, a few basic definitions help.

  • Record
    A single row from a source system at a point in time (often an SCD2 version in Bronze).
  • Entity
    The real-world thing we care about: an individual, a legal entity, an account, a relationship, a household.
  • Entity ID (or “spine ID”)
    A stable, platform-level identifier that groups records believed to represent the same entity.
  • Link
    A relationship between records or between a record and an entity ID. Links often have a type (exact_id_match, high_score_match, manually_confirmed, manually_rejected).
  • Cluster
    The set of records that belong to the same entity_id at a given point in time.

Entity resolution is about managing links and clusters over time, anchored to SCD2’d Bronze records.

Please note: An entity_id represents a belief held by the platform at a point in time, not an immutable truth.

3. Where Entity Resolution Lives in a Medallion Architecture

Where ER is implemented determines whether it becomes a platform capability or a perpetual source of inconsistency. This section explains why ER cannot live purely in applications or Silver models, and why anchoring it close to Bronze is critical for temporal truth and rebuildability.

A common and understandable mistake is to treat entity resolution as purely a Silver or application concern. For example:

  • CRM builds its own “golden customer” logic,
  • KYC builds another,
  • Risk builds yet another,
  • each with partial, conflicting logic.

The result is:

  • three different “Customer 360s”,
  • inconsistent risk numbers,
  • and very unhappy regulators.

In a Bronze/Silver/Gold/Platinum medallion model:

  • Raw
    Landed, unchanged source data (possibly with XML/JSON blobs as discussed in the previous article).
  • Bronze
    SCD2 history per source, plus cross-source temporal truth.
  • Entity Resolution Layer (Bronze+ / Spine)
    A set of tables that map source records → entity_id, built on top of Bronze history.
  • Silver
    Non-SCD current-state, using entity_id as the anchor for Customer/Account/Party views.
  • Gold
    Business KPIs, MI, risk & pricing models using entity_id and stable relationships.
  • Platinum
    Conceptual/semantic models spanning multiple domains.

The key point:

Entity resolution belongs as close to Bronze as possible, because it must:

  • use the full historical signal,
  • be replayable and rebuildable,
  • and underpin all downstream domain views.

ER is not just a feature in one team’s data mart — it is part of the platform’s temporal backbone.

4. Matching Strategies: From Deterministic Keys to Probabilistic Scoring

No single matching technique is sufficient at Financial Services scale. This section introduces the spectrum of matching approaches and explains how mature platforms combine them, rather than treating ER as either a rules problem or a machine learning problem.

At a high level, there are four broad classes of matching technique. Mature platforms normally use a hybrid of all four.

4.1 Deterministic / Exact Matching

Simple, robust, and explainable:

  • Exact match on a shared identifier (e.g. customer_number, card_number, IBAN, LEI).
  • Exact match on identifiers with normalisation (case, whitespace, formatting).
  • Hash-based equality for documents (e.g. same passport number + issuing country).

Used for:

  • intra-system de-duplication,
  • cross-system mapping where IDs are formally shared,
  • strong, unambiguous matches.

Limitations:

  • only works if identifiers truly align;
  • fails when IDs are missing, mistyped, or system-specific.

4.2 Standardised / Normalised Matching

Deterministic rules on standardised values. For example:

  • Normalised names: remove punctuation, case, common titles (MR/MRS/DR), and certain stop words.
  • Address standardisation: postcodes, house numbers, street synonyms.
  • Phone/email canonicalisation: remove spaces, standard country codes, lowercase emails.

Pairs of records are considered matches if:

  • normalised_name matches, and
  • normalised_dob matches, and
  • normalised_postcode matches.

This is still deterministic, just on canonical forms.

4.3 Fuzzy / Similarity-Based Matching

For many fields, especially names and addresses, you need fuzzy logic:

  • Levenshtein / edit distance
  • Jaro–Winkler similarity
  • Token-based similarity (Jaccard, cosine over q-grams)
  • Phonetic algorithms (Soundex, Metaphone)

Example rule:

If Jaro–Winkler(name) ≥ 0.93 and same DOB and postcode distance ≤ 1 character → candidate match.

These yield a score rather than a binary yes/no, and are especially useful across legacy systems and human-entered data.

4.4 Probabilistic & Graph-Based Matching

At higher maturity, FS firms use:

  • probabilistic models (e.g. Fellegi–Sunter, logistic regression, ML models) to assign match probabilities;
  • graph-based approaches, where signals such as shared addresses, devices, IPs, employers, or family links create a network of related entities.

For example:

  • Two individuals may not look identical, but both:
    • share the same phone,
    • live at the same address,
    • and jointly hold an account.

From a fraud or AML perspective, that cluster is often what matters.

These models are powerful, but must be kept explainable and deterministic in effect (for audit).

In regulated environments, these models must be constrained so that their outputs are reproducible, explainable, and versioned in effect, even if their internal mechanics are complex.

5. Designing a Bronze-Layer Entity Resolution Engine

Once the conceptual and architectural placement is clear, the question becomes how to build ER as a repeatable, auditable capability. This section walks through the logical components of an ER engine that operates on SCD2 Bronze data rather than ad-hoc extracts.

Let’s walk through the essential components of an ER engine that sits logically on top of Bronze.

5.1 Inputs: SCD2’d Source Records

Inputs typically include:

  • Customer tables from each system (SCD2 in Bronze)
  • Account tables
  • Party / legal entity tables
  • KYC / AML reference data
  • Identity & onboarding systems

Each record carries:

  • business keys (customer_id, account_id…)
  • SCD2 metadata (effective_from, effective_to, is_current)
  • normalised attributes (name, address, contact, DOB…)
  • source_system identifiers

5.2 Candidate Generation (Blocking)

Naively comparing every record with every other is impossible at FS scale.

So we use blocking:

  • build blocks on easily matchable keys: e.g. normalised_postcode, last_name_prefix, partial DOB;
  • only run expensive matching within blocks.

Examples:

  • “all records with the same postcode and same year-of-birth”;
  • “all records sharing the same LEI”;
  • “all records with same last 4 digits of phone + same surname initial”.

Blocking is where many ER engines either succeed or silently miss matches. It must be tuned and tested.

Blocking strategies must be biased toward recall rather than precision; missed candidates cannot be recovered later.

5.3 Scoring and Classification

Within each block:

  • apply deterministic rules first (hard matches),
  • then fuzzy comparisons (for candidates that aren’t exact),
  • then, optionally, a probabilistic model.

The output is something like:

left_record_idright_record_idscorematch_typedecision
AB0.99EXACT_IDMATCH
CD0.92NAME_DOB_ADDRESS_FUZZMATCH
EF0.63NAME_ONLYREVIEW
GH0.10LOW_SIMILARITYNON_MATCH

The decision should be deterministic and version-controlled:

  • thresholds for MATCH/REVIEW/NON_MATCH
  • which rules apply in which jurisdictions or business contexts
  • versioning of the scoring logic itself (for audit and replay)

5.4 Entity Clustering and Persistent IDs

From pairwise matches and non-matches, we need to form clusters and assign a stable entity_id:

  • If A matches B and B matches C → cluster {A, B, C} with entity_id E12345
  • If later D is matched to B → cluster extends to {A, B, C, D}

Important considerations:

  • Ensure clustering is transitive and consistent.
  • Break or re-cluster when a hard non-match (e.g. manually confirmed) is added.
  • Treat clusters as SCD2 entities themselves (see next section).

Clusters should be treated as first-class data structures, not incidental query results.

5.5 SCD2 for Entities: Versioning the Cluster over Time

Entities are not static. Over time:

  • new records join a cluster;
  • some links are revoked (manual review, new information);
  • entity may split into two clusters or merge with another.

So we use SCD2 not just for source records, but for entity_id↔record clusters:

entity_idrecord_ideffective_fromeffective_tolink_typelink_confidence
E12345A2021-01-019999-12-31MATCH_RULE_010.99
E12345B2021-03-109999-12-31MATCH_RULE_010.99
E12345C2022-07-159999-12-31MATCH_RULE_020.93

When something changes (e.g. B is later deemed not to be the same entity):

  • we close off B’s link with a new effective_to;
  • re-cluster if needed;
  • downstream PIT queries can reconstruct what we believed at any point in time.

This is crucial in regulated FS environments.

6. Survivorship and Golden Records

Resolving identity and choosing attribute values are related but distinct problems. This section separates entity resolution from survivorship, clarifying how “golden” views are constructed without corrupting historical truth or rewriting identity decisions.

Entity resolution groups records; survivorship decides which values we take from those records for the entity-level “golden” view.

This interacts directly with your previous golden-source precedence article.

For a given entity_id cluster, at a point in time:

  • decide which record provides each attribute (e.g. Core → legal name, KYC → residency status, CRM → email);
  • apply the precedence matrix at entity level;
  • track which system “won” for each attribute.

This gives a current-state Silver view like:

entity_idlegal_nameaddressemailrisk_ratingsource_legal_namesource_addresssource_email
E12345Alice Lee14 Bishopsgatealice@domain.comHIGHCORECORECRM

Survivorship rules must be:

  • explicit,
  • version-controlled,
  • and explainable to a regulator.

Survivorship must never retroactively rewrite entity resolution history.

7. Temporal Complexity: How History Interacts with Matching

Entity resolution is inherently temporal: identities evolve as new information arrives and old assumptions are corrected. This section explores how time, late data, and reinterpretation interact with matching logic in regulated environments.

Entity resolution is not just about now — it’s about then.

7.1 “State as Known” vs “State as Now Known”

As discussed previously:

  • “State as known on 01 Jan 2022” means what we believed then.
  • “State as now known on 01 Jan 2022” means what we would have believed if we had our current knowledge and fixed data.

Entity resolution must support both:

  • PIT entity clusters as of a date using historic match rules;
  • recalculated clusters under new, corrected rules for backtesting or remediation.

7.2 Late-Arriving Information

When a late KYC correction or identity document arrives:

  • it may dramatically change match behaviour;
  • two entities may merge or split;
  • clusters may be reconfigured historically.

A mature ER engine needs temporal repair capabilities:

  • versioned matching logic;
  • “re-evaluate history from date X using rule version Y”;
  • replay SCD2 clusters and survivorship with new knowledge.

This is where having ER anchored to SCD2 Bronze is invaluable. You have all the raw material to rebuild deterministically.

8. Regulatory & Risk Use Cases

Entity resolution is not an abstract data exercise; it directly underpins regulatory, risk, and customer obligations. This section connects ER capabilities to the concrete questions regulators and risk teams actually ask under scrutiny.

In UK FS, regulators increasingly care not just about what you reported, but whether:

  • your notion of “customer”, “account”, “party” is consistent and defendable;
  • you can show how entities were formed;
  • you can explain why two records were (or were not) treated as the same person.

Entity resolution feeds directly into:

  • KYC / AML: identifying customers across systems and over time, linking to sanctions, PEP lists, adverse media.
  • Consumer Duty & Remediation: making sure you’re aggregating all relevant products, transactions, and advice across the true entity.
  • Credit & Risk Models: correctly aggregating exposures at customer, household, or group level.
  • Fraud & Financial Crime: understanding networks of accounts, devices, and behaviour.
  • Operational Resilience: knowing which customers/accounts are affected by incidents.

Being able to demonstrate:

  • the ER logic,
  • the source attributes used,
  • and the ability to re-run and reconstruct history

is becoming a key part of PRA/FCA expectations.

Regulators increasingly expect firms to demonstrate not just the outcome of entity resolution, but the process by which it was reached, including why alternative matches were rejected.

9. Practical Platform Patterns (Databricks, Snowflake, Others)

While the principles of entity resolution are platform-agnostic, implementation details matter at scale. This section highlights common patterns seen across modern data platforms, without prescribing specific vendor solutions.

This article is platform-agnostic on purpose, but a few recurring patterns are worth calling out.

9.1 Databricks / Delta Lake

  • Use Bronze SCD2 tables as input.
  • Use Spark for blocking, scoring and clustering at scale.
  • Store entity_id↔record links as Delta tables with SCD2 versioning.
  • Use Delta Change Data Feed to incrementally update matches and clusters.

9.2 Snowflake

  • Use Streams & Tasks over SCD2’d base tables to feed incremental ER logic.
  • Use QUALIFY with ROW_NUMBER for precedence and survivorship.
  • Store ER scores and decisions in dedicated tables for auditing.

9.3 BigQuery / Iceberg / Hudi

  • Similar architectural patterns:
    • SCD2 base,
    • blocking and scoring jobs (SQL or Spark/Flink),
    • entity link tables,
    • PIT views built on top.

The key is not the specific engine, but the separation of concerns:

  • SCD2 per domain → ER logic → survivorship → PIT & Silver.

10. Operating & Governing Entity Resolution

Building an ER engine is only the beginning. This section addresses what it takes to operate, monitor, govern, and evolve entity resolution safely over time in a live Financial Services environment.

Building an ER engine once is not enough. You need to run and govern it.

Key practices:

  • Metrics & Monitoring
    • number of entities, average cluster size, match rates, review queue volume;
    • drift in match scores over time.
  • Feedback Loops
    • input from operations, KYC, fraud teams on false positives/negatives;
    • manual confirm/reject signals fed back into models and rules.
  • Versioning and Governance
    • treat ER configuration (rules, thresholds, precedence) as code;
    • subject it to change control, testing, and approval;
    • maintain an audit trail of changes.
  • Data Quality Dependencies
    • ER is only as good as the attributes it sees;
    • invest in name/address/contact standardisation upstream.

Fully automated match decisions without a governed review path are rarely defensible in high-risk use cases.

11. Summary

Entity resolution is where historical truth, identity, and regulatory accountability intersect. This closing section consolidates the patterns described and reinforces why ER must be treated as core platform infrastructure rather than an afterthought.

Entity resolution on the Bronze layer is where:

  • SCD2 temporal truth,
  • multi-source precedence,
  • and real-world identity

finally meet.

Without ER, your SCD2 Bronze is an exquisitely detailed but fragmented archive. With ER:

  • you anchor records to persistent entities;
  • you can build trustworthy Customer/Account/Party 360 views;
  • you can reconstruct who was who when, under what logic;
  • and you can defend that story to regulators and auditors.

The pattern is conceptually simple, but operationally demanding:

  1. SCD2 Bronzes per source system
  2. Blocking + matching on standardised attributes
  3. Deterministic + fuzzy + probabilistic scoring
  4. Clusters and entity_ids with SCD2’d link history
  5. Survivorship rules to construct entity-level “golden” views
  6. Temporal repair and PIT reconstruction when rules or data change

As with the rest of this series, the goal is not to suggest that this is easy or “solved”, but to articulate the patterns that have proven to work in real, regulated UK Financial Services environments — and to do so in a way that can be implemented, tested, and defended.

If you get entity resolution right on the Bronze layer, everything above it — Silver, Gold, Platinum — becomes sharper, more reliable, and much easier to explain.