Tag Archives: Data Quality

WTF is the Fellegi–Sunter Model? A Practical Guide to Record Matching in an Uncertain World

The Fellegi–Sunter model is the foundational probabilistic framework for record linkage… deciding whether two imperfect records refer to the same real-world entity. Rather than enforcing brittle matching rules, it treats linkage as a problem of weighing evidence under uncertainty. By modelling how fields behave for true matches versus non-matches, it produces interpretable scores and explicit decision thresholds. Despite decades of new tooling and machine learning, most modern matching systems still rest on this logic… often without acknowledging it.

Continue reading

Aligning the Data Platform to Enterprise Data & AI Strategy

This article establishes the data platform as the execution engine of Enterprise Data & AI Strategy in Financial Services. It bridges executive strategy and technical delivery by showing how layered architecture (Bronze, Silver, Gold, Platinum), embedded governance, dual promotion lifecycles (North/South and East/West), and domain-aligned operating models turn strategic pillars, architecture & quality, governance, security & privacy, process & tools, and people & culture, into repeatable, regulator-ready outcomes. The result is a platform that delivers control, velocity, semantic alignment, and safe AI enablement at scale.

Continue reading

Production-Grade Testing for SCD2 & Temporal Pipelines

The testing discipline that prevents regulatory failure, data corruption, and sleepless nights in Financial Services. Slowly Changing Dimension Type 2 pipelines underpin regulatory reporting, remediation, risk models, and point-in-time evidence across Financial Services — yet most are effectively untested. As data platforms adopt CDC, hybrid SCD2 patterns, and large-scale reprocessing, silent temporal defects become both more likely and harder to detect. This article sets out a production-grade testing discipline for SCD2 and temporal pipelines, focused on determinism, late data, precedence, replay, and PIT reconstruction. The goal is simple: prevent silent corruption and ensure SCD2 outputs remain defensible under regulatory scrutiny.

Continue reading