Tag Archives: Data Engineering

Series Wrap-Up: Reconstructing Time, Truth, and Trust in UK Financial Services Data Platforms

This series explored how UK Financial Services data platforms can preserve temporal truth, reconstruct institutional belief, and withstand regulatory scrutiny at scale. Beginning with foundational concepts such as SCD2 and event modelling, it developed into a comprehensive architectural pattern centred on an audit-grade Bronze layer, non-SCD Silver consumption, and point-in-time defensibility. Along the way, it addressed operational reality, governance, cost, AI integration, and regulatory expectations. This final article brings the work together, offering a structured map of the series and a coherent lens for understanding how modern, regulated data platforms actually succeed. Taken together, this body of work describes what I refer to as a “land it early, manage it early” data platform architecture for regulated industries.

Continue reading →

East/West vs North/South Promotion Lifecycles: How Modern Financial Services Data Platforms Support Operational Stability and Analytical Freedom Simultaneously

Leave a Reply

This article argues that modern Financial Services (FS) data platforms must deliberately support two distinct but complementary promotion lifecycles. The well known and understood North/South lifecycle provides operational stability, governance, and regulatory safety for customer-facing and auditor-visible systems. In parallel, the East/West lifecycle enables analytical exploration, experimentation, and rapid innovation for data science and analytics teams. By mapping these lifecycles onto layered data architectures (Bronze to Platinum) and introducing clear promotion gates, FS organisations can protect operational integrity while sustaining analytical freedom and innovation.

Continue reading →

Production-Grade Testing for SCD2 & Temporal Pipelines

Leave a Reply

The testing discipline that prevents regulatory failure, data corruption, and sleepless nights in Financial Services. Slowly Changing Dimension Type 2 pipelines underpin regulatory reporting, remediation, risk models, and point-in-time evidence across Financial Services — yet most are effectively untested. As data platforms adopt CDC, hybrid SCD2 patterns, and large-scale reprocessing, silent temporal defects become both more likely and harder to detect. This article sets out a production-grade testing discipline for SCD2 and temporal pipelines, focused on determinism, late data, precedence, replay, and PIT reconstruction. The goal is simple: prevent silent corruption and ensure SCD2 outputs remain defensible under regulatory scrutiny.

Continue reading →

Event-Driven CDC to Correct SCD2 Bronze in 2025–2026

Leave a Reply

Broken history often stays hidden until remediation or skilled-person reviews. Why? Event-driven Change Data Capture fundamentally changes how history behaves in a data platform. When Financial Services organisations move from batch ingestion to streaming CDC, long-standing SCD2 assumptions quietly break — often without immediate symptoms. Late, duplicated, partial, or out-of-order events can silently corrupt Bronze history and undermine regulatory confidence. This article sets out what “correct” SCD2 means in a streaming world, why most implementations fail, and how to design Bronze pipelines that remain temporally accurate, replayable, and defensible under PRA/FCA scrutiny in 2025–2026.

Continue reading →

From SCD2 Bronze to a Non-SCD Silver Layer in Other Tech (Iceberg, Hudi, BigQuery, Fabric)

Leave a Reply

Modern data platforms consistently separate historical truth from analytical usability by storing full SCD2 history in a Bronze layer and exposing a simplified, current-state Silver layer. Whether using Apache Iceberg, Apache Hudi, Google BigQuery, or Microsoft Fabric, the same pattern applies: Bronze preserves immutable, auditable change history, while Silver removes temporal complexity to deliver one row per business entity. Each platform implements this differently, via snapshots, incremental queries, QUALIFY, or Delta MERGE, but the architectural principle remains universal and essential for regulated environments.

Continue reading →

From SCD2 Bronze to a Non-SCD Silver Layer in Databricks

Leave a Reply

This article explains a best-practice Databricks lakehouse pattern for transforming fully historical SCD2 Bronze data into clean, non-SCD Silver tables. Bronze preserves complete temporal truth for audit, compliance, and investigation, while Silver exposes simplified, current-state views optimised for analytics and data products. Using Delta Lake features such as MERGE, Change Data Feed, OPTIMIZE, and ZORDER, organisations, particularly in regulated Financial Services, can efficiently maintain audit-proof history while delivering fast, intuitive, consumption-ready datasets.

Continue reading →

Operationalising SCD2 at Scale: Monitoring, Cost Controls, and Governance for a Healthy Bronze Layer

Leave a Reply

This article explains how to operationalise Slowly Changing Dimension Type 2 (SCD2) at scale in the Bronze layer of a medallion architecture, with a focus on highly regulated Financial Services environments. It outlines three critical pillars: monitoring, cost control, and governance, needed to keep historical data trustworthy, performant, and compliant. By tracking growth patterns, preventing meaningless updates, controlling storage and compute costs, and enforcing clear governance, organisations can ensure their Bronze layer remains a reliable audit-grade historical asset rather than an unmanaged data swamp.

Continue reading →

Advanced SCD2 Optimisation Techniques for Mature Data Platforms

Leave a Reply

Advanced SCD2 optimisation techniques are essential for mature Financial Services data platforms, where historical accuracy, regulatory traceability, and scale demands exceed the limits of basic SCD2 patterns. Attribute-level SCD2 significantly reduces storage and computation by tracking changes per column rather than per row. Hybrid SCD2 pipelines, combining lightweight delta logs with periodic MERGEs into the main Bronze table, minimise write amplification and improve reliability. Hash-based and probabilistic change detection eliminate unnecessary updates and accelerate temporal comparison at scale. Together, these techniques enable high-performance, audit-grade SCD2 in platforms such as Databricks, Snowflake, BigQuery, Iceberg, and Hudi, supporting the long-term data lineage and reconstruction needs of regulated UK Financial Services institutions.

Continue reading →

Scaling the SCD2 Bronze Layer: Practical Strategies for Financial Services

Leave a Reply

A rapidly growing SCD2 Bronze layer is an expected outcome of implementing full historical tracking across financial services data platforms, where high-frequency attribute changes, noisy upstream systems, and long regulatory retention periods contribute to rapid data growth. This article outlines practical engineering strategies to keep SCD2 Bronze efficient and cost-effective, including effective partitioning, change-suppression logic, hot–warm–cold tiering, compaction and lifecycle optimisation, windowed processing, metadata offloading, and upstream data contracts. Advanced approaches, such as attribute-level SCD2, hybrid SCD2/delta-merge patterns, and hash-based change detection, further enhance scalability for mature platforms. The piece concludes that SCD2 growth is not a failure but a natural result of robust governance. With the right architecture and operational discipline, organisations can maintain an auditable, scalable, and regulator-aligned Bronze layer ready for analytics, AI, and Data Mesh.

Continue reading →

Using SCD2 in the Bronze Layer with a Non-SCD2 Silver Layer: A Modern Data Architecture Pattern for UK Financial Services

Leave a Reply

UK Financial Services firms increasingly implement SCD2 history in the Bronze layer while providing simplified, non-SCD2 current-state views in the Silver layer. This pattern preserves full historical auditability for FCA/PRA compliance and regulatory forensics, while delivering cleaner, faster, easier-to-use datasets for analytics, BI, and data science. It separates “truth” from “insight,” improves governance, supports Data Mesh models, reduces duplicated logic, and enables deterministic rebuilds across the lakehouse. In regulated UK Financial Services today, it is the only pattern I have seen that satisfies the full, real-world constraint set with no material trade-offs.

Continue reading →

WTF Is SCD? A Practical Guide to Slowly Changing Dimensions

Leave a Reply

Slowly Changing Dimensions (SCDs) are how data systems manage attributes that evolve without constantly rewriting history. They determine whether you keep only the latest value, preserve full historical versions, or maintain a limited snapshot of changes. The classic SCD types (0–3, plus hybrids) define different behaviours… from never updating values, to overwriting them, to keeping every version with timestamps. The real purpose of SCDs is to make an explicit choice about how truth should behave in your analytics: what should remain fixed, what should update, and what historical context matters. Modern data platforms make tracking changes easy, but they don’t make the design decisions for you. SCDs are ultimately the backbone of reliable, temporal, reality-preserving analytics.

Continue reading →

Databricks vs Snowflake vs Microsoft Fabric: Positioning the Future of Enterprise Data Platforms

Leave a Reply

This article extends the Databricks vs Snowflake comparison to include Microsoft Fabric, exploring the platforms’ philosophical roots, architectural approaches, and strategic trade-offs. It positions Fabric not as a direct competitor but as a consolidation play for Microsoft-centric organisations, and introduces Microsoft Purview as the governance layer that unifies divergent estates. Drawing on real enterprise patterns where Databricks underpins engineering, Fabric drives BI adoption, and functional teams risk fragmentation, the piece outlines the “Build–Consume–Govern” model and a phased transition plan. The conclusion emphasises orchestration across platforms, not choosing a single winner, as the path to a governed, AI-ready data estate.

Continue reading →

Databricks vs Snowflake: A Critical Comparison of Modern Data Platforms

Leave a Reply

This article provides a critical, side-by-side comparison of Databricks and Snowflake, drawing on real-world experience leading enterprise data platform teams. It covers their origins, architecture, programming language support, workload fit, operational complexity, governance, AI capabilities, and ecosystem maturity. The guide helps architects and data leaders understand the philosophical and technical trade-offs, whether prioritising AI-native flexibility and open-source alignment with Databricks or streamlined governance and SQL-first simplicity with Snowflake. Practical recommendations, strategic considerations, and guidance by team persona equip readers to choose or combine these platforms to align with their data strategy and talent strengths.

Continue reading →

Horkan

a blog by Wayne Horkan

Tag Archives: Data Engineering

Series Wrap-Up: Reconstructing Time, Truth, and Trust in UK Financial Services Data Platforms

East/West vs North/South Promotion Lifecycles: How Modern Financial Services Data Platforms Support Operational Stability and Analytical Freedom Simultaneously

Production-Grade Testing for SCD2 & Temporal Pipelines

Event-Driven CDC to Correct SCD2 Bronze in 2025–2026

From SCD2 Bronze to a Non-SCD Silver Layer in Other Tech (Iceberg, Hudi, BigQuery, Fabric)

From SCD2 Bronze to a Non-SCD Silver Layer in Databricks

Operationalising SCD2 at Scale: Monitoring, Cost Controls, and Governance for a Healthy Bronze Layer

Advanced SCD2 Optimisation Techniques for Mature Data Platforms

Scaling the SCD2 Bronze Layer: Practical Strategies for Financial Services

Using SCD2 in the Bronze Layer with a Non-SCD2 Silver Layer: A Modern Data Architecture Pattern for UK Financial Services

WTF Is SCD? A Practical Guide to Slowly Changing Dimensions