Complex Precedence & Out-of-Sequence Safety in Bronze-Layer SCD2 (Regulated FS)

This article defines how to implement SCD2 in the Bronze layer to safely handle multi-source precedence, out-of-sequence data, partial and full loads, deletions, and transaction patterns in regulated Financial Services. It introduces a metadata-driven approach that preserves temporal truth, prevents ingestion-order corruption, and enables deterministic is_current. The result is a defensible, replayable foundation that simplifies downstream Silver layers and supports point-in-time reconstruction under audit.

Abstract

Modern Financial Services data platforms must reconcile conflicting assertions from multiple systems, often arriving out of sequence, partially, or concurrently. Traditional SCD2 implementations fail under these conditions, resulting in incorrect current-state views and limited auditability.

This article presents a Bronze-layer SCD2 pattern that is resilient to ingestion order, supports explicit source precedence, and models deletions and transactions correctly. By anchoring all logic in event-time and structured metadata, it ensures that institutional belief remains reconstructable.

The approach enables deterministic is_current, prevents version explosion, and creates a clean, governed foundation for Silver-layer consumption.

It operationalises the principle of “land it early, manage it early” for complex, regulated data environments.

Series Overview

This article is part of a follow-on series to the original “land it early, manage it early” work on SCD2 Bronze architectures in UK Financial Services.

The original series focused on the principles of temporal data platforms and how to store, govern, and reconstruct truth over time using SCD2 in the Bronze/temporal tracking layer.

This series moves into a different space: what happens when you actually build and run these systems.

In practice, real platforms must deal with:

multiple competing sources of authority
out-of-sequence and late-arriving data
partial updates and deletions
operational systems writing outside the platform

These are not edge cases; they are the normal operating conditions.

This article focuses on one of the hardest of these problems: maintaining deterministic temporal truth when data arrives out of order and from multiple sources.

Table of Contents

Abstract
Series Overview
Contents
1. Introduction: Why This Problem Exists (And Why It’s Not “Just SCD2”)
2. The Core Principle: Bronze Stores Evidence, Not Truth
3. The Three Families of Metadata You Must Capture
4. The Two Operating Modes (This is the Big Fork)
5. Bronze SCD2 Row Model (Schema + Meaning)
6. Hashing Strategy: Where the Hash Goes and Why
7. Handling Out-of-Sequence Events Correctly
8. Full Loads, Partial Loads, and “Over-the-Top” Snapshots
9. Deletions: What They Mean and How to Model Them
10. Precedence: How Multi-Source Truth Is Resolved
11. Transactions vs State: Stop Treating Events Like Dimensions
12. Reference PySpark Implementation (Bronze Writer)
13. What This Enables in Silver (And Why Silver Gets Simpler)
14. Expected Substantiation (Regulator / Audit Lens)
15. Summary: “Land It Early, Manage It Early” Applied to Precedence

1. Introduction: Why This Problem Exists (And Why It’s Not “Just SCD2”)

At first glance, this looks like a standard SCD2 problem: track changes over time, maintain history, and derive a current view. In practice, regulated Financial Services platforms face a far more complex reality.

Most failed SCD2 implementations are not wrong because of SQL — they are wrong because they confuse ingestion order with truth.

1.1 Multi-Source Truth

The same entity is asserted by multiple systems — CRM, core banking, onboarding, risk, external providers — each with its own timing, completeness, and authority. These are not clean updates to a single record; they are competing versions of belief.

1.2 Out-of-Sequence Delivery

Data does not arrive in the order it was created. Late events, replays, corrections, and batch delays mean older states can arrive after newer ones. If ordering is based on ingestion, history becomes incorrect.

1.3 ADF Concurrency Storms

When ingestion frameworks (e.g. ADF) process files in parallel, thousands of records for the same entity can arrive simultaneously and out of order. Without deterministic logic, this creates duplicate versions, broken timelines, and incorrect is_current.

1.4 Regulator Expectations (PIT & Belief Reconstruction)

Regulators do not ask “what is true now?” — they ask:

What did you believe at the time, and why?

This requires:

Point-in-time (PIT) reconstruction
Evidence of source and authority
A defensible timeline of changes

Standard SCD2 implementations — especially those based on arrival order — cannot meet this requirement.

2. The Core Principle: Bronze Stores Evidence, Not Truth

To solve this, the architecture must shift from “data processing” to evidence preservation.

2.1 Bronze = All Assertions

The Bronze layer captures:

Every version received
From every source
With full provenance
Without overwriting or collapsing

It answers:

What did we receive, when, and from whom?

It does not decide which version is “correct”.

2.2 Silver = Resolved Belief

The Silver layer applies:

Precedence rules
Business logic
Data quality constraints

It answers:

Given all available evidence, what do we believe now (or at time X)?

2.3 Why “Overwrite” Is a Compliance Failure

Overwriting data at ingestion destroys the ability to:

Reconstruct prior belief
Explain conflicting inputs
Demonstrate decision lineage

It replaces:

“We received two conflicting assertions and chose one”

with:

“We only ever had one version”

That is not a technical shortcut — it is a loss of evidence.

In regulated environments, that loss is indefensible.

This is why Bronze must be designed as an append-only, temporally ordered evidence store, not a cleaned or corrected dataset.

3. The Three Families of Metadata You Must Capture

To make Bronze SCD2 deterministic, replayable, and regulator-defensible, metadata must be structured deliberately. Three distinct families are required. Mixing them leads to ambiguity, broken timelines, and unexplainable outcomes.

3.1 Provenance / Source Identity

This metadata answers:

Where did this assertion come from, and under what context?

Required fields:

source_system — the originating system (e.g. CRM, Core Banking)
source_type — file, API, stream, database extract
source_path — file path, topic, or table identifier
source_file — file name where applicable
ingest_run_id — pipeline execution identifier
ingest_ts — when the platform received the data

Why it matters:

Enables precedence rules across systems
Supports audit and lineage (“which file said this?”)
Allows reconstruction of ingestion conditions (e.g. ADF batch behaviour)
Distinguishes duplicate storms from genuine updates

Without provenance, all records look the same — and conflicts cannot be explained.

3.2 Temporal Control Fields (Event vs Ingestion)

This metadata defines when something was true, versus when it was seen.

Required fields:

source_event_ts (or equivalent) — when the source believed this state was valid
effective_from — start of validity (usually = source_event_ts)
effective_to — end of validity
is_current — derived (effective_to IS NULL)
first_seen_ts — first ingestion time
last_seen_ts — most recent ingestion time

Why it matters:

Separates truth timeline from processing timeline
Allows correction of out-of-sequence arrival
Enables point-in-time reconstruction
Protects against ingestion-order corruption

Critical distinction:

source_event_ts → governs truth
ingest_ts → governs observability

Confusing the two is the most common cause of broken SCD2 implementations.

3.3 Keys & Hashes (Business Key, Surrogate, Attribute Hash)

This metadata defines identity and change detection.

Required fields:

business_key (bk) — natural key identifying the entity
surrogate_key (sk) — stable internal identifier (usually hash of bk)
attr_hash — hash of mutable attributes

Optional but useful:

record_hash — full row hash (including metadata)

Why it matters:

bk groups all assertions for the same entity
sk stabilises joins and downstream models
attr_hash determines whether a change is real

Hashing rules:

Hash only mutable attributes
Exclude metadata and timestamps
Use consistent null handling

This enables:

Idempotent ingestion
Duplicate collapse (ADF storms)
Prevention of version explosion

Without hashing, every arrival risks becoming a new version.

4. The Two Operating Modes (This is the Big Fork)

All Bronze SCD2 implementations fall into one of two categories. The difference is not technical — it is epistemological.

The fork between source-time and ingestion-time is not a technical decision — it is a statement about what the platform claims it knows.

4.1 Mode A — You Have Source Time

You possess a reliable field such as:

event timestamp
effective date
extract “as-of” timestamp
source change sequence

This allows you to construct:

Source-time belief

Characteristics:

Ordering is based on when the source asserted the state
Late-arriving data can be inserted into the correct historical position
Full and partial loads can be reconciled correctly
is_current is deterministic and stable
Point-in-time reconstruction is accurate

This is the required mode for regulated Financial Services.

4.2 Mode B — You Don’t (Ingestion-Time Belief)

No reliable source-time exists, or it is incomplete, inconsistent, or absent.

You are forced to use:

ingest_ts as ordering

This creates:

Ingestion-time belief

Characteristics:

Ordering reflects when data arrived, not when it was true
Late data cannot be correctly placed in history
Full loads arriving late may overwrite newer belief
is_current is dependent on arrival order
Point-in-time reconstruction reflects platform behaviour, not source reality

This mode can still be useful for:

operational reporting
exploratory analytics
non-regulated use cases

But it is not regulator-defensible for:

audit
financial reporting
decision reconstruction

4.3 The Practical Implication

Before writing a single line of SCD2 logic, you must answer:

Do we have reliable source-time?

If yes → build a temporal system of record
If no → build an ingestion-history system, and be explicit about its limits

Everything else — precedence, deletions, out-of-sequence handling — depends on this decision.

5. Bronze SCD2 Row Model (Schema + Meaning)

A Bronze SCD2 row is not just a record of data. It is a record of institutional belief, source provenance, temporal validity, and ingestion evidence. If the row model is incomplete, the platform may still run — but it will not be able to explain itself under scrutiny.

The row model therefore needs to capture four things at once:

What entity this row refers to
What was believed
When that belief was valid
Where that belief came from

5.1 Required Columns

These are the minimum columns required for a regulator-defensible Bronze SCD2 implementation.

Column	Meaning	What it substantiates
`bk`	Canonical business key	Which real-world entity the row relates to
`sk`	Stable surrogate key	Internal identity continuity and downstream joins
`source_system`	Originating system	Which authority asserted the value
`source_type`	File / API / stream / table	Nature of the source
`source_path`	File path, topic, or table identifier	Exact provenance of the assertion
`ingest_ts`	Time received by platform	When the platform first observed the data
`source_event_ts`	Time asserted by source	When the source believed the state was valid
`effective_from`	Start of validity	When this version began to apply
`effective_to`	End of validity	When this version ceased to apply
`is_current`	Current active version flag	Which version is currently effective
`attr_hash`	Hash of mutable business attributes	Whether this is genuinely a new version
`is_deleted`	Deletion marker	Whether this version represents a logical delete

These columns are enough to support:

Version ordering
Out-of-sequence repair
Duplicate suppression
Point-in-time reconstruction
Logical deletions
Downstream Silver derivation

5.2 Recommended Columns

These are not strictly required for SCD2 to function, but they materially improve replayability, auditability, and operational safety.

Column	Meaning	What it substantiates
`source_file`	Specific file name	Which artefact contained the record
`ingest_run_id`	Pipeline execution ID	Which orchestration run loaded it
`first_seen_ts`	First ingestion timestamp	When the platform first encountered this exact version
`last_seen_ts`	Most recent ingestion timestamp	Whether it has been replayed or resent
`precedence_rank`	Source priority for belief resolution	Why one source wins over another
`load_type`	FULL / PARTIAL	Whether missing fields are meaningful
`record_hash`	Hash of full row including metadata	Row-level replay and forensic deduplication
`entity_type`	STATE / TRANSACTION	Which processing rules apply

These columns allow the platform to distinguish:

identical facts replayed multiple times
genuine change versus duplicate arrival
partial updates versus full replacement
source precedence versus temporal order
state rows versus immutable transactions

5.3 What Each Group of Columns Is For

To keep the design readable, it helps to think of the row model in grouped families.

5.3.1 Identity Columns

bk
sk
entity_type

These answer:

What thing are we talking about, and what kind of thing is it?

5.3.2 Provenance Columns

source_system
source_type
source_path
source_file
ingest_run_id

These answer:

Where did this assertion come from, and how did it enter the platform?

5.3.3 Temporal Columns

source_event_ts
ingest_ts
effective_from
effective_to
is_current
first_seen_ts
last_seen_ts

These answer:

When was this believed, when was it observed, and is it still current?

5.3.4 Change & Control Columns

attr_hash
record_hash
is_deleted
precedence_rank
load_type

These answer:

Has anything meaningful changed, does this row represent deletion, and how should it be resolved against competing assertions?

6. Hashing Strategy: Where the Hash Goes and Why

Hashing is one of the most important controls in Bronze SCD2. Without it, every replay, duplicate, or noisy source update risks being treated as a new version. With it, the platform can distinguish between:

a genuine state change
a replay of the same state
a duplicate caused by concurrent ingestion
an unchanged row seen again later

Hashing does not create truth. It creates deterministic change detection.

6.1 Attribute Hash vs Record Hash

These serve different purposes and should not be confused.

6.1.1 Attribute Hash (`attr_hash`)

This is the hash of mutable business attributes only.

Example inputs:

customer name
status
product type
risk rating
account state

It must exclude:

ingestion metadata
source file name
load timestamp
pipeline run id
technical audit columns

Why:
Because it is used to answer:

Has the business state actually changed?

If attr_hash stays the same, the record is the same belief version even if it arrived ten times.

6.1.2 Record Hash (`record_hash`)

This is the hash of the entire row payload, often including metadata.

It may include:

business attributes
provenance fields
load metadata

Why:
Because it is used to answer:

Is this exactly the same row we have seen before?

This is helpful for:

forensic deduplication
replay traceability
comparing source artefacts under investigation

In most Bronze SCD2 logic, attr_hash drives versioning.
record_hash supports evidence handling and replay diagnostics.

6.2 Change Detection

The core SCD2 rule is:

A new version is created only when the business key is the same and the attr_hash has changed.

This prevents version spam caused by:

duplicate file processing
repeated full extracts with identical content
resubmitted rows from external systems
concurrent ADF file opening

Typical rule:

same bk
same source_event_ts
same attr_hash

→ same version

If attr_hash differs, that is a candidate new version.

This is why the hash must be applied after standardisation and before SCD2 comparison.

6.3 Idempotency Under Replay

Replay is inevitable in regulated platforms:

failed pipelines rerun
source systems resend files
backfills occur
out-of-sequence data arrives late

The Bronze process must therefore be idempotent.

Hashing enables idempotency by ensuring that replaying the same logical state does not create a new version.

This means:

if the row is replayed unchanged → update last_seen_ts, do not create version
if the row reappears with a different hash → create or repair version as appropriate
if a file storm causes thousands of copies → collapse them to one version, preserve evidence in provenance metadata

Without hashing, replay becomes corruption.

6.4 Where the Hash Should Be Applied

Hashing should be applied:

After canonicalisation / normalisation
- trim strings
- standardise nulls
- align data types
Before SCD2 grouping and merge logic
- so equality is determined on business state, not arrival conditions
Separately for state and transaction handling
- state uses attr_hash for versioning
- transactions may use a transaction identifier or event hash for deduplication, but do not become SCD2 versions

This ensures the platform can:

detect meaningful change
remain replay-safe
keep is_current stable under concurrency and reprocessing

6.5 Why Hashing Matters So Much Here

In a simple warehouse, hashing is often treated as a performance trick.

In Bronze SCD2 for regulated Financial Services, it is more than that. It is the mechanism that protects the platform from:

accidental version explosion
hidden duplicate storms
replay corruption
arrival-order bias

Put simply:

Hashing is how the platform distinguishes repeated evidence from changed belief.

6.6 Schema Evolution: Preserving History While the Model Changes

Schema evolution is inevitable in regulated Financial Services. New attributes appear (e.g. KYC enhancements), existing ones change meaning, and some are deprecated. The challenge is not accommodating change — it is doing so without corrupting historical truth.

The key principle is:

A schema change alters what can be asserted going forward — it must not rewrite what was previously known.

6.6.1 The Core Problem

If schema evolution is handled incorrectly, the platform will:

recompute historical hashes using new columns
overwrite or reinterpret prior states
introduce false change events
break point-in-time (PIT) reconstruction

This results in a system that cannot explain its own history, which is unacceptable under regulatory scrutiny.

6.6.2 Attribute Hash Stability

The attribute hash (attr_hash) must reflect:

only the attributes that were known and populated at the time of the assertion

This leads to two valid strategies:

6.6.2.1 Strategy A — Forward-Compatible Hashing (Recommended)

Compute attr_hash only on attributes present in the incoming record
Missing attributes are treated as “unknown”, not null
Historical rows remain unchanged

6.6.2.2 Strategy B — Versioned Hash Definition

Introduce hash_version alongside attr_hash
Each schema version defines its own hashing logic
Hash comparisons are only valid within the same version

This is more explicit but adds complexity.

6.6.3 Adding New Attributes

When a new attribute is introduced:

do not backfill historical rows unless you have evidence
do not recompute historical hashes
treat the attribute as:

unknown before introduction
known after introduction

This ensures:

no artificial change events
no distortion of prior belief

6.6.4 Partial Updates Under Schema Evolution

Schema evolution amplifies the importance of partial merge logic:

new attributes may only appear in some sources
older sources may never supply them

The merge rule remains:

Only overwrite an attribute if the incoming record explicitly asserts it

All other attributes are inherited from the latest valid state at that time.

6.6.5 Deleting or Deprecating Attributes

If an attribute is removed:

retain it in Bronze history
stop populating it going forward
optionally mark as deprecated in metadata

Historical queries must still be able to reconstruct prior states including that attribute.

6.6.6 What This Substantiates

With correct schema evolution handling, the platform can prove:

what attributes were known at any point in time
when new information became available
that no historical state was retroactively altered

Without this discipline:

schema evolution becomes silent data corruption.

6.6.7 Packed Payloads, Blobs, and Deferred Unpacking

Schema evolution does not apply only to relational columns. In regulated platforms, XML, JSON, and other packed payloads may be preserved in Bronze as first-class evidentiary artefacts, with selected fields unpacked over time as indexing, querying, or regulatory needs evolve.

This deferred unpacking must not be treated as retroactive truth creation. If a field is extracted from a historical blob for the first time in March 2026, that does not mean the platform “knew” that field in January 2025. It means the evidence already existed, but the interpretation or accessibility changed.

Accordingly:

raw blobs remain the primary evidence
unpacked fields are derived interpretations
historical hashes must not be recomputed simply because more of a payload is now unpacked
where unpacking logic materially changes, the extraction version or interpretation version should be recorded

7. Handling Out-of-Sequence Events Correctly

Out-of-sequence data is not an edge case — it is the normal operating condition in distributed, regulated data platforms. The Bronze layer must therefore be explicitly designed to reconstruct correct temporal order regardless of arrival order.

7.1 Ordering Rules

The primary rule is simple:

Order by source_event_ts, not ingest_ts

Processing hierarchy:

source_event_ts (primary ordering)
ingest_ts (fallback / observability only)

This ensures:

Late-arriving data is inserted into the correct historical position
is_current reflects the latest known truth, not the latest arrival
Point-in-time reconstruction aligns with source reality

If you rely on ingest_ts, you are modelling pipeline behaviour, not business truth.

7.2 Tie-Breaking Rules (Sequence, File, Batch ID)

Real systems often produce records with identical timestamps. Deterministic tie-breaking is therefore essential.

Recommended tie-break hierarchy:

source_sequence_id (if available)
source_file or source_path
ingest_run_id (batch / pipeline execution)
ingest_ts (final fallback)

This ensures:

Stable ordering across replays
Deterministic version chains
Prevention of “flip-flopping” is_current

Without tie-breaking, identical timestamps can produce non-repeatable outcomes, which is unacceptable in regulated environments.

7.3 Repair Logic (Late Arriving Updates)

When late data arrives:

Identify the correct position in the timeline using source_event_ts
Insert the new version into the historical chain
Recalculate:
- effective_to of the preceding version
- effective_from of the inserted version
- downstream versions if necessary

This may require reopening and resequencing existing rows.

Key principle:

History is corrected, not appended blindly

Example:

Version A (Jan 1)
Version C (Jan 10) ← already present
Version B (Jan 5) ← arrives late

Correct result:

A → effective_to = Jan 5
B → Jan 5 → Jan 10
C → remains current

Not:

A → C → B (arrival order corruption)

This is what makes Bronze repairable and regulator-defensible.

8. Full Loads, Partial Loads, and “Over-the-Top” Snapshots

Different ingestion patterns imply different semantics. Treating them the same leads to data loss, false deletions, and broken history.

8.1 Full Snapshot Semantics

A full load represents:

“This is the complete state of the entity (or dataset) at time X.”

Implications:

Missing records may imply deletion or absence
Every record should be compared against existing state
May trigger closure of prior versions

Required metadata:

load_type = FULL
clear source_event_ts or extract timestamp

Full loads are the only safe basis for absence-based inference.

8.2 Partial Feed Semantics

A partial load represents:

“This is a subset of changes or updates.”

Implications:

Missing fields ≠ null
Missing records ≠ deletion
Only included attributes should be updated

Required metadata:

load_type = PARTIAL

This prevents:

accidental overwrites
false deletions
loss of previously known values

Partial feeds must be treated as patches, not replacements.

8.3 Full-Over-Top Edge Case

This is a common failure mode:

A newer partial feed updates a record
A late-arriving full snapshot is then processed

If handled incorrectly, the full snapshot can:

overwrite newer belief
reset attributes to older values
corrupt is_current

Correct handling:

Treat full snapshots using source_event_ts ordering
Insert them into the correct temporal position
Do not blindly overwrite current state

This ensures:

history remains consistent
newer belief is not lost
snapshots behave as time-bound assertions, not overrides

8.4 Why “Missing Fields” ≠ Deletion

A missing field can mean:

not present in source extract
not included in partial feed
optional attribute
upstream omission

It does not automatically mean:

“This value has been deleted”

Deletion must be explicitly signalled via:

is_deleted = true
status flags
tombstone events
absence in a trusted full snapshot

Treating missing data as deletion leads to:

silent data loss
incorrect history
broken auditability

8.5 The Core Principle

Across all of this:

Every row is an assertion with context, not a command to overwrite state

By distinguishing:

ordering vs arrival
full vs partial semantics
absence vs deletion

you ensure that Bronze remains:

replayable
explainable
regulator-defensible

—and that Silver can remain simple, because Bronze has already done the hard work correctly.

9. Deletions: What They Mean and How to Model Them

Deletion in regulated data platforms is not simply “row removed.” It is an assertion about state, and must be treated with the same care as any other change. Incorrect handling of deletions is one of the fastest ways to break auditability.

9.1 Explicit Deletes

These are clear, intentional signals from the source:

status flags (e.g. is_deleted = true)
tombstone events
delete operations in CDC streams

Handling:

create a new SCD2 version
set is_deleted = true
close the previous version (effective_to)

This provides:

“At time X, the source asserted that this entity no longer exists.”

9.2 Soft Deletes

These are indirect indicators of deletion:

status = “closed”, “inactive”, “cancelled”
lifecycle transitions

Handling:

treat as a state change, not a hard delete
retain the record, update attributes accordingly

Why:

Soft deletes are still part of business history. They must remain queryable and reconstructable.

9.3 “Blanking” Patterns

Common in real systems:

fields suddenly become null or empty
entire payloads appear “wiped”
partial feeds omit previously populated attributes

This does not automatically mean deletion.

Interpretation depends on:

load_type (FULL vs PARTIAL)
source behaviour
domain rules

Safe handling:

treat blanking as a new version
do not infer deletion unless explicitly defined
preserve prior values in earlier versions

This avoids:

silent data loss
accidental erasure of historical truth

9.4 Silver Implications

Silver is where deletion semantics are resolved into usable views.

Typical patterns:

filter out is_deleted = true for “active” views
include deleted records for audit or PIT queries
apply domain logic (e.g. “closed accounts still visible”)

Key principle:

Bronze records deletion assertions. Silver decides what they mean for consumption.

10. Precedence: How Multi-Source Truth Is Resolved

In multi-source environments, conflicting assertions are inevitable. The platform must not pretend they don’t exist — it must capture them and resolve them deterministically.

10.1 Precedence Tables (by Domain / Attribute Group)

Precedence is not global. It varies by:

domain (Customer, Account, Product)
attribute group (name, address, risk rating, balance)

Example:

CRM may own customer contact details
Core Banking may own account status
External provider may own credit score

This is typically modelled as:

a precedence table or ruleset
mapping source_system → priority
optionally scoped to attribute groups

This enables:

“For this attribute, which source do we trust first?”

10.2 Effective Dating Precedence Rules

Precedence is not just “which source wins.” It must also consider time.

Rules combine:

source_event_ts (when asserted)
precedence_rank (who asserted it)

Typical resolution:

higher precedence source overrides lower
if equal precedence → latest source_event_ts wins
ties resolved via deterministic tie-breakers

This ensures:

consistent outcomes
explainable decisions
stable Silver views

Important: precedence does not replace temporal ordering — it operates on top of it.
Bronze preserves all temporally ordered assertions.
Silver then applies precedence to determine which assertion is believed, even if it is not the latest in time.

10.3 Bronze Stores Conflicts; Silver Resolves Them

The most important architectural rule:

Bronze does not resolve conflicts

Bronze stores:

all assertions
from all sources
with full provenance
even when they disagree

Silver applies:

precedence rules
survivorship logic
domain-specific interpretation

This separation ensures:

no loss of evidence
full auditability
ability to revisit decisions if rules change

If conflicts are resolved in Bronze:

alternative viewpoints are lost
decisions become irreversible
regulatory defensibility is weakened

10.4 The Core Principle

Across deletions and precedence:

Bronze captures what was said. Silver decides what it means.

This preserves:

evidence
optionality
explainability

—and allows the platform to evolve its logic without rewriting its history.

11. Transactions vs State: Stop Treating Events Like Dimensions

One of the most common category errors in Bronze design is to treat transactions as if they were slowly changing dimensions. They are not.

A transaction is an immutable event: something that happened once, at a point in time. A dimension or state record is a mutable interpretation of an entity over time. Confusing the two creates false history, broken lineage, and regulator-hostile data models.

11.1 Transaction Handling Pattern

Transactions should be handled as append-only facts.

Typical examples:

payments
trades
orders
cash movements
account postings
customer interactions logged as events

Each transaction row should be identified by a stable transactional identity, for example:

source transaction ID
message ID
event ID
composite event business key

The Bronze handling pattern is:

if the transaction has not been seen before → insert it
if the transaction has been seen before and is identical → update last_seen_ts only
if the transaction has been seen before but arrives with correction semantics → treat the correction as a new event or adjustment, not as mutation of the original event

This preserves the distinction between:

what happened
what we later learned about what happened

11.2 `last_seen` Update

For transactions, repeated arrival usually means one of three things:

replay
duplicate file ingestion
source resend / redelivery

That does not justify a new SCD2 version.

Instead, the correct pattern is:

preserve the original event row
update:
- last_seen_ts
- optionally ingest_run_id
- optionally replay count / duplicate count

This provides operational evidence that the platform observed the event again, without pretending the event itself changed.

Typical transaction metadata includes:

first_seen_ts
last_seen_ts
source_system
source_path
source_file
event_hash or record_hash

This gives you replay-safe ingestion without corrupting event history.

11.3 Why It’s Not SCD2

SCD2 exists to model changing state.

Examples of state:

customer address
account status
product risk classification
contract terms

These are beliefs that can change over time while referring to the same entity.

Transactions are different. They do not “change” in the same sense. What changes is:

enrichment
classification
reconciliation status
investigation outcome

Those changes should be modelled as:

separate state tables
linked interpretation tables
or explicit correction / adjustment events

Not by mutating or versioning the transaction itself as if the original fact changed.

The rule is simple:

Transactions are facts. State is belief. SCD2 is for belief, not fact.

If this boundary is violated, the platform quickly becomes unable to answer basic regulatory questions such as:

What actually happened?
What did we believe about it at the time?
What later changed in our interpretation?

12. Reference PySpark Implementation (Bronze Writer)

12.1 Preferred Pattern — Bronze with Source Time

The following implementation pattern is a reference approach for writing Bronze safely in Databricks / PySpark. It supports:

deterministic ordering
duplicate suppression
SCD2 state handling
transaction handling
late-arriving events
logical deletes
safe is_current

The code is illustrative, but the logic is the important part.

12.1.1 Inputs and Assumptions

Assume the incoming DataFrame contains, at minimum:

business key columns
mutable state attribute columns
source metadata
source_event_ts where available
ingest_ts
entity_type (state or transaction)
optional delete flag

Assume Bronze is stored in Delta and that impacted keys can be recomputed deterministically.

from pyspark.sql import functions as F, Window as W

BK_COLS = ["customer_id"]          # example
ATTR_COLS = ["name", "address", "status", "risk_rating"]
EVENT_TS_COL = "source_event_ts"
BRONZE_TABLE = "bronze.customer_state"

Helper assumptions:

bk is the canonical business key
sk is a hash of bk
attr_hash is the hash of mutable state columns
entity_type determines whether the record follows SCD2 or event logic

12.1.2 Dedup + Ordering

First standardise keys, assign hashes, and deduplicate repeated identical observations.

def add_identity_and_hashes(df, bk_cols, attr_cols):
    df = df.withColumn("bk", F.concat_ws("||", *[F.col(c).cast("string") for c in bk_cols]))
    df = df.withColumn("sk", F.sha2(F.col("bk"), 256))

    df = df.withColumn(
        "attr_hash",
        F.when(
            F.col("entity_type") == "state",
            F.sha2(
                F.concat_ws(
                    "||",
                    *[F.coalesce(F.col(c).cast("string"), F.lit("∅")) for c in attr_cols]
                ),
                256
            )
        )
    )

    df = df.withColumn(
        "record_hash",
        F.sha2(
            F.concat_ws(
                "||",
                F.col("bk"),
                F.coalesce(F.col(EVENT_TS_COL).cast("string"), F.lit("∅")),
                F.coalesce(F.col("attr_hash"), F.lit("∅")),
                F.coalesce(F.col("source_system"), F.lit("∅")),
                F.coalesce(F.col("source_path"), F.lit("∅"))
            ),
            256
        )
    )
    return df

For state rows, deduplicate on:

bk
source_event_ts
attr_hash

For transaction rows, deduplicate on:

bk or transaction ID
record_hash

def dedup_state(df):
    w = W.partitionBy("bk", EVENT_TS_COL, "attr_hash")
    return (
        df.withColumn("first_seen_ts", F.min("ingest_ts").over(w))
          .withColumn("last_seen_ts", F.max("ingest_ts").over(w))
          .withColumn("rn", F.row_number().over(w.orderBy(F.col("ingest_ts").desc())))
          .filter(F.col("rn") == 1)
          .drop("rn")
    )

Ordering for state must be based on source event time, never arrival order:

STATE_ORDER = [
    F.col(EVENT_TS_COL).asc(),
    F.col("source_sequence_id").asc_nulls_last(),
    F.col("source_file").asc_nulls_last(),
    F.col("ingest_run_id").asc_nulls_last(),
    F.col("ingest_ts").asc()
]

That gives deterministic results even when ADF processes many files concurrently.

12.1.3 Partial-Load Merge Logic

Source-time Bronze does not remove the need to handle partial feeds carefully. It simply gives you a stronger temporal foundation on which to do so.

A partial feed means:

“These attributes changed at source time X”

It does not mean:

“Every omitted attribute became null at source time X”

So before hashing and versioning a partial row, the platform must reconstruct the full state as of the source event time by combining:

the incoming partial assertion
the latest prior known state for that business key as of source_event_ts

This ensures that:

missing fields do not become false deletions
partial updates are treated as patches
the resulting attr_hash represents full business state, not a lossy payload fragment

12.1.3.1 Source-time partial merge rules

Find the latest prior state row for the same bk where:
- effective_from <= source_event_ts
- and either effective_to > source_event_ts or effective_to IS NULL
For each mutable attribute:
- if incoming value is present → use incoming value
- if incoming value is absent → inherit prior value
Recompute attr_hash on the reconstructed full state
Continue through standard source-time SCD2 ordering and rebuild

12.1.3.2 Reference helper

def merge_partial_state_with_source_time(df_new_partial, df_existing_state, attr_cols, event_ts_col):
    """
    Reconstruct full state for partial loads using the latest known state
    as of the incoming source_event_ts.
    """    e = df_existing_state.alias("e")
    n = df_new_partial.alias("n")    # Find candidate prior rows valid at incoming source_event_ts
    joined = (
        n.join(
            e,
            on="bk",
            how="left"
        )
        .where(
            (F.col(f"e.effective_from") <= F.col(f"n.{event_ts_col}")) &
            (
                F.col("e.effective_to").isNull() |
                (F.col("e.effective_to") > F.col(f"n.{event_ts_col}"))
            )
        )
    )    # If multiple prior candidates somehow exist, keep the latest valid one
    w_prior = W.partitionBy("n.bk", F.col(f"n.{event_ts_col}")).orderBy(F.col("e.effective_from").desc())    latest_prior = (
        joined
        .withColumn("rn_prior", F.row_number().over(w_prior))
        .filter(F.col("rn_prior") == 1)
        .drop("rn_prior")
    )    select_exprs = [
        F.col("n.bk").alias("bk"),
        F.col("n.sk").alias("sk"),
        F.col(f"n.{event_ts_col}").alias(event_ts_col),
        F.col("n.ingest_ts").alias("ingest_ts"),
        F.col("n.ingest_run_id").alias("ingest_run_id"),
        F.col("n.source_system").alias("source_system"),
        F.col("n.source_type").alias("source_type"),
        F.col("n.source_path").alias("source_path"),
        F.col("n.source_file").alias("source_file"),
        F.col("n.entity_type").alias("entity_type"),
        F.col("n.load_type").alias("load_type"),
        F.col("n.is_deleted").alias("is_deleted"),
        F.col("n.precedence_rank").alias("precedence_rank"),
        F.col("n.source_sequence_id").alias("source_sequence_id"),
    ]    for c in attr_cols:
        select_exprs.append(
            F.when(
                F.col(f"n.{c}").isNull() | (F.col(f"n.{c}") == F.lit("∅")),
                F.col(f"e.{c}")
            ).otherwise(F.col(f"n.{c}")).alias(c)
        )    return latest_prior.select(*select_exprs)

This is the crucial piece that prevents source-time Bronze from becoming unsafe under mixed full and partial ingestion patterns.

12.1.4 SCD2 Merge Logic

The safest pattern is not “row-by-row mutate Bronze”.
It is:

identify impacted business keys
load existing Bronze for those keys
union incoming + existing
rebuild the timeline
replace only those keys

This is what makes late arrivals and precedence repairable.

def rebuild_scd2_timeline(df_union):
    w = W.partitionBy("bk").orderBy(*STATE_ORDER)

    df = (
        df_union
        .withColumn("prev_hash", F.lag("attr_hash").over(w))
        .withColumn(
            "is_new_version",
            F.when(
                F.col("prev_hash").isNull() | (F.col("attr_hash") != F.col("prev_hash")),
                F.lit(1)
            ).otherwise(F.lit(0))
        )
        .withColumn(
            "version_grp",
            F.sum("is_new_version").over(w.rowsBetween(W.unboundedPreceding, 0))
        )
    )

    # collapse repeated same-hash rows into one version
    latest_in_grp = W.partitionBy("bk", "version_grp").orderBy(F.col("ingest_ts").desc())

    df_collapsed = (
        df.withColumn("grp_effective_from", F.min(EVENT_TS_COL).over(W.partitionBy("bk", "version_grp")))
          .withColumn("grp_first_seen_ts", F.min("first_seen_ts").over(W.partitionBy("bk", "version_grp")))
          .withColumn("grp_last_seen_ts", F.max("last_seen_ts").over(W.partitionBy("bk", "version_grp")))
          .withColumn("grp_rn", F.row_number().over(latest_in_grp))
          .filter(F.col("grp_rn") == 1)
          .drop("grp_rn", "prev_hash", "is_new_version")
          .withColumn("effective_from", F.col("grp_effective_from"))
          .withColumn("first_seen_ts", F.col("grp_first_seen_ts"))
          .withColumn("last_seen_ts", F.col("grp_last_seen_ts"))
          .drop("grp_effective_from", "grp_first_seen_ts", "grp_last_seen_ts")
    )

    final_w = W.partitionBy("bk").orderBy(F.col("effective_from").asc())

    df_final = (
        df_collapsed
        .withColumn("effective_to", F.lead("effective_from").over(final_w))
        .withColumn("is_current", F.col("effective_to").isNull())
    )

    return df_final

This produces a proper SCD2 chain.

12.1.5 Full PySpark Reference Implementation

With source time available, the safest implementation pattern is to rebuild the timeline for impacted business keys rather than trying to mutate Bronze row-by-row. This keeps late-arriving data, precedence, and partial updates all repairable.

The flow is:

prepare and standardise incoming rows
split transactions from state
merge partial state rows against prior valid state at source time
deduplicate identical observations
pull existing Bronze rows for impacted keys
rebuild the full source-time timeline
derive effective_to and is_current
union transactions and state results

12.1.5.1 Full Reference Flow

from pyspark.sql import functions as F, Window as WBK_COLS = ["customer_id"]          # example
ATTR_COLS = ["name", "address", "status", "risk_rating", "is_deleted"]
EVENT_TS_COL = "source_event_ts"
BRONZE_TABLE = "bronze.customer_state"def canonicalise_for_hash(df, attr_cols):
    for c in attr_cols:
        df = df.withColumn(c, F.coalesce(F.col(c).cast("string"), F.lit("∅")))
    return dfdef add_identity_and_hashes(df, bk_cols, attr_cols):
    df = df.withColumn("bk", F.concat_ws("||", *[F.col(c).cast("string") for c in bk_cols]))
    df = df.withColumn("sk", F.sha2(F.col("bk"), 256))    df = canonicalise_for_hash(df, attr_cols)    df = df.withColumn(
        "attr_hash",
        F.when(
            F.col("entity_type") == "state",
            F.sha2(F.concat_ws("||", *[F.col(c) for c in attr_cols]), 256)
        )
    )    df = df.withColumn(
        "record_hash",
        F.sha2(
            F.concat_ws(
                "||",
                F.col("bk"),
                F.coalesce(F.col(EVENT_TS_COL).cast("string"), F.lit("∅")),
                F.coalesce(F.col("attr_hash"), F.lit("∅")),
                F.coalesce(F.col("source_system"), F.lit("∅")),
                F.coalesce(F.col("source_path"), F.lit("∅")),
                F.coalesce(F.col("source_file"), F.lit("∅"))
            ),
            256
        )
    )    return dfdef prepare_inputs_with_source_time(df_in, bk_cols, attr_cols, event_ts_col):
    df = df_in    df = df.withColumn("ingest_ts", F.coalesce(F.col("ingest_ts"), F.current_timestamp()))
    df = df.withColumn("ingest_run_id", F.coalesce(F.col("ingest_run_id"), F.lit("UNKNOWN_RUN")))
    df = df.withColumn("source_system", F.coalesce(F.col("source_system"), F.lit("UNKNOWN_SOURCE")))
    df = df.withColumn("source_type", F.coalesce(F.col("source_type"), F.lit("unknown")))
    df = df.withColumn("source_path", F.coalesce(F.col("source_path"), F.lit("UNKNOWN_PATH")))
    df = df.withColumn("source_file", F.coalesce(F.col("source_file"), F.lit("UNKNOWN_FILE")))
    df = df.withColumn("entity_type", F.coalesce(F.col("entity_type"), F.lit("state")))
    df = df.withColumn("load_type", F.coalesce(F.col("load_type"), F.lit("UNKNOWN")))
    df = df.withColumn("is_deleted", F.coalesce(F.col("is_deleted"), F.lit(False)))
    df = df.withColumn("precedence_rank", F.coalesce(F.col("precedence_rank"), F.lit(0)))    # required for source-time mode
    df = df.withColumn(event_ts_col, F.col(event_ts_col).cast("timestamp"))
    df = df.withColumn("effective_from", F.col(event_ts_col))    df = add_identity_and_hashes(df, bk_cols, attr_cols)    return dfdef dedup_state_with_source_time(df):
    """
    Collapse duplicate observations of the same state assertion.
    """
    w = W.partitionBy("bk", EVENT_TS_COL, "attr_hash")    return (
        df.withColumn("first_seen_ts", F.min("ingest_ts").over(w))
          .withColumn("last_seen_ts", F.max("ingest_ts").over(w))
          .withColumn("rn", F.row_number().over(w.orderBy(F.col("ingest_ts").desc(), F.col("ingest_run_id").desc())))
          .filter(F.col("rn") == 1)
          .drop("rn")
    )def dedup_transactions_with_source_time(df_tx):
    """
    Transactions are immutable events.
    Replays update last_seen_ts, not history.
    """
    w_tx = W.partitionBy("bk", "record_hash")    return (
        df_tx
        .withColumn("first_seen_ts", F.min("ingest_ts").over(w_tx))
        .withColumn("last_seen_ts", F.max("ingest_ts").over(w_tx))
        .withColumn("rn", F.row_number().over(w_tx.orderBy(F.col("ingest_ts").desc(), F.col("ingest_run_id").desc())))
        .filter(F.col("rn") == 1)
        .drop("rn")
    )STATE_ORDER = [
    F.col(EVENT_TS_COL).asc(),
    F.col("precedence_rank").desc(),
    F.col("source_sequence_id").asc_nulls_last(),
    F.col("source_file").asc_nulls_last(),
    F.col("ingest_run_id").asc_nulls_last(),
    F.col("ingest_ts").asc()
]def rebuild_scd2_timeline_with_source_time(df_union):
    """
    Rebuild complete source-time history for impacted keys.
    """
    w = W.partitionBy("bk").orderBy(*STATE_ORDER)    df = (
        df_union
        .withColumn("prev_hash", F.lag("attr_hash").over(w))
        .withColumn(
            "is_new_version",
            F.when(
                F.col("prev_hash").isNull() | (F.col("attr_hash") != F.col("prev_hash")),
                F.lit(1)
            ).otherwise(F.lit(0))
        )
        .withColumn(
            "version_grp",
            F.sum("is_new_version").over(w.rowsBetween(W.unboundedPreceding, 0))
        )
    )    latest_in_grp = W.partitionBy("bk", "version_grp").orderBy(F.col("ingest_ts").desc(), F.col("ingest_run_id").desc())    df_collapsed = (
        df.withColumn("grp_effective_from", F.min(EVENT_TS_COL).over(W.partitionBy("bk", "version_grp")))
          .withColumn("grp_first_seen_ts", F.min("first_seen_ts").over(W.partitionBy("bk", "version_grp")))
          .withColumn("grp_last_seen_ts", F.max("last_seen_ts").over(W.partitionBy("bk", "version_grp")))
          .withColumn("grp_rn", F.row_number().over(latest_in_grp))
          .filter(F.col("grp_rn") == 1)
          .drop("grp_rn", "prev_hash", "is_new_version")
          .withColumn("effective_from", F.col("grp_effective_from"))
          .withColumn("first_seen_ts", F.col("grp_first_seen_ts"))
          .withColumn("last_seen_ts", F.col("grp_last_seen_ts"))
          .drop("grp_effective_from", "grp_first_seen_ts", "grp_last_seen_ts")
    )    final_w = W.partitionBy("bk").orderBy(F.col("effective_from").asc(), F.col("precedence_rank").desc())    return (
        df_collapsed
        .withColumn("effective_to", F.lead("effective_from").over(final_w))
        .withColumn("is_current", F.col("effective_to").isNull())
    )def process_bronze_with_source_time(df_in, bronze_table, bk_cols, attr_cols, event_ts_col):
    """
    Main preferred Bronze ingestion function.    Guarantees:
    - source-time ordering
    - late-arrival repair
    - partial-load safety
    - deterministic is_current
    - replay-safe transaction handling
    """    df = prepare_inputs_with_source_time(df_in, bk_cols, attr_cols, event_ts_col)    # Split transactions from state
    df_tx = df.filter(F.col("entity_type") == "transaction")
    df_state_new = df.filter(F.col("entity_type") == "state")    # Process transactions
    df_tx_final = dedup_transactions_with_source_time(df_tx)    # Identify impacted business keys
    impacted_keys = df_state_new.select("bk").distinct()    existing_state = (
        spark.read.format("delta").table(bronze_table)
        .filter(F.col("entity_type") == "state")
        .join(impacted_keys, on="bk", how="inner")
    )    # Split full vs partial state
    df_state_partial = df_state_new.filter(F.col("load_type") == "PARTIAL")
    df_state_full = df_state_new.filter(F.col("load_type") != "PARTIAL")    # Reconstruct full state for partial rows
    if df_state_partial.limit(1).count() > 0:
        df_state_partial = merge_partial_state_with_source_time(
            df_state_partial,
            existing_state,
            attr_cols,
            event_ts_col
        )
        df_state_partial = add_identity_and_hashes(df_state_partial, bk_cols, attr_cols)    # Deduplicate incoming state
    df_state_prepared = df_state_full.unionByName(df_state_partial, allowMissingColumns=True)
    df_state_prepared = dedup_state_with_source_time(df_state_prepared)    # Union with existing and rebuild source-time timeline
    df_state_union = existing_state.unionByName(df_state_prepared, allowMissingColumns=True)
    df_state_final = rebuild_scd2_timeline_with_source_time(df_state_union)    # Final combined result
    df_final = df_state_final.unionByName(df_tx_final, allowMissingColumns=True)    return df_final

This is the complete preferred pattern. It is the pattern that should be used where regulated reconstruction and source-time defensibility are required.

12.1.6 Handling Late Events

Late events are handled by the rebuild pattern itself.

If an older event arrives later:

it is inserted into the ordered timeline based on source_event_ts
previous and next rows are resequenced
effective_to and is_current are recalculated

There is no special hack required if the rebuild is correct.

Typical flow:

incoming = add_identity_and_hashes(df_in, BK_COLS, ATTR_COLS)

state_in = incoming.filter(F.col("entity_type") == "state")
txn_in = incoming.filter(F.col("entity_type") == "transaction")

impacted_keys = state_in.select("bk").distinct()

existing_state = (
    spark.read.table(BRONZE_TABLE)
    .filter(F.col("entity_type") == "state")
    .join(impacted_keys, on="bk", how="inner")
)

state_union = dedup_state(state_in).unionByName(existing_state, allowMissingColumns=True)
state_final = rebuild_scd2_timeline(state_union)

That is the correct late-arrival repair model.

12.1.7 Handling Deletes

Deletes should be modelled as state transitions, not physical removals.

Three supported patterns:

Explicit delete flag
- source provides is_deleted = true
Tombstone event
- source emits explicit delete record
Trusted blanking in full snapshot
- only if domain rules confirm this means deletion

In all three cases:

create a new state version
set is_deleted = true
do not physically remove history

Silver can then derive active views using:

WHERE is_current = true
  AND is_deleted = false

Delete handling in PySpark is simply part of the state row and therefore part of attr_hash if deletion status is business-meaningful.

For example:

ATTR_COLS = ["name", "address", "status", "risk_rating", "is_deleted"]

That ensures a delete becomes a new version.

12.1.8 Producing `is_current` Safely

is_current must never be loaded directly from source and must never be set by “last row wins”.

It must always be derived from the version chain:

order by effective_from
set effective_to = next(effective_from)
set is_current = effective_to IS NULL

This makes is_current:

deterministic
repairable
immune to ADF concurrency storms
safe under replay

For transactions, is_current is not meaningful in the SCD2 sense and should generally be null or omitted. The transaction handling pattern is:

def process_transactions(df_tx):
    w = W.partitionBy("bk")  # or transaction_id if separate
    return (
        df_tx.withColumn("first_seen_ts", F.min("ingest_ts").over(w))
             .withColumn("last_seen_ts", F.max("ingest_ts").over(w))
             .dropDuplicates(["bk", "record_hash"])
    )

That gives:

insert if unseen
update last_seen_ts if seen again
no fake versioning

is_current represents the latest version in the temporal chain, not the resolved truth across sources.
Precedence may select a non-current row as the “believed” version in Silver.

12.1.9 What This Preferred Pattern Achieves

This pattern gives you the things the fallback model cannot.

12.1.9.1 True source-time ordering

Rows are positioned according to when the source asserted them, not when the platform happened to receive them.

12.1.9.2 Late-arrival repair

Older assertions can be inserted back into the correct historical position without corrupting the version chain.

12.1.9.3 Deterministic `is_current`

is_current becomes a consequence of a rebuilt source-time timeline, not a race condition caused by file arrival order.

12.1.9.4 Safe handling of partial loads

Partial payloads become full reconstructed state assertions before hashing and versioning, rather than accidental null-overwrites.

12.1.9.5 Proper deletion modelling

Deletes, tombstones, and trusted blanking patterns become explicit state transitions in history.

12.1.9.6 Replay-safe transaction handling

Transactions remain immutable events. Replays update last_seen_ts, not fake dimensional versions.

12.1.9.7 Regulator-defensible PIT reconstruction

This is the mode that supports answering:

What did we believe at that time, based on what source, and how did that belief later change?

That is the whole point.

12.1.10 Production Notes

The reference code above is logically correct, but several production considerations should be made explicit.

12.1.10.1 Impacted-key pruning is essential

Do not rebuild all Bronze history on every load. Restrict recomputation to the impacted business keys only.

12.1.10.2 Tie-breakers must be stable

If source_event_ts is not unique, deterministic ordering requires additional fields such as:

source_sequence_id
source_file
ingest_run_id

Without stable tie-breakers, replay may produce different outcomes.

12.1.10.3 Replace-by-key is often safer than row-level merge

For complex late-arriving repair, it is often safer to:

rebuild all rows for impacted keys
replace only those keys in Bronze

rather than try to patch individual rows in place.

12.1.10.4 Partitioning and clustering still matter

Large Bronze tables should still be optimised physically using:

partition pruning
clustering / liquid clustering
compaction windows

These improve performance without weakening temporal guarantees.

12.1.10.5 Source-time quality must be monitored

Even when a source claims event time, you should still monitor for:

null source timestamps
backwards-moving sequences
impossible event times
inconsistent partial/full markers

A strong pattern implemented on weak metadata is still weak.

12.1.10.6 The main caveat

This preferred pattern assumes the source can substantiate:

reliable source time
stable business identity
usable partial/full semantics
explicit or well-governed deletion behaviour

If it cannot, then the correct response is not to weaken this pattern until it “fits”.
It is to fall back explicitly to the weaker ingestion-time model and state the limitation clearly.

12.2 Fallback Pattern — Bronze without Source Time

Not all upstream systems provide usable source-event time, source extract “as-of” time, or a trustworthy sequence number. In those cases, Bronze cannot model source-time belief. It can only model platform-time belief: what the platform knew, when it learned it.

That is a weaker posture, but it is still materially better than pretending such sources are temporally complete when they are not.

This fallback pattern is designed for exactly those situations. It gives you:

platform-time ordering
duplicate suppression
replay-safe ingestion
explicit caveats about what cannot be proven

It does not reconstruct true source-time history.

12.2.1 Core rule

If source time is absent, unreliable, or ambiguous, then ingest_ts becomes the platform’s effective time — and that limitation must be made explicit.

12.2.2 What this pattern is for

Use this fallback only when:

the source does not expose reliable event time
extracts are unordered or lossy
snapshots arrive without trustworthy “as-of” semantics
source sequencing cannot be trusted

This pattern still supports:

idempotent Bronze ingestion
duplicate suppression under ADF concurrency storms
full vs partial load handling
explicit deletion semantics
current-state Silver derivation

But it supports them as platform-observed history, not source-authored history.

12.2.3 Inputs and assumptions

Assume the incoming DataFrame contains, at minimum:

business key columns
mutable state attributes
ingestion timestamp
source metadata
load type (FULL / PARTIAL) where available
optional delete flag
entity type (state or transaction)

The key design choice is:

platform_effective_from = ingest_ts

That is the timeline.

12.2.4 Full PySpark reference implementation

from pyspark.sql import functions as F, Window as W

BK_COLS = ["customer_id"]   # example business key columns
ATTR_COLS = ["name", "address", "status", "risk_rating", "is_deleted"]
BRONZE_TABLE = "bronze.customer_state_no_source_time"

def canonicalise_for_hash(df, attr_cols):
    """
    Standardise mutable attributes before hashing.
    Missing values are normalised explicitly.
    """
    for c in attr_cols:
        df = df.withColumn(c, F.coalesce(F.col(c).cast("string"), F.lit("∅")))
    return df

def add_identity_and_hashes_no_source_time(df, bk_cols, attr_cols):
    """
    Builds business key, surrogate key, attribute hash, and record hash
    for the ingestion-time fallback model.
    """
    df = df.withColumn("bk", F.concat_ws("||", *[F.col(c).cast("string") for c in bk_cols]))
    df = df.withColumn("sk", F.sha2(F.col("bk"), 256))

    df = canonicalise_for_hash(df, attr_cols)

    # Hash only mutable business-state attributes
    df = df.withColumn(
        "attr_hash",
        F.when(
            F.col("entity_type") == "state",
            F.sha2(F.concat_ws("||", *[F.col(c) for c in attr_cols]), 256)
        )
    )

    # Full-row / replay hash including provenance
    df = df.withColumn(
        "record_hash",
        F.sha2(
            F.concat_ws(
                "||",
                F.col("bk"),
                F.coalesce(F.col("source_system"), F.lit("∅")),
                F.coalesce(F.col("source_path"), F.lit("∅")),
                F.coalesce(F.col("source_file"), F.lit("∅")),
                F.coalesce(F.col("attr_hash"), F.lit("∅"))
            ),
            256
        )
    )

    return df

def prepare_fallback_inputs(df_in, bk_cols, attr_cols):
    """
    Adds required fallback metadata and declares platform-time ordering.
    """
    df = df_in

    df = df.withColumn("ingest_ts", F.coalesce(F.col("ingest_ts"), F.current_timestamp()))
    df = df.withColumn("ingest_run_id", F.coalesce(F.col("ingest_run_id"), F.lit("UNKNOWN_RUN")))
    df = df.withColumn("source_system", F.coalesce(F.col("source_system"), F.lit("UNKNOWN_SOURCE")))
    df = df.withColumn("source_type", F.coalesce(F.col("source_type"), F.lit("unknown")))
    df = df.withColumn("source_path", F.coalesce(F.col("source_path"), F.lit("UNKNOWN_PATH")))
    df = df.withColumn("source_file", F.coalesce(F.col("source_file"), F.lit("UNKNOWN_FILE")))
    df = df.withColumn("entity_type", F.coalesce(F.col("entity_type"), F.lit("state")))
    df = df.withColumn("load_type", F.coalesce(F.col("load_type"), F.lit("UNKNOWN")))
    df = df.withColumn("is_deleted", F.coalesce(F.col("is_deleted"), F.lit(False)))

    # Fallback mode: platform time becomes effective time
    df = df.withColumn("effective_from", F.col("ingest_ts"))

    df = add_identity_and_hashes_no_source_time(df, bk_cols, attr_cols)

    return df

def dedup_transactions_fallback(df_tx):
    """
    Transactions are immutable events.
    If seen again, update last_seen_ts; do not create SCD2 versions.
    """
    w_tx = W.partitionBy("bk", "record_hash")

    return (
        df_tx
        .withColumn("first_seen_ts", F.min("ingest_ts").over(w_tx))
        .withColumn("last_seen_ts", F.max("ingest_ts").over(w_tx))
        .withColumn("rn", F.row_number().over(w_tx.orderBy(F.col("ingest_ts").desc(), F.col("ingest_run_id").desc())))
        .filter(F.col("rn") == 1)
        .drop("rn")
    )

def merge_partial_state(df_new, df_existing, attr_cols):
    """
    For partial feeds, preserve existing attribute values where incoming
    rows do not explicitly provide them.

    This assumes the source has declared the load as PARTIAL.
    """
    existing_alias = df_existing.alias("e")
    new_alias = df_new.alias("n")

    joined = new_alias.join(existing_alias, on="bk", how="left")

    select_exprs = [
        F.col("n.bk").alias("bk"),
        F.col("n.sk").alias("sk"),
        F.col("n.ingest_ts").alias("ingest_ts"),
        F.col("n.ingest_run_id").alias("ingest_run_id"),
        F.col("n.source_system").alias("source_system"),
        F.col("n.source_type").alias("source_type"),
        F.col("n.source_path").alias("source_path"),
        F.col("n.source_file").alias("source_file"),
        F.col("n.entity_type").alias("entity_type"),
        F.col("n.load_type").alias("load_type"),
        F.col("n.is_deleted").alias("is_deleted"),
        F.col("n.effective_from").alias("effective_from"),
    ]

    # preserve previous values where incoming values are blank-ish
    for c in attr_cols:
        select_exprs.append(
            F.when(
                (F.col(f"n.{c}") == F.lit("∅")) | F.col(f"n.{c}").isNull(),
                F.col(f"e.{c}")
            ).otherwise(F.col(f"n.{c}")).alias(c)
        )

    return joined.select(*select_exprs)

def dedup_state_fallback(df_state):
    """
    Deduplicate ingestion storms and repeated arrivals for state rows.
    Same bk + attr_hash = same belief version in fallback mode,
    because no reliable source time exists.
    """
    w_dup = W.partitionBy("bk", "attr_hash")

    return (
        df_state
        .withColumn("first_seen_ts", F.min("ingest_ts").over(w_dup))
        .withColumn("last_seen_ts", F.max("ingest_ts").over(w_dup))
        .withColumn("rn", F.row_number().over(w_dup.orderBy(F.col("ingest_ts").desc(), F.col("ingest_run_id").desc())))
        .filter(F.col("rn") == 1)
        .drop("rn")
    )

def rebuild_scd2_fallback(df_union):
    """
    Rebuilds platform-time SCD2 history.
    Ordering is by ingestion time, because source time is unavailable.
    This is operationally consistent but weaker than source-time Bronze.
    """
    order_w = W.partitionBy("bk").orderBy(
        F.col("effective_from").asc(),
        F.col("source_file").asc_nulls_last(),
        F.col("ingest_run_id").asc_nulls_last()
    )

    df = (
        df_union
        .withColumn("prev_hash", F.lag("attr_hash").over(order_w))
        .withColumn(
            "is_new_version",
            F.when(
                F.col("prev_hash").isNull() | (F.col("prev_hash") != F.col("attr_hash")),
                F.lit(1)
            ).otherwise(F.lit(0))
        )
        .withColumn(
            "version_grp",
            F.sum("is_new_version").over(order_w.rowsBetween(W.unboundedPreceding, 0))
        )
    )

    latest_grp_w = W.partitionBy("bk", "version_grp").orderBy(F.col("ingest_ts").desc(), F.col("ingest_run_id").desc())

    df_collapsed = (
        df
        .withColumn("grp_effective_from", F.min("effective_from").over(W.partitionBy("bk", "version_grp")))
        .withColumn("grp_first_seen_ts", F.min("first_seen_ts").over(W.partitionBy("bk", "version_grp")))
        .withColumn("grp_last_seen_ts", F.max("last_seen_ts").over(W.partitionBy("bk", "version_grp")))
        .withColumn("grp_rn", F.row_number().over(latest_grp_w))
        .filter(F.col("grp_rn") == 1)
        .drop("grp_rn", "prev_hash", "is_new_version")
        .withColumn("effective_from", F.col("grp_effective_from"))
        .withColumn("first_seen_ts", F.col("grp_first_seen_ts"))
        .withColumn("last_seen_ts", F.col("grp_last_seen_ts"))
        .drop("grp_effective_from", "grp_first_seen_ts", "grp_last_seen_ts")
    )

    final_w = W.partitionBy("bk").orderBy(F.col("effective_from").asc())

    return (
        df_collapsed
        .withColumn("effective_to", F.lead("effective_from").over(final_w))
        .withColumn("is_current", F.col("effective_to").isNull())
    )

def process_bronze_without_source_time(df_in, bronze_table, bk_cols, attr_cols):
    """
    Main fallback ingestion function.

    Guarantees:
    - ingestion-time ordering
    - duplicate suppression
    - replay-safe ingestion
    - partial-load protection (where declared)
    - transaction/event separation
    """
    df = prepare_fallback_inputs(df_in, bk_cols, attr_cols)

    # Separate transactions from state
    df_tx = df.filter(F.col("entity_type") == "transaction")
    df_state_new = df.filter(F.col("entity_type") == "state")

    # Process transactions: insert if unseen, update last_seen if seen
    df_tx_final = dedup_transactions_fallback(df_tx)

    # Identify impacted business keys for state
    impacted_keys = df_state_new.select("bk").distinct()

    existing_state = (
        spark.read.format("delta").table(bronze_table)
        .filter(F.col("entity_type") == "state")
        .join(impacted_keys, on="bk", how="inner")
    )

    # For declared partial loads, preserve prior attribute values
    df_state_partial = df_state_new.filter(F.col("load_type") == "PARTIAL")
    df_state_full = df_state_new.filter(F.col("load_type") != "PARTIAL")

    if df_state_partial.limit(1).count() > 0: # OR: if not df_state_partial.rdd.isEmpty():
        latest_existing = (
            existing_state
            .withColumn(
                "rn_latest",
                F.row_number().over(W.partitionBy("bk").orderBy(F.col("effective_from").desc()))
            )
            .filter(F.col("rn_latest") == 1)
            .drop("rn_latest")
        )
        df_state_partial = merge_partial_state(df_state_partial, latest_existing, attr_cols)
        df_state_partial = add_identity_and_hashes_no_source_time(df_state_partial, bk_cols, attr_cols)

    df_state_prepared = df_state_full.unionByName(df_state_partial, allowMissingColumns=True)
    df_state_prepared = dedup_state_fallback(df_state_prepared)

    # Union with existing and rebuild platform-time SCD2
    df_state_union = existing_state.unionByName(df_state_prepared, allowMissingColumns=True)
    df_state_final = rebuild_scd2_fallback(df_state_union)

    # Final combined output
    df_final = df_state_final.unionByName(df_tx_final, allowMissingColumns=True)

    return df_final

12.2.5 What this fallback pattern achieves

This gives you:

platform-time ordering
because effective_from = ingest_ts
duplicate suppression
because repeated identical records collapse via bk + attr_hash
replay-safe ingestion
because repeated arrivals update last_seen_ts rather than creating spurious history
partial-feed safety
because declared partial loads preserve prior attribute values
transaction/event handling
because immutable events are not falsely versioned as SCD2 dimensions

12.2.6 Limitations and audit caveats

This pattern is useful, but weaker.

It can prove:

what the platform received
when the platform received it
which file/system/run delivered it
how the platform versioned belief from that point onward

It cannot prove:

what the source actually believed at a historical moment prior to ingestion
whether a late full snapshot was older or newer in source terms
whether out-of-order arrivals reflect source chronology or pipeline delay
true source-time point-in-time reconstruction

The platform is therefore preserving:

platform-time belief, not source-time belief

That distinction must be documented and visible to audit, risk, and regulators.

12.3 When You Must Reject a Source as Non-Defensible

Some sources are not merely “messy”. They are architecturally non-defensible for regulated use-cases unless their limitations are explicitly quarantined.

A source should be rejected as a source of regulator-defensible truth if it cannot substantiate the minimum controls required for temporal reconstruction.

This does not necessarily mean the source cannot be ingested. It means it must not be treated as capable of supporting:

point-in-time reconstruction
authoritative belief replay
evidentiary decision reconstruction

12.3.1 No source-time

If a source cannot provide any reliable notion of:

event timestamp
effective timestamp
extract “as-of” time
ordered change sequence

then the platform cannot reconstruct source-time belief.

In that case, the source may still be ingested in fallback mode, but must be classified as:

operationally useful, temporally weak

It must not be presented as regulator-grade temporal evidence.

12.3.2 Ambiguous partial snapshots

A source should be treated as non-defensible if it sends partial updates without making clear:

whether the payload is partial or full
whether omitted fields are unchanged, null, or deleted
whether row absence implies deletion

If partiality is ambiguous, then:

missing fields can cause false deletes
missing rows can cause false absence
downstream belief becomes unrecoverable

A source without explicit load_type or equivalent semantics cannot safely drive Bronze truth in regulated domains.

12.3.3 No stable business key

If the source cannot provide a stable identity for the entity or event, then Bronze cannot:

group assertions correctly
deduplicate deterministically
reconstruct timelines reliably
distinguish replay from genuine change

In practical terms, if there is no stable bk, there is no safe SCD2.

You can still land such data as evidence, but you cannot build defensible entity history from it.

12.3.4 No delete semantics

If the source cannot distinguish between:

deletion
omission
partial payload
not-included-in-this-extract

then the platform cannot safely model lifecycle state.

This means:

active vs inactive state becomes guesswork
Silver current-state views become unstable
historical belief cannot be defended

In regulated contexts, ambiguous deletion semantics are a hard control weakness.

12.3.5 Practical reject / quarantine logic

The easiest way to operationalise this is to score the source contract before allowing it into the “defensible Bronze” path.

from pyspark.sql import functions as F

def assess_source_defensibility(df_contract):
    """
    Example contract assessment input:
    one-row dataframe or config-derived dataframe with booleans such as:
    has_source_time, has_stable_bk, has_explicit_load_type, has_delete_semantics
    """

    return (
        df_contract
        .withColumn(
            "defensibility_status",
            F.when(
                (F.col("has_source_time") == True) &
                (F.col("has_stable_bk") == True) &
                (F.col("has_delete_semantics") == True),
                F.lit("DEFENSIBLE")
            ).when(
                (F.col("has_stable_bk") == True),
                F.lit("LIMITED_FALLBACK_ONLY")
            ).otherwise(
                F.lit("REJECT_FOR_AUTHORITATIVE_BRONZE")
            )
        )
        .withColumn(
            "defensibility_reason",
            F.when(F.col("has_stable_bk") != True, F.lit("No stable business key"))
             .when(F.col("has_source_time") != True, F.lit("No reliable source time"))
             .when(F.col("has_delete_semantics") != True, F.lit("No explicit delete semantics"))
             .otherwise(F.lit("Meets minimum temporal evidentiary standard"))
        )
    )

Example usage:

contract_df = spark.createDataFrame(
    [
        (False, True, False, False)
    ],
    ["has_source_time", "has_stable_bk", "has_explicit_load_type", "has_delete_semantics"]
)

assess_source_defensibility(contract_df).show(truncate=False)

Possible outputs:

DEFENSIBLE
LIMITED_FALLBACK_ONLY
REJECT_FOR_AUTHORITATIVE_BRONZE

12.3.6 The architectural rule

A source may be ingested without source-time.
It may not be presented as regulator-defensible historical truth unless it can substantiate temporal ordering, stable identity, and deletion semantics.

This is the architectural boundary.

Without it, weak sources contaminate strong Bronze and undermine the entire point of preserving temporal truth.

12.4 CDC Microbatch Ingestion Pattern (State + Transaction Handling)

Change Data Capture (CDC) is one of the most common ingestion patterns in modern Financial Services platforms. It introduces a continuous stream of change events — inserts, updates, deletes — typically delivered in microbatches via Kafka, file drops, or replication tooling.

In a Bronze SCD2 architecture, CDC must be treated as:

a stream of assertions, not a sequence of instructions

Each CDC event represents:

what the source asserted
when it asserted it
under what context

The role of Bronze is not to “apply” these changes, but to capture them as evidence, and construct a defensible temporal model from them.

This section extends the preferred source-time pattern (12.1) to CDC ingestion.

12.4.1 Inputs and Assumptions

Assume each microbatch contains CDC events with fields such as:

business key
mutable attributes (often partial)
operation type (INSERT, UPDATE, DELETE)
source timestamp or sequence (source_event_ts)
ingestion timestamp (ingest_ts)
optional before/after images

Key assumptions:

source_event_ts is reliable (or at least monotonic within source scope)
CDC payloads are partial by nature
entity type (state vs transaction) is known or derivable
load type is implicitly PARTIAL

12.4.2 CDC Op Mapping Rules

CDC events must be translated into state assertions, not executed as mutations.

12.4.2.1 Insert (`INSERT` / `c`)

Represents a new state assertion
Create new version
effective_from = source_event_ts

12.4.2.2 Update (`UPDATE` / `u`)

Represents a changed assertion
Payload is often partial
Must reconstruct full state using prior valid version (see 12.1.3)
Create new SCD2 version

12.4.2.3 Delete (`DELETE` / `d`)

Represents a deletion assertion
Create new version with:
- is_deleted = true
Do not remove prior history

12.4.2.4 Before / After Images

If available:

use after image as the assertion
optionally retain before image for forensic/audit use (not required for SCD2 logic)

If not available:

treat as partial update
apply merge logic from 12.1.3

12.4.3 Microbatch Processing Flow

CDC ingestion should use Structured Streaming with foreachBatch, allowing explicit control over:

ordering
deduplication
provenance
replay behaviour

The correct pattern is:

parse CDC payload
standardise + assign metadata
map CDC ops to state assertions
reconstruct partial state (where required)
deduplicate observations
rebuild SCD2 timeline for impacted keys
append transaction events
write Bronze

12.4.4 Reference PySpark Implementation

def process_cdc_microbatch(microbatch_df, batch_id):    # 1. Parse CDC structure (example assumes Debezium-style payload)
    df = (
        microbatch_df
        .withColumn("op", F.col("payload.op"))
        .withColumn("source_event_ts", F.col("payload.source.ts_ms").cast("timestamp"))
        .withColumn("bk", F.col("payload.after.customer_id"))
        # extract attributes from payload.after as required
    )    # 2. Prepare inputs (source-time mode)
    df = prepare_inputs_with_source_time(df, BK_COLS, ATTR_COLS, EVENT_TS_COL)    # 3. Map CDC operations to state assertions
    df_state = df.filter(F.col("entity_type") == "state") \
        .withColumn(
            "is_deleted",
            F.when(F.col("op") == "d", True).otherwise(F.col("is_deleted"))
        )    df_tx = df.filter(F.col("entity_type") == "transaction")    # 4. Identify impacted keys
    impacted_keys = df_state.select("bk").distinct()    existing_state = (
        spark.read.table(BRONZE_TABLE)
        .filter(F.col("entity_type") == "state")
        .join(impacted_keys, "bk")
    )    # 5. Split insert vs partial (updates + deletes are partial)
    df_state_insert = df_state.filter(F.col("op") == "c")
    df_state_partial = df_state.filter(F.col("op") != "c")    # 6. Reconstruct partial state at source time
    if df_state_partial.limit(1).count() > 0:
        df_state_partial = merge_partial_state_with_source_time(
            df_state_partial,
            existing_state,
            ATTR_COLS,
            EVENT_TS_COL
        )
        df_state_partial = add_identity_and_hashes(df_state_partial, BK_COLS, ATTR_COLS)    # 7. Combine + deduplicate
    df_state_prepared = df_state_insert.unionByName(df_state_partial, allowMissingColumns=True)
    df_state_prepared = dedup_state_with_source_time(df_state_prepared)    # 8. Rebuild source-time timeline
    df_union = existing_state.unionByName(df_state_prepared, allowMissingColumns=True)
    df_state_final = rebuild_scd2_timeline_with_source_time(df_union)    # 9. Transactions (append-only, replay-safe)
    df_tx_final = dedup_transactions_with_source_time(df_tx)    df_final = df_state_final.unionByName(df_tx_final, allowMissingColumns=True)    # 10. Write Bronze
    df_final.write.format("delta").mode("append").saveAsTable(BRONZE_TABLE)

12.4.4.1 Important: Append vs Rebuild

CDC often tempts teams into row-level mutation:

“apply update, update row, delete row”

That is incorrect for Bronze.

The correct model is:

append assertions
rebuild timelines
derive state

Even under streaming.

The use of foreachBatch is deliberate:

It allows each microbatch to behave like a controlled batch rebuild for impacted keys, preserving determinism.

12.4.5 Why This Fits Regulated Financial Services

This pattern preserves the properties regulators actually care about.

12.4.5.1 Full evidentiary capture

Every CDC event is retained with:

source timestamp
operation type
provenance

12.4.5.2 Out-of-sequence safety

Late or replayed CDC events are:

inserted into correct temporal position
not appended blindly

12.4.5.3 Replay and idempotency

Duplicate CDC deliveries:

collapse via hashing
update last_seen_ts
do not create false versions

12.4.5.4 Correct handling of partial change

CDC payloads are inherently partial. The merge logic ensures:

no accidental null overwrites
no silent data loss

12.4.5.5 Separation of fact vs belief

state → SCD2 versioned belief
transactions → append-only facts

12.4.5.6 Limitations and Caveats

CDC is powerful, but not automatically safe.

12.4.5.6.1 Source-time quality is critical

If source_event_ts is:

missing
inconsistent
non-monotonic

then the model degrades to ingestion-time belief.

12.4.5.6.2 Ordering guarantees vary by source

Different CDC systems provide different guarantees:

some guarantee ordering per key
others only per partition
some allow reordering under failure

Tie-breakers must still be applied.

12.4.5.6.3 High-volume considerations

Large CDC streams require:

partitioning on bk
clustering by source_event_ts
periodic compaction

12.4.5.6.4 Deletes must be explicit

Some CDC feeds omit deletes or encode them ambiguously.
If delete semantics are unclear, the source becomes:

non-defensible for lifecycle modelling

12.4.5.6.5 The Core Principle

CDC does not simplify Bronze.
It amplifies the need to get Bronze right.

You are not applying changes.
You are capturing a stream of assertions and constructing a defensible history from them.

That is what makes the system:

replayable
explainable
regulator-grade

12.4.6 Worked CDC Example: Insert, Partial Update, Delete, and Out-of-Order Arrival

To make the pattern concrete, consider a simple customer record flowing from a CDC source.

Assume the business key is:

customer_id = C123

And the mutable attributes are:

name
address
status
is_deleted

Assume the source emits CDC events with:

op (c, u, d)
source_event_ts
partial payloads for updates
provenance metadata

12.4.6.1 The Source Assertions

The source emits four events:

12.4.6.1.1 Event 1 — Initial insert

Field	Value
`op`	`c`
`source_event_ts`	`2026-03-01 09:00:00`
`customer_id`	`C123`
`name`	`Jane Carter`
`address`	`12 Market Street`
`status`	`Active`
`is_deleted`	`false`

12.4.6.1.2 Event 2 — Partial update

Only the address changes.

Field	Value
`op`	`u`
`source_event_ts`	`2026-03-03 10:00:00`
`customer_id`	`C123`
`address`	`18 King Street`

Because this is CDC, the payload is partial. It does not repeat name, status, or is_deleted.

12.4.6.1.3 Event 3 — Delete

The source later deletes the customer.

Field	Value
`op`	`d`
`source_event_ts`	`2026-03-05 08:30:00`
`customer_id`	`C123`

This is interpreted as a new version with is_deleted = true.

12.4.6.1.4 Event 4 — Late-arriving update

Now imagine an earlier update arrives late due to replication lag.

Field	Value
`op`	`u`
`source_event_ts`	`2026-03-02 15:00:00`
`customer_id`	`C123`
`status`	`Restricted`

This arrives after Events 2 and 3 have already been processed.

That is the real test.

12.4.6.2 Step 1: What Bronze Receives

Bronze receives all four assertions as evidence, with provenance and ingestion metadata.

The crucial point is this:

Bronze does not discard Event 4 just because it arrived late.
Bronze inserts it into the correct place in the source-time timeline.

12.4.6.3 Step 2: Partial Reconstruction

Event 2 and Event 4 are partial updates, so the platform reconstructs the full row as of their source_event_ts.

12.4.6.3.1 Event 2 reconstructed full state

Attribute	Value
`name`	`Jane Carter`
`address`	`18 King Street`
`status`	`Active`
`is_deleted`	`false`

12.4.6.3.2 Event 4 reconstructed full state

Attribute	Value
`name`	`Jane Carter`
`address`	`12 Market Street`
`status`	`Restricted`
`is_deleted`	`false`

This is why partial updates must be merged against prior valid state before hashing and versioning.

12.4.6.4.Step 3: Correct Source-Time Ordering

Even if the events arrive in this ingestion order:

Event 1
Event 2
Event 3
Event 4 (late)

the Bronze timeline is rebuilt in this source-time order:

Event 1 — 2026-03-01 09:00:00
Event 4 — 2026-03-02 15:00:00
Event 2 — 2026-03-03 10:00:00
Event 3 — 2026-03-05 08:30:00

That means effective_from, effective_to, and is_current are recomputed accordingly.

12.4.6.5 Step 4: Final Bronze SCD2 History

The correct Bronze history becomes:

customer_id	effective_from	effective_to	name	address	status	is_deleted	is_current
C123	2026-03-01 09:00:00	2026-03-02 15:00:00	Jane Carter	12 Market Street	Active	false	false
C123	2026-03-02 15:00:00	2026-03-03 10:00:00	Jane Carter	12 Market Street	Restricted	false	false
C123	2026-03-03 10:00:00	2026-03-05 08:30:00	Jane Carter	18 King Street	Active	false	false
C123	2026-03-05 08:30:00	null	Jane Carter	18 King Street	Active	true	true

Notice what happened:

the late-arriving Event 4 was inserted historically
the prior and subsequent rows were resequenced
the delete is represented as a new version, not row removal
is_current remains correct

12.4.6.6 Step 5: What Silver Sees

Silver does not need to solve this complexity. It simply reads the Bronze history.

12.4.6.6.1 Current-state Silver

SELECT *
FROM bronze_customer
WHERE is_current = true
  AND is_deleted = false

Result:

no active record returned for C123

Because the customer is now deleted.

12.4.6.6.2 Point-in-time Silver as of `2026-03-02 16:00:00`

SELECT *
FROM bronze_customer
WHERE effective_from <= TIMESTAMP '2026-03-02 16:00:00'
  AND (effective_to > TIMESTAMP '2026-03-02 16:00:00' OR effective_to IS NULL)
  AND is_deleted = false

Result:

customer_id	name	address	status
C123	Jane Carter	12 Market Street	Restricted

Which is exactly the belief the platform should reconstruct at that time.

12.4.6.7 Why This Example Matters

This example shows four critical truths:

12.4.6.7.1 1. CDC updates are assertions, not commands

The platform does not “apply” an update blindly. It evaluates it as evidence.

12.4.6.7.2 2. Partial updates must be reconstructed

Without reconstruction, Event 2 and Event 4 would null out attributes and corrupt the hash.

12.4.6.7.3 3. Late events must repair history

Without source-time rebuild, Event 4 would arrive “too late” and produce the wrong current and PIT state.

12.4.6.7.4 4. Deletes are state transitions

The delete is not a row removal. It is a final, explicit belief version.

12.4.6.8 The Core Lesson

If Bronze is implemented correctly, even messy CDC behaviour becomes manageable:

partial updates do not lose state
late arrivals do not corrupt timelines
deletes do not erase evidence
Silver remains simple

That is exactly what “land it early, manage it early” means in practice.

12.4.7 Worked Example: Multi-Source CDC Conflict with Precedence

In real Financial Services systems, multiple sources assert state about the same entity. These assertions may:

overlap in time
conflict in value
arrive out of sequence

Bronze must capture all of them.
Silver must decide what to believe.

12.4.7.1 Scenario

We have two systems:

Source	Description	Precedence
CRM	Customer-facing system	1 (lower)
CORE	Core banking system	2 (higher)

Assume:

CORE is the authoritative source for status
CRM is authoritative for contact data (not shown here)

We focus on:

customer_id
status

12.4.7.2 The Source Assertions

12.4.7.2.1 Event 1 — CRM creates customer

Field	Value
`source_system`	CRM
`precedence_rank`	1
`source_event_ts`	2026-03-01 09:00
`status`	Active

12.4.7.2.2 Event 2 — CORE sets status to Restricted

Field	Value
`source_system`	CORE
`precedence_rank`	2
`source_event_ts`	2026-03-02 10:00
`status`	Restricted

12.4.7.2.3 Event 3 — CRM updates status (conflict)

Field	Value
`source_system`	CRM
`precedence_rank`	1
`source_event_ts`	2026-03-03 09:00
`status`	Active

This directly conflicts with CORE.

12.4.7.2.4 Event 4 — CORE clears restriction (late arrival)

Field	Value
`source_system`	CORE
`precedence_rank`	2
`source_event_ts`	2026-03-02 18:00
`status`	Active

This arrives after Event 3, but is earlier in source time.

12.4.7.3 Step 1: Bronze Stores All Assertions

Bronze does not resolve conflicts.

It produces a full SCD2 history ordered by:

source_event_ts ASC
precedence_rank DESC

12.4.7.3.1 Correct Bronze Timeline

customer_id	source	precedence	effective_from	status
C123	CRM	1	2026-03-01 09:00	Active
C123	CORE	2	2026-03-02 10:00	Restricted
C123	CORE	2	2026-03-02 18:00	Active
C123	CRM	1	2026-03-03 09:00	Active

Notice:

Event 4 (CORE late arrival) is inserted correctly between Events 2 and 3
CRM conflict is preserved, not removed

12.4.7.4 Step 2: Bronze is NOT “the truth”

At this point, Bronze contains:

conflicting assertions
overlapping beliefs
multiple valid versions

This is correct.

Bronze answers: what was said, when, and by whom

12.4.7.5 Step 3: Silver Applies Precedence

Silver resolves conflicts using:

Highest precedence_rank wins
If equal precedence → latest source_event_ts wins

Evaluate at Different Points in Time

12.4.7.5.1 As of `2026-03-02 12:00`

Available assertions:

CRM → Active (lower precedence)
CORE → Restricted (higher precedence)

Silver result:

status
Restricted

12.4.7.5.2 As of `2026-03-02 19:00`

Available assertions:

CORE → Active (Event 4)
CORE → Restricted (Event 2)

Same precedence → latest time wins.

Silver result:

status
Active

12.4.7.5.3 As of `2026-03-03 10:00`

Available assertions:

CRM → Active (latest in time, lower precedence)
CORE → Active (earlier, higher precedence)

Key rule: precedence overrides time

So:

Silver result:

status
Active (from CORE)

Even though CRM is later, CORE still governs.

12.4.7.6 Step 4: Why This Works

12.4.7.6.1 1. Time is preserved independently of precedence

Bronze ordering is always:

when was this asserted?

not:

which source wins?

12.4.7.6.2 2. Precedence is applied at query time (Silver)

This allows:

rules to evolve
conflicts to be reinterpreted
no loss of evidence

12.4.7.6.3 3. Late-arriving data is safe

Event 4:

arrives late
is inserted correctly
affects downstream belief

Without rebuild logic, this would break.

12.4.7.6.4 4. Conflicts remain explainable

You can always answer:

“CRM said Active at X”
“CORE said Restricted at Y”
“We chose CORE because of precedence”

That is exactly what regulators expect.

12.4.7.7 Step 5: What NOT to Do

12.4.7.7.1 Do NOT resolve precedence in Bronze

If you drop lower-precedence rows:

you lose evidence
you cannot explain decisions
you cannot re-run logic

12.4.7.7.2 Do NOT overwrite based on precedence

This collapses:

multiple assertions → single row
history → illusion

12.4.7.7.3 Do NOT treat “latest wins” as universal

That only works in:

single-source systems
non-regulated environments

12.4.7.8 Step 6: Example Silver Query Pattern

A simplified Silver resolution might look like:

WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY precedence_rank DESC, source_event_ts DESC
         ) AS rn
  FROM bronze_customer
  WHERE effective_from <= :as_at
    AND (effective_to > :as_at OR effective_to IS NULL)
)
SELECT *
FROM ranked
WHERE rn = 1

This is where:

precedence
time
business rules

come together.

12.4.7.9 The Core Insight

This example demonstrates the most important architectural separation in the entire design:

Bronze preserves competing assertions.
Silver expresses institutional belief.

If you collapse those layers, you lose:

auditability
flexibility
defensibility

12.4.7.10 Final Takeaway

Multi-source CDC is not a problem to eliminate.

It is a reality to model correctly.

If you:

preserve all assertions (Bronze)
order them correctly (source time)
resolve them explicitly (Silver)

then even conflicting, out-of-sequence, multi-source data becomes:

deterministic, explainable, and regulator-grade

12.4.8 Worked Example: CDC + Partial + Precedence + Delete Across Two Sources

The hardest real-world cases are not single-source updates. They are situations where:

multiple systems assert state about the same entity
one source is higher precedence than another
updates are partial
events arrive out of sequence
one source later deletes the record

This example shows how Bronze and Silver behave under exactly those conditions.

12.4.8.1 Scenario

We have one customer:

customer_id = C123

Two source systems assert state about that customer:

Source	Role	Precedence
CRM	customer-facing workflow system	1 (lower)
CORE	core banking / operational authority	2 (higher)

We care about these mutable attributes:

name
address
status
is_deleted

Assume:

CORE is authoritative for status
CRM is authoritative for contact workflow, but still contributes assertions
all events are CDC-style partial or full assertions
Bronze stores all assertions
Silver resolves current belief

12.4.8.2 The CDC Events

12.4.8.2.1 Event 1 — CRM creates the customer

A full create arrives from CRM.

Field	Value
`source_system`	CRM
`precedence_rank`	1
`op`	`c`
`source_event_ts`	2026-03-01 09:00
`name`	Jane Carter
`address`	12 Market Street
`status`	Active
`is_deleted`	false

This becomes the first Bronze version.

12.4.8.2.2 Event 2 — CORE partially updates status

A partial CDC update arrives from CORE.

Field	Value
`source_system`	CORE
`precedence_rank`	2
`op`	`u`
`source_event_ts`	2026-03-02 10:00
`status`	Restricted

This is partial: only status is present.

To version it correctly, Bronze must reconstruct full state as of 2026-03-02 10:00:

name	address	status	is_deleted
Jane Carter	12 Market Street	Restricted	false

12.4.8.2.3 Event 3 — CRM partially updates address

CRM later changes the address.

Field	Value
`source_system`	CRM
`precedence_rank`	1
`op`	`u`
`source_event_ts`	2026-03-03 09:00
`address`	18 King Street

Again, this is partial. Bronze reconstructs:

name	address	status	is_deleted
Jane Carter	18 King Street	Active	false

Notice something important:

CRM has no authority over CORE’s status assertion
Bronze still stores CRM’s assertion exactly as it was made
Silver will later decide what to believe

12.4.8.2.4 Event 4 — CORE clears the restriction, but arrives late

Now a CORE event arrives after Event 3 was processed, but with an earlier source time.

Field	Value
`source_system`	CORE
`precedence_rank`	2
`op`	`u`
`source_event_ts`	2026-03-02 18:00
`status`	Active

This is late-arriving and partial.

Bronze reconstructs:

name	address	status	is_deleted
Jane Carter	12 Market Street	Active	false

And then inserts it into the correct source-time position, not the arrival position.

12.4.8.2.5 Event 5 — CRM delete arrives later

CRM later emits a delete.

Field	Value
`source_system`	CRM
`precedence_rank`	1
`op`	`d`
`source_event_ts`	2026-03-04 12:00

Bronze models this as a new version:

name	address	status	is_deleted
Jane Carter	18 King Street	Active	true

But this is still only a CRM assertion.

That matters.

12.4.8.3 Step 1: Bronze Stores All Assertions in Source-Time Order

Even if the actual ingestion order is:

Event 1
Event 2
Event 3
Event 5
Event 4 (late)

Bronze must rebuild history into this source-time order:

Seq	source_system	precedence	source_event_ts	name	address	status	is_deleted
1	CRM	1	2026-03-01 09:00	Jane Carter	12 Market Street	Active	false
2	CORE	2	2026-03-02 10:00	Jane Carter	12 Market Street	Restricted	false
3	CORE	2	2026-03-02 18:00	Jane Carter	12 Market Street	Active	false
4	CRM	1	2026-03-03 09:00	Jane Carter	18 King Street	Active	false
5	CRM	1	2026-03-04 12:00	Jane Carter	18 King Street	Active	true

This is the Bronze timeline.

Bronze has done no precedence resolution. It has simply preserved assertions.

12.4.8.4 Step 2: Bronze SCD2 History

If represented as Bronze rows, it would look like:

customer_id	source_system	precedence_rank	effective_from	effective_to	address	status	is_deleted	is_current
C123	CRM	1	2026-03-01 09:00	2026-03-02 10:00	12 Market Street	Active	false	false
C123	CORE	2	2026-03-02 10:00	2026-03-02 18:00	12 Market Street	Restricted	false	false
C123	CORE	2	2026-03-02 18:00	2026-03-03 09:00	12 Market Street	Active	false	false
C123	CRM	1	2026-03-03 09:00	2026-03-04 12:00	18 King Street	Active	false	false
C123	CRM	1	2026-03-04 12:00	null	18 King Street	Active	true	true

This is not yet “the truth”. It is the full evidence chain.

12.4.8.5 Step 3: What Silver Believes at Different Times

Silver now applies precedence.

Rule set:

Higher precedence_rank wins
If equal precedence, latest source-time wins
Deletes are only believed if they come from a source authoritative for deletion in that domain

That last point is crucial.

12.4.8.5.1 As of `2026-03-02 12:00`

Visible assertions:

CRM: Active
CORE: Restricted

Silver result:

status	address	deleted
Restricted	12 Market Street	false

Why:

CORE outranks CRM on status

12.4.8.5.2 As of `2026-03-02 19:00`

Visible assertions:

CRM: Active
CORE: Restricted
CORE: Active

Silver result:

status	address	deleted
Active	12 Market Street	false

Why:

CORE still outranks CRM
latest CORE assertion is now Active

12.4.8.5.3 As of `2026-03-03 10:00`

Visible assertions:

CRM changed address to 18 King Street
CORE latest status is Active

Silver result:

status	address	deleted
Active	18 King Street	false

Why:

CRM supplies the latest address assertion
CORE supplies the latest higher-precedence status assertion

This is the key insight:

Silver may compose belief from multiple Bronze assertions across sources and attributes.

12.4.8.5.4 As of `2026-03-04 13:00`

Visible assertions now include CRM delete.

Silver must decide:

is CRM allowed to delete the entity for this domain?
or is CORE the authoritative lifecycle source?

12.4.8.5.5 Case A — CRM delete is authoritative

Silver result:

status	address	deleted
Active	18 King Street	true

The customer is treated as deleted.

12.4.8.5.6 Case B — CRM delete is not authoritative

Silver result:

status	address	deleted
Active	18 King Street	false

The delete assertion remains preserved in Bronze, but is not adopted as belief.

This is why Bronze must store all conflicts and Silver must resolve them.

12.4.8.6 Step 4: Why This Example Matters

This single example shows all the hard parts working together:

12.4.8.6.1 Partial update safety

Events 2, 3, and 4 are partial.
Without reconstruction, they would null out omitted attributes.

12.4.8.6.2 Out-of-sequence repair

Event 4 arrives late but is inserted into the correct historical position.

12.4.8.6.3 Multi-source precedence

CRM and CORE both assert state.
Silver selects belief without destroying Bronze evidence.

12.4.8.6.4 Deletes as assertions

Delete is not row removal.
It is just another temporally ordered assertion.

12.4.8.6.5 Attribute-level authority

One source may govern status, another address, and a third deletion authority.
That decision belongs in Silver/precedence logic, not Bronze mutation.

12.4.8.7 Step 5: Example Silver Resolution Shape

A simplified Silver query pattern might be:

WITH candidate_rows AS (
  SELECT *
  FROM bronze_customer
  WHERE effective_from <= :as_at
    AND (effective_to > :as_at OR effective_to IS NULL)
),
ranked_status AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY precedence_rank DESC, source_event_ts DESC
         ) AS rn_status
  FROM candidate_rows
  WHERE status IS NOT NULL
),
ranked_address AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY source_event_ts DESC
         ) AS rn_address
  FROM candidate_rows
  WHERE address IS NOT NULL
)
SELECT
  s.customer_id,
  s.status,
  a.address,
  s.is_deleted
FROM ranked_status s
JOIN ranked_address a
  ON s.customer_id = a.customer_id
WHERE s.rn_status = 1
  AND a.rn_address = 1

This is illustrative only, but it makes the core point:

Silver resolves belief from Bronze assertions.
It does not pretend one source row is magically the whole truth.

12.4.8.8 The Core Lesson

This is the pattern you need for regulated Financial Services:

Bronze captures all assertions
source-time ordering repairs late arrival
partial updates are reconstructed before hashing
deletes are explicit state transitions
precedence is applied downstream, not destructively upstream

That is how you get:

truth without pain in Silver, because the pain was handled properly in Bronze

12.4.8.9 Final Takeaway

When CDC, partial updates, precedence, deletes, and out-of-sequence delivery all collide, the architecture either:

preserves evidence and remains explainable
or
collapses into ingestion-order folklore

The difference is whether Bronze is treated as:

a mutable current-state table

an append-only, temporally ordered record of institutional belief

Only the second survives scrutiny.

12.4.9 Streaming Reality: Watermarks, Replay, and Exactly-Once Illusions

CDC microbatch ingestion introduces operational concerns that are often misunderstood, particularly in regulated environments.

The key principle is:

Correctness under replay matters more than minimising duplication.

12.4.9.1 Exactly-Once Is Not the Goal

In distributed systems:

exactly-once delivery is difficult to guarantee
retries, replays, and duplicate delivery are normal

Instead of attempting to eliminate duplicates, the architecture must be:

idempotent by design

This is achieved through:

deterministic hashing (attr_hash, record_hash)
deduplication on (business_key, attr_hash, source_event_ts)
rebuild logic that collapses identical states

12.4.9.2 Replay Is a Feature, Not a Failure

In regulated Financial Services, replay is expected:

backfills
reprocessing after failure
regulatory reconstruction

The pipeline must produce identical results when replayed.

This is guaranteed if:

ordering is based on source_event_ts
deduplication removes duplicate assertions
SCD2 rebuild logic is deterministic

12.4.9.3 Watermarking and Late Data

Watermarks are often used to discard late data in streaming systems.

In regulated environments, this is dangerous.

If late data is dropped:

history becomes incomplete
PIT reconstruction becomes incorrect
audit defensibility is compromised

Therefore:

Late data should be processed and repaired, not discarded

If operational constraints require a watermark:

it must be conservative
and exceptions must be quarantined, not silently dropped

12.4.9.4 Microbatch Design Implications

Using foreachBatch:

each microbatch is treated as a bounded batch
full SCD2 rebuild logic can be applied safely
impacted-key pruning limits scope

This avoids:

unbounded state in streaming engines
reliance on streaming-native state stores

12.4.9.5 Failure and Restart Semantics

On failure:

the same microbatch may be reprocessed
events may be duplicated

This must not change outcomes.

The system remains correct if:

ingestion is append-only
deduplication is deterministic
rebuild logic is idempotent

12.4.9.6 What This Substantiates

With this approach, the platform can prove:

that no data was silently dropped
that reprocessing yields identical results
that late-arriving events are correctly incorporated

This is essential for:

regulator confidence in operational resilience and data integrity

12.5 Cross-Compatibility of Bronze Ingestion Patterns (How These Pieces Actually Fit Together)

The preceding sections are not alternative implementations to choose between arbitrarily. They are composable patterns with clear roles, and understanding how they interact is critical to building a platform that remains coherent under real-world conditions.

This section clarifies how:

source-time SCD2 (12.1)
fallback ingestion (12.2)
source rejection (12.3)
CDC microbatch ingestion (12.4)

fit together as a single operating model, not four competing designs.

12.5.1 The Compatibility Matrix

At a high level, the patterns relate as follows:

Pattern	12.1 Source-Time SCD2	12.2 No Source-Time	12.3 Reject Source	12.4 CDC Microbatch
12.1 Source-Time SCD2	Native model	Fallback downgrade	Protected by rejection	Fully compatible
12.2 No Source-Time	Degraded alternative	Native model	Often leads to rejection	Only if CDC lacks time
12.3 Reject Source	Protects integrity	Prevents weak fallback misuse	Native control	Applied before ingestion
12.4 CDC Microbatch	Ideal pairing	Degraded fallback	Must be validated first	Native delivery pattern

This matrix is not symmetrical by accident. It reflects a hierarchy:

Some patterns define truth, some define delivery, and some define control.

12.5.2 The Roles Are Not Equal

Each section plays a fundamentally different role in the architecture:

Section	Role	Responsibility
12.1	Truth engine	Constructs temporally ordered, regulator-defensible history
12.2	Fallback engine	Preserves evidence when source-time is unavailable
12.3	Governance gate	Prevents non-defensible data entering Bronze
12.4	Delivery mechanism	Ingests change events into the platform

This distinction is critical.

A common failure mode is to treat these as interchangeable ingestion strategies. They are not.

12.5.3 The Canonical Flow (What “Good” Looks Like)

The intended operating model is:

CDC / batch / files (source systems)
        ↓
Validation & rejection (12.3)
        ↓
Microbatch ingestion (12.4)
        ↓
Source-time SCD2 reconstruction (12.1)
        ↓
Bronze (evidence store)
        ↓
Silver (precedence + belief resolution)

This is the “land it early, manage it early” pipeline in full.

12.5.4 The Critical Pairing: 12.1 + 12.4

The most important combination in this entire section is:

CDC ingestion (12.4) feeding source-time SCD2 reconstruction (12.1)

These two patterns solve different problems:

Concern	Solved By
Continuous ingestion	CDC (12.4)
Ordering, repair, and history	SCD2 with source time (12.1)

Together, they enable:

out-of-sequence event repair
idempotent replay of CDC streams
safe handling of partial updates
correct modelling of deletes as state transitions
full auditability of all assertions

Without this pairing:

CDC alone becomes an overwrite stream
SCD2 alone becomes a batch-only abstraction detached from reality

12.5.5 Where 12.2 Fits (And Why It Must Be Contained)

The no-source-time pattern (12.2) exists for one reason:

Some sources do not provide reliable temporal information.

It allows ingestion to proceed, but at a cost:

ordering becomes ingestion-time, not event-time
point-in-time reconstruction becomes weaker
audit defensibility is reduced

This is acceptable only if:

it is explicitly declared
it is isolated by domain or dataset
downstream consumers understand the limitation

It must never silently coexist with 12.1 for the same entity without clear boundaries.

12.5.6 Where 12.3 Sits (And Why It Matters)

The rejection pattern (12.3) is not optional governance overhead.

It is what prevents the platform from becoming non-defensible.

It is applied:

before ingestion
before SCD2 logic
before Bronze persistence

Typical rejection triggers include:

no source-time and no acceptable fallback justification
ambiguous partial snapshots
unstable or missing business keys
undefined delete semantics

Without this control:

Bronze becomes a mixture of incompatible temporal models, and cannot be trusted.

12.5.7 CDC Is Not a Data Model

A subtle but critical point:

CDC (12.4) is a delivery pattern, not a truth model.

CDC provides:

change events
operation types (insert/update/delete)
often partial payloads

It does not provide:

correct ordering under delay
complete state reconstruction
conflict resolution
precedence handling

Those responsibilities remain entirely within:

12.1 (and 12.2 where unavoidable)

Treating CDC as “the truth” leads directly to:

overwrite-based pipelines
loss of historical context
non-reproducible state

12.5.8 What This Enables

When these patterns are combined correctly, the platform achieves:

deterministic replay — the same inputs always produce the same history
out-of-sequence tolerance — late data is repaired, not ignored
multi-source coexistence — conflicting assertions are preserved, not collapsed
audit-grade lineage — every state can be traced to its originating assertion
simplified Silver — belief resolution becomes a query problem, not an ingestion problem

12.5.9 The Core Principle

This entire section reduces to one architectural truth:

Separate how data arrives from how truth is constructed.

CDC, batch, and files describe arrival
SCD2 with temporal control describes truth
precedence and Silver describe belief

If those concerns are mixed, the system becomes fragile.

If they are separated cleanly:

even complex, multi-source, out-of-sequence data becomes explainable and controllable.

12.5.10 Practical Guidance

In implementation terms:

Default every new ingestion to 12.4 → 12.1
Use 12.3 aggressively to protect quality
Allow 12.2 only where explicitly justified
Never resolve precedence or delete conflicts in Bronze
Always treat Bronze as an append-only evidence store

12.5.11 The Outcome

When these patterns are used together correctly:

Bronze becomes a complete, ordered record of assertions
Silver becomes a deterministic projection of belief
and the platform as a whole becomes:

regulator-defensible, operationally resilient, and evolution-friendly

12.6 Silver Survivorship Patterns (Attribute-Level Belief Resolution)

Bronze preserves all assertions.
Silver determines what to believe.

In multi-source environments, this is not a simple “latest row wins” problem. Different sources may be authoritative for different attributes, and precedence may vary by domain.

12.6.1 The Core Principle

Belief is resolved per attribute (or attribute group), not per row.

A single “winning row” is often insufficient, because:

one source may provide the most authoritative status
another may provide the most recent address
a third may control lifecycle or deletion

Silver must compose belief from multiple assertions.

12.6.2 Attribute-Level Precedence

Define precedence rules at an attribute or domain level:

Attribute Group	Authoritative Source	Rule
status	CORE	highest precedence wins
address	CRM	most recent wins
deletion	CORE	only CORE can delete

These rules are:

explicit
versioned
auditable

12.6.3 Survivorship Pattern

For each attribute group:

filter Bronze rows valid at as_at
rank by:
- precedence (if applicable)
- source_event_ts
select the top-ranked assertion

This produces a resolved view per attribute.

12.6.4 Example Pattern (Conceptual SQL)

WITH candidate_rows AS (
  SELECT *
  FROM bronze_customer
  WHERE effective_from <= :as_at
    AND (effective_to > :as_at OR effective_to IS NULL)
),status_ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY precedence_rank DESC, source_event_ts DESC
         ) AS rn
  FROM candidate_rows
  WHERE status IS NOT NULL
),address_ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY source_event_ts DESC
         ) AS rn
  FROM candidate_rows
  WHERE address IS NOT NULL
)SELECT
  s.customer_id,
  s.status,
  a.address,
  s.is_deleted
FROM status_ranked s
JOIN address_ranked a
  ON s.customer_id = a.customer_id
WHERE s.rn = 1
  AND a.rn = 1

This demonstrates:

precedence applied to status
recency applied to address
composition into a single belief

12.6.5 Handling Deletes in Survivorship

Deletes must follow explicit authority rules:

if an authoritative source asserts deletion → entity is deleted
if a non-authoritative source asserts deletion → treated as an assertion, not a fact

This prevents:

accidental data loss
incorrect lifecycle transitions

12.6.6 Overlapping Assertions

Different attributes may originate from different timestamps.

This is valid.

Silver does not require:

a single “coherent row” from one source

It constructs:

the most defensible belief at a given point in time

12.6.7 What This Substantiates

With proper survivorship:

every attribute can be traced to its originating assertion
precedence decisions are explainable and reproducible
belief can be reconstructed at any point in time

This is the final step that turns:

Bronze evidence → institutional belief

12.6.8 The Outcome

When Bronze and Silver are implemented together correctly:

Bronze absorbs complexity
Silver expresses clarity

And the platform achieves:

truth preservation without operational or analytical friction

13. What This Enables in Silver (And Why Silver Gets Simpler)

When Bronze is implemented correctly, Silver stops being a place where complexity is introduced and becomes a place where complexity is resolved cleanly. Most of the hard problems — ordering, deduplication, replay safety, provenance — are already handled upstream.

13.1 Clean Current-State Views

Silver can derive current state using a simple, stable rule:

WHERE is_current = true
AND is_deleted = false

Because Bronze guarantees:

one valid current version per business key
correct ordering based on source time
no duplicate versions from replay or concurrency

There is no need for:

complex windowing logic
“latest ingestion wins” hacks
defensive deduplication

13.2 Point-in-Time (PIT) Views

Silver can reconstruct belief at any time using:

WHERE effective_from <= :as_at
AND (effective_to > :as_at OR effective_to IS NULL)

Because Bronze:

preserves full version history
orders by source-event time
repairs late-arriving data

This enables:

regulatory reporting (“what did we believe on date X?”)
investigation replay
model validation against historical context

13.3 Regulator Defensibility

Silver becomes defensible because Bronze has retained:

full provenance (source, file, run)
complete version chains
explicit deletion semantics
conflict visibility across sources

This allows you to answer:

What data did you have?
Where did it come from?
When did you receive it?
Why did you believe what you believed?

Without reconstructing logic from scratch.

13.4 “Truth Without Pain”

Most importantly, Silver becomes easy to use:

analysts query clean current-state tables
data scientists retrieve PIT datasets without custom logic
downstream systems consume consistent, stable views

The complexity of:

multi-source conflict
temporal repair
ingestion noise

is absorbed entirely by Bronze.

Silver delivers “truth without pain” because Bronze has already done the painful work correctly.

14. Expected Substantiation (Regulator / Audit Lens)

A regulator does not care how elegant your architecture is. They care what you can prove, under pressure, with evidence.

This section defines what this design substantiates — and where its limits are.

14.1 What You Can Prove

With this Bronze design, you can demonstrate:

1. Full Lineage of Belief

Every version of every record
When it was first seen and last seen
Which source asserted it

2. Point-in-Time Reconstruction

What was believed at any given time
How that belief changed over time

3. Source Provenance

Which system, file, or pipeline produced the data
Which ingestion run processed it

4. Conflict Visibility

Multiple competing assertions from different sources
Evidence of how conflicts were resolved (in Silver)

5. Replay & Processing Behaviour

Duplicate arrivals
Reprocessing events
Pipeline execution context

This is the foundation of:

s166 responses
Consumer Duty investigations
model validation
audit traceability

14.2 What You Cannot Prove (Without Source-Time)

If source_event_ts is missing, unreliable, or inconsistent, there are hard limits.

You cannot definitively prove:

the true chronological order of events at the source
what was actually believed by the source at a given time
whether a late-arriving record represents correction or delay
whether a full snapshot reflects a newer or older state

In this case, you can only prove:

What the platform knew, when it knew it

— not what the business should have known.

This distinction is critical in regulatory scenarios.

14.3 What Evidence Is Retained

The Bronze model retains three classes of evidence:

1. Data Evidence

all attribute values over time
deletion states
version history

2. Temporal Evidence

source-event timestamps (where available)
ingestion timestamps
effective dating

3. Provenance Evidence

source system
source path / file
ingestion run
duplicate/replay observation (first_seen_ts, last_seen_ts)

Together, these form an audit-grade evidentiary chain.

14.4 The Core Principle

From an audit perspective, the design ensures:

You are not claiming truth — you are proving what was believed, when, and why.

That is the standard regulators actually apply.

And this is why Bronze must be built this way — because once the evidence is lost, it cannot be reconstructed later.

15. Summary: “Land It Early, Manage It Early” Applied to Precedence

Precedence, ordering, deletion, and replay are not downstream concerns — they are ingestion-time responsibilities. If they are deferred, they become harder, more ambiguous, and often impossible to resolve correctly.

By applying “land it early, manage it early” at the Bronze layer, the platform:

captures all competing assertions, not just the chosen one
preserves temporal integrity, even under out-of-sequence delivery
enables deterministic precedence resolution, rather than ad hoc fixes
produces a stable, explainable is_current, immune to ingestion noise

The result is simple but powerful:

Bronze holds evidence. Silver expresses belief. Everything is explainable.

Part of the “land it early, manage it early” series on SCD2-driven Bronze architectures for regulated Financial Services.

Abstract

Series Overview

Contents

1. Introduction: Why This Problem Exists (And Why It’s Not “Just SCD2”)

1.1 Multi-Source Truth

1.2 Out-of-Sequence Delivery

1.3 ADF Concurrency Storms

1.4 Regulator Expectations (PIT & Belief Reconstruction)

2. The Core Principle: Bronze Stores Evidence, Not Truth

2.1 Bronze = All Assertions

2.2 Silver = Resolved Belief

2.3 Why “Overwrite” Is a Compliance Failure

3. The Three Families of Metadata You Must Capture

3.1 Provenance / Source Identity

3.2 Temporal Control Fields (Event vs Ingestion)

3.3 Keys & Hashes (Business Key, Surrogate, Attribute Hash)

4. The Two Operating Modes (This is the Big Fork)

4.1 Mode A — You Have Source Time

4.2 Mode B — You Don’t (Ingestion-Time Belief)

4.3 The Practical Implication

5. Bronze SCD2 Row Model (Schema + Meaning)

5.1 Required Columns

5.2 Recommended Columns

5.3 What Each Group of Columns Is For

5.3.1 Identity Columns

5.3.2 Provenance Columns

5.3.3 Temporal Columns

5.3.4 Change & Control Columns

6. Hashing Strategy: Where the Hash Goes and Why

6.1 Attribute Hash vs Record Hash

6.1.1 Attribute Hash (attr_hash)

6.1.2 Record Hash (record_hash)

6.2 Change Detection

6.3 Idempotency Under Replay

6.4 Where the Hash Should Be Applied

6.5 Why Hashing Matters So Much Here

6.6 Schema Evolution: Preserving History While the Model Changes

6.6.1 The Core Problem

6.6.2 Attribute Hash Stability

6.6.2.1 Strategy A — Forward-Compatible Hashing (Recommended)

6.6.2.2 Strategy B — Versioned Hash Definition

6.6.3 Adding New Attributes

6.6.4 Partial Updates Under Schema Evolution

6.6.5 Deleting or Deprecating Attributes

6.6.6 What This Substantiates

7. Handling Out-of-Sequence Events Correctly

7.1 Ordering Rules

7.2 Tie-Breaking Rules (Sequence, File, Batch ID)

7.3 Repair Logic (Late Arriving Updates)

8. Full Loads, Partial Loads, and “Over-the-Top” Snapshots

8.1 Full Snapshot Semantics

8.2 Partial Feed Semantics

8.3 Full-Over-Top Edge Case

8.4 Why “Missing Fields” ≠ Deletion

8.5 The Core Principle

9. Deletions: What They Mean and How to Model Them

9.1 Explicit Deletes

9.2 Soft Deletes

9.3 “Blanking” Patterns

9.4 Silver Implications

10. Precedence: How Multi-Source Truth Is Resolved

10.1 Precedence Tables (by Domain / Attribute Group)

10.2 Effective Dating Precedence Rules

10.3 Bronze Stores Conflicts; Silver Resolves Them

10.4 The Core Principle

11. Transactions vs State: Stop Treating Events Like Dimensions

11.1 Transaction Handling Pattern

11.2 last_seen Update

11.3 Why It’s Not SCD2

12. Reference PySpark Implementation (Bronze Writer)

12.1 Preferred Pattern — Bronze with Source Time

12.1.1 Inputs and Assumptions

12.1.2 Dedup + Ordering

12.1.3 Partial-Load Merge Logic

12.1.3.1 Source-time partial merge rules

12.1.3.2 Reference helper

12.1.4 SCD2 Merge Logic

12.1.5 Full PySpark Reference Implementation

12.1.5.1 Full Reference Flow

12.1.6 Handling Late Events

6.1.1 Attribute Hash (`attr_hash`)

6.1.2 Record Hash (`record_hash`)

11.2 `last_seen` Update

12.1.8 Producing `is_current` Safely

12.1.9.3 Deterministic `is_current`

12.4.2.1 Insert (`INSERT` / `c`)

12.4.2.2 Update (`UPDATE` / `u`)

12.4.2.3 Delete (`DELETE` / `d`)