This article defines how to implement SCD2 in the Bronze layer to safely handle multi-source precedence, out-of-sequence data, partial and full loads, deletions, and transaction patterns in regulated Financial Services. It introduces a metadata-driven approach that preserves temporal truth, prevents ingestion-order corruption, and enables deterministic is_current. The result is a defensible, replayable foundation that simplifies downstream Silver layers and supports point-in-time reconstruction under audit.
Abstract
Modern Financial Services data platforms must reconcile conflicting assertions from multiple systems, often arriving out of sequence, partially, or concurrently. Traditional SCD2 implementations fail under these conditions, resulting in incorrect current-state views and limited auditability.
This article presents a Bronze-layer SCD2 pattern that is resilient to ingestion order, supports explicit source precedence, and models deletions and transactions correctly. By anchoring all logic in event-time and structured metadata, it ensures that institutional belief remains reconstructable.
The approach enables deterministic is_current, prevents version explosion, and creates a clean, governed foundation for Silver-layer consumption.
It operationalises the principle of “land it early, manage it early” for complex, regulated data environments.
Series Overview
This article is part of a follow-on series to the original “land it early, manage it early” work on SCD2 Bronze architectures in UK Financial Services.
The original series focused on the principles of temporal data platforms and how to store, govern, and reconstruct truth over time using SCD2 in the Bronze/temporal tracking layer.
This series moves into a different space: what happens when you actually build and run these systems.
In practice, real platforms must deal with:
- multiple competing sources of authority
- out-of-sequence and late-arriving data
- partial updates and deletions
- operational systems writing outside the platform
These are not edge cases; they are the normal operating conditions.
This article focuses on one of the hardest of these problems: maintaining deterministic temporal truth when data arrives out of order and from multiple sources.
Contents
- Abstract
- Series Overview
- Contents
- 1. Introduction: Why This Problem Exists (And Why It’s Not “Just SCD2”)
- 2. The Core Principle: Bronze Stores Evidence, Not Truth
- 3. The Three Families of Metadata You Must Capture
- 4. The Two Operating Modes (This is the Big Fork)
- 5. Bronze SCD2 Row Model (Schema + Meaning)
- 6. Hashing Strategy: Where the Hash Goes and Why
- 7. Handling Out-of-Sequence Events Correctly
- 8. Full Loads, Partial Loads, and “Over-the-Top” Snapshots
- 9. Deletions: What They Mean and How to Model Them
- 10. Precedence: How Multi-Source Truth Is Resolved
- 11. Transactions vs State: Stop Treating Events Like Dimensions
- 12. Reference PySpark Implementation (Bronze Writer)
- 12.1 Preferred Pattern — Bronze with Source Time
- 12.1.1 Inputs and Assumptions
- 12.1.2 Dedup + Ordering
- 12.1.3 Partial-Load Merge Logic
- 12.1.4 SCD2 Merge Logic
- 12.1.5 Full PySpark Reference Implementation
- 12.1.6 Handling Late Events
- 12.1.7 Handling Deletes
- 12.1.8 Producing is_current Safely
- 12.1.9 What This Preferred Pattern Achieves
- 12.1.10 Production Notes
- 12.2 Fallback Pattern — Bronze without Source Time
- 12.3 When You Must Reject a Source as Non-Defensible
- 12.4 CDC Microbatch Ingestion Pattern (State + Transaction Handling)
- 12.4.1 Inputs and Assumptions
- 12.4.2 CDC Op Mapping Rules
- 12.4.3 Microbatch Processing Flow
- 12.4.4 Reference PySpark Implementation
- 12.4.5 Why This Fits Regulated Financial Services
- 12.4.5.1 Full evidentiary capture
- 12.4.5.2 Out-of-sequence safety
- 12.4.5.3 Replay and idempotency
- 12.4.5.4 Correct handling of partial change
- 12.4.5.5 Separation of fact vs belief
- 12.4.5.6 Limitations and Caveats
- 12.4.5.6.1 Source-time quality is critical
- 12.4.5.6.2 Ordering guarantees vary by source
- 12.4.5.6.3 High-volume considerations
- 12.4.5.6.4 Deletes must be explicit
- 12.4.5.6.5 The Core Principle
- 12.4.6 Worked CDC Example: Insert, Partial Update, Delete, and Out-of-Order Arrival
- 12.4.7 Worked Example: Multi-Source CDC Conflict with Precedence
- 12.4.7.1 Scenario
- 12.4.7.2 The Source Assertions
- 12.4.7.3 Step 1: Bronze Stores All Assertions
- 12.4.7.4 Step 2: Bronze is NOT “the truth”
- 12.4.7.5 Step 3: Silver Applies Precedence
- 12.4.7.6 Step 4: Why This Works
- 12.4.7.7 Step 5: What NOT to Do
- 12.4.7.8 Step 6: Example Silver Query Pattern
- 12.4.7.9 The Core Insight
- 12.4.7.10 Final Takeaway
- 12.4.8 Worked Example: CDC + Partial + Precedence + Delete Across Two Sources
- 12.4.8.1 Scenario
- 12.4.8.2 The CDC Events
- 12.4.8.3 Step 1: Bronze Stores All Assertions in Source-Time Order
- 12.4.8.4 Step 2: Bronze SCD2 History
- 12.4.8.5 Step 3: What Silver Believes at Different Times
- 12.4.8.6 Step 4: Why This Example Matters
- 12.4.8.7 Step 5: Example Silver Resolution Shape
- 12.4.8.8 The Core Lesson
- 12.4.8.9 Final Takeaway
- 12.4.9 Streaming Reality: Watermarks, Replay, and Exactly-Once Illusions
- 12.5 Cross-Compatibility of Bronze Ingestion Patterns (How These Pieces Actually Fit Together)
- 12.5.1 The Compatibility Matrix
- 12.5.2 The Roles Are Not Equal
- 12.5.3 The Canonical Flow (What “Good” Looks Like)
- 12.5.4 The Critical Pairing: 12.1 + 12.4
- 12.5.5 Where 12.2 Fits (And Why It Must Be Contained)
- 12.5.6 Where 12.3 Sits (And Why It Matters)
- 12.5.7 CDC Is Not a Data Model
- 12.5.8 What This Enables
- 12.5.9 The Core Principle
- 12.5.10 Practical Guidance
- 12.5.11 The Outcome
- 12.6 Silver Survivorship Patterns (Attribute-Level Belief Resolution)
- 12.1 Preferred Pattern — Bronze with Source Time
- 13. What This Enables in Silver (And Why Silver Gets Simpler)
- 14. Expected Substantiation (Regulator / Audit Lens)
- 15. Summary: “Land It Early, Manage It Early” Applied to Precedence
1. Introduction: Why This Problem Exists (And Why It’s Not “Just SCD2”)
At first glance, this looks like a standard SCD2 problem: track changes over time, maintain history, and derive a current view. In practice, regulated Financial Services platforms face a far more complex reality.
Most failed SCD2 implementations are not wrong because of SQL — they are wrong because they confuse ingestion order with truth.
1.1 Multi-Source Truth
The same entity is asserted by multiple systems — CRM, core banking, onboarding, risk, external providers — each with its own timing, completeness, and authority. These are not clean updates to a single record; they are competing versions of belief.
1.2 Out-of-Sequence Delivery
Data does not arrive in the order it was created. Late events, replays, corrections, and batch delays mean older states can arrive after newer ones. If ordering is based on ingestion, history becomes incorrect.
1.3 ADF Concurrency Storms
When ingestion frameworks (e.g. ADF) process files in parallel, thousands of records for the same entity can arrive simultaneously and out of order. Without deterministic logic, this creates duplicate versions, broken timelines, and incorrect is_current.
1.4 Regulator Expectations (PIT & Belief Reconstruction)
Regulators do not ask “what is true now?” — they ask:
What did you believe at the time, and why?
This requires:
- Point-in-time (PIT) reconstruction
- Evidence of source and authority
- A defensible timeline of changes
Standard SCD2 implementations — especially those based on arrival order — cannot meet this requirement.
2. The Core Principle: Bronze Stores Evidence, Not Truth
To solve this, the architecture must shift from “data processing” to evidence preservation.
2.1 Bronze = All Assertions
The Bronze layer captures:
- Every version received
- From every source
- With full provenance
- Without overwriting or collapsing
It answers:
What did we receive, when, and from whom?
It does not decide which version is “correct”.
2.2 Silver = Resolved Belief
The Silver layer applies:
- Precedence rules
- Business logic
- Data quality constraints
It answers:
Given all available evidence, what do we believe now (or at time X)?
2.3 Why “Overwrite” Is a Compliance Failure
Overwriting data at ingestion destroys the ability to:
- Reconstruct prior belief
- Explain conflicting inputs
- Demonstrate decision lineage
It replaces:
“We received two conflicting assertions and chose one”
with:
“We only ever had one version”
That is not a technical shortcut — it is a loss of evidence.
In regulated environments, that loss is indefensible.
This is why Bronze must be designed as an append-only, temporally ordered evidence store, not a cleaned or corrected dataset.
3. The Three Families of Metadata You Must Capture
To make Bronze SCD2 deterministic, replayable, and regulator-defensible, metadata must be structured deliberately. Three distinct families are required. Mixing them leads to ambiguity, broken timelines, and unexplainable outcomes.
3.1 Provenance / Source Identity
This metadata answers:
Where did this assertion come from, and under what context?
Required fields:
source_system— the originating system (e.g. CRM, Core Banking)source_type— file, API, stream, database extractsource_path— file path, topic, or table identifiersource_file— file name where applicableingest_run_id— pipeline execution identifieringest_ts— when the platform received the data
Why it matters:
- Enables precedence rules across systems
- Supports audit and lineage (“which file said this?”)
- Allows reconstruction of ingestion conditions (e.g. ADF batch behaviour)
- Distinguishes duplicate storms from genuine updates
Without provenance, all records look the same — and conflicts cannot be explained.
3.2 Temporal Control Fields (Event vs Ingestion)
This metadata defines when something was true, versus when it was seen.
Required fields:
source_event_ts(or equivalent) — when the source believed this state was valideffective_from— start of validity (usually =source_event_ts)effective_to— end of validityis_current— derived (effective_to IS NULL)first_seen_ts— first ingestion timelast_seen_ts— most recent ingestion time
Why it matters:
- Separates truth timeline from processing timeline
- Allows correction of out-of-sequence arrival
- Enables point-in-time reconstruction
- Protects against ingestion-order corruption
Critical distinction:
source_event_ts→ governs truthingest_ts→ governs observability
Confusing the two is the most common cause of broken SCD2 implementations.
3.3 Keys & Hashes (Business Key, Surrogate, Attribute Hash)
This metadata defines identity and change detection.
Required fields:
business_key (bk)— natural key identifying the entitysurrogate_key (sk)— stable internal identifier (usually hash of bk)attr_hash— hash of mutable attributes
Optional but useful:
record_hash— full row hash (including metadata)
Why it matters:
bkgroups all assertions for the same entityskstabilises joins and downstream modelsattr_hashdetermines whether a change is real
Hashing rules:
- Hash only mutable attributes
- Exclude metadata and timestamps
- Use consistent null handling
This enables:
- Idempotent ingestion
- Duplicate collapse (ADF storms)
- Prevention of version explosion
Without hashing, every arrival risks becoming a new version.
4. The Two Operating Modes (This is the Big Fork)
All Bronze SCD2 implementations fall into one of two categories. The difference is not technical — it is epistemological.
The fork between source-time and ingestion-time is not a technical decision — it is a statement about what the platform claims it knows.
4.1 Mode A — You Have Source Time
You possess a reliable field such as:
- event timestamp
- effective date
- extract “as-of” timestamp
- source change sequence
This allows you to construct:
Source-time belief
Characteristics:
- Ordering is based on when the source asserted the state
- Late-arriving data can be inserted into the correct historical position
- Full and partial loads can be reconciled correctly
is_currentis deterministic and stable- Point-in-time reconstruction is accurate
This is the required mode for regulated Financial Services.
4.2 Mode B — You Don’t (Ingestion-Time Belief)
No reliable source-time exists, or it is incomplete, inconsistent, or absent.
You are forced to use:
ingest_tsas ordering
This creates:
Ingestion-time belief
Characteristics:
- Ordering reflects when data arrived, not when it was true
- Late data cannot be correctly placed in history
- Full loads arriving late may overwrite newer belief
is_currentis dependent on arrival order- Point-in-time reconstruction reflects platform behaviour, not source reality
This mode can still be useful for:
- operational reporting
- exploratory analytics
- non-regulated use cases
But it is not regulator-defensible for:
- audit
- financial reporting
- decision reconstruction
4.3 The Practical Implication
Before writing a single line of SCD2 logic, you must answer:
Do we have reliable source-time?
If yes → build a temporal system of record
If no → build an ingestion-history system, and be explicit about its limits
Everything else — precedence, deletions, out-of-sequence handling — depends on this decision.
5. Bronze SCD2 Row Model (Schema + Meaning)
A Bronze SCD2 row is not just a record of data. It is a record of institutional belief, source provenance, temporal validity, and ingestion evidence. If the row model is incomplete, the platform may still run — but it will not be able to explain itself under scrutiny.
The row model therefore needs to capture four things at once:
- What entity this row refers to
- What was believed
- When that belief was valid
- Where that belief came from
5.1 Required Columns
These are the minimum columns required for a regulator-defensible Bronze SCD2 implementation.
| Column | Meaning | What it substantiates |
|---|---|---|
bk | Canonical business key | Which real-world entity the row relates to |
sk | Stable surrogate key | Internal identity continuity and downstream joins |
source_system | Originating system | Which authority asserted the value |
source_type | File / API / stream / table | Nature of the source |
source_path | File path, topic, or table identifier | Exact provenance of the assertion |
ingest_ts | Time received by platform | When the platform first observed the data |
source_event_ts | Time asserted by source | When the source believed the state was valid |
effective_from | Start of validity | When this version began to apply |
effective_to | End of validity | When this version ceased to apply |
is_current | Current active version flag | Which version is currently effective |
attr_hash | Hash of mutable business attributes | Whether this is genuinely a new version |
is_deleted | Deletion marker | Whether this version represents a logical delete |
These columns are enough to support:
- Version ordering
- Out-of-sequence repair
- Duplicate suppression
- Point-in-time reconstruction
- Logical deletions
- Downstream Silver derivation
5.2 Recommended Columns
These are not strictly required for SCD2 to function, but they materially improve replayability, auditability, and operational safety.
| Column | Meaning | What it substantiates |
|---|---|---|
source_file | Specific file name | Which artefact contained the record |
ingest_run_id | Pipeline execution ID | Which orchestration run loaded it |
first_seen_ts | First ingestion timestamp | When the platform first encountered this exact version |
last_seen_ts | Most recent ingestion timestamp | Whether it has been replayed or resent |
precedence_rank | Source priority for belief resolution | Why one source wins over another |
load_type | FULL / PARTIAL | Whether missing fields are meaningful |
record_hash | Hash of full row including metadata | Row-level replay and forensic deduplication |
entity_type | STATE / TRANSACTION | Which processing rules apply |
These columns allow the platform to distinguish:
- identical facts replayed multiple times
- genuine change versus duplicate arrival
- partial updates versus full replacement
- source precedence versus temporal order
- state rows versus immutable transactions
5.3 What Each Group of Columns Is For
To keep the design readable, it helps to think of the row model in grouped families.
5.3.1 Identity Columns
bkskentity_type
These answer:
What thing are we talking about, and what kind of thing is it?
5.3.2 Provenance Columns
source_systemsource_typesource_pathsource_fileingest_run_id
These answer:
Where did this assertion come from, and how did it enter the platform?
5.3.3 Temporal Columns
source_event_tsingest_tseffective_fromeffective_tois_currentfirst_seen_tslast_seen_ts
These answer:
When was this believed, when was it observed, and is it still current?
5.3.4 Change & Control Columns
attr_hashrecord_hashis_deletedprecedence_rankload_type
These answer:
Has anything meaningful changed, does this row represent deletion, and how should it be resolved against competing assertions?
6. Hashing Strategy: Where the Hash Goes and Why
Hashing is one of the most important controls in Bronze SCD2. Without it, every replay, duplicate, or noisy source update risks being treated as a new version. With it, the platform can distinguish between:
- a genuine state change
- a replay of the same state
- a duplicate caused by concurrent ingestion
- an unchanged row seen again later
Hashing does not create truth. It creates deterministic change detection.
6.1 Attribute Hash vs Record Hash
These serve different purposes and should not be confused.
6.1.1 Attribute Hash (attr_hash)
This is the hash of mutable business attributes only.
Example inputs:
- customer name
- status
- product type
- risk rating
- account state
It must exclude:
- ingestion metadata
- source file name
- load timestamp
- pipeline run id
- technical audit columns
Why:
Because it is used to answer:
Has the business state actually changed?
If attr_hash stays the same, the record is the same belief version even if it arrived ten times.
6.1.2 Record Hash (record_hash)
This is the hash of the entire row payload, often including metadata.
It may include:
- business attributes
- provenance fields
- load metadata
Why:
Because it is used to answer:
Is this exactly the same row we have seen before?
This is helpful for:
- forensic deduplication
- replay traceability
- comparing source artefacts under investigation
In most Bronze SCD2 logic, attr_hash drives versioning.record_hash supports evidence handling and replay diagnostics.
6.2 Change Detection
The core SCD2 rule is:
A new version is created only when the business key is the same and the
attr_hashhas changed.
This prevents version spam caused by:
- duplicate file processing
- repeated full extracts with identical content
- resubmitted rows from external systems
- concurrent ADF file opening
Typical rule:
- same
bk - same
source_event_ts - same
attr_hash
→ same version
If attr_hash differs, that is a candidate new version.
This is why the hash must be applied after standardisation and before SCD2 comparison.
6.3 Idempotency Under Replay
Replay is inevitable in regulated platforms:
- failed pipelines rerun
- source systems resend files
- backfills occur
- out-of-sequence data arrives late
The Bronze process must therefore be idempotent.
Hashing enables idempotency by ensuring that replaying the same logical state does not create a new version.
This means:
- if the row is replayed unchanged → update
last_seen_ts, do not create version - if the row reappears with a different hash → create or repair version as appropriate
- if a file storm causes thousands of copies → collapse them to one version, preserve evidence in provenance metadata
Without hashing, replay becomes corruption.
6.4 Where the Hash Should Be Applied
Hashing should be applied:
- After canonicalisation / normalisation
- trim strings
- standardise nulls
- align data types
- Before SCD2 grouping and merge logic
- so equality is determined on business state, not arrival conditions
- Separately for state and transaction handling
- state uses
attr_hashfor versioning - transactions may use a transaction identifier or event hash for deduplication, but do not become SCD2 versions
- state uses
This ensures the platform can:
- detect meaningful change
- remain replay-safe
- keep
is_currentstable under concurrency and reprocessing
6.5 Why Hashing Matters So Much Here
In a simple warehouse, hashing is often treated as a performance trick.
In Bronze SCD2 for regulated Financial Services, it is more than that. It is the mechanism that protects the platform from:
- accidental version explosion
- hidden duplicate storms
- replay corruption
- arrival-order bias
Put simply:
Hashing is how the platform distinguishes repeated evidence from changed belief.
6.6 Schema Evolution: Preserving History While the Model Changes
Schema evolution is inevitable in regulated Financial Services. New attributes appear (e.g. KYC enhancements), existing ones change meaning, and some are deprecated. The challenge is not accommodating change — it is doing so without corrupting historical truth.
The key principle is:
A schema change alters what can be asserted going forward — it must not rewrite what was previously known.
6.6.1 The Core Problem
If schema evolution is handled incorrectly, the platform will:
- recompute historical hashes using new columns
- overwrite or reinterpret prior states
- introduce false change events
- break point-in-time (PIT) reconstruction
This results in a system that cannot explain its own history, which is unacceptable under regulatory scrutiny.
6.6.2 Attribute Hash Stability
The attribute hash (attr_hash) must reflect:
only the attributes that were known and populated at the time of the assertion
This leads to two valid strategies:
6.6.2.1 Strategy A — Forward-Compatible Hashing (Recommended)
- Compute
attr_hashonly on attributes present in the incoming record - Missing attributes are treated as “unknown”, not null
- Historical rows remain unchanged
6.6.2.2 Strategy B — Versioned Hash Definition
- Introduce
hash_versionalongsideattr_hash - Each schema version defines its own hashing logic
- Hash comparisons are only valid within the same version
This is more explicit but adds complexity.
6.6.3 Adding New Attributes
When a new attribute is introduced:
- do not backfill historical rows unless you have evidence
- do not recompute historical hashes
- treat the attribute as:
unknown before introduction
known after introduction
This ensures:
- no artificial change events
- no distortion of prior belief
6.6.4 Partial Updates Under Schema Evolution
Schema evolution amplifies the importance of partial merge logic:
- new attributes may only appear in some sources
- older sources may never supply them
The merge rule remains:
Only overwrite an attribute if the incoming record explicitly asserts it
All other attributes are inherited from the latest valid state at that time.
6.6.5 Deleting or Deprecating Attributes
If an attribute is removed:
- retain it in Bronze history
- stop populating it going forward
- optionally mark as deprecated in metadata
Historical queries must still be able to reconstruct prior states including that attribute.
6.6.6 What This Substantiates
With correct schema evolution handling, the platform can prove:
- what attributes were known at any point in time
- when new information became available
- that no historical state was retroactively altered
Without this discipline:
schema evolution becomes silent data corruption.
6.6.7 Packed Payloads, Blobs, and Deferred Unpacking
Schema evolution does not apply only to relational columns. In regulated platforms, XML, JSON, and other packed payloads may be preserved in Bronze as first-class evidentiary artefacts, with selected fields unpacked over time as indexing, querying, or regulatory needs evolve.
This deferred unpacking must not be treated as retroactive truth creation. If a field is extracted from a historical blob for the first time in March 2026, that does not mean the platform “knew” that field in January 2025. It means the evidence already existed, but the interpretation or accessibility changed.
Accordingly:
- raw blobs remain the primary evidence
- unpacked fields are derived interpretations
- historical hashes must not be recomputed simply because more of a payload is now unpacked
- where unpacking logic materially changes, the extraction version or interpretation version should be recorded
7. Handling Out-of-Sequence Events Correctly
Out-of-sequence data is not an edge case — it is the normal operating condition in distributed, regulated data platforms. The Bronze layer must therefore be explicitly designed to reconstruct correct temporal order regardless of arrival order.
7.1 Ordering Rules
The primary rule is simple:
Order by source_event_ts, not ingest_ts
Processing hierarchy:
source_event_ts(primary ordering)ingest_ts(fallback / observability only)
This ensures:
- Late-arriving data is inserted into the correct historical position
is_currentreflects the latest known truth, not the latest arrival- Point-in-time reconstruction aligns with source reality
If you rely on ingest_ts, you are modelling pipeline behaviour, not business truth.
7.2 Tie-Breaking Rules (Sequence, File, Batch ID)
Real systems often produce records with identical timestamps. Deterministic tie-breaking is therefore essential.
Recommended tie-break hierarchy:
source_sequence_id(if available)source_fileorsource_pathingest_run_id(batch / pipeline execution)ingest_ts(final fallback)
This ensures:
- Stable ordering across replays
- Deterministic version chains
- Prevention of “flip-flopping”
is_current
Without tie-breaking, identical timestamps can produce non-repeatable outcomes, which is unacceptable in regulated environments.
7.3 Repair Logic (Late Arriving Updates)
When late data arrives:
- Identify the correct position in the timeline using
source_event_ts - Insert the new version into the historical chain
- Recalculate:
effective_toof the preceding versioneffective_fromof the inserted version- downstream versions if necessary
This may require reopening and resequencing existing rows.
Key principle:
History is corrected, not appended blindly
Example:
- Version A (Jan 1)
- Version C (Jan 10) ← already present
- Version B (Jan 5) ← arrives late
Correct result:
- A → effective_to = Jan 5
- B → Jan 5 → Jan 10
- C → remains current
Not:
- A → C → B (arrival order corruption)
This is what makes Bronze repairable and regulator-defensible.
8. Full Loads, Partial Loads, and “Over-the-Top” Snapshots
Different ingestion patterns imply different semantics. Treating them the same leads to data loss, false deletions, and broken history.
8.1 Full Snapshot Semantics
A full load represents:
“This is the complete state of the entity (or dataset) at time X.”
Implications:
- Missing records may imply deletion or absence
- Every record should be compared against existing state
- May trigger closure of prior versions
Required metadata:
load_type = FULL- clear
source_event_tsor extract timestamp
Full loads are the only safe basis for absence-based inference.
8.2 Partial Feed Semantics
A partial load represents:
“This is a subset of changes or updates.”
Implications:
- Missing fields ≠ null
- Missing records ≠ deletion
- Only included attributes should be updated
Required metadata:
load_type = PARTIAL
This prevents:
- accidental overwrites
- false deletions
- loss of previously known values
Partial feeds must be treated as patches, not replacements.
8.3 Full-Over-Top Edge Case
This is a common failure mode:
- A newer partial feed updates a record
- A late-arriving full snapshot is then processed
If handled incorrectly, the full snapshot can:
- overwrite newer belief
- reset attributes to older values
- corrupt
is_current
Correct handling:
- Treat full snapshots using source_event_ts ordering
- Insert them into the correct temporal position
- Do not blindly overwrite current state
This ensures:
- history remains consistent
- newer belief is not lost
- snapshots behave as time-bound assertions, not overrides
8.4 Why “Missing Fields” ≠ Deletion
A missing field can mean:
- not present in source extract
- not included in partial feed
- optional attribute
- upstream omission
It does not automatically mean:
“This value has been deleted”
Deletion must be explicitly signalled via:
is_deleted = true- status flags
- tombstone events
- absence in a trusted full snapshot
Treating missing data as deletion leads to:
- silent data loss
- incorrect history
- broken auditability
8.5 The Core Principle
Across all of this:
Every row is an assertion with context, not a command to overwrite state
By distinguishing:
- ordering vs arrival
- full vs partial semantics
- absence vs deletion
you ensure that Bronze remains:
- replayable
- explainable
- regulator-defensible
—and that Silver can remain simple, because Bronze has already done the hard work correctly.
9. Deletions: What They Mean and How to Model Them
Deletion in regulated data platforms is not simply “row removed.” It is an assertion about state, and must be treated with the same care as any other change. Incorrect handling of deletions is one of the fastest ways to break auditability.
9.1 Explicit Deletes
These are clear, intentional signals from the source:
- status flags (e.g.
is_deleted = true) - tombstone events
- delete operations in CDC streams
Handling:
- create a new SCD2 version
- set
is_deleted = true - close the previous version (
effective_to)
This provides:
“At time X, the source asserted that this entity no longer exists.”
9.2 Soft Deletes
These are indirect indicators of deletion:
- status = “closed”, “inactive”, “cancelled”
- lifecycle transitions
Handling:
- treat as a state change, not a hard delete
- retain the record, update attributes accordingly
Why:
Soft deletes are still part of business history. They must remain queryable and reconstructable.
9.3 “Blanking” Patterns
Common in real systems:
- fields suddenly become null or empty
- entire payloads appear “wiped”
- partial feeds omit previously populated attributes
This does not automatically mean deletion.
Interpretation depends on:
load_type(FULL vs PARTIAL)- source behaviour
- domain rules
Safe handling:
- treat blanking as a new version
- do not infer deletion unless explicitly defined
- preserve prior values in earlier versions
This avoids:
- silent data loss
- accidental erasure of historical truth
9.4 Silver Implications
Silver is where deletion semantics are resolved into usable views.
Typical patterns:
- filter out
is_deleted = truefor “active” views - include deleted records for audit or PIT queries
- apply domain logic (e.g. “closed accounts still visible”)
Key principle:
Bronze records deletion assertions. Silver decides what they mean for consumption.
10. Precedence: How Multi-Source Truth Is Resolved
In multi-source environments, conflicting assertions are inevitable. The platform must not pretend they don’t exist — it must capture them and resolve them deterministically.
10.1 Precedence Tables (by Domain / Attribute Group)
Precedence is not global. It varies by:
- domain (Customer, Account, Product)
- attribute group (name, address, risk rating, balance)
Example:
- CRM may own customer contact details
- Core Banking may own account status
- External provider may own credit score
This is typically modelled as:
- a precedence table or ruleset
- mapping
source_system → priority - optionally scoped to attribute groups
This enables:
“For this attribute, which source do we trust first?”
10.2 Effective Dating Precedence Rules
Precedence is not just “which source wins.” It must also consider time.
Rules combine:
source_event_ts(when asserted)precedence_rank(who asserted it)
Typical resolution:
- higher precedence source overrides lower
- if equal precedence → latest
source_event_tswins - ties resolved via deterministic tie-breakers
This ensures:
- consistent outcomes
- explainable decisions
- stable Silver views
Important: precedence does not replace temporal ordering — it operates on top of it.
Bronze preserves all temporally ordered assertions.
Silver then applies precedence to determine which assertion is believed, even if it is not the latest in time.
10.3 Bronze Stores Conflicts; Silver Resolves Them
The most important architectural rule:
Bronze does not resolve conflicts
Bronze stores:
- all assertions
- from all sources
- with full provenance
- even when they disagree
Silver applies:
- precedence rules
- survivorship logic
- domain-specific interpretation
This separation ensures:
- no loss of evidence
- full auditability
- ability to revisit decisions if rules change
If conflicts are resolved in Bronze:
- alternative viewpoints are lost
- decisions become irreversible
- regulatory defensibility is weakened
10.4 The Core Principle
Across deletions and precedence:
Bronze captures what was said. Silver decides what it means.
This preserves:
- evidence
- optionality
- explainability
—and allows the platform to evolve its logic without rewriting its history.
11. Transactions vs State: Stop Treating Events Like Dimensions
One of the most common category errors in Bronze design is to treat transactions as if they were slowly changing dimensions. They are not.
A transaction is an immutable event: something that happened once, at a point in time. A dimension or state record is a mutable interpretation of an entity over time. Confusing the two creates false history, broken lineage, and regulator-hostile data models.
11.1 Transaction Handling Pattern
Transactions should be handled as append-only facts.
Typical examples:
- payments
- trades
- orders
- cash movements
- account postings
- customer interactions logged as events
Each transaction row should be identified by a stable transactional identity, for example:
- source transaction ID
- message ID
- event ID
- composite event business key
The Bronze handling pattern is:
- if the transaction has not been seen before → insert it
- if the transaction has been seen before and is identical → update
last_seen_tsonly - if the transaction has been seen before but arrives with correction semantics → treat the correction as a new event or adjustment, not as mutation of the original event
This preserves the distinction between:
- what happened
- what we later learned about what happened
11.2 last_seen Update
For transactions, repeated arrival usually means one of three things:
- replay
- duplicate file ingestion
- source resend / redelivery
That does not justify a new SCD2 version.
Instead, the correct pattern is:
- preserve the original event row
- update:
last_seen_ts- optionally
ingest_run_id - optionally replay count / duplicate count
This provides operational evidence that the platform observed the event again, without pretending the event itself changed.
Typical transaction metadata includes:
first_seen_tslast_seen_tssource_systemsource_pathsource_fileevent_hashorrecord_hash
This gives you replay-safe ingestion without corrupting event history.
11.3 Why It’s Not SCD2
SCD2 exists to model changing state.
Examples of state:
- customer address
- account status
- product risk classification
- contract terms
These are beliefs that can change over time while referring to the same entity.
Transactions are different. They do not “change” in the same sense. What changes is:
- enrichment
- classification
- reconciliation status
- investigation outcome
Those changes should be modelled as:
- separate state tables
- linked interpretation tables
- or explicit correction / adjustment events
Not by mutating or versioning the transaction itself as if the original fact changed.
The rule is simple:
Transactions are facts. State is belief. SCD2 is for belief, not fact.
If this boundary is violated, the platform quickly becomes unable to answer basic regulatory questions such as:
- What actually happened?
- What did we believe about it at the time?
- What later changed in our interpretation?
12. Reference PySpark Implementation (Bronze Writer)
12.1 Preferred Pattern — Bronze with Source Time
The following implementation pattern is a reference approach for writing Bronze safely in Databricks / PySpark. It supports:
- deterministic ordering
- duplicate suppression
- SCD2 state handling
- transaction handling
- late-arriving events
- logical deletes
- safe
is_current
The code is illustrative, but the logic is the important part.
12.1.1 Inputs and Assumptions
Assume the incoming DataFrame contains, at minimum:
- business key columns
- mutable state attribute columns
- source metadata
source_event_tswhere availableingest_tsentity_type(stateortransaction)- optional delete flag
Assume Bronze is stored in Delta and that impacted keys can be recomputed deterministically.
from pyspark.sql import functions as F, Window as W
BK_COLS = ["customer_id"] # example
ATTR_COLS = ["name", "address", "status", "risk_rating"]
EVENT_TS_COL = "source_event_ts"
BRONZE_TABLE = "bronze.customer_state"
Helper assumptions:
bkis the canonical business keyskis a hash ofbkattr_hashis the hash of mutable state columnsentity_typedetermines whether the record follows SCD2 or event logic
12.1.2 Dedup + Ordering
First standardise keys, assign hashes, and deduplicate repeated identical observations.
def add_identity_and_hashes(df, bk_cols, attr_cols):
df = df.withColumn("bk", F.concat_ws("||", *[F.col(c).cast("string") for c in bk_cols]))
df = df.withColumn("sk", F.sha2(F.col("bk"), 256))
df = df.withColumn(
"attr_hash",
F.when(
F.col("entity_type") == "state",
F.sha2(
F.concat_ws(
"||",
*[F.coalesce(F.col(c).cast("string"), F.lit("∅")) for c in attr_cols]
),
256
)
)
)
df = df.withColumn(
"record_hash",
F.sha2(
F.concat_ws(
"||",
F.col("bk"),
F.coalesce(F.col(EVENT_TS_COL).cast("string"), F.lit("∅")),
F.coalesce(F.col("attr_hash"), F.lit("∅")),
F.coalesce(F.col("source_system"), F.lit("∅")),
F.coalesce(F.col("source_path"), F.lit("∅"))
),
256
)
)
return df
For state rows, deduplicate on:
bksource_event_tsattr_hash
For transaction rows, deduplicate on:
bkor transaction IDrecord_hash
def dedup_state(df):
w = W.partitionBy("bk", EVENT_TS_COL, "attr_hash")
return (
df.withColumn("first_seen_ts", F.min("ingest_ts").over(w))
.withColumn("last_seen_ts", F.max("ingest_ts").over(w))
.withColumn("rn", F.row_number().over(w.orderBy(F.col("ingest_ts").desc())))
.filter(F.col("rn") == 1)
.drop("rn")
)
Ordering for state must be based on source event time, never arrival order:
STATE_ORDER = [
F.col(EVENT_TS_COL).asc(),
F.col("source_sequence_id").asc_nulls_last(),
F.col("source_file").asc_nulls_last(),
F.col("ingest_run_id").asc_nulls_last(),
F.col("ingest_ts").asc()
]
That gives deterministic results even when ADF processes many files concurrently.
12.1.3 Partial-Load Merge Logic
Source-time Bronze does not remove the need to handle partial feeds carefully. It simply gives you a stronger temporal foundation on which to do so.
A partial feed means:
“These attributes changed at source time X”
It does not mean:
“Every omitted attribute became null at source time X”
So before hashing and versioning a partial row, the platform must reconstruct the full state as of the source event time by combining:
- the incoming partial assertion
- the latest prior known state for that business key as of
source_event_ts
This ensures that:
- missing fields do not become false deletions
- partial updates are treated as patches
- the resulting
attr_hashrepresents full business state, not a lossy payload fragment
12.1.3.1 Source-time partial merge rules
- Find the latest prior state row for the same
bkwhere:effective_from <= source_event_ts- and either
effective_to > source_event_tsoreffective_to IS NULL
- For each mutable attribute:
- if incoming value is present → use incoming value
- if incoming value is absent → inherit prior value
- Recompute
attr_hashon the reconstructed full state - Continue through standard source-time SCD2 ordering and rebuild
12.1.3.2 Reference helper
def merge_partial_state_with_source_time(df_new_partial, df_existing_state, attr_cols, event_ts_col):
"""
Reconstruct full state for partial loads using the latest known state
as of the incoming source_event_ts.
""" e = df_existing_state.alias("e")
n = df_new_partial.alias("n") # Find candidate prior rows valid at incoming source_event_ts
joined = (
n.join(
e,
on="bk",
how="left"
)
.where(
(F.col(f"e.effective_from") <= F.col(f"n.{event_ts_col}")) &
(
F.col("e.effective_to").isNull() |
(F.col("e.effective_to") > F.col(f"n.{event_ts_col}"))
)
)
) # If multiple prior candidates somehow exist, keep the latest valid one
w_prior = W.partitionBy("n.bk", F.col(f"n.{event_ts_col}")).orderBy(F.col("e.effective_from").desc()) latest_prior = (
joined
.withColumn("rn_prior", F.row_number().over(w_prior))
.filter(F.col("rn_prior") == 1)
.drop("rn_prior")
) select_exprs = [
F.col("n.bk").alias("bk"),
F.col("n.sk").alias("sk"),
F.col(f"n.{event_ts_col}").alias(event_ts_col),
F.col("n.ingest_ts").alias("ingest_ts"),
F.col("n.ingest_run_id").alias("ingest_run_id"),
F.col("n.source_system").alias("source_system"),
F.col("n.source_type").alias("source_type"),
F.col("n.source_path").alias("source_path"),
F.col("n.source_file").alias("source_file"),
F.col("n.entity_type").alias("entity_type"),
F.col("n.load_type").alias("load_type"),
F.col("n.is_deleted").alias("is_deleted"),
F.col("n.precedence_rank").alias("precedence_rank"),
F.col("n.source_sequence_id").alias("source_sequence_id"),
] for c in attr_cols:
select_exprs.append(
F.when(
F.col(f"n.{c}").isNull() | (F.col(f"n.{c}") == F.lit("∅")),
F.col(f"e.{c}")
).otherwise(F.col(f"n.{c}")).alias(c)
) return latest_prior.select(*select_exprs)
This is the crucial piece that prevents source-time Bronze from becoming unsafe under mixed full and partial ingestion patterns.
12.1.4 SCD2 Merge Logic
The safest pattern is not “row-by-row mutate Bronze”.
It is:
- identify impacted business keys
- load existing Bronze for those keys
- union incoming + existing
- rebuild the timeline
- replace only those keys
This is what makes late arrivals and precedence repairable.
def rebuild_scd2_timeline(df_union):
w = W.partitionBy("bk").orderBy(*STATE_ORDER)
df = (
df_union
.withColumn("prev_hash", F.lag("attr_hash").over(w))
.withColumn(
"is_new_version",
F.when(
F.col("prev_hash").isNull() | (F.col("attr_hash") != F.col("prev_hash")),
F.lit(1)
).otherwise(F.lit(0))
)
.withColumn(
"version_grp",
F.sum("is_new_version").over(w.rowsBetween(W.unboundedPreceding, 0))
)
)
# collapse repeated same-hash rows into one version
latest_in_grp = W.partitionBy("bk", "version_grp").orderBy(F.col("ingest_ts").desc())
df_collapsed = (
df.withColumn("grp_effective_from", F.min(EVENT_TS_COL).over(W.partitionBy("bk", "version_grp")))
.withColumn("grp_first_seen_ts", F.min("first_seen_ts").over(W.partitionBy("bk", "version_grp")))
.withColumn("grp_last_seen_ts", F.max("last_seen_ts").over(W.partitionBy("bk", "version_grp")))
.withColumn("grp_rn", F.row_number().over(latest_in_grp))
.filter(F.col("grp_rn") == 1)
.drop("grp_rn", "prev_hash", "is_new_version")
.withColumn("effective_from", F.col("grp_effective_from"))
.withColumn("first_seen_ts", F.col("grp_first_seen_ts"))
.withColumn("last_seen_ts", F.col("grp_last_seen_ts"))
.drop("grp_effective_from", "grp_first_seen_ts", "grp_last_seen_ts")
)
final_w = W.partitionBy("bk").orderBy(F.col("effective_from").asc())
df_final = (
df_collapsed
.withColumn("effective_to", F.lead("effective_from").over(final_w))
.withColumn("is_current", F.col("effective_to").isNull())
)
return df_final
This produces a proper SCD2 chain.
12.1.5 Full PySpark Reference Implementation
With source time available, the safest implementation pattern is to rebuild the timeline for impacted business keys rather than trying to mutate Bronze row-by-row. This keeps late-arriving data, precedence, and partial updates all repairable.
The flow is:
- prepare and standardise incoming rows
- split transactions from state
- merge partial state rows against prior valid state at source time
- deduplicate identical observations
- pull existing Bronze rows for impacted keys
- rebuild the full source-time timeline
- derive
effective_toandis_current - union transactions and state results
12.1.5.1 Full Reference Flow
from pyspark.sql import functions as F, Window as WBK_COLS = ["customer_id"] # example
ATTR_COLS = ["name", "address", "status", "risk_rating", "is_deleted"]
EVENT_TS_COL = "source_event_ts"
BRONZE_TABLE = "bronze.customer_state"def canonicalise_for_hash(df, attr_cols):
for c in attr_cols:
df = df.withColumn(c, F.coalesce(F.col(c).cast("string"), F.lit("∅")))
return dfdef add_identity_and_hashes(df, bk_cols, attr_cols):
df = df.withColumn("bk", F.concat_ws("||", *[F.col(c).cast("string") for c in bk_cols]))
df = df.withColumn("sk", F.sha2(F.col("bk"), 256)) df = canonicalise_for_hash(df, attr_cols) df = df.withColumn(
"attr_hash",
F.when(
F.col("entity_type") == "state",
F.sha2(F.concat_ws("||", *[F.col(c) for c in attr_cols]), 256)
)
) df = df.withColumn(
"record_hash",
F.sha2(
F.concat_ws(
"||",
F.col("bk"),
F.coalesce(F.col(EVENT_TS_COL).cast("string"), F.lit("∅")),
F.coalesce(F.col("attr_hash"), F.lit("∅")),
F.coalesce(F.col("source_system"), F.lit("∅")),
F.coalesce(F.col("source_path"), F.lit("∅")),
F.coalesce(F.col("source_file"), F.lit("∅"))
),
256
)
) return dfdef prepare_inputs_with_source_time(df_in, bk_cols, attr_cols, event_ts_col):
df = df_in df = df.withColumn("ingest_ts", F.coalesce(F.col("ingest_ts"), F.current_timestamp()))
df = df.withColumn("ingest_run_id", F.coalesce(F.col("ingest_run_id"), F.lit("UNKNOWN_RUN")))
df = df.withColumn("source_system", F.coalesce(F.col("source_system"), F.lit("UNKNOWN_SOURCE")))
df = df.withColumn("source_type", F.coalesce(F.col("source_type"), F.lit("unknown")))
df = df.withColumn("source_path", F.coalesce(F.col("source_path"), F.lit("UNKNOWN_PATH")))
df = df.withColumn("source_file", F.coalesce(F.col("source_file"), F.lit("UNKNOWN_FILE")))
df = df.withColumn("entity_type", F.coalesce(F.col("entity_type"), F.lit("state")))
df = df.withColumn("load_type", F.coalesce(F.col("load_type"), F.lit("UNKNOWN")))
df = df.withColumn("is_deleted", F.coalesce(F.col("is_deleted"), F.lit(False)))
df = df.withColumn("precedence_rank", F.coalesce(F.col("precedence_rank"), F.lit(0))) # required for source-time mode
df = df.withColumn(event_ts_col, F.col(event_ts_col).cast("timestamp"))
df = df.withColumn("effective_from", F.col(event_ts_col)) df = add_identity_and_hashes(df, bk_cols, attr_cols) return dfdef dedup_state_with_source_time(df):
"""
Collapse duplicate observations of the same state assertion.
"""
w = W.partitionBy("bk", EVENT_TS_COL, "attr_hash") return (
df.withColumn("first_seen_ts", F.min("ingest_ts").over(w))
.withColumn("last_seen_ts", F.max("ingest_ts").over(w))
.withColumn("rn", F.row_number().over(w.orderBy(F.col("ingest_ts").desc(), F.col("ingest_run_id").desc())))
.filter(F.col("rn") == 1)
.drop("rn")
)def dedup_transactions_with_source_time(df_tx):
"""
Transactions are immutable events.
Replays update last_seen_ts, not history.
"""
w_tx = W.partitionBy("bk", "record_hash") return (
df_tx
.withColumn("first_seen_ts", F.min("ingest_ts").over(w_tx))
.withColumn("last_seen_ts", F.max("ingest_ts").over(w_tx))
.withColumn("rn", F.row_number().over(w_tx.orderBy(F.col("ingest_ts").desc(), F.col("ingest_run_id").desc())))
.filter(F.col("rn") == 1)
.drop("rn")
)STATE_ORDER = [
F.col(EVENT_TS_COL).asc(),
F.col("precedence_rank").desc(),
F.col("source_sequence_id").asc_nulls_last(),
F.col("source_file").asc_nulls_last(),
F.col("ingest_run_id").asc_nulls_last(),
F.col("ingest_ts").asc()
]def rebuild_scd2_timeline_with_source_time(df_union):
"""
Rebuild complete source-time history for impacted keys.
"""
w = W.partitionBy("bk").orderBy(*STATE_ORDER) df = (
df_union
.withColumn("prev_hash", F.lag("attr_hash").over(w))
.withColumn(
"is_new_version",
F.when(
F.col("prev_hash").isNull() | (F.col("attr_hash") != F.col("prev_hash")),
F.lit(1)
).otherwise(F.lit(0))
)
.withColumn(
"version_grp",
F.sum("is_new_version").over(w.rowsBetween(W.unboundedPreceding, 0))
)
) latest_in_grp = W.partitionBy("bk", "version_grp").orderBy(F.col("ingest_ts").desc(), F.col("ingest_run_id").desc()) df_collapsed = (
df.withColumn("grp_effective_from", F.min(EVENT_TS_COL).over(W.partitionBy("bk", "version_grp")))
.withColumn("grp_first_seen_ts", F.min("first_seen_ts").over(W.partitionBy("bk", "version_grp")))
.withColumn("grp_last_seen_ts", F.max("last_seen_ts").over(W.partitionBy("bk", "version_grp")))
.withColumn("grp_rn", F.row_number().over(latest_in_grp))
.filter(F.col("grp_rn") == 1)
.drop("grp_rn", "prev_hash", "is_new_version")
.withColumn("effective_from", F.col("grp_effective_from"))
.withColumn("first_seen_ts", F.col("grp_first_seen_ts"))
.withColumn("last_seen_ts", F.col("grp_last_seen_ts"))
.drop("grp_effective_from", "grp_first_seen_ts", "grp_last_seen_ts")
) final_w = W.partitionBy("bk").orderBy(F.col("effective_from").asc(), F.col("precedence_rank").desc()) return (
df_collapsed
.withColumn("effective_to", F.lead("effective_from").over(final_w))
.withColumn("is_current", F.col("effective_to").isNull())
)def process_bronze_with_source_time(df_in, bronze_table, bk_cols, attr_cols, event_ts_col):
"""
Main preferred Bronze ingestion function. Guarantees:
- source-time ordering
- late-arrival repair
- partial-load safety
- deterministic is_current
- replay-safe transaction handling
""" df = prepare_inputs_with_source_time(df_in, bk_cols, attr_cols, event_ts_col) # Split transactions from state
df_tx = df.filter(F.col("entity_type") == "transaction")
df_state_new = df.filter(F.col("entity_type") == "state") # Process transactions
df_tx_final = dedup_transactions_with_source_time(df_tx) # Identify impacted business keys
impacted_keys = df_state_new.select("bk").distinct() existing_state = (
spark.read.format("delta").table(bronze_table)
.filter(F.col("entity_type") == "state")
.join(impacted_keys, on="bk", how="inner")
) # Split full vs partial state
df_state_partial = df_state_new.filter(F.col("load_type") == "PARTIAL")
df_state_full = df_state_new.filter(F.col("load_type") != "PARTIAL") # Reconstruct full state for partial rows
if df_state_partial.limit(1).count() > 0:
df_state_partial = merge_partial_state_with_source_time(
df_state_partial,
existing_state,
attr_cols,
event_ts_col
)
df_state_partial = add_identity_and_hashes(df_state_partial, bk_cols, attr_cols) # Deduplicate incoming state
df_state_prepared = df_state_full.unionByName(df_state_partial, allowMissingColumns=True)
df_state_prepared = dedup_state_with_source_time(df_state_prepared) # Union with existing and rebuild source-time timeline
df_state_union = existing_state.unionByName(df_state_prepared, allowMissingColumns=True)
df_state_final = rebuild_scd2_timeline_with_source_time(df_state_union) # Final combined result
df_final = df_state_final.unionByName(df_tx_final, allowMissingColumns=True) return df_final
This is the complete preferred pattern. It is the pattern that should be used where regulated reconstruction and source-time defensibility are required.
12.1.6 Handling Late Events
Late events are handled by the rebuild pattern itself.
If an older event arrives later:
- it is inserted into the ordered timeline based on
source_event_ts - previous and next rows are resequenced
effective_toandis_currentare recalculated
There is no special hack required if the rebuild is correct.
Typical flow:
incoming = add_identity_and_hashes(df_in, BK_COLS, ATTR_COLS)
state_in = incoming.filter(F.col("entity_type") == "state")
txn_in = incoming.filter(F.col("entity_type") == "transaction")
impacted_keys = state_in.select("bk").distinct()
existing_state = (
spark.read.table(BRONZE_TABLE)
.filter(F.col("entity_type") == "state")
.join(impacted_keys, on="bk", how="inner")
)
state_union = dedup_state(state_in).unionByName(existing_state, allowMissingColumns=True)
state_final = rebuild_scd2_timeline(state_union)
That is the correct late-arrival repair model.
12.1.7 Handling Deletes
Deletes should be modelled as state transitions, not physical removals.
Three supported patterns:
- Explicit delete flag
- source provides
is_deleted = true
- source provides
- Tombstone event
- source emits explicit delete record
- Trusted blanking in full snapshot
- only if domain rules confirm this means deletion
In all three cases:
- create a new state version
- set
is_deleted = true - do not physically remove history
Silver can then derive active views using:
WHERE is_current = true
AND is_deleted = false
Delete handling in PySpark is simply part of the state row and therefore part of attr_hash if deletion status is business-meaningful.
For example:
ATTR_COLS = ["name", "address", "status", "risk_rating", "is_deleted"]
That ensures a delete becomes a new version.
12.1.8 Producing is_current Safely
is_current must never be loaded directly from source and must never be set by “last row wins”.
It must always be derived from the version chain:
- order by
effective_from - set
effective_to = next(effective_from) - set
is_current = effective_to IS NULL
This makes is_current:
- deterministic
- repairable
- immune to ADF concurrency storms
- safe under replay
For transactions, is_current is not meaningful in the SCD2 sense and should generally be null or omitted. The transaction handling pattern is:
def process_transactions(df_tx):
w = W.partitionBy("bk") # or transaction_id if separate
return (
df_tx.withColumn("first_seen_ts", F.min("ingest_ts").over(w))
.withColumn("last_seen_ts", F.max("ingest_ts").over(w))
.dropDuplicates(["bk", "record_hash"])
)
That gives:
- insert if unseen
- update
last_seen_tsif seen again - no fake versioning
is_current represents the latest version in the temporal chain, not the resolved truth across sources.
Precedence may select a non-current row as the “believed” version in Silver.
12.1.9 What This Preferred Pattern Achieves
This pattern gives you the things the fallback model cannot.
12.1.9.1 True source-time ordering
Rows are positioned according to when the source asserted them, not when the platform happened to receive them.
12.1.9.2 Late-arrival repair
Older assertions can be inserted back into the correct historical position without corrupting the version chain.
12.1.9.3 Deterministic is_current
is_current becomes a consequence of a rebuilt source-time timeline, not a race condition caused by file arrival order.
12.1.9.4 Safe handling of partial loads
Partial payloads become full reconstructed state assertions before hashing and versioning, rather than accidental null-overwrites.
12.1.9.5 Proper deletion modelling
Deletes, tombstones, and trusted blanking patterns become explicit state transitions in history.
12.1.9.6 Replay-safe transaction handling
Transactions remain immutable events. Replays update last_seen_ts, not fake dimensional versions.
12.1.9.7 Regulator-defensible PIT reconstruction
This is the mode that supports answering:
What did we believe at that time, based on what source, and how did that belief later change?
That is the whole point.
12.1.10 Production Notes
The reference code above is logically correct, but several production considerations should be made explicit.
12.1.10.1 Impacted-key pruning is essential
Do not rebuild all Bronze history on every load. Restrict recomputation to the impacted business keys only.
12.1.10.2 Tie-breakers must be stable
If source_event_ts is not unique, deterministic ordering requires additional fields such as:
source_sequence_idsource_fileingest_run_id
Without stable tie-breakers, replay may produce different outcomes.
12.1.10.3 Replace-by-key is often safer than row-level merge
For complex late-arriving repair, it is often safer to:
- rebuild all rows for impacted keys
- replace only those keys in Bronze
rather than try to patch individual rows in place.
12.1.10.4 Partitioning and clustering still matter
Large Bronze tables should still be optimised physically using:
- partition pruning
- clustering / liquid clustering
- compaction windows
These improve performance without weakening temporal guarantees.
12.1.10.5 Source-time quality must be monitored
Even when a source claims event time, you should still monitor for:
- null source timestamps
- backwards-moving sequences
- impossible event times
- inconsistent partial/full markers
A strong pattern implemented on weak metadata is still weak.
12.1.10.6 The main caveat
This preferred pattern assumes the source can substantiate:
- reliable source time
- stable business identity
- usable partial/full semantics
- explicit or well-governed deletion behaviour
If it cannot, then the correct response is not to weaken this pattern until it “fits”.
It is to fall back explicitly to the weaker ingestion-time model and state the limitation clearly.
12.2 Fallback Pattern — Bronze without Source Time
Not all upstream systems provide usable source-event time, source extract “as-of” time, or a trustworthy sequence number. In those cases, Bronze cannot model source-time belief. It can only model platform-time belief: what the platform knew, when it learned it.
That is a weaker posture, but it is still materially better than pretending such sources are temporally complete when they are not.
This fallback pattern is designed for exactly those situations. It gives you:
- platform-time ordering
- duplicate suppression
- replay-safe ingestion
- explicit caveats about what cannot be proven
It does not reconstruct true source-time history.
12.2.1 Core rule
If source time is absent, unreliable, or ambiguous, then
ingest_tsbecomes the platform’s effective time — and that limitation must be made explicit.
12.2.2 What this pattern is for
Use this fallback only when:
- the source does not expose reliable event time
- extracts are unordered or lossy
- snapshots arrive without trustworthy “as-of” semantics
- source sequencing cannot be trusted
This pattern still supports:
- idempotent Bronze ingestion
- duplicate suppression under ADF concurrency storms
- full vs partial load handling
- explicit deletion semantics
- current-state Silver derivation
But it supports them as platform-observed history, not source-authored history.
12.2.3 Inputs and assumptions
Assume the incoming DataFrame contains, at minimum:
- business key columns
- mutable state attributes
- ingestion timestamp
- source metadata
- load type (
FULL/PARTIAL) where available - optional delete flag
- entity type (
stateortransaction)
The key design choice is:
platform_effective_from = ingest_ts
That is the timeline.
12.2.4 Full PySpark reference implementation
from pyspark.sql import functions as F, Window as W
BK_COLS = ["customer_id"] # example business key columns
ATTR_COLS = ["name", "address", "status", "risk_rating", "is_deleted"]
BRONZE_TABLE = "bronze.customer_state_no_source_time"
def canonicalise_for_hash(df, attr_cols):
"""
Standardise mutable attributes before hashing.
Missing values are normalised explicitly.
"""
for c in attr_cols:
df = df.withColumn(c, F.coalesce(F.col(c).cast("string"), F.lit("∅")))
return df
def add_identity_and_hashes_no_source_time(df, bk_cols, attr_cols):
"""
Builds business key, surrogate key, attribute hash, and record hash
for the ingestion-time fallback model.
"""
df = df.withColumn("bk", F.concat_ws("||", *[F.col(c).cast("string") for c in bk_cols]))
df = df.withColumn("sk", F.sha2(F.col("bk"), 256))
df = canonicalise_for_hash(df, attr_cols)
# Hash only mutable business-state attributes
df = df.withColumn(
"attr_hash",
F.when(
F.col("entity_type") == "state",
F.sha2(F.concat_ws("||", *[F.col(c) for c in attr_cols]), 256)
)
)
# Full-row / replay hash including provenance
df = df.withColumn(
"record_hash",
F.sha2(
F.concat_ws(
"||",
F.col("bk"),
F.coalesce(F.col("source_system"), F.lit("∅")),
F.coalesce(F.col("source_path"), F.lit("∅")),
F.coalesce(F.col("source_file"), F.lit("∅")),
F.coalesce(F.col("attr_hash"), F.lit("∅"))
),
256
)
)
return df
def prepare_fallback_inputs(df_in, bk_cols, attr_cols):
"""
Adds required fallback metadata and declares platform-time ordering.
"""
df = df_in
df = df.withColumn("ingest_ts", F.coalesce(F.col("ingest_ts"), F.current_timestamp()))
df = df.withColumn("ingest_run_id", F.coalesce(F.col("ingest_run_id"), F.lit("UNKNOWN_RUN")))
df = df.withColumn("source_system", F.coalesce(F.col("source_system"), F.lit("UNKNOWN_SOURCE")))
df = df.withColumn("source_type", F.coalesce(F.col("source_type"), F.lit("unknown")))
df = df.withColumn("source_path", F.coalesce(F.col("source_path"), F.lit("UNKNOWN_PATH")))
df = df.withColumn("source_file", F.coalesce(F.col("source_file"), F.lit("UNKNOWN_FILE")))
df = df.withColumn("entity_type", F.coalesce(F.col("entity_type"), F.lit("state")))
df = df.withColumn("load_type", F.coalesce(F.col("load_type"), F.lit("UNKNOWN")))
df = df.withColumn("is_deleted", F.coalesce(F.col("is_deleted"), F.lit(False)))
# Fallback mode: platform time becomes effective time
df = df.withColumn("effective_from", F.col("ingest_ts"))
df = add_identity_and_hashes_no_source_time(df, bk_cols, attr_cols)
return df
def dedup_transactions_fallback(df_tx):
"""
Transactions are immutable events.
If seen again, update last_seen_ts; do not create SCD2 versions.
"""
w_tx = W.partitionBy("bk", "record_hash")
return (
df_tx
.withColumn("first_seen_ts", F.min("ingest_ts").over(w_tx))
.withColumn("last_seen_ts", F.max("ingest_ts").over(w_tx))
.withColumn("rn", F.row_number().over(w_tx.orderBy(F.col("ingest_ts").desc(), F.col("ingest_run_id").desc())))
.filter(F.col("rn") == 1)
.drop("rn")
)
def merge_partial_state(df_new, df_existing, attr_cols):
"""
For partial feeds, preserve existing attribute values where incoming
rows do not explicitly provide them.
This assumes the source has declared the load as PARTIAL.
"""
existing_alias = df_existing.alias("e")
new_alias = df_new.alias("n")
joined = new_alias.join(existing_alias, on="bk", how="left")
select_exprs = [
F.col("n.bk").alias("bk"),
F.col("n.sk").alias("sk"),
F.col("n.ingest_ts").alias("ingest_ts"),
F.col("n.ingest_run_id").alias("ingest_run_id"),
F.col("n.source_system").alias("source_system"),
F.col("n.source_type").alias("source_type"),
F.col("n.source_path").alias("source_path"),
F.col("n.source_file").alias("source_file"),
F.col("n.entity_type").alias("entity_type"),
F.col("n.load_type").alias("load_type"),
F.col("n.is_deleted").alias("is_deleted"),
F.col("n.effective_from").alias("effective_from"),
]
# preserve previous values where incoming values are blank-ish
for c in attr_cols:
select_exprs.append(
F.when(
(F.col(f"n.{c}") == F.lit("∅")) | F.col(f"n.{c}").isNull(),
F.col(f"e.{c}")
).otherwise(F.col(f"n.{c}")).alias(c)
)
return joined.select(*select_exprs)
def dedup_state_fallback(df_state):
"""
Deduplicate ingestion storms and repeated arrivals for state rows.
Same bk + attr_hash = same belief version in fallback mode,
because no reliable source time exists.
"""
w_dup = W.partitionBy("bk", "attr_hash")
return (
df_state
.withColumn("first_seen_ts", F.min("ingest_ts").over(w_dup))
.withColumn("last_seen_ts", F.max("ingest_ts").over(w_dup))
.withColumn("rn", F.row_number().over(w_dup.orderBy(F.col("ingest_ts").desc(), F.col("ingest_run_id").desc())))
.filter(F.col("rn") == 1)
.drop("rn")
)
def rebuild_scd2_fallback(df_union):
"""
Rebuilds platform-time SCD2 history.
Ordering is by ingestion time, because source time is unavailable.
This is operationally consistent but weaker than source-time Bronze.
"""
order_w = W.partitionBy("bk").orderBy(
F.col("effective_from").asc(),
F.col("source_file").asc_nulls_last(),
F.col("ingest_run_id").asc_nulls_last()
)
df = (
df_union
.withColumn("prev_hash", F.lag("attr_hash").over(order_w))
.withColumn(
"is_new_version",
F.when(
F.col("prev_hash").isNull() | (F.col("prev_hash") != F.col("attr_hash")),
F.lit(1)
).otherwise(F.lit(0))
)
.withColumn(
"version_grp",
F.sum("is_new_version").over(order_w.rowsBetween(W.unboundedPreceding, 0))
)
)
latest_grp_w = W.partitionBy("bk", "version_grp").orderBy(F.col("ingest_ts").desc(), F.col("ingest_run_id").desc())
df_collapsed = (
df
.withColumn("grp_effective_from", F.min("effective_from").over(W.partitionBy("bk", "version_grp")))
.withColumn("grp_first_seen_ts", F.min("first_seen_ts").over(W.partitionBy("bk", "version_grp")))
.withColumn("grp_last_seen_ts", F.max("last_seen_ts").over(W.partitionBy("bk", "version_grp")))
.withColumn("grp_rn", F.row_number().over(latest_grp_w))
.filter(F.col("grp_rn") == 1)
.drop("grp_rn", "prev_hash", "is_new_version")
.withColumn("effective_from", F.col("grp_effective_from"))
.withColumn("first_seen_ts", F.col("grp_first_seen_ts"))
.withColumn("last_seen_ts", F.col("grp_last_seen_ts"))
.drop("grp_effective_from", "grp_first_seen_ts", "grp_last_seen_ts")
)
final_w = W.partitionBy("bk").orderBy(F.col("effective_from").asc())
return (
df_collapsed
.withColumn("effective_to", F.lead("effective_from").over(final_w))
.withColumn("is_current", F.col("effective_to").isNull())
)
def process_bronze_without_source_time(df_in, bronze_table, bk_cols, attr_cols):
"""
Main fallback ingestion function.
Guarantees:
- ingestion-time ordering
- duplicate suppression
- replay-safe ingestion
- partial-load protection (where declared)
- transaction/event separation
"""
df = prepare_fallback_inputs(df_in, bk_cols, attr_cols)
# Separate transactions from state
df_tx = df.filter(F.col("entity_type") == "transaction")
df_state_new = df.filter(F.col("entity_type") == "state")
# Process transactions: insert if unseen, update last_seen if seen
df_tx_final = dedup_transactions_fallback(df_tx)
# Identify impacted business keys for state
impacted_keys = df_state_new.select("bk").distinct()
existing_state = (
spark.read.format("delta").table(bronze_table)
.filter(F.col("entity_type") == "state")
.join(impacted_keys, on="bk", how="inner")
)
# For declared partial loads, preserve prior attribute values
df_state_partial = df_state_new.filter(F.col("load_type") == "PARTIAL")
df_state_full = df_state_new.filter(F.col("load_type") != "PARTIAL")
if df_state_partial.limit(1).count() > 0: # OR: if not df_state_partial.rdd.isEmpty():
latest_existing = (
existing_state
.withColumn(
"rn_latest",
F.row_number().over(W.partitionBy("bk").orderBy(F.col("effective_from").desc()))
)
.filter(F.col("rn_latest") == 1)
.drop("rn_latest")
)
df_state_partial = merge_partial_state(df_state_partial, latest_existing, attr_cols)
df_state_partial = add_identity_and_hashes_no_source_time(df_state_partial, bk_cols, attr_cols)
df_state_prepared = df_state_full.unionByName(df_state_partial, allowMissingColumns=True)
df_state_prepared = dedup_state_fallback(df_state_prepared)
# Union with existing and rebuild platform-time SCD2
df_state_union = existing_state.unionByName(df_state_prepared, allowMissingColumns=True)
df_state_final = rebuild_scd2_fallback(df_state_union)
# Final combined output
df_final = df_state_final.unionByName(df_tx_final, allowMissingColumns=True)
return df_final
12.2.5 What this fallback pattern achieves
This gives you:
- platform-time ordering
becauseeffective_from = ingest_ts - duplicate suppression
because repeated identical records collapse viabk + attr_hash - replay-safe ingestion
because repeated arrivals updatelast_seen_tsrather than creating spurious history - partial-feed safety
because declared partial loads preserve prior attribute values - transaction/event handling
because immutable events are not falsely versioned as SCD2 dimensions
12.2.6 Limitations and audit caveats
This pattern is useful, but weaker.
It can prove:
- what the platform received
- when the platform received it
- which file/system/run delivered it
- how the platform versioned belief from that point onward
It cannot prove:
- what the source actually believed at a historical moment prior to ingestion
- whether a late full snapshot was older or newer in source terms
- whether out-of-order arrivals reflect source chronology or pipeline delay
- true source-time point-in-time reconstruction
The platform is therefore preserving:
platform-time belief, not source-time belief
That distinction must be documented and visible to audit, risk, and regulators.
12.3 When You Must Reject a Source as Non-Defensible
Some sources are not merely “messy”. They are architecturally non-defensible for regulated use-cases unless their limitations are explicitly quarantined.
A source should be rejected as a source of regulator-defensible truth if it cannot substantiate the minimum controls required for temporal reconstruction.
This does not necessarily mean the source cannot be ingested. It means it must not be treated as capable of supporting:
- point-in-time reconstruction
- authoritative belief replay
- evidentiary decision reconstruction
12.3.1 No source-time
If a source cannot provide any reliable notion of:
- event timestamp
- effective timestamp
- extract “as-of” time
- ordered change sequence
then the platform cannot reconstruct source-time belief.
In that case, the source may still be ingested in fallback mode, but must be classified as:
operationally useful, temporally weak
It must not be presented as regulator-grade temporal evidence.
12.3.2 Ambiguous partial snapshots
A source should be treated as non-defensible if it sends partial updates without making clear:
- whether the payload is partial or full
- whether omitted fields are unchanged, null, or deleted
- whether row absence implies deletion
If partiality is ambiguous, then:
- missing fields can cause false deletes
- missing rows can cause false absence
- downstream belief becomes unrecoverable
A source without explicit load_type or equivalent semantics cannot safely drive Bronze truth in regulated domains.
12.3.3 No stable business key
If the source cannot provide a stable identity for the entity or event, then Bronze cannot:
- group assertions correctly
- deduplicate deterministically
- reconstruct timelines reliably
- distinguish replay from genuine change
In practical terms, if there is no stable bk, there is no safe SCD2.
You can still land such data as evidence, but you cannot build defensible entity history from it.
12.3.4 No delete semantics
If the source cannot distinguish between:
- deletion
- omission
- partial payload
- not-included-in-this-extract
then the platform cannot safely model lifecycle state.
This means:
- active vs inactive state becomes guesswork
- Silver current-state views become unstable
- historical belief cannot be defended
In regulated contexts, ambiguous deletion semantics are a hard control weakness.
12.3.5 Practical reject / quarantine logic
The easiest way to operationalise this is to score the source contract before allowing it into the “defensible Bronze” path.
from pyspark.sql import functions as F
def assess_source_defensibility(df_contract):
"""
Example contract assessment input:
one-row dataframe or config-derived dataframe with booleans such as:
has_source_time, has_stable_bk, has_explicit_load_type, has_delete_semantics
"""
return (
df_contract
.withColumn(
"defensibility_status",
F.when(
(F.col("has_source_time") == True) &
(F.col("has_stable_bk") == True) &
(F.col("has_delete_semantics") == True),
F.lit("DEFENSIBLE")
).when(
(F.col("has_stable_bk") == True),
F.lit("LIMITED_FALLBACK_ONLY")
).otherwise(
F.lit("REJECT_FOR_AUTHORITATIVE_BRONZE")
)
)
.withColumn(
"defensibility_reason",
F.when(F.col("has_stable_bk") != True, F.lit("No stable business key"))
.when(F.col("has_source_time") != True, F.lit("No reliable source time"))
.when(F.col("has_delete_semantics") != True, F.lit("No explicit delete semantics"))
.otherwise(F.lit("Meets minimum temporal evidentiary standard"))
)
)
Example usage:
contract_df = spark.createDataFrame(
[
(False, True, False, False)
],
["has_source_time", "has_stable_bk", "has_explicit_load_type", "has_delete_semantics"]
)
assess_source_defensibility(contract_df).show(truncate=False)
Possible outputs:
DEFENSIBLELIMITED_FALLBACK_ONLYREJECT_FOR_AUTHORITATIVE_BRONZE
12.3.6 The architectural rule
A source may be ingested without source-time.
It may not be presented as regulator-defensible historical truth unless it can substantiate temporal ordering, stable identity, and deletion semantics.
This is the architectural boundary.
Without it, weak sources contaminate strong Bronze and undermine the entire point of preserving temporal truth.
12.4 CDC Microbatch Ingestion Pattern (State + Transaction Handling)
Change Data Capture (CDC) is one of the most common ingestion patterns in modern Financial Services platforms. It introduces a continuous stream of change events — inserts, updates, deletes — typically delivered in microbatches via Kafka, file drops, or replication tooling.
In a Bronze SCD2 architecture, CDC must be treated as:
a stream of assertions, not a sequence of instructions
Each CDC event represents:
- what the source asserted
- when it asserted it
- under what context
The role of Bronze is not to “apply” these changes, but to capture them as evidence, and construct a defensible temporal model from them.
This section extends the preferred source-time pattern (12.1) to CDC ingestion.
12.4.1 Inputs and Assumptions
Assume each microbatch contains CDC events with fields such as:
- business key
- mutable attributes (often partial)
- operation type (
INSERT,UPDATE,DELETE) - source timestamp or sequence (
source_event_ts) - ingestion timestamp (
ingest_ts) - optional before/after images
Key assumptions:
source_event_tsis reliable (or at least monotonic within source scope)- CDC payloads are partial by nature
- entity type (
statevstransaction) is known or derivable - load type is implicitly
PARTIAL
12.4.2 CDC Op Mapping Rules
CDC events must be translated into state assertions, not executed as mutations.
12.4.2.1 Insert (INSERT / c)
- Represents a new state assertion
- Create new version
effective_from = source_event_ts
12.4.2.2 Update (UPDATE / u)
- Represents a changed assertion
- Payload is often partial
- Must reconstruct full state using prior valid version (see 12.1.3)
- Create new SCD2 version
12.4.2.3 Delete (DELETE / d)
- Represents a deletion assertion
- Create new version with:
is_deleted = true
- Do not remove prior history
12.4.2.4 Before / After Images
If available:
- use after image as the assertion
- optionally retain before image for forensic/audit use (not required for SCD2 logic)
If not available:
- treat as partial update
- apply merge logic from 12.1.3
12.4.3 Microbatch Processing Flow
CDC ingestion should use Structured Streaming with foreachBatch, allowing explicit control over:
- ordering
- deduplication
- provenance
- replay behaviour
The correct pattern is:
- parse CDC payload
- standardise + assign metadata
- map CDC ops to state assertions
- reconstruct partial state (where required)
- deduplicate observations
- rebuild SCD2 timeline for impacted keys
- append transaction events
- write Bronze
12.4.4 Reference PySpark Implementation
def process_cdc_microbatch(microbatch_df, batch_id): # 1. Parse CDC structure (example assumes Debezium-style payload)
df = (
microbatch_df
.withColumn("op", F.col("payload.op"))
.withColumn("source_event_ts", F.col("payload.source.ts_ms").cast("timestamp"))
.withColumn("bk", F.col("payload.after.customer_id"))
# extract attributes from payload.after as required
) # 2. Prepare inputs (source-time mode)
df = prepare_inputs_with_source_time(df, BK_COLS, ATTR_COLS, EVENT_TS_COL) # 3. Map CDC operations to state assertions
df_state = df.filter(F.col("entity_type") == "state") \
.withColumn(
"is_deleted",
F.when(F.col("op") == "d", True).otherwise(F.col("is_deleted"))
) df_tx = df.filter(F.col("entity_type") == "transaction") # 4. Identify impacted keys
impacted_keys = df_state.select("bk").distinct() existing_state = (
spark.read.table(BRONZE_TABLE)
.filter(F.col("entity_type") == "state")
.join(impacted_keys, "bk")
) # 5. Split insert vs partial (updates + deletes are partial)
df_state_insert = df_state.filter(F.col("op") == "c")
df_state_partial = df_state.filter(F.col("op") != "c") # 6. Reconstruct partial state at source time
if df_state_partial.limit(1).count() > 0:
df_state_partial = merge_partial_state_with_source_time(
df_state_partial,
existing_state,
ATTR_COLS,
EVENT_TS_COL
)
df_state_partial = add_identity_and_hashes(df_state_partial, BK_COLS, ATTR_COLS) # 7. Combine + deduplicate
df_state_prepared = df_state_insert.unionByName(df_state_partial, allowMissingColumns=True)
df_state_prepared = dedup_state_with_source_time(df_state_prepared) # 8. Rebuild source-time timeline
df_union = existing_state.unionByName(df_state_prepared, allowMissingColumns=True)
df_state_final = rebuild_scd2_timeline_with_source_time(df_union) # 9. Transactions (append-only, replay-safe)
df_tx_final = dedup_transactions_with_source_time(df_tx) df_final = df_state_final.unionByName(df_tx_final, allowMissingColumns=True) # 10. Write Bronze
df_final.write.format("delta").mode("append").saveAsTable(BRONZE_TABLE)
12.4.4.1 Important: Append vs Rebuild
CDC often tempts teams into row-level mutation:
“apply update, update row, delete row”
That is incorrect for Bronze.
The correct model is:
- append assertions
- rebuild timelines
- derive state
Even under streaming.
The use of foreachBatch is deliberate:
It allows each microbatch to behave like a controlled batch rebuild for impacted keys, preserving determinism.
12.4.5 Why This Fits Regulated Financial Services
This pattern preserves the properties regulators actually care about.
12.4.5.1 Full evidentiary capture
Every CDC event is retained with:
- source timestamp
- operation type
- provenance
12.4.5.2 Out-of-sequence safety
Late or replayed CDC events are:
- inserted into correct temporal position
- not appended blindly
12.4.5.3 Replay and idempotency
Duplicate CDC deliveries:
- collapse via hashing
- update
last_seen_ts - do not create false versions
12.4.5.4 Correct handling of partial change
CDC payloads are inherently partial. The merge logic ensures:
- no accidental null overwrites
- no silent data loss
12.4.5.5 Separation of fact vs belief
- state → SCD2 versioned belief
- transactions → append-only facts
12.4.5.6 Limitations and Caveats
CDC is powerful, but not automatically safe.
12.4.5.6.1 Source-time quality is critical
If source_event_ts is:
- missing
- inconsistent
- non-monotonic
then the model degrades to ingestion-time belief.
12.4.5.6.2 Ordering guarantees vary by source
Different CDC systems provide different guarantees:
- some guarantee ordering per key
- others only per partition
- some allow reordering under failure
Tie-breakers must still be applied.
12.4.5.6.3 High-volume considerations
Large CDC streams require:
- partitioning on
bk - clustering by
source_event_ts - periodic compaction
12.4.5.6.4 Deletes must be explicit
Some CDC feeds omit deletes or encode them ambiguously.
If delete semantics are unclear, the source becomes:
non-defensible for lifecycle modelling
12.4.5.6.5 The Core Principle
CDC does not simplify Bronze.
It amplifies the need to get Bronze right.
You are not applying changes.
You are capturing a stream of assertions and constructing a defensible history from them.
That is what makes the system:
- replayable
- explainable
- regulator-grade
12.4.6 Worked CDC Example: Insert, Partial Update, Delete, and Out-of-Order Arrival
To make the pattern concrete, consider a simple customer record flowing from a CDC source.
Assume the business key is:
customer_id = C123
And the mutable attributes are:
nameaddressstatusis_deleted
Assume the source emits CDC events with:
op(c,u,d)source_event_ts- partial payloads for updates
- provenance metadata
12.4.6.1 The Source Assertions
The source emits four events:
12.4.6.1.1 Event 1 — Initial insert
| Field | Value |
|---|---|
op | c |
source_event_ts | 2026-03-01 09:00:00 |
customer_id | C123 |
name | Jane Carter |
address | 12 Market Street |
status | Active |
is_deleted | false |
12.4.6.1.2 Event 2 — Partial update
Only the address changes.
| Field | Value |
|---|---|
op | u |
source_event_ts | 2026-03-03 10:00:00 |
customer_id | C123 |
address | 18 King Street |
Because this is CDC, the payload is partial. It does not repeat name, status, or is_deleted.
12.4.6.1.3 Event 3 — Delete
The source later deletes the customer.
| Field | Value |
|---|---|
op | d |
source_event_ts | 2026-03-05 08:30:00 |
customer_id | C123 |
This is interpreted as a new version with is_deleted = true.
12.4.6.1.4 Event 4 — Late-arriving update
Now imagine an earlier update arrives late due to replication lag.
| Field | Value |
|---|---|
op | u |
source_event_ts | 2026-03-02 15:00:00 |
customer_id | C123 |
status | Restricted |
This arrives after Events 2 and 3 have already been processed.
That is the real test.
12.4.6.2 Step 1: What Bronze Receives
Bronze receives all four assertions as evidence, with provenance and ingestion metadata.
The crucial point is this:
Bronze does not discard Event 4 just because it arrived late.
Bronze inserts it into the correct place in the source-time timeline.
12.4.6.3 Step 2: Partial Reconstruction
Event 2 and Event 4 are partial updates, so the platform reconstructs the full row as of their source_event_ts.
12.4.6.3.1 Event 2 reconstructed full state
| Attribute | Value |
|---|---|
name | Jane Carter |
address | 18 King Street |
status | Active |
is_deleted | false |
12.4.6.3.2 Event 4 reconstructed full state
| Attribute | Value |
|---|---|
name | Jane Carter |
address | 12 Market Street |
status | Restricted |
is_deleted | false |
This is why partial updates must be merged against prior valid state before hashing and versioning.
12.4.6.4.Step 3: Correct Source-Time Ordering
Even if the events arrive in this ingestion order:
- Event 1
- Event 2
- Event 3
- Event 4 (late)
the Bronze timeline is rebuilt in this source-time order:
- Event 1 —
2026-03-01 09:00:00 - Event 4 —
2026-03-02 15:00:00 - Event 2 —
2026-03-03 10:00:00 - Event 3 —
2026-03-05 08:30:00
That means effective_from, effective_to, and is_current are recomputed accordingly.
12.4.6.5 Step 4: Final Bronze SCD2 History
The correct Bronze history becomes:
| customer_id | effective_from | effective_to | name | address | status | is_deleted | is_current |
|---|---|---|---|---|---|---|---|
| C123 | 2026-03-01 09:00:00 | 2026-03-02 15:00:00 | Jane Carter | 12 Market Street | Active | false | false |
| C123 | 2026-03-02 15:00:00 | 2026-03-03 10:00:00 | Jane Carter | 12 Market Street | Restricted | false | false |
| C123 | 2026-03-03 10:00:00 | 2026-03-05 08:30:00 | Jane Carter | 18 King Street | Active | false | false |
| C123 | 2026-03-05 08:30:00 | null | Jane Carter | 18 King Street | Active | true | true |
Notice what happened:
- the late-arriving Event 4 was inserted historically
- the prior and subsequent rows were resequenced
- the delete is represented as a new version, not row removal
is_currentremains correct
12.4.6.6 Step 5: What Silver Sees
Silver does not need to solve this complexity. It simply reads the Bronze history.
12.4.6.6.1 Current-state Silver
SELECT *
FROM bronze_customer
WHERE is_current = true
AND is_deleted = false
Result:
- no active record returned for
C123
Because the customer is now deleted.
12.4.6.6.2 Point-in-time Silver as of 2026-03-02 16:00:00
SELECT *
FROM bronze_customer
WHERE effective_from <= TIMESTAMP '2026-03-02 16:00:00'
AND (effective_to > TIMESTAMP '2026-03-02 16:00:00' OR effective_to IS NULL)
AND is_deleted = false
Result:
| customer_id | name | address | status |
|---|---|---|---|
| C123 | Jane Carter | 12 Market Street | Restricted |
Which is exactly the belief the platform should reconstruct at that time.
12.4.6.7 Why This Example Matters
This example shows four critical truths:
12.4.6.7.1 1. CDC updates are assertions, not commands
The platform does not “apply” an update blindly. It evaluates it as evidence.
12.4.6.7.2 2. Partial updates must be reconstructed
Without reconstruction, Event 2 and Event 4 would null out attributes and corrupt the hash.
12.4.6.7.3 3. Late events must repair history
Without source-time rebuild, Event 4 would arrive “too late” and produce the wrong current and PIT state.
12.4.6.7.4 4. Deletes are state transitions
The delete is not a row removal. It is a final, explicit belief version.
12.4.6.8 The Core Lesson
If Bronze is implemented correctly, even messy CDC behaviour becomes manageable:
- partial updates do not lose state
- late arrivals do not corrupt timelines
- deletes do not erase evidence
- Silver remains simple
That is exactly what “land it early, manage it early” means in practice.
12.4.7 Worked Example: Multi-Source CDC Conflict with Precedence
In real Financial Services systems, multiple sources assert state about the same entity. These assertions may:
- overlap in time
- conflict in value
- arrive out of sequence
Bronze must capture all of them.
Silver must decide what to believe.
12.4.7.1 Scenario
We have two systems:
| Source | Description | Precedence |
|---|---|---|
| CRM | Customer-facing system | 1 (lower) |
| CORE | Core banking system | 2 (higher) |
Assume:
- CORE is the authoritative source for status
- CRM is authoritative for contact data (not shown here)
We focus on:
customer_idstatus
12.4.7.2 The Source Assertions
12.4.7.2.1 Event 1 — CRM creates customer
| Field | Value |
|---|---|
source_system | CRM |
precedence_rank | 1 |
source_event_ts | 2026-03-01 09:00 |
status | Active |
12.4.7.2.2 Event 2 — CORE sets status to Restricted
| Field | Value |
|---|---|
source_system | CORE |
precedence_rank | 2 |
source_event_ts | 2026-03-02 10:00 |
status | Restricted |
12.4.7.2.3 Event 3 — CRM updates status (conflict)
| Field | Value |
|---|---|
source_system | CRM |
precedence_rank | 1 |
source_event_ts | 2026-03-03 09:00 |
status | Active |
This directly conflicts with CORE.
12.4.7.2.4 Event 4 — CORE clears restriction (late arrival)
| Field | Value |
|---|---|
source_system | CORE |
precedence_rank | 2 |
source_event_ts | 2026-03-02 18:00 |
status | Active |
This arrives after Event 3, but is earlier in source time.
12.4.7.3 Step 1: Bronze Stores All Assertions
Bronze does not resolve conflicts.
It produces a full SCD2 history ordered by:
source_event_ts ASC
precedence_rank DESC
12.4.7.3.1 Correct Bronze Timeline
| customer_id | source | precedence | effective_from | status |
|---|---|---|---|---|
| C123 | CRM | 1 | 2026-03-01 09:00 | Active |
| C123 | CORE | 2 | 2026-03-02 10:00 | Restricted |
| C123 | CORE | 2 | 2026-03-02 18:00 | Active |
| C123 | CRM | 1 | 2026-03-03 09:00 | Active |
Notice:
- Event 4 (CORE late arrival) is inserted correctly between Events 2 and 3
- CRM conflict is preserved, not removed
12.4.7.4 Step 2: Bronze is NOT “the truth”
At this point, Bronze contains:
- conflicting assertions
- overlapping beliefs
- multiple valid versions
This is correct.
Bronze answers: what was said, when, and by whom
12.4.7.5 Step 3: Silver Applies Precedence
Silver resolves conflicts using:
- Highest
precedence_rankwins - If equal precedence → latest
source_event_tswins
Evaluate at Different Points in Time
12.4.7.5.1 As of 2026-03-02 12:00
Available assertions:
- CRM → Active (lower precedence)
- CORE → Restricted (higher precedence)
Silver result:
| status |
|---|
| Restricted |
12.4.7.5.2 As of 2026-03-02 19:00
Available assertions:
- CORE → Active (Event 4)
- CORE → Restricted (Event 2)
Same precedence → latest time wins.
Silver result:
| status |
|---|
| Active |
12.4.7.5.3 As of 2026-03-03 10:00
Available assertions:
- CRM → Active (latest in time, lower precedence)
- CORE → Active (earlier, higher precedence)
Key rule: precedence overrides time
So:
Silver result:
| status |
|---|
| Active (from CORE) |
Even though CRM is later, CORE still governs.
12.4.7.6 Step 4: Why This Works
12.4.7.6.1 1. Time is preserved independently of precedence
Bronze ordering is always:
when was this asserted?
not:
which source wins?
12.4.7.6.2 2. Precedence is applied at query time (Silver)
This allows:
- rules to evolve
- conflicts to be reinterpreted
- no loss of evidence
12.4.7.6.3 3. Late-arriving data is safe
Event 4:
- arrives late
- is inserted correctly
- affects downstream belief
Without rebuild logic, this would break.
12.4.7.6.4 4. Conflicts remain explainable
You can always answer:
- “CRM said Active at X”
- “CORE said Restricted at Y”
- “We chose CORE because of precedence”
That is exactly what regulators expect.
12.4.7.7 Step 5: What NOT to Do
12.4.7.7.1 Do NOT resolve precedence in Bronze
If you drop lower-precedence rows:
- you lose evidence
- you cannot explain decisions
- you cannot re-run logic
12.4.7.7.2 Do NOT overwrite based on precedence
This collapses:
- multiple assertions → single row
- history → illusion
12.4.7.7.3 Do NOT treat “latest wins” as universal
That only works in:
- single-source systems
- non-regulated environments
12.4.7.8 Step 6: Example Silver Query Pattern
A simplified Silver resolution might look like:
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY precedence_rank DESC, source_event_ts DESC
) AS rn
FROM bronze_customer
WHERE effective_from <= :as_at
AND (effective_to > :as_at OR effective_to IS NULL)
)
SELECT *
FROM ranked
WHERE rn = 1
This is where:
- precedence
- time
- business rules
come together.
12.4.7.9 The Core Insight
This example demonstrates the most important architectural separation in the entire design:
Bronze preserves competing assertions.
Silver expresses institutional belief.
If you collapse those layers, you lose:
- auditability
- flexibility
- defensibility
12.4.7.10 Final Takeaway
Multi-source CDC is not a problem to eliminate.
It is a reality to model correctly.
If you:
- preserve all assertions (Bronze)
- order them correctly (source time)
- resolve them explicitly (Silver)
then even conflicting, out-of-sequence, multi-source data becomes:
deterministic, explainable, and regulator-grade
12.4.8 Worked Example: CDC + Partial + Precedence + Delete Across Two Sources
The hardest real-world cases are not single-source updates. They are situations where:
- multiple systems assert state about the same entity
- one source is higher precedence than another
- updates are partial
- events arrive out of sequence
- one source later deletes the record
This example shows how Bronze and Silver behave under exactly those conditions.
12.4.8.1 Scenario
We have one customer:
customer_id = C123
Two source systems assert state about that customer:
| Source | Role | Precedence |
|---|---|---|
| CRM | customer-facing workflow system | 1 (lower) |
| CORE | core banking / operational authority | 2 (higher) |
We care about these mutable attributes:
nameaddressstatusis_deleted
Assume:
- CORE is authoritative for
status - CRM is authoritative for contact workflow, but still contributes assertions
- all events are CDC-style partial or full assertions
- Bronze stores all assertions
- Silver resolves current belief
12.4.8.2 The CDC Events
12.4.8.2.1 Event 1 — CRM creates the customer
A full create arrives from CRM.
| Field | Value |
|---|---|
source_system | CRM |
precedence_rank | 1 |
op | c |
source_event_ts | 2026-03-01 09:00 |
name | Jane Carter |
address | 12 Market Street |
status | Active |
is_deleted | false |
This becomes the first Bronze version.
12.4.8.2.2 Event 2 — CORE partially updates status
A partial CDC update arrives from CORE.
| Field | Value |
|---|---|
source_system | CORE |
precedence_rank | 2 |
op | u |
source_event_ts | 2026-03-02 10:00 |
status | Restricted |
This is partial: only status is present.
To version it correctly, Bronze must reconstruct full state as of 2026-03-02 10:00:
| name | address | status | is_deleted |
|---|---|---|---|
| Jane Carter | 12 Market Street | Restricted | false |
12.4.8.2.3 Event 3 — CRM partially updates address
CRM later changes the address.
| Field | Value |
|---|---|
source_system | CRM |
precedence_rank | 1 |
op | u |
source_event_ts | 2026-03-03 09:00 |
address | 18 King Street |
Again, this is partial. Bronze reconstructs:
| name | address | status | is_deleted |
|---|---|---|---|
| Jane Carter | 18 King Street | Active | false |
Notice something important:
- CRM has no authority over CORE’s status assertion
- Bronze still stores CRM’s assertion exactly as it was made
- Silver will later decide what to believe
12.4.8.2.4 Event 4 — CORE clears the restriction, but arrives late
Now a CORE event arrives after Event 3 was processed, but with an earlier source time.
| Field | Value |
|---|---|
source_system | CORE |
precedence_rank | 2 |
op | u |
source_event_ts | 2026-03-02 18:00 |
status | Active |
This is late-arriving and partial.
Bronze reconstructs:
| name | address | status | is_deleted |
|---|---|---|---|
| Jane Carter | 12 Market Street | Active | false |
And then inserts it into the correct source-time position, not the arrival position.
12.4.8.2.5 Event 5 — CRM delete arrives later
CRM later emits a delete.
| Field | Value |
|---|---|
source_system | CRM |
precedence_rank | 1 |
op | d |
source_event_ts | 2026-03-04 12:00 |
Bronze models this as a new version:
| name | address | status | is_deleted |
|---|---|---|---|
| Jane Carter | 18 King Street | Active | true |
But this is still only a CRM assertion.
That matters.
12.4.8.3 Step 1: Bronze Stores All Assertions in Source-Time Order
Even if the actual ingestion order is:
- Event 1
- Event 2
- Event 3
- Event 5
- Event 4 (late)
Bronze must rebuild history into this source-time order:
| Seq | source_system | precedence | source_event_ts | name | address | status | is_deleted |
|---|---|---|---|---|---|---|---|
| 1 | CRM | 1 | 2026-03-01 09:00 | Jane Carter | 12 Market Street | Active | false |
| 2 | CORE | 2 | 2026-03-02 10:00 | Jane Carter | 12 Market Street | Restricted | false |
| 3 | CORE | 2 | 2026-03-02 18:00 | Jane Carter | 12 Market Street | Active | false |
| 4 | CRM | 1 | 2026-03-03 09:00 | Jane Carter | 18 King Street | Active | false |
| 5 | CRM | 1 | 2026-03-04 12:00 | Jane Carter | 18 King Street | Active | true |
This is the Bronze timeline.
Bronze has done no precedence resolution. It has simply preserved assertions.
12.4.8.4 Step 2: Bronze SCD2 History
If represented as Bronze rows, it would look like:
| customer_id | source_system | precedence_rank | effective_from | effective_to | address | status | is_deleted | is_current |
|---|---|---|---|---|---|---|---|---|
| C123 | CRM | 1 | 2026-03-01 09:00 | 2026-03-02 10:00 | 12 Market Street | Active | false | false |
| C123 | CORE | 2 | 2026-03-02 10:00 | 2026-03-02 18:00 | 12 Market Street | Restricted | false | false |
| C123 | CORE | 2 | 2026-03-02 18:00 | 2026-03-03 09:00 | 12 Market Street | Active | false | false |
| C123 | CRM | 1 | 2026-03-03 09:00 | 2026-03-04 12:00 | 18 King Street | Active | false | false |
| C123 | CRM | 1 | 2026-03-04 12:00 | null | 18 King Street | Active | true | true |
This is not yet “the truth”. It is the full evidence chain.
12.4.8.5 Step 3: What Silver Believes at Different Times
Silver now applies precedence.
Rule set:
- Higher
precedence_rankwins - If equal precedence, latest source-time wins
- Deletes are only believed if they come from a source authoritative for deletion in that domain
That last point is crucial.
12.4.8.5.1 As of 2026-03-02 12:00
Visible assertions:
- CRM: Active
- CORE: Restricted
Silver result:
| status | address | deleted |
|---|---|---|
| Restricted | 12 Market Street | false |
Why:
- CORE outranks CRM on status
12.4.8.5.2 As of 2026-03-02 19:00
Visible assertions:
- CRM: Active
- CORE: Restricted
- CORE: Active
Silver result:
| status | address | deleted |
|---|---|---|
| Active | 12 Market Street | false |
Why:
- CORE still outranks CRM
- latest CORE assertion is now Active
12.4.8.5.3 As of 2026-03-03 10:00
Visible assertions:
- CRM changed address to 18 King Street
- CORE latest status is Active
Silver result:
| status | address | deleted |
|---|---|---|
| Active | 18 King Street | false |
Why:
- CRM supplies the latest address assertion
- CORE supplies the latest higher-precedence status assertion
This is the key insight:
Silver may compose belief from multiple Bronze assertions across sources and attributes.
12.4.8.5.4 As of 2026-03-04 13:00
Visible assertions now include CRM delete.
Silver must decide:
- is CRM allowed to delete the entity for this domain?
- or is CORE the authoritative lifecycle source?
12.4.8.5.5 Case A — CRM delete is authoritative
Silver result:
| status | address | deleted |
|---|---|---|
| Active | 18 King Street | true |
The customer is treated as deleted.
12.4.8.5.6 Case B — CRM delete is not authoritative
Silver result:
| status | address | deleted |
|---|---|---|
| Active | 18 King Street | false |
The delete assertion remains preserved in Bronze, but is not adopted as belief.
This is why Bronze must store all conflicts and Silver must resolve them.
12.4.8.6 Step 4: Why This Example Matters
This single example shows all the hard parts working together:
12.4.8.6.1 Partial update safety
Events 2, 3, and 4 are partial.
Without reconstruction, they would null out omitted attributes.
12.4.8.6.2 Out-of-sequence repair
Event 4 arrives late but is inserted into the correct historical position.
12.4.8.6.3 Multi-source precedence
CRM and CORE both assert state.
Silver selects belief without destroying Bronze evidence.
12.4.8.6.4 Deletes as assertions
Delete is not row removal.
It is just another temporally ordered assertion.
12.4.8.6.5 Attribute-level authority
One source may govern status, another address, and a third deletion authority.
That decision belongs in Silver/precedence logic, not Bronze mutation.
12.4.8.7 Step 5: Example Silver Resolution Shape
A simplified Silver query pattern might be:
WITH candidate_rows AS (
SELECT *
FROM bronze_customer
WHERE effective_from <= :as_at
AND (effective_to > :as_at OR effective_to IS NULL)
),
ranked_status AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY precedence_rank DESC, source_event_ts DESC
) AS rn_status
FROM candidate_rows
WHERE status IS NOT NULL
),
ranked_address AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY source_event_ts DESC
) AS rn_address
FROM candidate_rows
WHERE address IS NOT NULL
)
SELECT
s.customer_id,
s.status,
a.address,
s.is_deleted
FROM ranked_status s
JOIN ranked_address a
ON s.customer_id = a.customer_id
WHERE s.rn_status = 1
AND a.rn_address = 1
This is illustrative only, but it makes the core point:
Silver resolves belief from Bronze assertions.
It does not pretend one source row is magically the whole truth.
12.4.8.8 The Core Lesson
This is the pattern you need for regulated Financial Services:
- Bronze captures all assertions
- source-time ordering repairs late arrival
- partial updates are reconstructed before hashing
- deletes are explicit state transitions
- precedence is applied downstream, not destructively upstream
That is how you get:
truth without pain in Silver, because the pain was handled properly in Bronze
12.4.8.9 Final Takeaway
When CDC, partial updates, precedence, deletes, and out-of-sequence delivery all collide, the architecture either:
- preserves evidence and remains explainable
or - collapses into ingestion-order folklore
The difference is whether Bronze is treated as:
a mutable current-state table
or
an append-only, temporally ordered record of institutional belief
Only the second survives scrutiny.
12.4.9 Streaming Reality: Watermarks, Replay, and Exactly-Once Illusions
CDC microbatch ingestion introduces operational concerns that are often misunderstood, particularly in regulated environments.
The key principle is:
Correctness under replay matters more than minimising duplication.
12.4.9.1 Exactly-Once Is Not the Goal
In distributed systems:
- exactly-once delivery is difficult to guarantee
- retries, replays, and duplicate delivery are normal
Instead of attempting to eliminate duplicates, the architecture must be:
idempotent by design
This is achieved through:
- deterministic hashing (
attr_hash,record_hash) - deduplication on
(business_key, attr_hash, source_event_ts) - rebuild logic that collapses identical states
12.4.9.2 Replay Is a Feature, Not a Failure
In regulated Financial Services, replay is expected:
- backfills
- reprocessing after failure
- regulatory reconstruction
The pipeline must produce identical results when replayed.
This is guaranteed if:
- ordering is based on
source_event_ts - deduplication removes duplicate assertions
- SCD2 rebuild logic is deterministic
12.4.9.3 Watermarking and Late Data
Watermarks are often used to discard late data in streaming systems.
In regulated environments, this is dangerous.
If late data is dropped:
- history becomes incomplete
- PIT reconstruction becomes incorrect
- audit defensibility is compromised
Therefore:
Late data should be processed and repaired, not discarded
If operational constraints require a watermark:
- it must be conservative
- and exceptions must be quarantined, not silently dropped
12.4.9.4 Microbatch Design Implications
Using foreachBatch:
- each microbatch is treated as a bounded batch
- full SCD2 rebuild logic can be applied safely
- impacted-key pruning limits scope
This avoids:
- unbounded state in streaming engines
- reliance on streaming-native state stores
12.4.9.5 Failure and Restart Semantics
On failure:
- the same microbatch may be reprocessed
- events may be duplicated
This must not change outcomes.
The system remains correct if:
- ingestion is append-only
- deduplication is deterministic
- rebuild logic is idempotent
12.4.9.6 What This Substantiates
With this approach, the platform can prove:
- that no data was silently dropped
- that reprocessing yields identical results
- that late-arriving events are correctly incorporated
This is essential for:
regulator confidence in operational resilience and data integrity
12.5 Cross-Compatibility of Bronze Ingestion Patterns (How These Pieces Actually Fit Together)
The preceding sections are not alternative implementations to choose between arbitrarily. They are composable patterns with clear roles, and understanding how they interact is critical to building a platform that remains coherent under real-world conditions.
This section clarifies how:
- source-time SCD2 (12.1)
- fallback ingestion (12.2)
- source rejection (12.3)
- CDC microbatch ingestion (12.4)
fit together as a single operating model, not four competing designs.
12.5.1 The Compatibility Matrix
At a high level, the patterns relate as follows:
| Pattern | 12.1 Source-Time SCD2 | 12.2 No Source-Time | 12.3 Reject Source | 12.4 CDC Microbatch |
|---|---|---|---|---|
| 12.1 Source-Time SCD2 | Native model | Fallback downgrade | Protected by rejection | Fully compatible |
| 12.2 No Source-Time | Degraded alternative | Native model | Often leads to rejection | Only if CDC lacks time |
| 12.3 Reject Source | Protects integrity | Prevents weak fallback misuse | Native control | Applied before ingestion |
| 12.4 CDC Microbatch | Ideal pairing | Degraded fallback | Must be validated first | Native delivery pattern |
This matrix is not symmetrical by accident. It reflects a hierarchy:
Some patterns define truth, some define delivery, and some define control.
12.5.2 The Roles Are Not Equal
Each section plays a fundamentally different role in the architecture:
| Section | Role | Responsibility |
|---|---|---|
| 12.1 | Truth engine | Constructs temporally ordered, regulator-defensible history |
| 12.2 | Fallback engine | Preserves evidence when source-time is unavailable |
| 12.3 | Governance gate | Prevents non-defensible data entering Bronze |
| 12.4 | Delivery mechanism | Ingests change events into the platform |
This distinction is critical.
A common failure mode is to treat these as interchangeable ingestion strategies. They are not.
12.5.3 The Canonical Flow (What “Good” Looks Like)
The intended operating model is:
CDC / batch / files (source systems)
↓
Validation & rejection (12.3)
↓
Microbatch ingestion (12.4)
↓
Source-time SCD2 reconstruction (12.1)
↓
Bronze (evidence store)
↓
Silver (precedence + belief resolution)
This is the “land it early, manage it early” pipeline in full.
12.5.4 The Critical Pairing: 12.1 + 12.4
The most important combination in this entire section is:
CDC ingestion (12.4) feeding source-time SCD2 reconstruction (12.1)
These two patterns solve different problems:
| Concern | Solved By |
|---|---|
| Continuous ingestion | CDC (12.4) |
| Ordering, repair, and history | SCD2 with source time (12.1) |
Together, they enable:
- out-of-sequence event repair
- idempotent replay of CDC streams
- safe handling of partial updates
- correct modelling of deletes as state transitions
- full auditability of all assertions
Without this pairing:
- CDC alone becomes an overwrite stream
- SCD2 alone becomes a batch-only abstraction detached from reality
12.5.5 Where 12.2 Fits (And Why It Must Be Contained)
The no-source-time pattern (12.2) exists for one reason:
Some sources do not provide reliable temporal information.
It allows ingestion to proceed, but at a cost:
- ordering becomes ingestion-time, not event-time
- point-in-time reconstruction becomes weaker
- audit defensibility is reduced
This is acceptable only if:
- it is explicitly declared
- it is isolated by domain or dataset
- downstream consumers understand the limitation
It must never silently coexist with 12.1 for the same entity without clear boundaries.
12.5.6 Where 12.3 Sits (And Why It Matters)
The rejection pattern (12.3) is not optional governance overhead.
It is what prevents the platform from becoming non-defensible.
It is applied:
- before ingestion
- before SCD2 logic
- before Bronze persistence
Typical rejection triggers include:
- no source-time and no acceptable fallback justification
- ambiguous partial snapshots
- unstable or missing business keys
- undefined delete semantics
Without this control:
Bronze becomes a mixture of incompatible temporal models, and cannot be trusted.
12.5.7 CDC Is Not a Data Model
A subtle but critical point:
CDC (12.4) is a delivery pattern, not a truth model.
CDC provides:
- change events
- operation types (insert/update/delete)
- often partial payloads
It does not provide:
- correct ordering under delay
- complete state reconstruction
- conflict resolution
- precedence handling
Those responsibilities remain entirely within:
12.1 (and 12.2 where unavoidable)
Treating CDC as “the truth” leads directly to:
- overwrite-based pipelines
- loss of historical context
- non-reproducible state
12.5.8 What This Enables
When these patterns are combined correctly, the platform achieves:
- deterministic replay — the same inputs always produce the same history
- out-of-sequence tolerance — late data is repaired, not ignored
- multi-source coexistence — conflicting assertions are preserved, not collapsed
- audit-grade lineage — every state can be traced to its originating assertion
- simplified Silver — belief resolution becomes a query problem, not an ingestion problem
12.5.9 The Core Principle
This entire section reduces to one architectural truth:
Separate how data arrives from how truth is constructed.
- CDC, batch, and files describe arrival
- SCD2 with temporal control describes truth
- precedence and Silver describe belief
If those concerns are mixed, the system becomes fragile.
If they are separated cleanly:
even complex, multi-source, out-of-sequence data becomes explainable and controllable.
12.5.10 Practical Guidance
In implementation terms:
- Default every new ingestion to 12.4 → 12.1
- Use 12.3 aggressively to protect quality
- Allow 12.2 only where explicitly justified
- Never resolve precedence or delete conflicts in Bronze
- Always treat Bronze as an append-only evidence store
12.5.11 The Outcome
When these patterns are used together correctly:
- Bronze becomes a complete, ordered record of assertions
- Silver becomes a deterministic projection of belief
- and the platform as a whole becomes:
regulator-defensible, operationally resilient, and evolution-friendly
12.6 Silver Survivorship Patterns (Attribute-Level Belief Resolution)
Bronze preserves all assertions.
Silver determines what to believe.
In multi-source environments, this is not a simple “latest row wins” problem. Different sources may be authoritative for different attributes, and precedence may vary by domain.
12.6.1 The Core Principle
Belief is resolved per attribute (or attribute group), not per row.
A single “winning row” is often insufficient, because:
- one source may provide the most authoritative status
- another may provide the most recent address
- a third may control lifecycle or deletion
Silver must compose belief from multiple assertions.
12.6.2 Attribute-Level Precedence
Define precedence rules at an attribute or domain level:
| Attribute Group | Authoritative Source | Rule |
|---|---|---|
| status | CORE | highest precedence wins |
| address | CRM | most recent wins |
| deletion | CORE | only CORE can delete |
These rules are:
- explicit
- versioned
- auditable
12.6.3 Survivorship Pattern
For each attribute group:
- filter Bronze rows valid at
as_at - rank by:
- precedence (if applicable)
source_event_ts
- select the top-ranked assertion
This produces a resolved view per attribute.
12.6.4 Example Pattern (Conceptual SQL)
WITH candidate_rows AS (
SELECT *
FROM bronze_customer
WHERE effective_from <= :as_at
AND (effective_to > :as_at OR effective_to IS NULL)
),status_ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY precedence_rank DESC, source_event_ts DESC
) AS rn
FROM candidate_rows
WHERE status IS NOT NULL
),address_ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY source_event_ts DESC
) AS rn
FROM candidate_rows
WHERE address IS NOT NULL
)SELECT
s.customer_id,
s.status,
a.address,
s.is_deleted
FROM status_ranked s
JOIN address_ranked a
ON s.customer_id = a.customer_id
WHERE s.rn = 1
AND a.rn = 1
This demonstrates:
- precedence applied to status
- recency applied to address
- composition into a single belief
12.6.5 Handling Deletes in Survivorship
Deletes must follow explicit authority rules:
- if an authoritative source asserts deletion → entity is deleted
- if a non-authoritative source asserts deletion → treated as an assertion, not a fact
This prevents:
- accidental data loss
- incorrect lifecycle transitions
12.6.6 Overlapping Assertions
Different attributes may originate from different timestamps.
This is valid.
Silver does not require:
- a single “coherent row” from one source
It constructs:
the most defensible belief at a given point in time
12.6.7 What This Substantiates
With proper survivorship:
- every attribute can be traced to its originating assertion
- precedence decisions are explainable and reproducible
- belief can be reconstructed at any point in time
This is the final step that turns:
Bronze evidence → institutional belief
12.6.8 The Outcome
When Bronze and Silver are implemented together correctly:
- Bronze absorbs complexity
- Silver expresses clarity
And the platform achieves:
truth preservation without operational or analytical friction
13. What This Enables in Silver (And Why Silver Gets Simpler)
When Bronze is implemented correctly, Silver stops being a place where complexity is introduced and becomes a place where complexity is resolved cleanly. Most of the hard problems — ordering, deduplication, replay safety, provenance — are already handled upstream.
13.1 Clean Current-State Views
Silver can derive current state using a simple, stable rule:
WHERE is_current = true
AND is_deleted = false
Because Bronze guarantees:
- one valid current version per business key
- correct ordering based on source time
- no duplicate versions from replay or concurrency
There is no need for:
- complex windowing logic
- “latest ingestion wins” hacks
- defensive deduplication
13.2 Point-in-Time (PIT) Views
Silver can reconstruct belief at any time using:
WHERE effective_from <= :as_at
AND (effective_to > :as_at OR effective_to IS NULL)
Because Bronze:
- preserves full version history
- orders by source-event time
- repairs late-arriving data
This enables:
- regulatory reporting (“what did we believe on date X?”)
- investigation replay
- model validation against historical context
13.3 Regulator Defensibility
Silver becomes defensible because Bronze has retained:
- full provenance (source, file, run)
- complete version chains
- explicit deletion semantics
- conflict visibility across sources
This allows you to answer:
- What data did you have?
- Where did it come from?
- When did you receive it?
- Why did you believe what you believed?
Without reconstructing logic from scratch.
13.4 “Truth Without Pain”
Most importantly, Silver becomes easy to use:
- analysts query clean current-state tables
- data scientists retrieve PIT datasets without custom logic
- downstream systems consume consistent, stable views
The complexity of:
- multi-source conflict
- temporal repair
- ingestion noise
is absorbed entirely by Bronze.
Silver delivers “truth without pain” because Bronze has already done the painful work correctly.
14. Expected Substantiation (Regulator / Audit Lens)
A regulator does not care how elegant your architecture is. They care what you can prove, under pressure, with evidence.
This section defines what this design substantiates — and where its limits are.
14.1 What You Can Prove
With this Bronze design, you can demonstrate:
1. Full Lineage of Belief
- Every version of every record
- When it was first seen and last seen
- Which source asserted it
2. Point-in-Time Reconstruction
- What was believed at any given time
- How that belief changed over time
3. Source Provenance
- Which system, file, or pipeline produced the data
- Which ingestion run processed it
4. Conflict Visibility
- Multiple competing assertions from different sources
- Evidence of how conflicts were resolved (in Silver)
5. Replay & Processing Behaviour
- Duplicate arrivals
- Reprocessing events
- Pipeline execution context
This is the foundation of:
- s166 responses
- Consumer Duty investigations
- model validation
- audit traceability
14.2 What You Cannot Prove (Without Source-Time)
If source_event_ts is missing, unreliable, or inconsistent, there are hard limits.
You cannot definitively prove:
- the true chronological order of events at the source
- what was actually believed by the source at a given time
- whether a late-arriving record represents correction or delay
- whether a full snapshot reflects a newer or older state
In this case, you can only prove:
What the platform knew, when it knew it
— not what the business should have known.
This distinction is critical in regulatory scenarios.
14.3 What Evidence Is Retained
The Bronze model retains three classes of evidence:
1. Data Evidence
- all attribute values over time
- deletion states
- version history
2. Temporal Evidence
- source-event timestamps (where available)
- ingestion timestamps
- effective dating
3. Provenance Evidence
- source system
- source path / file
- ingestion run
- duplicate/replay observation (
first_seen_ts,last_seen_ts)
Together, these form an audit-grade evidentiary chain.
14.4 The Core Principle
From an audit perspective, the design ensures:
You are not claiming truth — you are proving what was believed, when, and why.
That is the standard regulators actually apply.
And this is why Bronze must be built this way — because once the evidence is lost, it cannot be reconstructed later.
15. Summary: “Land It Early, Manage It Early” Applied to Precedence
Precedence, ordering, deletion, and replay are not downstream concerns — they are ingestion-time responsibilities. If they are deferred, they become harder, more ambiguous, and often impossible to resolve correctly.
By applying “land it early, manage it early” at the Bronze layer, the platform:
- captures all competing assertions, not just the chosen one
- preserves temporal integrity, even under out-of-sequence delivery
- enables deterministic precedence resolution, rather than ad hoc fixes
- produces a stable, explainable
is_current, immune to ingestion noise
The result is simple but powerful:
Bronze holds evidence. Silver expresses belief. Everything is explainable.
Part of the “land it early, manage it early” series on SCD2-driven Bronze architectures for regulated Financial Services.