Handling Embedded XML/JSON Blobs to Audit-Grade SCD2 Bronze

Contents

Introduction: Turning messy semi-structured payloads into compliant, traceable, high-fidelity Bronze data for UK Financial Services

In Financial Services, upstream systems routinely embed XML or JSON payloads inside VARCHAR/CLOB fields. This is not an edge case — it is the norm. Core banking systems emit SWIFT XML in free-text columns; KYC vendors send JSON blobs wrapped inside opaque fields; CRM systems encode customer events as semi-structured objects; legacy platforms export entire records as XML fragments.

Yet a modern FS data platform must:

  • preserve the raw blob intact for lineage and audit
  • flatten the structure for analytics
  • apply SCD2 to meaningful attributes
  • detect schema drift safely
  • ensure regulatory point-in-time reconstruction remains possible
  • avoid Bronze-layer runaway expansion

Most banks fall into one of three failure modes:

  • allow uncontrolled Bronze expansion driven by upstream drift
  • flatten too early and destroy audit fidelity
  • preserve raw blobs indefinitely without extracting meaning

This article provides a battle-tested, regulator-defensible approach used in mature UK banks and payment processors.

1. Why Embedded XML/JSON Is a Bronze-Layer Nightmare

Embedded XML/JSON is a symptom of:

  • legacy systems without schema evolution
  • vendor products delivering “flexible payloads”
  • payment messages where structure varies by product
  • AML/KYC engines adding new fields mid-year
  • CRM systems emitting verbose interaction objects

In a regulated FS context, poorly handled blobs create:

  • broken lineage
  • inconsistent SCD2 records
  • schema explosion
  • failed PIT reconstructions
  • inability to satisfy PRA/FCA audit queries
  • huge Bronze tables with redundant versions

Bronze must maintain raw fidelity, while Silver/Gold must remain stable and consumable. Blob ingestion sits at the fault line between these goals.

2. Common Scenarios in UK Financial Services

2.1 Payments (SWIFT MT, ISO20022 XML)

  • Deeply nested XML
  • Optional fields vary by corridor
  • Arrays of beneficiaries or charges
  • Schema drift during regulatory upgrades (ISO20022 yearly releases)

2.2 AML/KYC Vendor Responses

  • JSON with risk indicators, match scores, document scans
  • Vendor updates add/remove fields without notice
  • Near-duplicate payloads causing version spam

2.3 CRM and Customer Interactions

  • SFDC/CRM JSON describing interactions, tasks, preferences
  • Often includes embedded status/event arrays

2.4 Legacy Exports

  • Entire record dumps in XML
  • Mixed naming conventions
  • Inconsistent casing or nesting

All these require Bronze handling that preserves truth and supports long-term SCD2.

3. Landing Raw Data in Bronze: Do’s and Don’ts

This is the most important rule:

3.1 DO: Land the raw blob exactly as received — append-only and immutable

Your Bronze landing table should look like:

fielddescription
raw_blobXML/JSON as delivered, unchanged
record_keyprimary identifier or compound key
source_systemwhere it came from
ingest_timestampwhen it arrived
schema_version_detectedmetadata captured at ingest
batch_id / event_idlineage identifiers

Never mutate raw. Never pre-flatten. Never infer meaning at this stage.

3.2 DO: Add metadata, not transformations

Examples:

  • SHA256 hash of the raw blob
  • size, encoding, version markers
  • XML root element name or JSON top-level keys

SHA256 hash of the raw blob (for deterministic change detection and replay verification)

3.3 DO NOT: Parse XML/JSON in the landing step

Parsing at ingest time creates:

  • non-deterministic pipelines
  • version drift
  • irreversible schema errors
  • lost lineage

Raw stays raw. Always.

4. Flattening Strategies: When and How

After raw landing, a parsed_bronze layer extracts structure safely.

Partial flattening dramatically reduces false-positive SCD2 versioning caused by non-material payload changes.

There are three main patterns:

4.1 Full Flatten (Complete Normalisation)

Pros:

  • SQL-friendly
  • Great for reporting
    Cons:
  • Breaks when schema changes
  • Requires constant reprocessing
  • Painful for deeply nested arrays

Use only for stable, well-defined schemas.

4.2 Partial Flatten (High-Value Attributes Only)

This is the correct choice for most FS pipelines.

Extract only:

  • identifiers (customer_id, account_id)
  • volatile attributes (risk_score, match_status, address)
  • data needed for SCD2 logic

Leave everything else as a variant/struct column.

This reduces:

  • schema drift
  • excessive SCD2 versioning
  • reprocessing cost

4.3 Hybrid Flattening (Flatten + Retain Raw)

A combined pattern:

  • Extract 5–20 high-value fields → columns
  • Store full parsed struct → variant/struct
  • Keep original raw XML/JSON → raw_blob

This supports:

  • analytics
  • lineage
  • PIT reconstruction

Most mature banks adopt this hybrid.

5. Applying SCD2 to Flattened Data

Blob-derived fields can participate in SCD2.

Key considerations:

5.1 Row-Level vs Attribute-Level SCD2

Row-level SCD2 explodes when:

  • blobs change frequently
  • vendor schemas drift
  • CRM systems issue many near-duplicate updates

Attribute-level SCD2 is preferred:

  • only version specific changed attributes
  • hash at attribute granularity
  • do NOT version the whole blob unless required

Cross-reference: your Part 3 article on attribute-level SCD2.

5.2 Hashing Nested Payloads

When blobs contain structured objects, use:

  • SHA256(raw_blob) for change detection
  • hash(struct) for extracted attributes
  • xxHash64 for fast comparisons

Only create new versions when the semantic fields change.

Only create new versions when business-meaningful fields change, not when payload structure shifts.

5.3 Temporal Compaction for Volatile Payloads

CRM and KYC systems often send:

  • 10 updates in 30 seconds
  • duplicate risk assessments
  • partial patches

Temporal compaction collapses these into a single SCD2 version.

From Part 3:
“Group updates within a time window and emit one SCD2 record.”

6. Handling Inconsistent Schemas & Drift

Drift is inevitable.

Your Bronze must be designed to survive it.

Schema metadata becomes part of your audit evidence, demonstrating controlled evolution rather than uncontrolled change.

6.1 Schema Inference and Evolution

At ingest, automatically detect:

  • missing fields
  • added fields
  • type changes

Store this in a schema_metadata table.

6.2 Drift Detection Pipelines

When schema changes:

  • alert data engineering
  • trigger schema evolution
  • produce a versioned schema record

6.3 Fallback Patterns

Use platform-native types:

  • Snowflake VARIANT
  • BigQuery STRUCT
  • Delta Lake MAP/STRUCT
  • Iceberg/Hudi schema evolution

When flattening fails, store parsed struct only; extract later.

6.4 Backfill & Historical Repair

If schema changes introduce new fields:

  1. Reparse historical blobs
  2. Populate new columns
  3. Adjust SCD2 history
  4. Recompute PITs where needed

Cross-reference: Part 4 (Golden-Source Precedence & PIT).

7. Platform-Specific Patterns

These examples illustrate platform capabilities, not prescriptive implementations.

7.1 Databricks / Delta Lake

  • from_xml(), from_json()
  • Auto Loader schema evolution
  • Delta Change Data Feed for incremental extractions

7.2 Snowflake

  • PARSE_XML(), PARSE_JSON()
  • VARIANT for hybrid storage
  • Efficient semi-structured flattening with LATERAL FLATTEN

7.3 BigQuery / Iceberg / Hudi

  • Nested STRUCT support
  • Auto schema merge
  • Efficient MERGE for SCD2

8. Regulatory Considerations

The PRA/FCA expect:

  • raw payload intact
  • traceability from attribute back to XML node
  • ability to reconstruct “state as known on date X”
  • defensible lineage across transformations
  • deterministic reprocessing

Do NOT discard fields.

Do NOT flatten away auditability.

9. Real-World Examples (Anonymised)

Case 1 — UK Bank Handling SWIFT XML

  • ISO20022 MT messages embedded in CLOB
  • Used hybrid extract + struct
  • Reduced Bronze versioning by 65%
  • Enabled AML PIT reconstruction for sanctions cases

Case 2 — AML Vendor Schema Drift

  • Vendor added new <risk_model_v3> node
  • Drift detection caught change
  • New attribute added without breaking existing pipelines
  • Backfilled 4 years of history with version-safe merge

Case 3 — CRM JSON Expansion

  • CRM pushed multiple partial updates per session
  • Temporal compaction collapsed 17 updates to 1
  • Bronze size reduced by 80%

10. Conclusion: Turn Messy Blobs into Audit-Grade Bronze

Embedded XML/JSON is not a problem — it is a reality.

Bronze is not a dumping ground — it is a regulatory contract with the future.

The maturity comes from:

  • landing raw blobs safely
  • flattening only what matters
  • applying attribute-level SCD2
  • managing schema drift gracefully
  • maintaining audit-grade lineage

Handled correctly, XML/JSON becomes a first-class Bronze asset, not a liability. In regulated FS, this is the only safe path to ensure historical fidelity, analytical clarity, and regulator confidence.