Contents
- Contents
- Introduction: Turning messy semi-structured payloads into compliant, traceable, high-fidelity Bronze data for UK Financial Services
- 1. Why Embedded XML/JSON Is a Bronze-Layer Nightmare
- 2. Common Scenarios in UK Financial Services
- 3. Landing Raw Data in Bronze: Do’s and Don’ts
- 4. Flattening Strategies: When and How
- 5. Applying SCD2 to Flattened Data
- 6. Handling Inconsistent Schemas & Drift
- 7. Platform-Specific Patterns
- 8. Regulatory Considerations
- 9. Real-World Examples (Anonymised)
- 10. Conclusion: Turn Messy Blobs into Audit-Grade Bronze
Introduction: Turning messy semi-structured payloads into compliant, traceable, high-fidelity Bronze data for UK Financial Services
In Financial Services, upstream systems routinely embed XML or JSON payloads inside VARCHAR/CLOB fields. This is not an edge case — it is the norm. Core banking systems emit SWIFT XML in free-text columns; KYC vendors send JSON blobs wrapped inside opaque fields; CRM systems encode customer events as semi-structured objects; legacy platforms export entire records as XML fragments.
Yet a modern FS data platform must:
- preserve the raw blob intact for lineage and audit
- flatten the structure for analytics
- apply SCD2 to meaningful attributes
- detect schema drift safely
- ensure regulatory point-in-time reconstruction remains possible
- avoid Bronze-layer runaway expansion
Most banks fall into one of three failure modes:
- allow uncontrolled Bronze expansion driven by upstream drift
- flatten too early and destroy audit fidelity
- preserve raw blobs indefinitely without extracting meaning
This article provides a battle-tested, regulator-defensible approach used in mature UK banks and payment processors.
1. Why Embedded XML/JSON Is a Bronze-Layer Nightmare
Embedded XML/JSON is a symptom of:
- legacy systems without schema evolution
- vendor products delivering “flexible payloads”
- payment messages where structure varies by product
- AML/KYC engines adding new fields mid-year
- CRM systems emitting verbose interaction objects
In a regulated FS context, poorly handled blobs create:
- broken lineage
- inconsistent SCD2 records
- schema explosion
- failed PIT reconstructions
- inability to satisfy PRA/FCA audit queries
- huge Bronze tables with redundant versions
Bronze must maintain raw fidelity, while Silver/Gold must remain stable and consumable. Blob ingestion sits at the fault line between these goals.
2. Common Scenarios in UK Financial Services
2.1 Payments (SWIFT MT, ISO20022 XML)
- Deeply nested XML
- Optional fields vary by corridor
- Arrays of beneficiaries or charges
- Schema drift during regulatory upgrades (ISO20022 yearly releases)
2.2 AML/KYC Vendor Responses
- JSON with risk indicators, match scores, document scans
- Vendor updates add/remove fields without notice
- Near-duplicate payloads causing version spam
2.3 CRM and Customer Interactions
- SFDC/CRM JSON describing interactions, tasks, preferences
- Often includes embedded status/event arrays
2.4 Legacy Exports
- Entire record dumps in XML
- Mixed naming conventions
- Inconsistent casing or nesting
All these require Bronze handling that preserves truth and supports long-term SCD2.
3. Landing Raw Data in Bronze: Do’s and Don’ts
This is the most important rule:
3.1 DO: Land the raw blob exactly as received — append-only and immutable
Your Bronze landing table should look like:
| field | description |
|---|---|
| raw_blob | XML/JSON as delivered, unchanged |
| record_key | primary identifier or compound key |
| source_system | where it came from |
| ingest_timestamp | when it arrived |
| schema_version_detected | metadata captured at ingest |
| batch_id / event_id | lineage identifiers |
Never mutate raw. Never pre-flatten. Never infer meaning at this stage.
3.2 DO: Add metadata, not transformations
Examples:
- SHA256 hash of the raw blob
- size, encoding, version markers
- XML root element name or JSON top-level keys
SHA256 hash of the raw blob (for deterministic change detection and replay verification)
3.3 DO NOT: Parse XML/JSON in the landing step
Parsing at ingest time creates:
- non-deterministic pipelines
- version drift
- irreversible schema errors
- lost lineage
Raw stays raw. Always.
4. Flattening Strategies: When and How
After raw landing, a parsed_bronze layer extracts structure safely.
Partial flattening dramatically reduces false-positive SCD2 versioning caused by non-material payload changes.
There are three main patterns:
4.1 Full Flatten (Complete Normalisation)
Pros:
- SQL-friendly
- Great for reporting
Cons: - Breaks when schema changes
- Requires constant reprocessing
- Painful for deeply nested arrays
Use only for stable, well-defined schemas.
4.2 Partial Flatten (High-Value Attributes Only)
This is the correct choice for most FS pipelines.
Extract only:
- identifiers (customer_id, account_id)
- volatile attributes (risk_score, match_status, address)
- data needed for SCD2 logic
Leave everything else as a variant/struct column.
This reduces:
- schema drift
- excessive SCD2 versioning
- reprocessing cost
4.3 Hybrid Flattening (Flatten + Retain Raw)
A combined pattern:
- Extract 5–20 high-value fields → columns
- Store full parsed struct → variant/struct
- Keep original raw XML/JSON → raw_blob
This supports:
- analytics
- lineage
- PIT reconstruction
Most mature banks adopt this hybrid.
5. Applying SCD2 to Flattened Data
Blob-derived fields can participate in SCD2.
Key considerations:
5.1 Row-Level vs Attribute-Level SCD2
Row-level SCD2 explodes when:
- blobs change frequently
- vendor schemas drift
- CRM systems issue many near-duplicate updates
Attribute-level SCD2 is preferred:
- only version specific changed attributes
- hash at attribute granularity
- do NOT version the whole blob unless required
Cross-reference: your Part 3 article on attribute-level SCD2.
5.2 Hashing Nested Payloads
When blobs contain structured objects, use:
- SHA256(raw_blob) for change detection
- hash(struct) for extracted attributes
- xxHash64 for fast comparisons
Only create new versions when the semantic fields change.
Only create new versions when business-meaningful fields change, not when payload structure shifts.
5.3 Temporal Compaction for Volatile Payloads
CRM and KYC systems often send:
- 10 updates in 30 seconds
- duplicate risk assessments
- partial patches
Temporal compaction collapses these into a single SCD2 version.
From Part 3:
“Group updates within a time window and emit one SCD2 record.”
6. Handling Inconsistent Schemas & Drift
Drift is inevitable.
Your Bronze must be designed to survive it.
Schema metadata becomes part of your audit evidence, demonstrating controlled evolution rather than uncontrolled change.
6.1 Schema Inference and Evolution
At ingest, automatically detect:
- missing fields
- added fields
- type changes
Store this in a schema_metadata table.
6.2 Drift Detection Pipelines
When schema changes:
- alert data engineering
- trigger schema evolution
- produce a versioned schema record
6.3 Fallback Patterns
Use platform-native types:
- Snowflake VARIANT
- BigQuery STRUCT
- Delta Lake MAP/STRUCT
- Iceberg/Hudi schema evolution
When flattening fails, store parsed struct only; extract later.
6.4 Backfill & Historical Repair
If schema changes introduce new fields:
- Reparse historical blobs
- Populate new columns
- Adjust SCD2 history
- Recompute PITs where needed
Cross-reference: Part 4 (Golden-Source Precedence & PIT).
7. Platform-Specific Patterns
These examples illustrate platform capabilities, not prescriptive implementations.
7.1 Databricks / Delta Lake
from_xml(),from_json()- Auto Loader schema evolution
- Delta Change Data Feed for incremental extractions
7.2 Snowflake
PARSE_XML(),PARSE_JSON()- VARIANT for hybrid storage
- Efficient semi-structured flattening with
LATERAL FLATTEN
7.3 BigQuery / Iceberg / Hudi
- Nested STRUCT support
- Auto schema merge
- Efficient MERGE for SCD2
8. Regulatory Considerations
The PRA/FCA expect:
- raw payload intact
- traceability from attribute back to XML node
- ability to reconstruct “state as known on date X”
- defensible lineage across transformations
- deterministic reprocessing
Do NOT discard fields.
Do NOT flatten away auditability.
9. Real-World Examples (Anonymised)
Case 1 — UK Bank Handling SWIFT XML
- ISO20022 MT messages embedded in CLOB
- Used hybrid extract + struct
- Reduced Bronze versioning by 65%
- Enabled AML PIT reconstruction for sanctions cases
Case 2 — AML Vendor Schema Drift
- Vendor added new
<risk_model_v3>node - Drift detection caught change
- New attribute added without breaking existing pipelines
- Backfilled 4 years of history with version-safe merge
Case 3 — CRM JSON Expansion
- CRM pushed multiple partial updates per session
- Temporal compaction collapsed 17 updates to 1
- Bronze size reduced by 80%
10. Conclusion: Turn Messy Blobs into Audit-Grade Bronze
Embedded XML/JSON is not a problem — it is a reality.
Bronze is not a dumping ground — it is a regulatory contract with the future.
The maturity comes from:
- landing raw blobs safely
- flattening only what matters
- applying attribute-level SCD2
- managing schema drift gracefully
- maintaining audit-grade lineage
Handled correctly, XML/JSON becomes a first-class Bronze asset, not a liability. In regulated FS, this is the only safe path to ensure historical fidelity, analytical clarity, and regulator confidence.