Retail Intelligence Platform — PySpark · Microsoft Fabric · Delta Lake

The Problem

8+ years of data locked in Excel files no one could query.

A convenience store's entire transaction history — spanning 2017 to 2025 — existed as hundreds of semi-structured .xls and .xlsx files with inconsistent schemas. Transaction metadata (ID and date) was stored only in Excel header rows, not repeated on each product line-item, making it impossible to query without manual reconstruction. No aggregations, no trends, no dashboards.

The goal: build a fully automated pipeline that ingests, cleans, and models this data into a production-grade analytics layer — with zero manual intervention after initial setup.

Architecture

Bronze → Silver → Gold Medallion layers.

The pipeline follows the Medallion Architecture pattern — three progressive layers of data quality, each built as a separate PySpark notebook running on Microsoft Fabric's Synapse Data Engineering runtime.

Layer 01

Bronze — Raw Ingestion

Idempotent loading — tracks processed files via _source_path
Header sanitization to lowercase snake_case
Removes empty Excel rows/columns with dropna(how='all')
Extracts data_year from folder structure
Append-only — never re-processes historical files

Layer 02

Silver — Validation & Hierarchy

Forward-fill Window functions reconstruct transaction IDs
Hybrid timestamp parser handles two date formats
Smart Revenue Rule: fees vs. discount differentiation
Quarantine gate for failed quality checks
Incremental watermark — only processes new Bronze records
Auto-generates date dimension table

Layer 03

Gold — Analytics & Aggregates

Monthly P&L with countDistinct for accuracy
Product Hero analysis by profit contribution
Daily operational heatmap by day-of-week
Market Basket Discovery via self-join association mining
Basket health metrics by fiscal year

Architecture Diagrams

Pipeline visualized — end to end.

Four diagrams covering the full data flow, table relationships, orchestration sequence, and semantic model.

Diagram 1 — Medallion Pipeline End to End

flowchart TD A(["POS Excel exports\n.xls / .xlsx — 8+ years\n~676K raw records"]) A -->|"File-drop trigger\nData Factory"| B subgraph BRONZE ["Bronze Layer — Raw Ingestion"] B["01_bronze_layer\nIdempotent load · header sanitize\n_source_path tracking · dropna"] end BRONZE -->|"_ingested_at watermark"| SILVER subgraph SILVER ["Silver Layer — Validation & Hierarchy"] D["02_silver_layer\nForward-fill Window · Smart Revenue Rule\nIncremental watermark · Quality gate\nAuto date dimension · 351,275 rows promoted"] end SILVER -->|"Full recalculation"| GOLD subgraph GOLD ["Gold Layer — Analytics & Aggregates"] E["03_gold_layer\nMonthly P&L · Product heroes\nDaily heatmap · Market basket"] end GOLD --> F(["Power BI\nDirect Lake mode\nSub-second queries"]) BRONZE -->|"Binary copy"| G(["Archive\nFiles/Bronze/Archive"]) BRONZE -->|"Wildcard purge"| H(["Landing zone cleared"])

Full data flow from raw POS Excel files through Bronze → Silver → Gold layers to Power BI. The pipeline is triggered automatically on file-drop and cleans up the landing zone after processing.

Diagram 2 — Data Factory Orchestration

flowchart LR T(["File-drop trigger\nFiles/Bronze/Landing"]) --> B B["Execute Bronze\nIdempotent ingest · sanitize"] --> S S["Execute Silver\nIncremental watermark\nSmart Revenue Rule · quality gate"] --> G G["Execute Gold\nFull recalculation · overwrite\nAll 5 Gold tables refreshed"] --> AR & DL AR["Copy to Archive\nBinary mode — preserves .xls/.xlsx"] DL["Delete landing files\nWildcard purge · directory kept"] AR & DL --> PBI PBI(["Power BI Direct Lake\nDashboard auto-synced\nNo manual refresh needed"])

Fully automated execution sequence: Bronze → Silver → Gold notebooks run sequentially, then original files are archived and the landing zone is cleared — ready for the next drop.

Diagram 3 — Delta Table Schema Registry

flowchart LR subgraph BRONZE ["Bronze"] B1["bronze_details_transactions\nRaw line-items"] B2["bronze_tran_transactions\nRaw headers"] end subgraph SILVER ["Silver"] S1["silver_details_transactions\nFACT · 351,275 rows"] S2["silver_tran_summary\nDIMENSION"] S3["silver_date_dimension\nAuto-generated"] S4["quarantine_details\nFailed quality gate"] end subgraph GOLD ["Gold"] G1["gold_monthly_pnl"] G2["gold_product_performance"] G3["gold_daily_sales"] G4["gold_basket_analysis"] G5["gold_basket_health"] end B1 -->|"Forward-fill Window"| S1 B1 -->|"Failed gate"| S4 B2 -->|"Standardize & dedup"| S2 S1 --> S3 & G1 & G2 & G3 & G4 S2 --> G5

All 11 Delta tables across the three layers — showing which Bronze tables feed which Silver tables, and which Silver tables power which Gold aggregates.

Diagram 4 — Star Schema Semantic Model

erDiagram silver_details_transactions { string transaction_id FK timestamp transaction_timestamp string product_name int quantity decimal gross_revenue decimal gross_profit int fiscal_year } silver_tran_summary { string transaction_id PK decimal total_basket_value int items_in_basket string payment_method } silver_date_dimension { date calendar_date PK int year string month_name string day_of_week } gold_product_performance { string product_name PK int total_units_sold decimal total_profit decimal avg_retail_price } gold_monthly_pnl { int fiscal_year PK int fiscal_month PK decimal total_net_revenue decimal profit_margin_pct } silver_tran_summary ||--o{ silver_details_transactions : "1 header → many line-items" silver_date_dimension ||--o{ silver_details_transactions : "1 date → many transactions" gold_product_performance ||--o{ silver_details_transactions : "1 product → many sales"

Formal Star Schema powering Power BI Direct Lake. One-to-Many relationships with single-directional cross-filtering prevent query ambiguity and optimize Spark read performance.

Key Engineering Decisions

The hard problems that made this production-grade.

The Forward-Fill Problem: Legacy Excel exports store Transaction ID and Date only in the first row of each transaction block. Every product line-item below it has null values. Standard SQL can't fix this — it required a Spark Window function partitioned by file source and ordered by original row index, propagating the last non-null value down to every line-item.

# Window specification — partitioned per file, ordered by original Excel row
window_spec = Window.partitionBy("_source_path") \
                    .orderBy(F.col("original_row_index").cast("int")) \
                    .rowsBetween(Window.unboundedPreceding, 0)

# Forward-fill Transaction ID from header row to all product rows below it
df_filled = df.withColumn(
    "final_transaction_id",
    F.last(F.when(
        (F.col("id") != "ID") & (F.col("id").isNotNull()), F.col("id")
    ), ignorenulls=True).over(window_spec)
)

The Smart Revenue Rule: The raw data had a single adjustment column that contained two completely different things depending on the value — Massachusetts bottle deposits (regulatory fees ≤ $0.05, which should be added to revenue) and promotional discounts (> $0.05, which should be excluded). Treating them the same would cause every revenue figure to be wrong.

Market Basket Discovery: To find which products are bought together most frequently, the pipeline performs a self-join on transaction_id and filters with product_a < product_b (lexicographical ordering) to prevent duplicate pairs from appearing — a clean association rule mining implementation at scale in pure PySpark.

Delta Table Schema

11 tables across 3 layers.

Bronze

bronze_details_transactions

Raw line-items with metadata

Bronze

bronze_tran_transactions

Raw transaction headers

Silver

silver_details_transactions

Fact table — 351,275 rows

Silver

silver_tran_summary

Transaction-level dimension

Silver

silver_date_dimension

Auto-generated date table

Silver

quarantine_details

Failed quality gate records

Gold

gold_monthly_pnl

Monthly P&L aggregation

Gold

gold_product_performance

Product Hero rankings

Gold

gold_daily_sales

Day-of-week heatmap

Gold

gold_basket_analysis

Product co-occurrence pairs

Gold

gold_basket_health

Yearly basket depth metrics

Tech Stack

Built on Microsoft Fabric.

Microsoft Fabric

Platform / OneLake

PySpark / Spark 3.4

Distributed compute

Delta Lake

ACID storage layer

Spark SQL

Query & aggregation

Data Factory Pipelines

Orchestration

Power BI Direct Lake

BI / dashboards

Python / Pandas

File ingestion

Star Schema Design

Semantic model

Results

From unqueryable files to sub-second dashboards.

351K

rows promoted to the Silver analytics layer

100%

reconciliation with physical store records via Smart Revenue Rule

0

manual steps after pipeline setup — fully automated file-drop to dashboard

<1s

Power BI query response time via Direct Lake mode

Retail Intelligence& Data Platform

8+ years of data locked in Excel files no one could query.

Bronze → Silver → Gold Medallion layers.

Pipeline visualized — end to end.

The hard problems that made this production-grade.

11 tables across 3 layers.

Built on Microsoft Fabric.

From unqueryable files to sub-second dashboards.

Retail Intelligence
& Data Platform