Featured Project · Data Engineering · Feb 2026

Retail Intelligence
& Data Platform

An end-to-end Medallion Architecture pipeline on Microsoft Fabric transforming 8+ years of legacy retail POS data into a high-performance Star Schema for Power BI analytics.

View on GitHub ↗ ← All projects
676K
Raw records
ingested
351K
Rows promoted
to Silver
8+
Years of POS
data modeled
11
Delta tables
across 3 layers
The Problem

8+ years of data locked in Excel files no one could query.

A convenience store's entire transaction history — spanning 2017 to 2025 — existed as hundreds of semi-structured .xls and .xlsx files with inconsistent schemas. Transaction metadata (ID and date) was stored only in Excel header rows, not repeated on each product line-item, making it impossible to query without manual reconstruction. No aggregations, no trends, no dashboards.

The goal: build a fully automated pipeline that ingests, cleans, and models this data into a production-grade analytics layer — with zero manual intervention after initial setup.

Architecture

Bronze → Silver → Gold Medallion layers.

The pipeline follows the Medallion Architecture pattern — three progressive layers of data quality, each built as a separate PySpark notebook running on Microsoft Fabric's Synapse Data Engineering runtime.

Layer 01
Bronze — Raw Ingestion
  • Idempotent loading — tracks processed files via _source_path
  • Header sanitization to lowercase snake_case
  • Removes empty Excel rows/columns with dropna(how='all')
  • Extracts data_year from folder structure
  • Append-only — never re-processes historical files
Layer 02
Silver — Validation & Hierarchy
  • Forward-fill Window functions reconstruct transaction IDs
  • Hybrid timestamp parser handles two date formats
  • Smart Revenue Rule: fees vs. discount differentiation
  • Quarantine gate for failed quality checks
  • Incremental watermark — only processes new Bronze records
  • Auto-generates date dimension table
Layer 03
Gold — Analytics & Aggregates
  • Monthly P&L with countDistinct for accuracy
  • Product Hero analysis by profit contribution
  • Daily operational heatmap by day-of-week
  • Market Basket Discovery via self-join association mining
  • Basket health metrics by fiscal year
Architecture Diagrams

Pipeline visualized — end to end.

Four diagrams covering the full data flow, table relationships, orchestration sequence, and semantic model.

Diagram 1 — Medallion Pipeline End to End
flowchart TD A(["POS Excel exports\n.xls / .xlsx — 8+ years\n~676K raw records"]) A -->|"File-drop trigger\nData Factory"| B subgraph BRONZE ["Bronze Layer — Raw Ingestion"] B["01_bronze_layer\nIdempotent load · header sanitize\n_source_path tracking · dropna"] end BRONZE -->|"_ingested_at watermark"| SILVER subgraph SILVER ["Silver Layer — Validation & Hierarchy"] D["02_silver_layer\nForward-fill Window · Smart Revenue Rule\nIncremental watermark · Quality gate\nAuto date dimension · 351,275 rows promoted"] end SILVER -->|"Full recalculation"| GOLD subgraph GOLD ["Gold Layer — Analytics & Aggregates"] E["03_gold_layer\nMonthly P&L · Product heroes\nDaily heatmap · Market basket"] end GOLD --> F(["Power BI\nDirect Lake mode\nSub-second queries"]) BRONZE -->|"Binary copy"| G(["Archive\nFiles/Bronze/Archive"]) BRONZE -->|"Wildcard purge"| H(["Landing zone cleared"])
Full data flow from raw POS Excel files through Bronze → Silver → Gold layers to Power BI. The pipeline is triggered automatically on file-drop and cleans up the landing zone after processing.
Diagram 2 — Data Factory Orchestration
flowchart LR T(["File-drop trigger\nFiles/Bronze/Landing"]) --> B B["Execute Bronze\nIdempotent ingest · sanitize"] --> S S["Execute Silver\nIncremental watermark\nSmart Revenue Rule · quality gate"] --> G G["Execute Gold\nFull recalculation · overwrite\nAll 5 Gold tables refreshed"] --> AR & DL AR["Copy to Archive\nBinary mode — preserves .xls/.xlsx"] DL["Delete landing files\nWildcard purge · directory kept"] AR & DL --> PBI PBI(["Power BI Direct Lake\nDashboard auto-synced\nNo manual refresh needed"])
Fully automated execution sequence: Bronze → Silver → Gold notebooks run sequentially, then original files are archived and the landing zone is cleared — ready for the next drop.
Diagram 3 — Delta Table Schema Registry
flowchart LR subgraph BRONZE ["Bronze"] B1["bronze_details_transactions\nRaw line-items"] B2["bronze_tran_transactions\nRaw headers"] end subgraph SILVER ["Silver"] S1["silver_details_transactions\nFACT · 351,275 rows"] S2["silver_tran_summary\nDIMENSION"] S3["silver_date_dimension\nAuto-generated"] S4["quarantine_details\nFailed quality gate"] end subgraph GOLD ["Gold"] G1["gold_monthly_pnl"] G2["gold_product_performance"] G3["gold_daily_sales"] G4["gold_basket_analysis"] G5["gold_basket_health"] end B1 -->|"Forward-fill Window"| S1 B1 -->|"Failed gate"| S4 B2 -->|"Standardize & dedup"| S2 S1 --> S3 & G1 & G2 & G3 & G4 S2 --> G5
All 11 Delta tables across the three layers — showing which Bronze tables feed which Silver tables, and which Silver tables power which Gold aggregates.
Diagram 4 — Star Schema Semantic Model
erDiagram silver_details_transactions { string transaction_id FK timestamp transaction_timestamp string product_name int quantity decimal gross_revenue decimal gross_profit int fiscal_year } silver_tran_summary { string transaction_id PK decimal total_basket_value int items_in_basket string payment_method } silver_date_dimension { date calendar_date PK int year string month_name string day_of_week } gold_product_performance { string product_name PK int total_units_sold decimal total_profit decimal avg_retail_price } gold_monthly_pnl { int fiscal_year PK int fiscal_month PK decimal total_net_revenue decimal profit_margin_pct } silver_tran_summary ||--o{ silver_details_transactions : "1 header → many line-items" silver_date_dimension ||--o{ silver_details_transactions : "1 date → many transactions" gold_product_performance ||--o{ silver_details_transactions : "1 product → many sales"
Formal Star Schema powering Power BI Direct Lake. One-to-Many relationships with single-directional cross-filtering prevent query ambiguity and optimize Spark read performance.
Key Engineering Decisions

The hard problems that made this production-grade.

The Forward-Fill Problem: Legacy Excel exports store Transaction ID and Date only in the first row of each transaction block. Every product line-item below it has null values. Standard SQL can't fix this — it required a Spark Window function partitioned by file source and ordered by original row index, propagating the last non-null value down to every line-item.

# Window specification — partitioned per file, ordered by original Excel row
window_spec = Window.partitionBy("_source_path") \
                    .orderBy(F.col("original_row_index").cast("int")) \
                    .rowsBetween(Window.unboundedPreceding, 0)

# Forward-fill Transaction ID from header row to all product rows below it
df_filled = df.withColumn(
    "final_transaction_id",
    F.last(F.when(
        (F.col("id") != "ID") & (F.col("id").isNotNull()), F.col("id")
    ), ignorenulls=True).over(window_spec)
)

The Smart Revenue Rule: The raw data had a single adjustment column that contained two completely different things depending on the value — Massachusetts bottle deposits (regulatory fees ≤ $0.05, which should be added to revenue) and promotional discounts (> $0.05, which should be excluded). Treating them the same would cause every revenue figure to be wrong.

Market Basket Discovery: To find which products are bought together most frequently, the pipeline performs a self-join on transaction_id and filters with product_a < product_b (lexicographical ordering) to prevent duplicate pairs from appearing — a clean association rule mining implementation at scale in pure PySpark.

Delta Table Schema

11 tables across 3 layers.

Bronze
bronze_details_transactions
Raw line-items with metadata
Bronze
bronze_tran_transactions
Raw transaction headers
Silver
silver_details_transactions
Fact table — 351,275 rows
Silver
silver_tran_summary
Transaction-level dimension
Silver
silver_date_dimension
Auto-generated date table
Silver
quarantine_details
Failed quality gate records
Gold
gold_monthly_pnl
Monthly P&L aggregation
Gold
gold_product_performance
Product Hero rankings
Gold
gold_daily_sales
Day-of-week heatmap
Gold
gold_basket_analysis
Product co-occurrence pairs
Gold
gold_basket_health
Yearly basket depth metrics
Tech Stack

Built on Microsoft Fabric.

Microsoft Fabric
Platform / OneLake
PySpark / Spark 3.4
Distributed compute
Delta Lake
ACID storage layer
Spark SQL
Query & aggregation
Data Factory Pipelines
Orchestration
Power BI Direct Lake
BI / dashboards
Python / Pandas
File ingestion
Star Schema Design
Semantic model
Results

From unqueryable files to sub-second dashboards.

351K
rows promoted to the Silver analytics layer
100%
reconciliation with physical store records via Smart Revenue Rule
0
manual steps after pipeline setup — fully automated file-drop to dashboard
<1s
Power BI query response time via Direct Lake mode