The Problem
8+ years of data locked in Excel files no one could query.
A convenience store's entire transaction history — spanning 2017 to 2025 — existed as hundreds of semi-structured .xls and .xlsx files with inconsistent schemas. Transaction metadata (ID and date) was stored only in Excel header rows, not repeated on each product line-item, making it impossible to query without manual reconstruction. No aggregations, no trends, no dashboards.
The goal: build a fully automated pipeline that ingests, cleans, and models this data into a production-grade analytics layer — with zero manual intervention after initial setup.
Architecture
Bronze → Silver → Gold Medallion layers.
The pipeline follows the Medallion Architecture pattern — three progressive layers of data quality, each built as a separate PySpark notebook running on Microsoft Fabric's Synapse Data Engineering runtime.
Layer 01
Bronze — Raw Ingestion
- Idempotent loading — tracks processed files via
_source_path
- Header sanitization to lowercase
snake_case
- Removes empty Excel rows/columns with
dropna(how='all')
- Extracts
data_year from folder structure
- Append-only — never re-processes historical files
Layer 02
Silver — Validation & Hierarchy
- Forward-fill Window functions reconstruct transaction IDs
- Hybrid timestamp parser handles two date formats
- Smart Revenue Rule: fees vs. discount differentiation
- Quarantine gate for failed quality checks
- Incremental watermark — only processes new Bronze records
- Auto-generates date dimension table
Layer 03
Gold — Analytics & Aggregates
- Monthly P&L with
countDistinct for accuracy
- Product Hero analysis by profit contribution
- Daily operational heatmap by day-of-week
- Market Basket Discovery via self-join association mining
- Basket health metrics by fiscal year
Architecture Diagrams
Pipeline visualized — end to end.
Four diagrams covering the full data flow, table relationships, orchestration sequence, and semantic model.
Diagram 1 — Medallion Pipeline End to End
flowchart TD
A(["POS Excel exports\n.xls / .xlsx — 8+ years\n~676K raw records"])
A -->|"File-drop trigger\nData Factory"| B
subgraph BRONZE ["Bronze Layer — Raw Ingestion"]
B["01_bronze_layer\nIdempotent load · header sanitize\n_source_path tracking · dropna"]
end
BRONZE -->|"_ingested_at watermark"| SILVER
subgraph SILVER ["Silver Layer — Validation & Hierarchy"]
D["02_silver_layer\nForward-fill Window · Smart Revenue Rule\nIncremental watermark · Quality gate\nAuto date dimension · 351,275 rows promoted"]
end
SILVER -->|"Full recalculation"| GOLD
subgraph GOLD ["Gold Layer — Analytics & Aggregates"]
E["03_gold_layer\nMonthly P&L · Product heroes\nDaily heatmap · Market basket"]
end
GOLD --> F(["Power BI\nDirect Lake mode\nSub-second queries"])
BRONZE -->|"Binary copy"| G(["Archive\nFiles/Bronze/Archive"])
BRONZE -->|"Wildcard purge"| H(["Landing zone cleared"])
Full data flow from raw POS Excel files through Bronze → Silver → Gold layers to Power BI. The pipeline is triggered automatically on file-drop and cleans up the landing zone after processing.
Diagram 2 — Data Factory Orchestration
flowchart LR
T(["File-drop trigger\nFiles/Bronze/Landing"]) --> B
B["Execute Bronze\nIdempotent ingest · sanitize"] --> S
S["Execute Silver\nIncremental watermark\nSmart Revenue Rule · quality gate"] --> G
G["Execute Gold\nFull recalculation · overwrite\nAll 5 Gold tables refreshed"] --> AR & DL
AR["Copy to Archive\nBinary mode — preserves .xls/.xlsx"]
DL["Delete landing files\nWildcard purge · directory kept"]
AR & DL --> PBI
PBI(["Power BI Direct Lake\nDashboard auto-synced\nNo manual refresh needed"])
Fully automated execution sequence: Bronze → Silver → Gold notebooks run sequentially, then original files are archived and the landing zone is cleared — ready for the next drop.
Diagram 3 — Delta Table Schema Registry
flowchart LR
subgraph BRONZE ["Bronze"]
B1["bronze_details_transactions\nRaw line-items"]
B2["bronze_tran_transactions\nRaw headers"]
end
subgraph SILVER ["Silver"]
S1["silver_details_transactions\nFACT · 351,275 rows"]
S2["silver_tran_summary\nDIMENSION"]
S3["silver_date_dimension\nAuto-generated"]
S4["quarantine_details\nFailed quality gate"]
end
subgraph GOLD ["Gold"]
G1["gold_monthly_pnl"]
G2["gold_product_performance"]
G3["gold_daily_sales"]
G4["gold_basket_analysis"]
G5["gold_basket_health"]
end
B1 -->|"Forward-fill Window"| S1
B1 -->|"Failed gate"| S4
B2 -->|"Standardize & dedup"| S2
S1 --> S3 & G1 & G2 & G3 & G4
S2 --> G5
All 11 Delta tables across the three layers — showing which Bronze tables feed which Silver tables, and which Silver tables power which Gold aggregates.
Diagram 4 — Star Schema Semantic Model
erDiagram
silver_details_transactions {
string transaction_id FK
timestamp transaction_timestamp
string product_name
int quantity
decimal gross_revenue
decimal gross_profit
int fiscal_year
}
silver_tran_summary {
string transaction_id PK
decimal total_basket_value
int items_in_basket
string payment_method
}
silver_date_dimension {
date calendar_date PK
int year
string month_name
string day_of_week
}
gold_product_performance {
string product_name PK
int total_units_sold
decimal total_profit
decimal avg_retail_price
}
gold_monthly_pnl {
int fiscal_year PK
int fiscal_month PK
decimal total_net_revenue
decimal profit_margin_pct
}
silver_tran_summary ||--o{ silver_details_transactions : "1 header → many line-items"
silver_date_dimension ||--o{ silver_details_transactions : "1 date → many transactions"
gold_product_performance ||--o{ silver_details_transactions : "1 product → many sales"
Formal Star Schema powering Power BI Direct Lake. One-to-Many relationships with single-directional cross-filtering prevent query ambiguity and optimize Spark read performance.
Key Engineering Decisions
The hard problems that made this production-grade.
The Forward-Fill Problem: Legacy Excel exports store Transaction ID and Date only in the first row of each transaction block. Every product line-item below it has null values. Standard SQL can't fix this — it required a Spark Window function partitioned by file source and ordered by original row index, propagating the last non-null value down to every line-item.
# Window specification — partitioned per file, ordered by original Excel row
window_spec = Window.partitionBy("_source_path") \
.orderBy(F.col("original_row_index").cast("int")) \
.rowsBetween(Window.unboundedPreceding, 0)
# Forward-fill Transaction ID from header row to all product rows below it
df_filled = df.withColumn(
"final_transaction_id",
F.last(F.when(
(F.col("id") != "ID") & (F.col("id").isNotNull()), F.col("id")
), ignorenulls=True).over(window_spec)
)
The Smart Revenue Rule: The raw data had a single adjustment column that contained two completely different things depending on the value — Massachusetts bottle deposits (regulatory fees ≤ $0.05, which should be added to revenue) and promotional discounts (> $0.05, which should be excluded). Treating them the same would cause every revenue figure to be wrong.
Market Basket Discovery: To find which products are bought together most frequently, the pipeline performs a self-join on transaction_id and filters with product_a < product_b (lexicographical ordering) to prevent duplicate pairs from appearing — a clean association rule mining implementation at scale in pure PySpark.