Data Engineering · Real-Time Pipeline

NYC MTA Bus
Reliability Tracker

A real-time data pipeline that automatically collects, processes, and analyzes NYC bus performance — detecting delays, ghost buses, and bus bunching across 4 major routes.

Live Demo ↗ GitHub ↗ ← All projects
926K+
Arrivals
analyzed
64.6%
System
on-time rate
40
pytest tests
passing
88%
False positive
reduction
The Problem

3 problems nobody was measuring.

The MTA tells you when your bus should arrive. We built a system that tells you what's actually happening.

Ghost Buses

A bus appears in the MTA app with a promised arrival time — then vanishes before reaching your stop. Riders are left stranded with zero warning.

45 detected per day
Schedule Delays

Buses consistently running late with no historical data for riders to plan around. The worst stop averages 16.5 minutes of delay across 23,000 observations.

+16.5 min worst stop
Bus Bunching

Instead of one bus every 10 minutes, three buses arrive together then nothing for 30 minutes. Q58 has 664 confirmed bunching events in a single day.

664 events on Q58
System Design

Pipeline architecture.

4-layer data engineering pipeline with zero-conflict read replica pattern for concurrent access between pipeline writer and dashboard reader.

Diagram 1 — System Architecture
flowchart TD A["MTA Bus Time API\nSIRI VehicleMonitoring\n4 routes - every 60 seconds"] B["Ingestion Layer\ningestion/ingest.py\nPython - requests - loguru"] C["DuckDB Writer\nmta_bus.db\nraw_bus_snapshots"] D["Transform Layer\ntransforms/transform.py\nDelay - Ghost - Bunching"] E["bus_arrivals\ndelays calculated"] F["ghost_buses\ndisappearances flagged"] G["bunching_events\nclusters detected"] H["Read Replica\nmta_bus_reader.db\natomic copy"] I["Overview"] J["Ghost Buses"] K["Bunching"] L["Route Analysis"] M["Live Map"] A -->|"60s poll"| B B -->|"raw snapshots stored"| C C -->|"incremental process"| D D --> E D --> F D --> G E -->|"atomic copy"| H F -->|"atomic copy"| H G -->|"atomic copy"| H H -->|"read only"| I H -->|"read only"| J H -->|"read only"| K H -->|"read only"| L H -->|"live MTA API"| M
The pipeline runs continuously — ingesting from the MTA API, transforming raw GPS data into delay and detection records, copying atomically to a read replica, and serving 5 live dashboard pages. The dashboard never writes to the database and never conflicts with the pipeline writer.
Diagram 2 — Database Schema
erDiagram RAW_BUS_SNAPSHOTS { BIGINT snapshot_id PK TIMESTAMP captured_at VARCHAR vehicle_id VARCHAR route VARCHAR direction DOUBLE latitude DOUBLE longitude VARCHAR next_stop_name TIMESTAMP aimed_arrival TIMESTAMP expected_arrival DOUBLE distance_to_stop } BUS_ARRIVALS { BIGINT arrival_id PK VARCHAR vehicle_id VARCHAR route VARCHAR stop_name DOUBLE delay_minutes BOOLEAN is_late DATE date INTEGER hour } GHOST_BUSES { VARCHAR vehicle_id PK VARCHAR route TIMESTAMP last_seen_at DOUBLE distance_at_disappear BOOLEAN is_ghost } BUNCHING_EVENTS { VARCHAR route PK VARCHAR vehicle_1_id VARCHAR vehicle_2_id DOUBLE distance_between_m BOOLEAN is_bunched TIMESTAMP timestamp } ROUTE_RELIABILITY { VARCHAR route PK INTEGER hour_of_day DOUBLE on_time_pct DOUBLE avg_delay_minutes INTEGER total_arrivals } RAW_BUS_SNAPSHOTS ||--o{ BUS_ARRIVALS : "transform pipeline" RAW_BUS_SNAPSHOTS ||--o{ GHOST_BUSES : "ghost detection" RAW_BUS_SNAPSHOTS ||--o{ BUNCHING_EVENTS : "bunching detection" BUS_ARRIVALS ||--o{ ROUTE_RELIABILITY : "aggregated into"
Medallion architecture — 3 layers. Raw layer stores unmodified API responses. Cleaned layer stores processed detections. Metrics layer stores pre-aggregated statistics for fast dashboard queries.
Technology

Every tool chosen for a reason.

Not just because it was popular — each has a specific engineering justification.

Python 3.11
Core language for all pipeline components
primary
DuckDB
Zero-config analytical database — no server process needed
storage
Streamlit
5-page interactive live dashboard at localhost:8501
dashboard
Docker
Two containers, shared volume, single command startup
deployment
pytest
40 tests across 4 files, 76% coverage, 2.66s runtime
testing
Plotly
Interactive charts, heatmaps, hourly line charts
charts
Folium
Live GPS map, colored route markers, click popups
maps
Apache Airflow
Production DAG with 4 tasks, retry logic, ready to deploy
orchestration
Algorithms

Detection methodology.

Multi-layer filtering reduced false positives by 88%. Two quality gates prevent false positives: a 10-minute minimum gap eliminates brief GPS signal loss, and a 500m distance threshold confirms the bus genuinely failed to arrive.

Diagram 3 — Ghost Bus Detection Algorithm
flowchart LR START["API Poll\nevery 60s"] --> CHECK{"Has expected\narrival?"} CHECK -->|"No"| NEXT["Next poll"] CHECK -->|"Yes"| TRACK["Track vehicle\n+ distance + time"] TRACK --> NEXT NEXT --> PRESENT{"Still in\nAPI feed?"} PRESENT -->|"Yes"| UPDATE["Update record"] --> NEXT PRESENT -->|"No"| GAP{"Gap >\n10 min?"} GAP -->|"No"| SKIP["Signal loss\nSkip"] --> NEXT GAP -->|"Yes"| DIST{"Distance >\n500m?"} DIST -->|"No"| NORMAL["Bus arrived\nnormally"] DIST -->|"Yes"| GHOST["GHOST BUS\nDETECTED"] GHOST --> DASHBOARD["Logged to\ndashboard"]
Two quality gates prevent false positives: the 10-minute minimum gap eliminates brief GPS signal interruptions, and the 500m distance threshold ensures the bus genuinely failed to arrive. Initial count was 15,648 — after these filters: 45 credible detections.
Data Quality

From 342 events to 40 — 88% false positive reduction.

The most important part of this project. We never accepted our first numbers.

Ghost Buses

First count: 15,648/day — obviously wrong. Root cause: algorithm counted every raw snapshot instead of tracking vehicle completion. Fix: 10-min gap + 500m threshold.

15,648 → 45 (99.7% reduction)
Bus Bunching

First count: 342 events for B46 — suspiciously high. Fixed with 3 filters: direction check, 50m–500m range, and duplicate pair deduplication across 5-min windows.

342 → 40 (88% reduction)
Delay Calculation

First query returned ALL NULL for aimed_arrival. Root cause: missing API parameter VehicleMonitoringDetailLevel=calls. After fix: 926,551 arrivals processed.

0 → 926,551 arrivals measurable
Key Findings

What 926,551 arrivals revealed.

8 days of continuous data collection across 4 NYC bus routes.

Real findings from real data
64.6%
System-wide on-time rate. 35.4% of buses run late, early, or disappear entirely.
+16.5 min
Average delay at worst stop — Palmetto St/Myrtle Av on Q58 — across 23,000 observations.
88%
False positive reduction through 3-layer quality filtering. Initial count: 342. Final: 40.
Route on-time performance
BX12
68.4%
B
M15
67.7%
B
B46
55.5%
C
Q58
37.2%
D
Delay distribution — all arrivals
Early27.2%
On time39.0%
Slightly late14.1%
Late13.1%
Very late (20+ min)6.6%
Quality Assurance

40 tests, 76% coverage.

Automated tests validate data quality, detection algorithms, pipeline integration, and API connectivity.

test_api.py
5 tests
API returns HTTP 200
Response is valid JSON
Vehicle data present
Required fields exist
MonitoredCall has arrival time
test_data_quality.py
8 tests
No null vehicle IDs
Coordinates within NYC bounds
Delay values in valid range
Boolean flags consistent
On-time rate realistic (10–95%)
test_detection.py
11 tests
Bunching distance 50–500m
Bus cannot bunch with itself
Ghost distance > 50m
Haversine same point = 0m
Times Sq to Empire State = ~1,350m
test_pipeline.py
11 tests
All 4 tables exist
All required columns present
Read replica exists and readable
API key configured
DB file configured correctly
$ pytest tests/ -v
tests/test_api.py::TestMTAAPI::test_api_returns_200 PASSED
tests/test_data_quality.py::TestRawDataQuality::test_coordinates_within_nyc_bounds PASSED
tests/test_detection.py::TestHaversineFormula::test_known_distance PASSED
tests/test_pipeline.py::TestReadReplica::test_reader_db_exists PASSED
... 36 more tests ...
======================== 40 passed, 1 skipped in 2.66s ========================
Deployment

One command to run everything.

Fully containerized with Docker. Airflow DAG ready for production deployment.

Quick start
$ git clone https://github.com/Malav4217/MTA
$ echo "MTA_API_KEY=your_key" > .env
$ docker-compose up -d
# Dashboard live at http://localhost:8501
Reflection

What this project taught me.

Question your own numbers
15,648 ghost buses in a day is obviously wrong. The willingness to investigate and fix your own data is what separates good data engineers from bad ones.
Architecture patterns matter
The read replica pattern eliminates DB contention with zero complexity. The same pattern as PostgreSQL streaming replication — implemented in 20 lines of Python.
Data tells stories
The Q58 is bad across every dimension — delays, bunching, ghost buses. That's not random noise. That's a specific route with specific operational problems a pipeline can identify.