A real-time data pipeline that automatically collects, processes, and analyzes NYC bus performance — detecting delays, ghost buses, and bus bunching across 4 major routes.
The MTA tells you when your bus should arrive. We built a system that tells you what's actually happening.
A bus appears in the MTA app with a promised arrival time — then vanishes before reaching your stop. Riders are left stranded with zero warning.
45 detected per dayBuses consistently running late with no historical data for riders to plan around. The worst stop averages 16.5 minutes of delay across 23,000 observations.
+16.5 min worst stopInstead of one bus every 10 minutes, three buses arrive together then nothing for 30 minutes. Q58 has 664 confirmed bunching events in a single day.
664 events on Q584-layer data engineering pipeline with zero-conflict read replica pattern for concurrent access between pipeline writer and dashboard reader.
Not just because it was popular — each has a specific engineering justification.
Multi-layer filtering reduced false positives by 88%. Two quality gates prevent false positives: a 10-minute minimum gap eliminates brief GPS signal loss, and a 500m distance threshold confirms the bus genuinely failed to arrive.
The most important part of this project. We never accepted our first numbers.
First count: 15,648/day — obviously wrong. Root cause: algorithm counted every raw snapshot instead of tracking vehicle completion. Fix: 10-min gap + 500m threshold.
15,648 → 45 (99.7% reduction)First count: 342 events for B46 — suspiciously high. Fixed with 3 filters: direction check, 50m–500m range, and duplicate pair deduplication across 5-min windows.
342 → 40 (88% reduction)First query returned ALL NULL for aimed_arrival. Root cause: missing API parameter VehicleMonitoringDetailLevel=calls. After fix: 926,551 arrivals processed.
8 days of continuous data collection across 4 NYC bus routes.
Automated tests validate data quality, detection algorithms, pipeline integration, and API connectivity.
Fully containerized with Docker. Airflow DAG ready for production deployment.