NYC MTA Bus Reliability Tracker — Python · DuckDB · Streamlit

The Problem

3 problems nobody was measuring.

The MTA tells you when your bus should arrive. We built a system that tells you what's actually happening.

Ghost Buses

A bus appears in the MTA app with a promised arrival time — then vanishes before reaching your stop. Riders are left stranded with zero warning.

45 detected per day

Schedule Delays

Buses consistently running late with no historical data for riders to plan around. The worst stop averages 16.5 minutes of delay across 23,000 observations.

+16.5 min worst stop

Bus Bunching

Instead of one bus every 10 minutes, three buses arrive together then nothing for 30 minutes. Q58 has 664 confirmed bunching events in a single day.

664 events on Q58

System Design

Pipeline architecture.

4-layer data engineering pipeline with zero-conflict read replica pattern for concurrent access between pipeline writer and dashboard reader.

Diagram 1 — System Architecture

flowchart TD A["MTA Bus Time API\nSIRI VehicleMonitoring\n4 routes - every 60 seconds"] B["Ingestion Layer\ningestion/ingest.py\nPython - requests - loguru"] C["DuckDB Writer\nmta_bus.db\nraw_bus_snapshots"] D["Transform Layer\ntransforms/transform.py\nDelay - Ghost - Bunching"] E["bus_arrivals\ndelays calculated"] F["ghost_buses\ndisappearances flagged"] G["bunching_events\nclusters detected"] H["Read Replica\nmta_bus_reader.db\natomic copy"] I["Overview"] J["Ghost Buses"] K["Bunching"] L["Route Analysis"] M["Live Map"] A -->|"60s poll"| B B -->|"raw snapshots stored"| C C -->|"incremental process"| D D --> E D --> F D --> G E -->|"atomic copy"| H F -->|"atomic copy"| H G -->|"atomic copy"| H H -->|"read only"| I H -->|"read only"| J H -->|"read only"| K H -->|"read only"| L H -->|"live MTA API"| M

The pipeline runs continuously — ingesting from the MTA API, transforming raw GPS data into delay and detection records, copying atomically to a read replica, and serving 5 live dashboard pages. The dashboard never writes to the database and never conflicts with the pipeline writer.

Diagram 2 — Database Schema

erDiagram RAW_BUS_SNAPSHOTS { BIGINT snapshot_id PK TIMESTAMP captured_at VARCHAR vehicle_id VARCHAR route VARCHAR direction DOUBLE latitude DOUBLE longitude VARCHAR next_stop_name TIMESTAMP aimed_arrival TIMESTAMP expected_arrival DOUBLE distance_to_stop } BUS_ARRIVALS { BIGINT arrival_id PK VARCHAR vehicle_id VARCHAR route VARCHAR stop_name DOUBLE delay_minutes BOOLEAN is_late DATE date INTEGER hour } GHOST_BUSES { VARCHAR vehicle_id PK VARCHAR route TIMESTAMP last_seen_at DOUBLE distance_at_disappear BOOLEAN is_ghost } BUNCHING_EVENTS { VARCHAR route PK VARCHAR vehicle_1_id VARCHAR vehicle_2_id DOUBLE distance_between_m BOOLEAN is_bunched TIMESTAMP timestamp } ROUTE_RELIABILITY { VARCHAR route PK INTEGER hour_of_day DOUBLE on_time_pct DOUBLE avg_delay_minutes INTEGER total_arrivals } RAW_BUS_SNAPSHOTS ||--o{ BUS_ARRIVALS : "transform pipeline" RAW_BUS_SNAPSHOTS ||--o{ GHOST_BUSES : "ghost detection" RAW_BUS_SNAPSHOTS ||--o{ BUNCHING_EVENTS : "bunching detection" BUS_ARRIVALS ||--o{ ROUTE_RELIABILITY : "aggregated into"

Medallion architecture — 3 layers. Raw layer stores unmodified API responses. Cleaned layer stores processed detections. Metrics layer stores pre-aggregated statistics for fast dashboard queries.

Technology

Every tool chosen for a reason.

Not just because it was popular — each has a specific engineering justification.

Python 3.11

Core language for all pipeline components

primary

DuckDB

Zero-config analytical database — no server process needed

storage

Streamlit

5-page interactive live dashboard at localhost:8501

dashboard

Docker

Two containers, shared volume, single command startup

deployment

pytest

40 tests across 4 files, 76% coverage, 2.66s runtime

testing

Plotly

Interactive charts, heatmaps, hourly line charts

charts

Folium

Live GPS map, colored route markers, click popups

maps

Apache Airflow

Production DAG with 4 tasks, retry logic, ready to deploy

orchestration

Algorithms

Detection methodology.

Multi-layer filtering reduced false positives by 88%. Two quality gates prevent false positives: a 10-minute minimum gap eliminates brief GPS signal loss, and a 500m distance threshold confirms the bus genuinely failed to arrive.

Diagram 3 — Ghost Bus Detection Algorithm

flowchart LR START["API Poll\nevery 60s"] --> CHECK{"Has expected\narrival?"} CHECK -->|"No"| NEXT["Next poll"] CHECK -->|"Yes"| TRACK["Track vehicle\n+ distance + time"] TRACK --> NEXT NEXT --> PRESENT{"Still in\nAPI feed?"} PRESENT -->|"Yes"| UPDATE["Update record"] --> NEXT PRESENT -->|"No"| GAP{"Gap >\n10 min?"} GAP -->|"No"| SKIP["Signal loss\nSkip"] --> NEXT GAP -->|"Yes"| DIST{"Distance >\n500m?"} DIST -->|"No"| NORMAL["Bus arrived\nnormally"] DIST -->|"Yes"| GHOST["GHOST BUS\nDETECTED"] GHOST --> DASHBOARD["Logged to\ndashboard"]

Two quality gates prevent false positives: the 10-minute minimum gap eliminates brief GPS signal interruptions, and the 500m distance threshold ensures the bus genuinely failed to arrive. Initial count was 15,648 — after these filters: 45 credible detections.

Data Quality

From 342 events to 40 — 88% false positive reduction.

The most important part of this project. We never accepted our first numbers.

Ghost Buses

First count: 15,648/day — obviously wrong. Root cause: algorithm counted every raw snapshot instead of tracking vehicle completion. Fix: 10-min gap + 500m threshold.

15,648 → 45 (99.7% reduction)

Bus Bunching

First count: 342 events for B46 — suspiciously high. Fixed with 3 filters: direction check, 50m–500m range, and duplicate pair deduplication across 5-min windows.

342 → 40 (88% reduction)

Delay Calculation

First query returned ALL NULL for aimed_arrival. Root cause: missing API parameter VehicleMonitoringDetailLevel=calls. After fix: 926,551 arrivals processed.

0 → 926,551 arrivals measurable

Key Findings

What 926,551 arrivals revealed.

8 days of continuous data collection across 4 NYC bus routes.

Real findings from real data

64.6%

System-wide on-time rate. 35.4% of buses run late, early, or disappear entirely.

+16.5 min

Average delay at worst stop — Palmetto St/Myrtle Av on Q58 — across 23,000 observations.

88%

False positive reduction through 3-layer quality filtering. Initial count: 342. Final: 40.

Route on-time performance

BX12

68.4%

B

M15

67.7%

B

B46

55.5%

C

Q58

37.2%

D

Delay distribution — all arrivals

Early27.2%

On time39.0%

Slightly late14.1%

Late13.1%

Very late (20+ min)6.6%

Quality Assurance

40 tests, 76% coverage.

Automated tests validate data quality, detection algorithms, pipeline integration, and API connectivity.

test_api.py

5 tests

API returns HTTP 200

Response is valid JSON

Vehicle data present

Required fields exist

MonitoredCall has arrival time

test_data_quality.py

8 tests

No null vehicle IDs

Coordinates within NYC bounds

Delay values in valid range

Boolean flags consistent

On-time rate realistic (10–95%)

test_detection.py

11 tests

Bunching distance 50–500m

Bus cannot bunch with itself

Ghost distance > 50m

Haversine same point = 0m

Times Sq to Empire State = ~1,350m

test_pipeline.py

11 tests

All 4 tables exist

All required columns present

Read replica exists and readable

API key configured

DB file configured correctly

$ pytest tests/ -v

tests/test_api.py::TestMTAAPI::test_api_returns_200 PASSED

tests/test_data_quality.py::TestRawDataQuality::test_coordinates_within_nyc_bounds PASSED

tests/test_detection.py::TestHaversineFormula::test_known_distance PASSED

tests/test_pipeline.py::TestReadReplica::test_reader_db_exists PASSED

... 36 more tests ...

======================== 40 passed, 1 skipped in 2.66s ========================

Deployment

One command to run everything.

Fully containerized with Docker. Airflow DAG ready for production deployment.

Quick start

$ git clone https://github.com/Malav4217/MTA

$ echo "MTA_API_KEY=your_key" > .env

$ docker-compose up -d

# Dashboard live at http://localhost:8501

Reflection

What this project taught me.

Question your own numbers

15,648 ghost buses in a day is obviously wrong. The willingness to investigate and fix your own data is what separates good data engineers from bad ones.

Architecture patterns matter

The read replica pattern eliminates DB contention with zero complexity. The same pattern as PostgreSQL streaming replication — implemented in 20 lines of Python.

Data tells stories

The Q58 is bad across every dimension — delays, bunching, ghost buses. That's not random noise. That's a specific route with specific operational problems a pipeline can identify.

NYC MTA BusReliability Tracker