Comprehensive EDA, ML modeling, and cross-city insights from 910,522 cleaned ride records (June 2016 – Feb 2017)
| Model | R² Score | MAE ($) | RMSE ($) | Status |
|---|---|---|---|---|
| Gradient Boosting Regressor | 0.9964 | $0.29 | $0.47 | 🏆 BEST |
| Extra Trees Regressor | 0.8916 | $1.65 | $2.60 | Baseline |
Trip distance is the strongest predictor of fare (r = 0.849). A linear relationship explains ~72% of fare variance before considering other features.
10.2% of rides experience surge pricing. At peak surge (6x), fares increase by up to 500% — reflecting acute demand-supply imbalance during events and late nights.
Demand peaks around midnight–2AM (nightlife) and early morning (4AM). Sunday is the busiest day, indicating Austin's vibrant entertainment district driving ride-hailing usage.
95.1% of rides use REGULAR category. SUV and PREMIUM categories command 15-40% fare premiums, suggesting price-inelastic demand for premium services.
Gradient Boosting achieves R²=0.9964 with just $0.29 MAE. Duration and distance are the top features, while surge factor and time-of-day introduce non-linear effects captured by ensemble methods.
| City | Records | Share |
|---|---|---|
| 🏛️ Washington DC | 2,574,807 | 63.5% |
| 🤠 Austin, TX | 909,830 | 22.4% |
| 🗽 New York City | 199,957 | 4.9% |
| 🌉 San Francisco | 191,128 | 4.7% |
| 🏙️ Chicago | 179,205 | 4.4% |
| Total Combined | 4,054,927 | 100% |