Predicting Congestion at Intersections

Driving is a means of travelling that is preferred by many people. We all want to drive efficiently and reach our destination with as little time stuck in a traffic jam as possible. The ability to predict traffic at intersections enables us to plan our course ahead of time and avoid busy streets and intersections. In this project, we attempt to train machine learning models to predict the time it takes to cross an intersection and how congested it is at an intersection.

Data Analysis

Our dataset is retrieved from the Kaggle competition: BigQuery-Geotab Intersection Congestion

Overview of the Dataset

Statistics

Number of features: 13

Feature ~ ~ ~ ~ ~
Row ID Intersection ID Latitude Longitude Entry Street Name Exit Street Name
Entry Heading Exit Heading Hour IsWeekend Month Path1
Total Time Stopped2: 20% Total Time Stopped: 40% Total Time Stopped: 50% Total Time Stopped: 60% Total Time Stopped: 80%  
Time from First Stop3: 20% Time from First Stop: 40% Time from First Stop: 50% Time from First Stop: 60% Time from First Stop: 80%  
Distance to First Stop4: 20% Distance to First Stop: 40% Distance to First Stop: 50% Distance to First Stop: 60% Distance to First Stop: 80%  
City          

Number of Records: 857409

Number of unique intersections: 1991

Features of concern

Target Output

Correlation
Correlation between target output

Visualizing the data

Example

North Avenue and Techwood Drive intersection

Average Waiting Time by City

Average Waiting Time Standard Deviation
Average Wait time by city Standard Deviation by city

Average First Stopping Distance from the Intersection by City

Average First Stop Distance Standard Deviation
Average distance first stop by city Standard deviation by city

Average Time to pass the Intersection since First Stop by City

Average Time from First Stop Standard Deviation
Average time from first stop by city Standard deviation by city

Wait time by day: weekdays and weekends

Weekdays Weekends
Wait time weekdays Wait time weekends

By Street

Entry Streets Exit Street
Highest Traffic Starting Street Highest Traffic Ending Street
Most Travelled Path
Most Traffic Path

Heat Map of Waiting Times

Our Approach

Note: We changed target outputs from the 6 mentioned above to 2: Total Time Stopped and Distance to First Stop.

Feature Engineering

Correlation
Correlation between engineered features

Machine Learning Models

Regression Models

First we tried linear regression, then we applied regularization techniques, including lasso regression and elastic net regression. We also experimented with polynomial regression for degree of 2. After tuning the hyperparameter to improve results on the validation dataset, we observed that the performance are still quite similar as shown below.

Performances (RMSE)
RMSE of Regressors
RMSE of Regressors

Gradient Boost Machines

We tried Catboost regressor and LightGBM regressor. For Catboost regressor, we applied techniques such as bagging and L2 regularization. For LightGBM, we transformed the dataset into matrices, and applied techniques such as specifying categorical features to predict the outputs. We see a better performance in LightGBM than Catboost.

Performances (RMSE)
RMSE of Gradient Boost Machines
RMSE of Gradient Boost Machines

Neural Network

We built a simple neural network that consists of three linear layers, and we used Relu as the activation function. The input size into the neural network is (batch size x 11), and output size is (batch size x 2) [We trained on 11 features, and outputted 2 target values]. The rmse is 78.574, which beats the benchmark (rmse = 80), but is slightly worse than LightGBM.

Performance (RMSE)

Random Forest

We applied grid search technique to find the optimal set of parameters such as number of estimators, maximum depth and number of jobs. Our best random forest regression model achieved a high performance of rmse = 75.001, which is the best amongst all models we tried.

Performance (RMSE)

Comparing all Models

RMSE
RMSE of all models

Discussion

Overall, we used supervised learning, and four categories of models were tested. We observe that all models provided similar results. Among these, random forest, neural network, LightGBM and catboost all beat benchmark (rmse = 80). In addition, random forest regression model has the best performance, achieving top 30% on the leaderboard on Kaggle.

During our training process, we also found that using random split and k-fold (k = 5) gives over optimistic results. So we only used random split to train our models. This gave us more consistent rmse scores when comparing predicted results on validation set and predicted results on test set.

References

  1. Google Big Query. BigQuery-Geotab Intersection Congestion. July 2019. 30 October 2019.
  2. Kohavi, Ron. “A study of cross-validation and bootstrap for accuracy estimation and model selection.” International Joint Conference on Artificial Intelligence. 1995. 1137-1145
  3. Kotsiantis, Sotiris B. and I Zaharakis. “Supervised machine learning: A review of classification techniques.” Emerging artificial intelligence applications in computer engineering (2007): 3-24
  4. Peng, Chao-Ying Joanne, Kuk Lida Lee, and Gary M. Ingersoll. “An introduction to logistic regression analysis and reporting.” The Journal of Educational Research 96.1 (2002): 3-14.