The pandemic has made the world gear towards a more digital world and greatly influenced internet shopping habits. According to a Survey, online retail sales increased 40% in the US in the year 2020. With the growing e-commerce sales and demand, predicting the correct delivery dates will help gain customer's trust and thus benefit the retailers. So, the accuracy of shipping estimates plays a significant role in providing a hassle-free and trusty customer experience. Despite the importance of the issue, there hasn't been much research done on this topic in the machine learning world.

What are we trying to Do?

This project focuses on how to apply Machine Learning to build a model that can accurately predict delivery dates for items sold by a retailer using a dataset with details of shipping information. For our analysis, we decided to train and implement multiple machine learning models, consisting of Linear Regression, Random Forest, Gradient Boosting Model (GBM), Support Vector Machine (SVM), and a final ensemble model. To score the model, we will be looking at RMSE (Root-mean-square deviation).

Methodology

There are 4.1M data values including timestamp, payment received date, and many others were used to train the random forest model.

Hyper-parameters for the model were built using RandomizedsearchCV, which was imported from Sklearn.

To find the right tuning for a random forest, hyper parameters such as n-estimators, minimum split leaf, maximum depth, minimum impurity, maximum leaf nodes, and minimum split sample are used.

With the model fine-tuned, we found that when n-estimators are set at 200 with a maximum depth of 30, min sample leaf is 4, and min sample split is 2, the score for the model is best.

Model Evaluation

As a regression problem, the data evaluation was done based on the Root mean squared error, mean absolute error, R2_score, and mean sum of squared error.

Different models were implemented to figure out which model will be better suited to predict delivery date more accurately.

Models such as decision tree, K-nearest neighbor, gradient boost, random forest, Ridge regression, lasso regression, and linear regression were used to predict the delivery date for customers.

According to their r2_score, mean absolute error, and root means square error, Random Forest, Linear Regression and Catboost regression had the best performance of all the models.

Likewise, for these three models hyperparameter tuning was done to find the best parameter for higher accuracy.

Scores before Hyper-tuning:

None

After the hyper tuning step, Random Forest gave about 20% more r2_score than it did before hyper tuning with 2.54 MSE, 1.59 RMSE, and 0.97 MAE scores.

Since the r2_score didn't change drastically, the ensemble approach was used to get majority votes from these selected models. R2_score 0.74 with MSE 2.53 was obtained using ensemble regression, which was about 0.1% better than random forest regressor.

Scores after Hyper-tuning:

None

Conclusion

We have successfully compared various machine learning models like Linear Regression, Random Forest, Gradient Boosting Model (GBM) and Support Vector Machine (SVM). After comparing these models and evaluating the models we concluded that random forest was the best choice as it has a 1.59 RMSE score. This score was far better than the other models. We have successfully predicted the delivery dates with the accuracy of 95% for customers' orders. By ensuring that the customers receive accurate delivery dates we hope that we can maximize customer satisfaction and enhance user experience.

You can refer to the code available on Github:

https://github.com/vraj1231/DATA245_-Delivery_time_prediction

References

  1. Guo, K. (2020). Application of Machine Learning and Real-time Feedback System to Predict Arriving Time. Journal of Physics: Conference Series, 1684, 012039. https://doi.org/10.1088/1742-6596/1684/1/012039
  2. Liu, J., Hwang, S., Yund, W., Boyle, L. N., & Banerjee, A. G. (2018). Predicting Purchase Orders Delivery Times Using Regression Models With Dimension Reduction. Volume 1B: 38th Computers and Information in Engineering Conference. Published. https://doi.org/10.1115/detc2018-85710