The prediction of new Covid-19 cases in Poland with machine learning models

Adam Chwila

doi:10.59170/stattrans-2023-020

The prediction of new Covid-19 cases in Poland with machine learning models

Adam Chwila University of Economics in Katowice, Katowice ORCID:https://orcid.org/ 0000-0003-4671-4298 Statistics in Transition new series, vol. 24, 2023, 2, pages: 59-83 Published online: 15 March 2023 DOI 10.59170/stattrans-2023-020 Citation: Chwila, A., 2023. The prediction of new Covid-19 cases in Poland with machine learning models. Statistics in Transition new series, 24(2), pp. 59-83. https://doi.org/10.59170/stattrans-2023-020

1543 Views 97 Downloads

ARTICLE

(English) PDF

ABSTRACT

The COVID-19 pandemic has had a huge impact both on the global economy and on everyday life in all countries all over the world. In this paper, we propose several possible machine learning approaches to forecasting new confirmed COVID-19 cases, including the LASSO regression, Gradient Boosted (GB) regression trees, Support Vector Regression (SVR), and Long-Short Term Memory (LSTM) neural network. The above methods are applied in two variants: to the data prepared for the whole Poland and to the data prepared separately for each of the 16 voivodeships (NUTS 2 regions). The learning of all the models has been performed in two variants: with the 5-fold time-series cross-validation as well as with the split into the single train and test subsets. The computations in the study used official statistics from government reports from the period of April 2020 to March 2022. We propose a setup of 16 scenarios of the model selection to detect the model characterized by the best ex-post prediction accuracy. The scenarios differ from each other by the following features: the machine learning model, the method for the hyperparameters selection and the data setup. The most accurate scenario for the LASSO and SVR machine learning approaches is the single train/test dataset split with data for the whole Poland, while in case of the LSTM and GB trees it is the cross validation with data for whole Poland. Among the best scenarios for each model, the most accurate ex-post RMSE is obtained for the SVR. For the model performing best in terms of the ex-post RMSE, the interpretation of the outcome is conducted with the Shapley values. The Shapley values make it possible to present the impact of auxiliary variables in the machine learning model on the actual predicted value. The knowledge regarding factors that have the strongest impact on the number of new infections can help companies to plan their economic activity during turbulent times of pandemics. We propose to identify and compare the most important variables that affect both the train and test datasets of the model.

KEYWORDS

machine learning, time series, COVID-19, forecasting, economic activity

REFERENCES

Ahmad, A., Garhwal, S., Ray, S., K., Kumar, G., Malebary, S., J., Barukab, O., M., (2020). The number of confirmed cases of Covid-19 by using machine learning: methods and challenges, Archives of Computational Methods in Engineering, https://doi.org/ 10.1007/s11831-020-09472-8.

Arino, J., Portet, S., (2020). A simple model for Covid-19, Infectious Disease Modelling, 5, pp. 309–315, https://doi.org/10.1016/j.idm.2020.04.002.

Aydin, N., Yurdakul, G., (2020). Assessing countries’ performances against Covid-19 via WSIDEA and machine learning algorithms, Applied Soft Computing Journal, 97, p. 106792, https://doi.org/10.1016/j.asoc.2020.106792.

Barnett-Itzhaki, Z., Elbaz, M., Butterman, R., Amar, D., Amitay, M., Racowskyc, C., Orvieto, R., Hauser, R., Baccarelli, A., Machtinger, R., (2020). Machine learning vs. classic statistics for the prediction of IVF outcomes, Journal of Assisted

Reproduction and Genetics, 37, pp. 2405–2412, https://doi.org/10.1007/s10815- 020-01908-1.

Benvenuto, D., Giovanetti, M., Vasallo, L., Angeletti, S., Ciccozzi, M., (2020). Application of the ARIMA model on the Covid-2019 epidemic dataset, Data in brief, 29, p. 105340, https://doi.org/10.1016/j.dib.2020.105340.

Bergmeir, C., Benítez, J. M., (2012). On the use of cross-validation for time series predictor evaluation, Information Sciences, 191, pp. 192–213, https://doi.org/ 10.1016/j.ins.2011.12.028.

Blavatnik School of Government, University of Oxford, (2022). [online] Available at [Accessed April 30, (2022)].

Breiman, L., Friedman, J., Olshen, R., Stone, C., (1984). Classification and regression trees, Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.

Chen, J., Hoogh, K., Gulliver, J., Hoffmann, B., Hertel, O., Ketzel, M., Bauwelick, M., Donkelar, A., Hividtfeldt, U., Katsouyanni, K., Et Al., (2019). A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide, Environment International, 130, p. 104934, DOI: 10.1016/j.envint.2019.104934.

Chen, T., Guestrin, T., (2016). XGBoost: a scalable tree boosting system, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, DOI: 10.1145/2939672.2939785.

Chen, Y., Lu, P., (2020). A time-dependent SIR model for Covid-19 with undetectable infected persons, IEEE Transactions on Network Science and Engineering, 7(4), pp. 3279–3294, DOI: 10.1109/TNSE.2020.3024723.

Chimmula V., K., R., Zhang, L., (2020). Time series forecasting of Covid-19 transmission in Canada using LSTM networks, Chaos, Solitons & Fractals, 135, p. 109864, https://doi.org/10.1016/j.chaos.2020.109864.

Cooper, I., Mondal A., Antonopoulos C. G., (2020). A SIR model assumption for the spread of Covid-19 in different communities, Chaos, Solitons & Fractals, 139, p. 110057, https://doi.org/10.1016/j.chaos.2020.110057.

Daily Temperature In Capital Cities Of Voivodeships In Poland, (2022). [online] Available at: [Accessed March 20, (2022)].

Demertzis, K., Tsiotas, D., Magafas, L., (2020). Modeling and forecasting the covid-19 temporal spread in Greece: an exploratory approach based on complex network defined splines, International Journal of Environmental Research and Public Health, 17, p. 4693, doi:10.3390/ijerph17134693.

Fanelli, D., Piazza, F., (2020). Analysis and forecast of COVID-19 spreading in China, Italy and France, Chaos, Solitons & Fractals, 134, p. 109761, https://doi.org/ 10.1016/ j.chaos.2020.109761.

Fong, S., J., Li, N., D., G., Crespo R., G., Herrera-Viedma, E., (2020). Finding an accurate early forecasting model from small dataset: a case of 2019-ncov novel coronavirus outbreak, International Journal of Interactive Multimedia and Artificial Intelligence, 6, pp. 132–139, DOI: 10.9781/ijimai.2020.02.002.

Friedman, J., H., (2001). Greedy function approximation: a gradient boosting machine, Annals of Statistics, 29 (5), pp. 1189–1232, DOI: 10.1214/aos/1013203451.

Giordano, G., Blanchini, F., Bruno, R., Colaneri, P., Di Filippo, A., Di Matteo, A., Colaneri, M., (2020). Modelling the COVID-19 epidemic and implementation of population-wide interventions in Italy, Nature Medicine, 26(6), pp. 855–860, https://doi.org/10.1038/s41591-020-0883-7.

GISAID database, (2022). [online] Available at [Accessed April 30, (2022)].

Google Covid-19 Community Mobility Reports, (2022) [online] Available at: [Accessed March 20, (2022)].

Gu, C., Zhu, J., Sun, Y., Zhou, K., Gu, J., (2020). The inflection point about covid-19 may have passed, Science Bulletin, 65(11), pp. 865–867, DOI: 10.1016/ j.scib.2020.02.025.

Gulli, A., Pal, S., (2017). Deep learning with Keras, Packt Publishing Ltd.

Hastie, T., Tibshirani, R., Friedman, J., (2008). The Elements of Statistical Learning, Springer Science + Business Media LLC, New York.

He, K., Zhang, X., Ren, S., Sun, J., (2015). Delving deep into rectifiers: surpassing human-level performance on ImageNet classification, Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, pp. 1026– 1034, https://doi.ieeecomputersociety.org/10.1109/ICCV.2015.123.

Hochreiter, S., Schmidhuber, J., (1997). Long Short-Term Memory, Neural Computation, 9(8), p. 1735–1780, https://doi.org/10.1162/neco.1997.9.8.1735.

Hutter, F., Hoos, H., Leyton-Brown K., (2014). An efficient approach for assessing hyperparameter importance, Proceedings of the 31st International Conference on Machine Learning, 32(1), pp. 754–762.

Jumin, E., Zaini, N., Ahmed A., Abdullah, S., Ismail, M., Sherif, M., (2020). Machine learning versus linear regression modelling approach for accurate ozone concentrations prediction, Engineering Applications of Computational Fluid Mechanics, 14(1), pp. 713–725, https://doi.org/10.1080/19942060.2020.1758792.

Kermack, W. O., Mckendrick, A., G., (1927). A contribution to the mathematical theory of epidemics, Proceedings of the Royal Society A, 115(772), pp. 700–721, https://doi.org/10.1098/rspa.1927.0118.

Kingma, D., Ba, J., (2015). ADAM: a method for stochastic optimization, International Conference on Learning Representations 2015, San Diego, USA, https://arxiv.org/ abs/1412.6980.

Kwekha-Rashid, A.S., Abduljabbar, H.N., Alhayani, B., (2021). Coronavirus disease (COVID-19) cases analysis using machine-learning applications, Applied Nanoscience, https://doi.org/10.1007/s13204-021-01868-7.

Lundberg, S., M., Lee, S. I., (2017). A unified approach to interpreting model predictions, Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 4768–4777.

Malki, Z., Atlam, E., Hassanien, A., E., Dagnew, G., Elhosseini M., A., Gad, I., (2020). Association between weather data and Covid-19 pandemic predicting mortality rate: machine learning approaches, Chaos, Solitons & Fractals, 138, p. 110137, https://doi.org/10.1016/j.chaos.2020.110137.

Molnar C. (2022). Interpretable machine learning. A guide for making black box models explainable, Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, [online] Available at: [Accessed April 30, (2022)].

Ministry of Health Republic of Poland, (2022). [online] Available at: [Accessed March 20, (2022)].

Nouvellet, P. Et Al., (2021). Reduction in mobility and COVID-19 transmission, Nature Communications, 12(1090), ttps://doi.org/10.1038/s41467-021-21358-2.

Okuonghae, D., Omame, A., (2020). Analysis of a mathematical model for Covid-19 population dynamics in Lagos, Nigeria, Chaos, Solitons & Fractals, 139, p. 110032, https://doi.org/10.1016/j.chaos.2020.110032.

Pedregosa, F. Et Al., (2011). Scikit-learn: machine learning in Python, Journal of Machine Learning Research, 12, pp. 2825–2830.

Peng, Y., Nagata, M. H., (2020). An empirical overview of nonlinearity and overfitting in machine learning using Covid-19 data, Chaos, Solitons & Fractals, 139, p. 110055, https://doi.org/10.1016/j.chaos.2020.110055.

Ranstam, J., Cook, J., A., (2018). LASSO regression, British Journal of Surgery, 105, p. 1348, https://doi.org/10.1002/bjs.10895.

R Interface To Covid-19 Data Hub, (2022). [online] Available at [Accessed March 20, (2022)].

Shapley, L. S., (1953). A value for n-person games, Contributions to the Theory of Games, 2 (28), pp. 307–317.

Štrumbelj, E., Kononenko I., (2014). Explaining prediction models and individual predictions with feature contributions, Knowledge and information systems, 41(3), pp. 647–665.

Toharudin, T., Pontoh, R. S., Caraka, R. E., Zahroh, S., Lee, Y., Chen, R., C., (2021). Employing long short-term memory and Facebook prophet model in air temperature forecasting, Communications in Statistics – Simulation and Computation, pp. 1–24, DOI: 10.1080/03610918.2020.1854302.

Tomar, A., Gupta, N., (2020). Prediction for the spread of covid-19 in India and effectiveness of preventive measures, Science of The Total Environment, 728, p. 138762, https://doi.org/10.1016/j.scitotenv.2020.138762.

Sato, J., R., Costafreda, S., Morettin, P., A., Brammer, M., J., (2008). Measuring time series predictability using Support Vector Regression, Communications in Statistics – Simulation and Computation, 37(6), pp. 1183–1197, https://doi.org/ 10.1080/03610910801942422.

Vaid, S., Cakan, C., Bhandari, M., (2020). Using machine learning to estimate unobserved Covid-19 infections in North America, The Journal of Bone and Joint Surgery Incorporated, 102 (70), pp. 1–5, http://dx.doi.org/10.2106/JBJS.20.00715.

Vapnik, V., Levin E., Cun Y. L., (1994). Measuring the vc-dimension of a learning machine, Neural Computation, 6(5), pp. 851–76, DOI: 10.1162/neco.1994.6.5.851.

Wang, P., Zheng, X., Li, J., Zhu, B., (2020). Prediction of epidemic trends in Covid-19 with logistic model and machine learning technics, Chaos, Solitons & Fractals, 139, p. 110058, DOI: 10.1016/j.chaos.2020.110058.

Xu, Y., Goodacre R., (2018). On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning, Journal of Analysis and Testing, 2, pp. 249–262.