Implementation of K-Nearest Neighbor using the oversampling technique on mixed data for the classification of household welfare status

Nur Mutmainnah Djafar; Achmad Fauzan

doi:https://doi.org/10.59170/stattrans-2024-007

Implementation of K-Nearest Neighbor using the oversampling technique on mixed data for the classification of household welfare status

Nur Mutmainnah Djafar Department of Statistics, Faculty of Mathematics and Natural Science, Universitas Islam Indonesia, Indonesia , Achmad Fauzan Department of Statistics, Faculty of Mathematics and Natural Science, Universitas Islam Indonesia, Indonesia ORCID:https://orcid.org/0000-0002-0533-5518 Statistics in Transition new series, vol. 25, 2024, 1, pages: 109-124 Published online: 6 March 2024 https://doi.org/10.59170/stattrans-2024-007 Citation: Djafar, N. M., Fauzan, A., 2024. Implementation of K-Nearest Neighbor using the oversampling technique on mixed data for the classification of household welfare status. Statistics in Transition new series, 25(1), pp. 109-124. https://doi.org/10.59170/stattrans-2024-007.

935 Views 103 Downloads

ARTICLE

(English) PDF

ABSTRACT

Welfare is closely related to poverty and the socio-economic disparities in a society. Based on data from the Central Bureau of Statistics, Kulon Progo in Indonesia had the highest poverty rate in the province of the Special Region of Yogyakarta; an increasing trend was observed every year from 2019 to 2021; Kulon Progo also had a low poverty line (after Gunung Kidul) compared to other regencies/cities in this province. This study aimed to classify the household welfare status in Kulon Progo in March 2021 using the K-Nearest Neighbor (KNN) method. Since imbalance was found between the poor and non-poor classes, an oversampling technique was employed. Imbalanced data affect classification, particularly when predicting the results of the classification. The following oversampling techniques were employed in this study: Random Oversampling (RO), the Adaptive Synthetic (ADASYN) and the Synthetic Minority Oversampling Technique (SMOTE). It was found that, of the three techniques, RO was the most efficient with k = 5, which yielded the best performance in terms of sensitivity, specificity, the G-mean, and accuracy reaching 0.643, 0.805, 0.719, and 78.873%, respectively. Therefore, it can be concluded that the classification model performed well enough to classify household welfare status, especially among the poor (minority class).

KEYWORDS

ADASYN, KNN, random oversampling, SMOTE, welfare

REFERENCES

Akbar, S., Hayat, M., Kabir, M., and Iqbal, M., (2019). iAFP-gap-SMOTE: An Efficient Feature Extraction Scheme Gapped Dipeptide Composition is Coupled with an Oversampling Technique for Identification of Antifreeze Proteins. Letters in Organic Chemistry, 16(4), pp. 294–302. https://doi.org/10.2174/ 1570178615666180816101653

Alsammak, I. L. H., Sahib, H. M. A., and Itwee, W. H., (2020). An Enhanced Performance of K-Nearest Neighbor (K-NN) Classifier to Meet New Big Data Necessities. IOP Conference Series: Materials Science and Engineering, 928(3). https://doi.org/10.1088/1757-899X/928/3/032013

Awotunde, J. B., Misra, S., Adeniyi, A. E., Abiodun, M. K., Kaushik, M., and Lawrence, M. O., (2022). A Feature Selection-Based K-NN Model for Fast Software Defect Prediction. In O. Gervasi, B. Murgante, S. Misra, A. M. A. C. Rocha, & C. Garau (Eds.), Computational Science and Its Applications – ICCSA 2022 Workshops, pp. 49–61. Springer International Publishing.

Bekkar, M., Djemaa, H. K., and Alitouche, T. A., (2013). Evaluation Measures for Models Assessment over Imbalanced Data Sets. Journal of Information Engineering and Applications, 3, pp. 27–38.

BPS-Statistics of DI Yogyakarta Province, (2021). Persentase Penduduk Miskin menurut Kabupaten/Kota di Provinsi DI Yogyakarta (Persen), 2009-2021.

Chawla, N. V., (2005). Data Mining for Imbalanced Datasets: An Overview. In L. Maimon Oded and Rokach (Ed.), Data Mining and Knowledge Discovery Handbook (pp. 853–867). Springer US. https://doi.org/10.1007/0-387-25465-X_40

ChitraDevi, N., Palanisamy, V., Baskaran, K., and Prabeela, S., (2012). A Novel Distance for Clustering to Support Mixed Data Attributes and Promote Data Reliability and Network Lifetime in Large Scale Wireless Sensor Networks. Procedia Engineering, 30, pp. 669–677. https://doi.org/10.1016/j.proeng.2012.01.913

Dalatu, P. I., Midi, (2020). Modified Statistical Approach for Data Preprocessing to Improve Heterogeneous Distance Functions. In Malaysian Journal of Mathematical Sciences (Vol. 14, Issue 2).

Elreedy, D., Atiya, A. F., (2019). A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance. Information Sciences, 505, pp. 32–64. https://doi.org/10.1016/j.ins.2019.07.070

Gao, K., Khoshgoftaar, T. M., and Wald, R., (2014). Combining Feature Selection and Ensemble Learning for Software Quality Estimation. The Florida AI Research Society.

Hamel, L., (2009). Model Assessment with ROC Curves. In Encyclopedia of Data Warehousing and Mining, Second Edition, pp. 1316–1323. IGI Global. https://doi.org/10.4018/978-1-60566-010-3.ch204

Haseela H A., (2022). Hybrid Method for Image Classification. EPRA International Journal of Research and Development (IJRD), 7(2), pp. 59–61. https://doi.org/ 10.36713/epra2016

He, H., Bai, Y., Garcia, E. A., and Li, S., (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322– 1328. https://doi.org/10.1109/IJCNN.2008.4633969

Hoque, N., Bhattacharyya, D. K., and Kalita, J. K., (2021). KNN-DK: A Modified K-NN Classifier with Dynamic k Nearest Neighbors. In J. C. Bansal, L. C. C. Fung, M. Simic, & A. Ghosh (Eds.), Advances in Applications of Data-Driven Computing, pp. 21–34. Springer Singapore. https://doi.org/10.1007/978-981-33-6919-1_2

Hussain, L., Lone, K. J., Awan, I. A., Abbasi, A. A., and Pirzada, J.-R., (2022). Detecting congestive heart failure by extracting multimodal features with synthetic minority oversampling technique (SMOTE) for imbalanced data using robust machine learning techniques. Waves in Random and Complex Media, 32(3), pp. 1079–1102. https://doi.org/10.1080/17455030.2020.1810364

Indriani, A., (2014). Klasifikasi Data Forum dengan menggunakan Metode Naive Bayes Classifier. Seminar Nasional Aplikasi Teknologi Informasi (SNATI) Yogyakarta. www.bluefame.com,

Islam, A., Belhaouari, S. B., Rehman, A. U., and Bensmail, H., (2022). KNNOR: An oversampling technique for imbalanced datasets. Applied Soft Computing, 115, 108288. https://doi.org/10.1016/j.asoc.2021.108288

Jahangiri, M., Jahangiri, M., and Najafgholipour, M., (2020). The sensitivity and specificity analyses of ambient temperature and population size on the transmission rate of the novel coronavirus (COVID-19) in different provinces of Iran. Science of The Total Environment, 728, 138872. https://doi.org/10.1016/ j.scitotenv.2020.138872

Jian, C., Gao, J., and Ao, Y., (2016). A New Sampling Method for Classifying Imbalanced Data Based on Support Vector Machine Ensemble. Neurocomput., 193(C), pp. 115–122. https://doi.org/10.1016/j.neucom.2016.02.006

Kirtania, R., Mitra, S., and Shankar, B. U., (2020). A novel adaptive k-NN classifier for handling imbalance: Application to brain MRI. Intelligent Data Analysis, 24, pp. 909–924. https://doi.org/10.3233/IDA-194647

Kubát, M., Matwin, S., (1997). Addressing the Curse of Imbalanced Training Sets: One- Sided Selection. International Conference on Machine Learning.

Li, J., Zhu, Q., Wu, Q., and Fan, Z., (2021). A novel oversampling technique for classimbalanced learning based on SMOTE and natural neighbors. Information Sciences, 565, pp. 438–455. https://doi.org/10.1016/j.ins.2021.03.041

Maxim, L. D., Niebo, R., and Utell, M. J., (2014). Screening tests: a review with examples. Inhalation Toxicology, 26(13), pp. 811–828. https://doi.org/10.3109/ 08958378.2014.955932

Noorhalim, N., Ali, A., and Shamsuddin, S. M., (2019). Handling Imbalanced Ratio for Class Imbalance Problem Using SMOTE. In L.-K. Kor, A.-R. Ahmad, Z. Idrus, & K. A. Mansor (Eds.), Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017), pp. 19–30. Springer Singapore.

Pangastuti, S. S., (2018). Perbandingan Metode Ensemble Random Forest dengan Smote- Boosting dan Smote-Bagging pada Klasifikasi Data Mining untuk Kelas Imbalance (Studi Kasus: Data Beasiswa Bidikmisi Tahun 2017 di Jawa Timur). Institut Teknologi Sepuluh Nopember.

Pramana, S., Yuniarto, B., Mariyah, S., Santoso, I., and Nooraeni, R., (2018). Data Mining dengan R: Konsep Serta Implementasi. IN MEDIA.

Pristyanto, Y., Pratama, I., and Nugraha, A. F., (2018). Data level approach for imbalanced class handling on educational data mining multiclass classification. 2018 International Conference on Information and Communications Technology (ICOIACT), pp. 310–314. https://doi.org/10.1109/ICOIACT.2018.8350792

Rahayu, S., Bharata Adji, T., Akhmad Setiawan, N., and Teknik Elektro dan Teknologi Informasi, D., (2017). Penghitungan k-NN pada Adaptive Synthetic-Nominal (ADASYN-N) dan Adaptive Synthetic-kNN (ADASYN-kNN) untuk Data Nominal-Multi Kategori. Ktrl.Inst (J.Auto.Ctrl.Inst), 9(2).

Randall, D., And, W., and Martinez, T. R., (2000). An Integrated Instance-Based Learning Algorithm. Computational Intelligence, 16(1).

Ren, F., Cao, P., Li, W., Zhao, D., and Zaiane, O., (2017). Ensemble based adaptive oversampling method for imbalanced data learning in computer aided detection of microaneurysm. Computerized Medical Imaging and Graphics, 55, pp. 54–67. https://doi.org/https://doi.org/10.1016/j.compmedimag.2016.07.011

Shi, Z., (2020). Improving k-Nearest Neighbors Algorithm for Imbalanced Data Classification. IOP Conference Series: Materials Science and Engineering, 719(1), 012072. https://doi.org/10.1088/1757-899X/719/1/012072

Srinilta, C., Kanharattanachai, S., (2021). Application of Natural Neighbor-based Algorithm on Oversampling SMOTE Algorithms. 2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST), pp. 217– 220. https://doi.org/10.1109/ICEAST52143.2021.9426310

Suryadarma, D., Akhmadi, Hastuti, and Toyamah, N., (2005). Objective measures of family welfare for individual targeting: results from pilot project on community based monitoring system in Indonesia. SMERU Research Institute.

Suud, M., Harsono, (2006). 3 Orientasi Kesejahteraan Sosial. Prestasi Pustaka.

Tusyakdiah, H., (2021). Implementasi K Nearest Neighbor (KNN) dalam Klasifikasi Status Kerja Lulusan Sekolah Menengah Kejuruan (SMK) dengan Oversampling Synthetic Minority Oversampling Technique (SMOTE) dan Adaptive Synthetic (ADASYN). Universitas Islam Indonesia.

Widayati, Y. T., Prihati, Y., and Widjaja, S., (2021). Analisis dan Komparasi Algoritma Naive Bayes dan C4.5 untuk Klasifikasi Loyalitas Pelanggan MNC Play Kota Semarang. TRANSFORMTIKA, 18(2), pp. 161–172.

Wilson, D. R., Martinez, T. R., (1997). Improved Heterogeneous Distance Functions. Journal of Artificial Intelligence Research, 6, pp. 1–34.

Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D., (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), pp. 1–37. https://doi.org/10.1007/s10115-007-0114-2

Xin, L. K., and Rashid, N. binti A., (2021). Prediction of Depression among Women Using Random Oversampling and Random Forest. 2021 International Conference of Women in Data Science at Taif University (WiDSTaif), pp. 1–5. https://doi.org/10.1109/WiDSTaif52235.2021.9430215

Zhu, W., Zeng, N. F., and Wang, N., (2010). Sensitivity, Specificity, Accuracy, Associated Confidence Interval and ROC Analysis with Practical SAS. Northeast SAS Users Group 2010: Health Care and Life Sciences.