Evaluation of selected approaches to clustering  categorical variables

Zdeněk  Šulc; Hana  Řezanková

doi:https://doi.org/10.59170/stattrans-2014-039

Evaluation of selected approaches to clustering categorical variables

Zdeněk Šulc Department of Statistics and Probability, University of Economics, Prague. W. Churchill sq.4, 130 67 Praha 3, Czech Republic. , Hana Řezanková Department of Statistics and Probability, University of Economics, Prague. W. Churchill sq.4, 130 67 Praha 3, Czech Republic. Statistics in Transition new series, vol. 15, 2014, 4, pages: 591-610 Published online: 1 December 2014 https://doi.org/10.59170/stattrans-2014-039

590 Views 25 Downloads

ARTICLE

(English) PDF

ABSTRACT

This paper focuses on recently proposed similarity measures and their performance in categorical variable clustering. It compares clustering results using three recently developed similarity measures (IOF, OF and Lin measures) with results obtained using two association measures for nominal variables (Cramér’s V and the uncertainty coefficient) and with the simple matching coefficient (the overlap measure). To eliminate the influence of a particular linkage method on the structure of final clusters, three linkage methods are examined (complete, single, average). The created groups (clusters) of variables can be considered as the basis for dimensionality reduction, e.g. by choosing one of the variables from a given group as a representative for the whole group. The quality of resulting clusters is evaluated by the within-cluster variability, expressed by the WCM coefficient, and by dendrogram analysis. The examined similarity measures are compared and evaluated using two real data sets from a social survey

KEYWORDS

variable clustering, nominal variables, association measures, similarity measures

REFERENCES

ANDERBERG, M. R., (1973). Cluster Analysis for Applications. Academic Press, New York.

BORIAH, S., CHANDOLA, V., KUMAR, V., (2008). Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 8th International Conference on Data Mining. SIAM, pp. 243–254.

CHANDOLA, V., BORIAH, S., KUMAR, V., (2009). A framework for exploring categorical data. In: Proceedings of the 9th International Conference on Data Mining. SIAM, pp. 187–198.

CHAVENT, M., KUENTZ, V., LIQUET, B., SARACCO, L., (2012). ClustOfVar: An R package for the clustering of variables. Journal of Statistical Software, 50(13):1–16. Available at: [Accessed: 16 October 2014].

CHAVENT, M., KUENTZ, V., SARACCO, J., (2010). A partitioning method for the CLUSTERING of categorical variables. In: Locarek-Junge, H., Weihs, C., eds, Classification as a Tool for Research. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin Heidelberg, pp. 91–99.

D’ENZA, A. I., GREENACRE, M. J., (2012). Multiple correspondence analysis for the quantification and visualization of large categorical data sets. In: Advanced Statistical Methods for the Analysis of Large Data-Sets. Springer, Berlin Heidelberg, pp. 453–463.

EVERITT, B. S., LANDAU, S., LEESE, M., STAHL, D., (2011). Cluster Analysis, 5th edn, Wiley, Chichester.

GAN, G., MA, C., WU, J., (2007). Data Clustering: Theory, Algorithms, and Applications, ASA-SIAM, Philadelphia.

GORDON, A. D., (1999). Classification, 2nd edn, Chapman & Hall/CRC, Boca Raton.

GREENACRE, M. J., (2010). Correspondence analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(5):613–619.

JOLLIFFE, I. T., (2002). Principal Component Analysis, 2nd edn, Springer, New York.

LIN, D., (1998). An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp. 296–304.

PALLA, K., KNOWLES, D. A., GHAHRAMANI, Z., (2012). A nonparametric variable clustering model. In: Pereira, F., Burges, C. J. C., Bottou, L., Weinberger, K. Q., eds, Advances in Neural Information Processing Systems 25. NIPS Foundation. Available at: [Accessed 16 October 2014].

PAYNE, T. R., EDWARDS, P., (1999). Dimensionality reduction through correspondence analysis. Available at: [Accessed 16 October 2014].

ŘEZANKOVÁ, H., LÖSTER, T., HÚSEK, D., (2011). Evaluation of categorical data clustering. In: Mugellini, E., Szczepaniak, P. S., Pettenati, M. C. et al., eds, Advances in Intelligent Web Mastering 3. Springer Verlag, Berlin, pp. 173–182.

ŘEZANKOVÁ, H., (2014). Nominal variable clustering and its evaluation. In: Proceedings of the 8th International Days of Statistics and Economics. Melandrium, Slaný, pp. 1293–1302. Available at: < http://msed.vse.cz/msed_2014/article/276-Rezankova-Hana-paper.pdf > [Accessed 5 November 2014].

SPARCK-JONES, K., (1972, 2002). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21. Later: Journal of Documentation, 60(5):493–502.

ŠULC, Z., ŘEZANKOVÁ, H., (2014). Evaluation of recent similarity measures for categorical data. In: Proceedings of the 17th International Conference Applications of Mathematics and Statistics in Economics. Wydawnictwo Uniwersytetu Ekonomicznego we Wrocławiu, Wroclaw, pp. 249–258. Available at: < http://www.amse.ue.wroc.pl/papers/Sulc,Rezankova.pdf> [Accessed 5 November 2014]