The effect of binary data transformation in categorical data clustering

Jana Cibulková; Zdenek Šulc; Sergej Sirota; Hana Rezanková

doi:10.21307/stattrans-2019-013

The effect of binary data transformation in categorical data clustering

Jana Cibulková Department of Statistics and Probability, University of Economics, Prague, Czech Republic , Zdenek Šulc Department of Statistics and Probability, University of Economics, Prague, Czech Republic , Sergej Sirota Department of Statistics and Probability, University of Economics, Prague, Czech Republic , Hana Rezanková Department of Statistics and Probability, University of Economics, Prague, Czech Republic Statistics in Transition new series, vol. 20, 2019, 2, pages: 33-47 Published online: 2 July 2019 DOI 10.21307/stattrans-2019-013

1763 Views 50 Downloads

ARTICLE

(English) PDF

ABSTRACT

This paper focuses on hierarchical clustering of categorical data and compares two approaches which can be used for this task. The first one, an extremely common approach, is to perform a binary transformation of the categorical variables into sets of dummy variables and then use the similarity measures suited for binary data. These similarity measures are well examined, and they occur in both commercial and non-commercial software. However, a binary transformation can possibly cause a loss of information in the data or decrease the speed of the computations. The second approach uses similarity measures developed for the categorical data. But these measures are not so well examined as the binary ones and they are not implemented in commercial software. The comparison of these two approaches is performed on generated data sets with categorical variables and the evaluation is done using both the internal and the external evaluation criteria. The purpose of this paper is to show that the binary transformation is not necessary in the process of clustering categorical data since the second approach leads to at least comparably good clustering results as the first approach.

KEYWORDS

hierarchical cluster analysis, nominal variable, binary variable, categorical data, similarity measures, evaluation criteria, generated data

REFERENCES

BORIAH, S., CHANDOLA, V., KUMAR, V., (2008). Similarity measures for categorical data: A comparative evaluation, In Proceedings of the 2008 SIAM International Conference on Data Mining, Society for Industrial, Applied Mathematics, pp. 243–254.

CAIRO, M., NELSON, B., (1997). Modeling and Generating Random Vectors with Arbitrary Marginal Distributions and Correlation Matrix, Technical Report, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL.

CHARU, C. A., CHANDAN, K. R., (2013). Data Clustering: Algorithms and Applications, Chapman & Hall/CRC.

CHOI, S. S., CHA, S. H., TAPPERT, C. C., (2010). A survey of binary similarity and distance measures,Journal of Systemics, Cybernetics and Informatics, 8 (1), pp. 43–48.

CIBULKOVÁ, J., Rˇ EZANKOVÁ, H., (2018). Categorical data generator, In International Days of Statistics and Economics 2018. T. Löster and T. Pavelka (eds.) Slaný: Melandrium, Libuše Macáková, pp. 288–296.

DUNN, G., EVERITT, B. S., (1982). An Introduction to Mathematical Taxonomy, Cambridge University Press.

ESKIN, E., ARNOLD, A., PRERAU, M., PORTNOY, L., STOLFO, S. V., (2002). A geometric framework for unsupervised anomaly detection, In Applications of Data Mining in Computer Security, D. Barbará and S. Jajodia (eds.) Boston: Springer, pp. 78–100.

HAHSLER, M., BUCHTA, C., GRUEN, B., HORNIK, K., (2015). Arules: Mining Association Rules and Frequent Itemsets. R package version 1.3-1. https://CRAN.Rproject. org/package=arules.

HIGHAM, N. J., (2009). Cholesky factorization, Wiley Interdisciplinary Reviews: Computational Statistics, 1 (2), pp. 251–254.

HUBERT, L., ARABIE, P., (1985). Comparing partitions, Journal of Classification, 2 (1), pp. 193–218.

JACCARD, P., (1901). Étude comparative de la distribuition florale dans une portion des Alpes et des Jura, Bulletin de la Societe Vaudoise des Sciences Naturelles, 37 (142), pp. 547–579.

LADDS, M. A., SIBANDA, N., ARNOLD, R., DUNN, M. R., (2018). Creating functional groups of marine fish from categorical traits, PeerJ 6:e5795.

LIN, D., (1998). An information-theoretic definition of similarity. In ICML ’98: Proceedings of the 15th International Conference on Machine Learning, San Francisco: Morgan Kaufmann Publishers Inc., pp. 296–304.

PEARSON, K., (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, Philosophical Magazine, Series 5, 50(302), pp. 157–175.

QIU, W., JOE, H., (2015). clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). R package version 1.3.4. https://CRAN.Rproject. org/package=clusterGeneration.

R CORE TEAM (2018). R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria. URL https://www.Rproject. org/.

RAND, W. M., (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, 66 (336), pp. 846–850.

Rˇ EZANKOVÁ, H., LÖSTER, T., HÚSEK, D., (2011). Evaluation of Categorical Data Clustering, In Advances in Intelligent Web Mastering – 3, Advances in Intelligent and Soft Computing. E. Mugellini, P. S. Szczepaniak, M. C. Pettenati and M. Sokhn (eds.), vol 86. Berlin:Springer, Heidelberg, pp. 173–182.

SALEM, S. B., NAOUALI, S., SALLAMI, M., (2017). Clustering Categorical Data Using the K-Means Algorithm and the Attribute’s Relative Frequency, International Journal of Computer, Electrical, Automation, Control and Information Engineering, 11 (6), pp. 708–713.

SOKAL, R., MICHENER, C., (1958). A statistical method for evaluating systematic relationships, University of Kansas Science Bulletin, 38 (2), pp. 1409–1438.

SPARCK-JONES, K., (1972). A statistical interpretation of term specificity and its application in retrieval, Journal of Documentation, 28 (1), pp. 11–21.

STAHL, D., SALLIS, H., (2012). Model-based cluster analysis, In Wiley Interdisciplinary Reviews: Computational Statistics, 4 (4), pp. 341–358.

ŠULC, Z., (2016). Similarity measures for nominal data in hierarchical clustering. Dissertation thesis, Prague: University of Economics.

ŠULC, Z., Rˇ EZANKOVÁ, H., (2015). Nomclust: An R package for hierarchical clustering of objects characterized by nominal variables, In International Days of Statistics and Economics 2018. T. Löster and T. Pavelka (eds.) Slaný: Melandrium, pp. 1581–1590.

TODESCHINI, R., CONSONNI, V., XIANG, H., HOLLIDAY, J., BUSCEMA, M., WILLETT, P., (2012). Similarity coefficients for binary chemoinformatics Data: Overview and extended comparison using simulated and real data sets, Journal of Chemical Information and Modeling, 52 (11), pp. 2884–2901.

YULE, G U., (1912). On the methods of measuring association between two attributes, Journal of the Royal Statistical Society, 49 (6), pp. 579–652.

YIM, O., RAMDEEN, K. T., (2015). Hierarchical cluster analysis: comparison of three linkage measures and application to psychological data, The Quantitative Methods for Psychology, 11 (1), pp. 8–21.