When faced with missing data in a statistical survey or administrative sources, imputation is frequently used in order to fill the gaps and reduce the major part of bias that can affect aggregated estimates as a consequence of these gaps. This paper presents research on the efficiency of model–based imputation in business statistics, where the explanatory variable is a complex measure constructed by taxonomic methods. The proposed approach involves selecting explanatory variables that fit best in terms of variation and correlation from a set of possible explanatory variables for imputed information, and then replacing them with a single complex measure (meta–feature) exploiting their whole informational potential. This meta–feature is constructed as a function of a median distance of given objects from the benchmark of development. A simulation study and empirical study were used to verify the efficiency of the proposed approach. The paper also presents five types of similar techniques: ratio imputation, regression imputation, regression imputation with iteration, predictive mean matching and the propensity score method. The second study presented in the paper involved a simulation of missing data using IT business data from the California State University in Los Angeles, USA. The results show that models with a strong dependence on functional form assumptions can be improved by using a complex measure to summarize the predictor variables rather than the variables themselves (raw or normalized).
complex measure, ratio imputation, regression imputation, predictive mean matching, propensity score method
ALLISON, P. D., (2000). Multiple Imputation for Missing Data: A Cautionary Tale, Sociological Methods and Research, Vol. 28, pp. 301–309.
ANDRIDGE, R. R. and LITTLE, R. J. A., (2010). A Review of Hot Deck Imputation of Survey Non–response, International Statistical Review, Vol. 70, pp. 40–64.
ARCARO, C. and YUNG, W., (2001). Variance estimation in the presence of imputation, SSC Annual Meeting, Proceedings of the Survey Method Section, pp. 75–80.
CHAUVET, G., DEVILLE, J.–C. and HAZIZA, D., (2011). On Balanced Random Imputation in Surveys, Biometrika, Vol. 98, pp. 459–471.
DE WAAL, T., PANNEKOEK, J. and SCHOLTUS, S. (2011). Handbook of Statistical Data Editing and Imputation, Wiley Handbooks in Survey Methodology, John Wiley & Sons, Inc., Hoboken, New Jersey.
DUROCHER, S. and KICKPATRICK, D., (2009). The projection median of a set of points, Computational Geometry, Vol. 42, pp. 364–375.
HORTON, N. J. and LIPSITZ, S. R., (2001). Multiple Imputation in Practice: Comparison of Software Packages for Regression Models with Missing Variables, Journal of the American Statistical Association, Vol. 55, pp. 244–254.
HUNDEPOOL, A., DOMINGO–FERRER, J., FRANCONI, L., GIESSING, S., NORDHOLT, E. S., SPICER, K., DE WOLF, P.–P., (2012). Statistical Disclosure Control, Series: Wiley Series in Survey Methodology, John Wiley & Sons, Ltd.
JOLLIFFE, I. T. (2002). Principle Component Analysis. Second Edition. Springer – Verlag, New York, Berlin, Heidelberg.
KIM, K., (2000). Variance estimation under regression imputation model, Proceedings of the Survey Research Methods Section, American Statistical Association.
KIM, J. K., BRICK, M., FULLER, W. A. and KALTON, G., (2006). On the bias of the multiple-imputation variance estimator in survey sampling, Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 68, pp. 509–521.
LAVORI, P. W., DAWSON, R. and SHERA, D., (1995). A Multiple Imputation Strategy for Clinical Trials with Truncation of Patient Data, Statistics in Medicine, Vol. 14, pp. 1913–1925.
LITTLE, R. J. A. and RUBIN, D. B., (2002). Statistical Analysis with Missing Data. Second Edition, John Wiley & Sons, Inc., New York.
MALINA, A. and ZELIAŚ, A., (1998). On Building Taxonometric Measures on Living Conditions, Statistics in Transition, Vol. 3, No. 3, pp. 523–544.
MILASEVIC, P. and DUCHARME, G. R., (1987). Uniqueness of the Spatial Median, The Annals of Statistics, Vol. 15, No. 3, pp. 1332–1333.
MŁODAK, A., (2014). On the construction of an aggregated measure of the development of interval data, Computational Statistics, Vol. 29, pp. 895–929.
MŁODAK, A., (2006). Multilateral normalisations of diagnostic features, Statistics in Transition, vol. 7, pp. 1125–1139.
NETER, J., WASSERMAN, W. and KUTNER, M. H., (1985). Applied Linear Statistical Models: Regression, Analysis of Variance and Experimental Designs, 2nd edition, Homewood, IL: Richard D. Irwin, Inc., U.S.A.
PAMPAKA, M., HUTCHESON, G. and WILLIAMS, J., (2016). Handling missing data: analysis of a challenging data set using multiple imputation, International Journal of Research & Method in Education, vol. 39, No. 1, pp. 19–37.
ROUSSEEUW, P. J. and LEROY, A. M., (1987). Robust Regression and Outlier Detection, ed. by John Wiley & Sons, New York.
RUBIN, D. B., (1987). Multiple Imputation for Nonresponse in Surveys, John Wiley & Sons, New York.
SÄRNDAL, C. E. (1992). Methods for estimating the precision of survey estimates when imputation has been used, Survey Methodology, vol. 18, pp. 241–252.
SCHAFER, J. L., (1997). Analysis of Incomplete Multivariate Data, New York: Chapman and Hall.
TIBSHIRANI, R., (1996). Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society, Series B (Methodological), Vol. 58, No. 1, pp. 267–288.
VANDEV, D. L., (2002). Computing of Trimmed L1 – Median, Laboratory of Computer Stochastics, Institute of Mathematics, Bulgarian Academy of Sciences, (preprint), available at http://www.fmi.uni-sofia.bg/fmi/statist/Personal/Vandev/papers/aspap.pdf .
YUAN, Y. C., (2010). Multiple Imputation for Missing Data: Concepts and New Development (Version 9.0), SAS Institute Inc, Rockville, MD, U.S.A.
ZELIAŚ, A., (20042). Some Notes on the Selection of Normalization of Diagnostic Variables, Statistics in Transition, vol. 5, No. 5, pp. 787–802.