Missing data is a nuisance in statistics. Real donor imputation can be used with item nonresponse. A pool of donor units with similar values on auxiliary variables is matched to each unit with missing values. The missing value is then replaced by a copy of the corresponding observed value from a randomly drawn donor. Such methods can to some extent protect against nonresponse bias. But bias also depends on the estimator and the nature of the data. We adopt techniques from kernel estimation to combat this bias. Motivated by Pólya urn sampling, we sequentially update the set of potential donors with units already imputed, and use multiple imputations via Bayesian bootstrap to account for imputation uncertainty. Simulations with a single auxiliary variable show that our imputation method performs almost as well as competing methods with linear data, but better when data is nonlinear, especially with large samples
bayesian bootstrap, boundary and nonresponse bias, missing data, multiple imputation, Pólya urn models, real donor imputation.
AERTS, M. CLAESKENS, G. HENS, N. and MOLENBERGHS G., (2002). Local multiple imputation. Biometrika, 89 (2), pp. 375–388.
ANDRIDGE, R. R. and LITTLE, R. J. A., (2010). A review of Hot Deck Imputation for Survey Non-response. International Statistical Review, 78 (1), pp. 40–64.
CONTI, P. L. MARELLA, D. and SCANU, M., (2008). Evaluation of matching noise for imputation techniques based on nonparametric local linear regression estimators. Computational Statistics and Data Analysis, 53 (2), pp.354–365.
DE FINETTI, B., (1931). Funzione caratteristica di un fenomeno aleatorio, Atti della R. Academia Nazionale dei Lincei, Classe di Scienze Fisiche, Mathematice e Naturale, 6(4), pp. 251–299.
DEVILLE, J-C. and SÄRNDAL, C-E., (1992), Calibration Estimators in Survey Sampling. Journal of the American Statistical Association, 87 (418), pp. 376–382.
DIACONIS, P. and FREEDMAN, D., (1980). Finite exchangeable sequences. Annals of Probability, 8(4), pp. 745–764.
EPANECHNIKOV, V. A., (1969). Non-parametric estimation of a multivariate probability density. Theory of Probability and its Applications, 14 (1), pp. 153–158.
FELLER, W., (1971). An Introduction to Probability Theory and Its Applications, 2nd ed. Wiley, New York.
FERGUSON, T. S., (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2), pp. 209–230.
GELMAN, A. HILL, J. SU, Y-S. YAJIMA, M. and PITTAU, M. G., (2010). mi: Missing Data Imputation and Model Checking. R package version 0.09–11.
GRAMACY, R. B., (2010). monomvn: Estimation for multivariate normal and Student-t data with monotone missingness. R package version 1.8–3.
GROSS, K. and BATES, D., (2008). mvnmle: ML estimation for multivariate normal data with missing values. R package version 0.1–8.
HARRELL, F. E., (2010). Hmisc: Harrell Miscellaneous. R package version 3.8–3.
HEWITT, E. and SAVAGE, L. J., (1955). Symmetric measures on Cartesian products. Transactions of the American Mathematical Society, 80 (2), pp.470–501.
HOFF, P., (2010). sbgcop: Semiparametric Bayesian Gaussian copula estimation and imputation. R package version 0.975.
HONAKER, J. KING, G. and BLACKWELL, M., (2011) Amelia: Amelia II: A Program for Missing Data. R package version 1.5–4.
KIM, K-Y. and YI, G-S., (2008). SeqKnn: Sequential KNN imputation method. R package version 1.0.1.
KONG, A. LIU, J. S. and WONG, W. H., (1994) Sequential Imputations and Bayesian Missing Data Problems. Journal of the American statistical association, 89(425), pp. 278–288.
LAAKSONEN, S., (2000). Regression-based nearest neighbour hot decking, Computational Statistics, 15(1), pp. 65–71.
LITTLE R. J. A. and RUBIN, D. B., (2002). Statistical analysis with missing data. Hoboken: Wiley.
LO, A. Y., (1988). A Bayesian bootstrap for a finite population. The Annals of Statistics, 16 (4), pp. 1684–1695.
NADARAYA, E. A., (1964). On estimating regression. Theory of Probability and its Applications, 9 (1), pp. 141–142.
R DEVELOPMENT CORE TEAM. (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
RICE, J., (1984). Boundary modification for kernel regression. Communications in statistics - Theory and methods, 13 (7), pp. 893–900.
RUBIN, D. B., (1981). The Bayesian Bootstrap. The Annals of Statistics, 9 (1), pp. 130–134.
RUBIN, D. B., (1987). Multiple imputation for nonresponse in surveys. Hoboken; Wiley.
SCHAFER, J. L., (1997). Analysis of incomplete multivariate data. London; Chapman and Hall.
SIDDIQUE, J. and BELIN, T. R., (2008). Multiple imputation using an iterative hot-deck with distance-based donor selection. Statistics in Medicine, 27 (1), pp. 83–102.
SILVERMAN, B. W., (1986). Density estimation for statistics and data analysis. London; Chapman and Hall.
SIMONOFF, J. S., (1996). Smoothing methods in statistics. New York; Springer Verlag.
STACKLIES, W. REDESTIG, H. and WRIGHT, K., (2011). pcaMethods: A collection of PCA methods. R package version 1.24.0.
TEMPL, M. HRON, K. and FILZMOSER, P., (2010). robCompositions: Robust Estimation for Compositional Data. R package version 1.4.3.
VAN BUUREN, S. and GROOTHUIS-OUDSHOORN, K., (2010). MICE: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, (in press).
WAND, M. P. and JONES, M. C., (1995). Kernel smoothing. London; Chapman and Hall.
WATSON, G. S., (1964). Smooth regression analysis. Sankhya Series A, 26 (4), pp. 359–372