We consider the problem of predicting a function of misclassified binary variables. We make an interesting observation that the naive predictor, which ignores the misclassification errors, is unbiased even if the total misclassification error is high as long as the probabilities of false positives and false negatives are identical. Other than this case, the bias of the naive predictor depends on the misclassification distribution and the magnitude of the bias can be high in certain cases. We correct the bias of the naive predictor using a double sampling idea where both inaccurate and accurate measurements are taken on the binary variable for all the units of a sample drawn from the original data using a probability sampling scheme. Using this additional information and design-based sample survey theory, we derive a biascorrected predictor. We examine the cases where the new bias-corrected predictors can also improve over the naive predictor in terms of mean square error (MSE).
binary classification, double sampling, finite population sampling, misclassification, linkage error, sampling design
BEAUCHAMP, A., TONKIN, A. M., KELSALL, H., SUNDARARAJAN, V., EN GLISH, D. R., SUNDARESAN, L., WOLFE, R., TURRELL, G., GILES, G.
G., PEETERS, A., (2011). Validation of de-identified record linkage to ascer tain hospital admissions in a cohort study. BMC Medical Research Methodol ogy. 11–42.
BENNELL, C., SNOOK, B., MACDONALD, S., HOUSE, J. C., TAYLOR, P. J.,(2012). Computerized crime linkage systems: a critical review and research agenda. Criminal Justice and Behavior. 39(5): 620–634.
BOESE, D. H., YOUNG, D. M., STAMEY, J. D., (2006). Confidence intervals for a binomial parameter based on binary data subject to false-positive misclassi fication. Computational Statistics & Data Analysis. 50: 3369–3385.
BRESLOW, N. E., LUBIN, J. H., LANGHOLZ, B., (1983). Multiplicative models and cohort analysis. Journal of the American Statistical Association. 78: 1–12.
BROSS, I., (1954). Misclassification in 2 × 2 tables. Biometrics. 10: 478–486.
EVANS, M., GUTTMAN, I., HAITOVSKY, Y., SWARTZ, T., (1996). Bayesian analysis of binary data subject to misclassification. In: Berry, D., Chaloner,K., Geweke, J., eds. Bayesian Analysis in Statistics and Econometrics: Essays in Honor of Arnold Zellner. New York: John Wiley, 67–77.
FAIR, M. E., (1989). Studies and references relating to the uses of the Canadian Mortality Data Base. Report from the Occupational and Environmental Health Research Unit, Health Division, Statistics Canada, Ottawa.
FELLIGI, I., SUNTER, A., (1969). A theory for record linkage. Journal of the American Statistical Association. 64: 1183–1210.
GABA, A., WINKLER, R. L., (1992). Implications of errors in survey data: a Bayesian model. Management Science. 38: 913–925.
GIRAUD-CARRIER, C., GOODLIFFE, J., JONES, B. M., CUEVA, S., (2015).Effective record linkage for mining campaign contribution data. Knowledge and Information Systems. 45(2): 389–416.
GOLDBERG, J. D., (1975). The effects of misclassification on the bias in the difference between two proportions and the relative odds in the fourfold table.Journal of the American Statistical Association. 70: 561–567.
GUSTAFSON, P., LE, N. D., SASKIN, R., (2001). Case-control analysis with partial knowledge of exposure misclassification probabilities. Biometrics. 57: 598–609.
HOWE, G. R., (1985). Use of computerized record linkage in follow-up studies of cancer epidemiology in Canada. National Cancer Institute Monograph. 67:117–121.
HOWE, G., R., (1998). Use of computerized record linkage in cohort studies.Epidemiologic Reviews. 20(1): 112–121.
HERZOG, T. N., SCHEUREN, F. J., WINKLER, W. E., (2007). Data Quality and Record Linkage Techniques. Springer, New York, NY.
KABUDULA, C. W., JOUBERT, J. D., TUOANE-NKHASI, M., KAHN, K.,RAO, C., GÓMEZ OLIVÉ, F. X., MEE, P., TOLLMAN, S., LOPEZ, A. D.,VOS, T., BRADSHAW, D., (2014). Evaluation of record linkage of mortality data between a health and demographic surveillance system and national civil registration system in South Africa. Population Health Metrics. 12–23.
KREWSKI, D., DEWANJI, A., WANG, Y., BARTLETT, S., ZIELINSKI, J. M.,MALLICK, R., (2005). The Effect of Record Linkage Errors on Risk Esti mates in Cohort Mortality Studies. Survey Methodology. 31: 13–21.
LAHIRI, P., LARSEN, M. D., (2005). Regression analysis with linked data. Jour nal of the American Statistical Association. 100: 222–230.
LYLES, R. H., LIN, H., M., WILLIAMSON, J. M., (2004). Design and analytic considerations for single-armed studies with misclassification of a repeated binary outcome. Journal of Biopharmaceutical Statistics. 14: 229–247.
NETER, J., MAYNES, E. S., RAMANATHAN, R., (1965). The effect of mis matching on the measurement of response errors. Journal of the American Statistical Association. 60: 1005–1027.
RAHARDJA, D., YANG, Y., (2015). Maximum likelihood estimation of a bino mial proportion using one-sample misclassified binary data. Statistica Neer landica. 69(3), 272–280.
RAHARDJA, D., ZHAO, Y. D., (2013). One-way analysis of proportions for mis classified binomial data. Journal of Statistical Computation and Simulation.1–10.
SCHEUREN, F., WINKLER, W. E., (1993). Regression Analysis of Data Files That Are Computer Matched. Survey Methodology. 19, 39–58.
STAMEY, J. D., SEAMAN, J. W., YOUNG, D. M., (2007). Bayesian estimation of intervention effect with pre- and post-misclassified binomial data. Journal of Biopmaceutical Statistics. 17: 93–108.
TENENBEIN, A., (1970). A double sampling scheme for estimating from bino mial data with misclassifications. Journal of American Statistical Association.65(331): 1350–1361.
VIANA, M., RAMAKRISHNAN, V., LEVY, P., (1993). Bayesian analysis of prevalence from results of small screening samples. Communication Statistics Theory and Methods. 22: 575–585.
YATES, F., GRUNDY, P. M., (1953). Selection without replacement from within strata with probability proportional to size. Journal of the Royal Statistical Society: Series B. 15: 235–261.
ZHONG, B., (2002). Evaluating qualitative assays using sensitivity and specificity.Journal of Biopharmaceutical Statistics. 12: 409–424