Thresholding nonprobability units in combined data for efficient domain estimation

Terrance D. Savitsky; Matthew R. Williams; Vladislav Beresovsky; Julie Gershunskaya

doi:https://doi.org/10.59139/stattrans-2025-013

Thresholding nonprobability units in combined data for efficient domain estimation

Terrance D. Savitsky Office of Survey Methods Research, U.S. Bureau of Labor Statistics, USA ORCID:https://orcid.org/0000-0003-1843-3106 , Matthew R. Williams RTI International, USA ORCID:https://orcid.org/0000-0001-8894-1240 , Vladislav Beresovsky Office of Survey Methods Research, U.S. Bureau of Labor Statistics, USA ORCID:https://orcid.org/0009-0002-8375-5195 , Julie Gershunskaya OEUS Statistical Methods Division, U.S. Bureau of Labor Statistics, USA ORCID:https://orcid.org/0000-0002-0096-186X Statistics in Transition new series, vol. 26, 2025, 2, pages: 1-19 Published online: 13 June 2025 https://doi.org/10.59139/stattrans-2025-013 Citation: Savitsky T. D., Williams M. R., Beresovsky V., Gershunskaya J., 2025. Thresholding nonprobability units in combined data for efficient domain estimation. Statistics in Transition new series, 26(2), pp. 1-19; https://doi.org/10.59139/stattrans-2025-013

760 Views 80 Downloads

ARTICLE

(English) PDF

ABSTRACT

Quasi-randomization approaches estimate latent participation probabilities for units from a nonprobability / convenience sample. Estimation of participation probabilities for convenience units allows their combination with units from the randomized survey sample to form a survey-weighted domain estimate. One leverages convenience units for domain estimation under the expectation that estimation precision and bias will improve relative to solely using the survey sample; however, convenience sample units that are very different in their covariate support from the survey sample units may inflate estimation bias or variance. This paper develops a method to threshold or exclude convenience units to minimize the variance of the resulting survey-weighted domain estimator. We compare our thresholding method with other thresholding constructions in a simulation study for two classes of datasets based on the degree of overlap between survey and convenience samples on covariate support. We reveal that excluding convenience units that each express a low probability of appearing in both reference and convenience samples reduces estimation error.

KEYWORDS

survey sampling, nonprobability sampling, data combining, quasi randomization, thresholding units, bayesian hierarchical modeling

REFERENCES

Beresovsky, V., Gershunskaya, J. and Savitsky, T. D., (2024). Review of quasirandomization approaches for estimation from non-probability samples.

Bethlehem, J., (2010). Selection bias in web surveys. International Statistical Review, 78(2), pp. 161 – 188.

Crump, R. K., Hotz, V. J., Imbens, G. W. and Mitnik, O. A., (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika, 96(1), pp. 187–199.

Elliott, M. R., (2009). Combining data from probability and non-probability samples using pseudo-weights. Survey Practice, 2, pp. 813–845.

Elliott, M. R. and Valliant, R., (2017). Inference for Nonprobability Samples. Statistical Science, 32(2), pp. 249 – 264.

Gelman, A. and Hill, J., (2007). Data analysis using regression and multilevel/hierarchical models, volume Analytical methods for social research. New York: Cambridge University Press.

Hirano, K., Imbens, G. and Ridder, G., (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4), pp. 1161–1189.

Meng, X.-L., (2018). Statistical paradises and paradoxes in big data (i): Law of large populations, big data paradox, and the 2016 us presidential election. Annals of Applied Statistics, 12(2), pp. 685 – 726.

Savitsky, T. D., Williams, M. R., Gershunskaya, J. and Beresovsky, V., (2023). Methods for combining probability and nonprobability samples under unknown overlaps. Statistics in Transition, 24(5), pp. 1–34.

Valliant, R., (2020). Comparing alternatives for estimation from nonprobability samples. Journal of Survey Statistics and Methodology, 8(2), pp. 231–263.

VanderWeele, T. J. and Shpitser, I., (2011). A new criterion for confounder selection. Biometrics, 67(4), pp. 1406 – 1413.

Wang, L., Valliant, R. and Li, Y., (2021). Adjusted logistic propensity weighting methods for population inference using nonprobability volunteer-based epidemiologic cohorts. Stat Med., 40(4), pp. 5237–5250.

Williams, D. and Brick, J. M., (2017). Trends in U.S. Face-To-Face Household Survey Nonresponse and Level of Effort. Journal of Survey Statistics and Methodology, 6(2), pp. 186–211.

Wu, C., (2022). Statistical inference with non-probability survey samples. Survey Methodology, 48(2), pp. 283–311.