An expectation-maximization algorithm for logistic regression based on individual-level predictors and aggregate-level response

Zheng Xu

doi:https://doi.org/10.59139/stattrans-2025-002

An expectation-maximization algorithm for logistic regression based on individual-level predictors and aggregate-level response

Zheng Xu Correspondence Author. Department of Mathematics and Statistics, Wright State University, Dayton, OH, USA ORCID:https://orcid.org/0000-0003-0311-7004 Statistics in Transition new series, vol. 26, 2025, 1, pages: 9-28 Published online: 10 March 2025 https://doi.org/10.59139/stattrans-2025-002 Citation: Zheng X., 2025. An expectation-maximization algorithm for logistic regression based on individual-level predictors and aggregate-level response. Statistics in Transition new series, 26(1), pp. 9-28. https://doi.org/10.59139/stattrans-2025-002

441 Views 67 Downloads

ARTICLE

(English) PDF

ABSTRACT

Logistic regression is widely used in complex data analysis. When predictors are at individual level and the response at aggregate level, logistic regression can be estimated using the Maximum Likelihood Estimation (MLE) method with the joint likelihood function formed by Poisson binomial distributions. When directly maximizing the complicated likelihood function, the performance of MLE will worsen as the number of predictors increases. In this article, we propose an expectation-maximization (EM) algorithm to avoid the direct maximization of the complicated likelihood function. Simulation studies have been conducted to evaluate the performance of our EM estimator compared to different estimators proposed in the literature. Two real data-based studies have been conducted to illustrate the use of the different estimators. Our EM estimator proves efficient f or t he l ogistic r egression problem with an aggregate-level response and individual-level predictors.

KEYWORDS

expectation-maximization algorithm, missing values, Poisson binomial distribution, logistic regression, data aggregation, numerical optimization.

REFERENCES

Agresti, A., (2013). Categorical Data Analysis, Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ, USA.

Ahmed, M., (2023). Maternal Health Risk. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5DP5D.

Ahmed, M., Kashem, M. A., Rahman, M. and Khatun, S., (2020). Review and analysis of risk factor of maternal health in remote area using the internet of things (iot). URL https://api.semanticscholar.org/CorpusID:214577407.

Bernardo, J., Bayarri, M., Berger, J., Dawid, A., Heckerman, D., Smith, A.,West, M., et al., (2003). The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. Bayesian statistics, 7, (pp. 453–464):210.

Brooks, S. P., Morgan, B. J. T., (2018). Optimization Using Simulated Annealing. Journal of the Royal Statistical Society Series D: The Statistician, 44(2), pp. 241–257, 12. ISSN: 2515–7884.

Chen, X., Dempster, A. and Liu, J., (1994). Weighted finite population sampling to maximize entropy. Biometrika, 81, pp. 457–469.

Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J., (2009). Wine Quality. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C56S3T.

Fernandez, M., Williams, S., (2010). Closed-formexpression for the Poisson-binomial probability density function. IEEE Trans. Aerosp. Electron. Syst., 46, pp. 803–817.

Fletcher, R., (1970). A new approach to variable metric algorithms. Comput. J., 13, pp. 317–322.

Fletcher, R., Reeves, C., (1964). Function minimization by conjugate gradient. Comput. J., 7, pp. 149–154.

Geamsakul W., Yoshida T., Ohara K., Motoda H., Yokoi H., and Takabayashi K., (2005). Constructing a decision tree for graph-structured data and its applications. Fundamenta Informaticae, 66(1–2), pp. 131–160.

Getoor, L., Mihalkova, L., (2011). Learning statistical models from relational data. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 1195–1198.

Givens, G., Hoeting, J., (2012). Computational Statistics, Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ, USA.

Hastie, T., Tibshirani, R. and Friedman, J., (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, Berlin, Germany. ISBN 9780387848846.

Henaff, M., Bruna, J. and LeCun, Y., (2015). Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163.

Hilbe, J., (2009). Logistic Regression Models. Chapman & Hall/CRC Texts in Statistical Science. CRC Press, Boca Ration, Florida, USA. ISBN 9781420075779.

Hong, Y., (2013). On computing the distribution function for the Poisson binomial distribution. Comput. Stat. Data Anal., 59, pp. 41–51.

Kalivas, J. H., (1992). Optimization using variations of simulated annealing. Chemometrics and Intelligent Laboratory Systems, 15(1), pp. 1–12. ISSN 0169-7439.

Lambora, A., Gupta, K. and Chopra, K., (2019). Genetic algorithm-a literature review. In 2019 international conference on machine learning, big data, cloud and parallel computing (COMITCon), pp. 380–384. IEEE.

McLachlan, G. J. and Krishnan, T., (2007). The EM algorithm and extensions. John Wiley & Sons, New York City, USA.

Mercer, T. R., Salit, M.. (2021). Testing at scale during the covid-19 pandemic. Nature Reviews Genetics, 22(7), pp. 415–426.

Nelder, J., Mead, R., (1965). A simplex method for function minimization. Comput. J., 7, pp. 308–313.

Primo, D. M., Jacobsmeier, M. L. and Milyo, J., (2007). Estimating the impact of state policies and institutions with mixed-level data. State Politics & Policy Quarterly, 7(4), pp. 446–459.

Saramago, P., Sutton, A. J., Cooper, N. J. and Manca, A., (2012). Mixed treatment comparisons using aggregate and individual participant level data. Statistics in medicine, 31(28), pp. 3516–3536.

Wang, Y., (1993). On the number of successes in independent trials. Stat. Sin., 3, pp. 295–312.

Wei, G. C., Tanner, M. A., (1990). A Monte Carlo implementation of the em algorithmand the poor man’s data augmentation algorithms. Journal of the American statistical Association, 85(411), pp. 699–704,

Xu, Z., (2023). Logistic regression based on individual-level predictors and aggregate-level responses. Mathematics, 11(3), p.746.

Zhai, Y., Liu, B., (2006). Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 18(12), pp. 1614– 1628.