The new generation of Large Language Models, based on Generative Pre-trained Transformers (GPT) can be useful for automatic text annotation and sentiment analysis. However, they tend to learn the bias from training data, which can lead to distorted results. In this paper, the GPT-4o-mini model by OpenAI is tested for the presence of geographical, political and gender bias in the case of Polish economic news headlines. It has been found that the model consistently differs in sentiment scores for the same sentence, depending on the country mentioned. A remedy to this problem is proposed, which masks the references to countries and nationalities using the GPT model. Some differences in sentiment scores resulting from explicit references to gender or political parties are also identified, although these types of bias are considerably weaker than geographical bias.
large language models, geographical bias, gender bias, political bias, sentiment analysis
Ahuja, K., Diddee, H., Hada, R., Ochieng, M., Ramesh, K., Jain, P., Nambi, A., Ganu, T., Segal, S., Axmed, M., Bali, K. and Sitaram, S., (2023). MEGA: Multilingual Evaluation of Generative AI, in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, pp. 4232–4267. https://aka.ms/MEGA.
Aslan, F., (2024). Bias assessment in Large Language Models, PhD thesis, Tilburg University, Tilburg
Caliskan, A., Bryson, J. J. and Narayanan, A., (2017). Semantics derived automatically from language corpora contain human-like biases Science, 356, pp. 183–186. https://doi.org/10.1126/science.aal4230.
Curry, N., Baker, P. and Brookes, G., (2024). Generative AI for corpus approaches to discourse studies: A critical evaluation of ChatGPT, Applied Corpus Linguistics, 4(1). https://doi.org/10.1016/j.acorp.2023.100082.
Dac Lai, V., Trung Ngo, N., Pouran Ben Veyseh, A., Man, H., Dernoncourt, F., Bui, T. and Huu Nguyen, T., (2023). ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning, in ‘Findings of the Association for Computational Linguistics: EMNLP 2023’, Association for Computational Linguistics, Singapore, pp. 13171–13189 .
Debess, I. N., Simonsen, A. and Einarsson, H., (2024). Good or Bad News? Exploring GPT-4 for Sentiment Analysis for Faroese on a Public News Corpora, Technical report, ELRA Language Resource Association. https://huggingface.co/datasets/hafsteinn/.
Etxaniz, J., Azkune, G., Soroa, A., Lopez De Lacalle, O. and Artetxe, M., (2023). Do Multilingual Language Models Think Better in English?, in Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), vol. 2, Association for Computational Linguistics, Mexico City, pp. 550–564
Fatouros, G., Soldatos, J., Kouroumali, K., Makridis, G. and Kyriazis, D., (2023). Transforming sentiment analysis in the financial domain with ChatGPT, Machine Learningwith Applications, 14, 100508.
Garg, N., Schiebinger, L., Jurafsky, D. and Zou, J., (2018), Word embeddings quantify 100 years of gender and ethnic stereotypes, Proceedings of the National Academy of Sciences of the United States of America, 115(16), pp. 3635–3644.
Gilardi, F., Alizadeh, M. and Kubli, M., (2023). ChatGPT outperforms crowd workers for text-annotation tasks, Proceedings of the National Academy of Sciences of the United States of America, 120(30) .
Han, X., Baldwin, T., and Cohn, T., (2022). Balancing out Bias: Achieving Fairness Through Balanced Training. In Y. Goldberg, Z. Kozareva, and Y. Zhang (Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, pp. 11335–11350. https://doi.org/10.18653/v1/2022.emnlpmain.779.
Huang, P.-S., Zhang, H., Jiang, R., Stanforth, R., Welbl, J., Rae, J., Maini, V., Yogatama, D. and Kohli, P., (2020). Reducing Sentiment Bias in Language Models via Counterfactual Evaluation, arXiv. http://arxiv.org/abs/1911.03064.
Kheiri, K., Karimi, H., (2023). SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its Departure from Current Machine Learning, arXiv. http://arxiv.org/abs/2307.10234.
Kocoń, J., Cichecki, I., Kaszyca, O., Kochanek, M., Szydło, D., Baran, J., Bielaniewicz, J., Gruza, M., Janz, A., Kanclerz, K., Kocoń, A., Koptyra, B., Mieleszczenko-Kowszewicz, W., Miłkowski, P., Oleksy, M., Piasecki, M., Radliński, , Wojtasik, K., Woźniak, S. and Kazienko, P., (2023). ChatGPT: Jack of all trades, master of none, Information Fusion, 99 101861. https://doi.org/10.1016/j.inffus.2023.101861.
Kristensen-McLachlan, R. D., Canavan, M., Kardos, M., Jacobsen, M. and Aaroe, L., (2023). Chatbots Are Not Reliable Text Annotators, arXiv . http://arxiv.org/abs/2311.05769.
Krugmann, J. O. and Hartmann, J., (2024). ‘Sentiment Analysis in the Age of Generative AI’, Customer Needs and Solutions, 11(1).
Lee, S., Kim, D., Jung, D., Park, C. and Lim, H., (2024). Exploring Inherent Biases in LLMs within Korean Social Context: A Comparative Analysis of ChatGPT and GPT-4, in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 4, pp. 93–104.
Liang, P. P.,Wu, C., Morency, L.-P. and Salakhutdinov, R., (2021). Towards Understanding and Mitigating Social Biases in Language Models, ICML. https://arxiv.org/abs/2106.13219.
Liu, R., Jia, C., Wei, J., Xu, G., Wang, L. and Vosoughi, S., (2021). Mitigating Political Bias in Language Models Through Reinforced Calibration, in The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21). https://doi.org/10.48550/arXiv.2104.14795.
Liu, Z., (2025). Cultural Bias in Large Language Models: A Comprehensive Analysis and Mitigation Strategies, Journal of Transcultural Communication, 3(2), pp. 224–244. https://www.degruyterbrill.com/document/doi/10.1515/jtc-2023-0019/html.
Liyanage, C. R., Gokani, R. and Mago, V., (2024). GPT-4 as an X data annotator: Unraveling its performance on a stance classification task, PLoS ONE, 19 .
Manvi, R., Khanna, S., Burke, M., Lobell, D. and Ermon, S., (2024), Large Language Models are Geographically Biased, in Proceedings of the 41st International Conference on Machine Learning, 1409, pp. 34654 - 34669.
Nadeem, M., Bethke, A. and Reddy, S., (2021). StereoSet: Measuring stereotypical bias in pretrained language models, in ‘Proceedings ofthe 59th Annual Meeting ofthe Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing’, Association for Computational Lingusitics, pp. 5356–5371.
Ollion, É, Shen, R., Macanovic, A. and Chatelain, A., (2023). ChatGPT for Text Annotation? Mind the Hype! https://doi.org/10.31235/osf.io/x58kn.
Orgad, H. and Belinkov, Y., (2023). BLIND: Bias RemovalWith No Demographics, in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, pp. 8801–8821. https://doi.org/10.18653/v1/2023.acl-long.490.
Pangakis, N.,Wolken, S. and Fasching, N., (2023). Automated Annotation with Generative AI Requires Validation, arXiv. http://arxiv.org/abs/2306.00176.
Radaideh, M. I., Kwon, H. and Radaideh, M. I., (2025). Fairness and Social Bias Quantification in Large Language Models for Sentiment Analysis, Knowledge-based Systems 319, 113569. https://doi.org/10.1016/j.knosys.2025.113569.
Ravfogel, S., Elazar, Y., Gonen, H., Twiton, M. and Goldberg, Y., (2020). Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7237–7256.
Retzlaff, N., (2024). Political Biases of ChatGPT in Different Languages, Preprints.org, URL: www.preprints.org.
Rozado, D., (2020). Wide range screening of algorithmic bias in word embedding models using large sentiment lexicons reveals underreported bias types, PLoS ONE, 15(4).
Rozado, D., (2023). The Political Biases of ChatGPT, textitSocial Sciences, 12(3), 148. https://doi.org/10.3390/socsci12030148.
Srinivasan, N., Perumalsamy, K., Sridhar, K., Rajendran, G. and Kumar, A. A., (2024). Comprehensive Study on Bias In Large Language Models, International Refereed Journal of Engineering and Science, 13(2), pp. 77–82 .
Utama, P. A., Moosavi, N. S. and Gurevych, I., (2020). Towards Debiasing NLU Models from Unknown Biases, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 7597–7610
Worldometer, (2024). Worldometer GDP per capita dataset. https://www.worldometers.info/gdp/gdp-per-capita/, accessed: 23.09.2024.
Zhao, J., Zhou, Y., Li, Z., Wang, W. and Chang, K.-W., (2018). Learning Gender-Neutral Word Embeddings, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4847–4853. http://arxiv.org/abs/1809.01496.
Zhu, S., Wang, W. and Liu, Y., (2024). Quite Good, but Not Enough: Nationality Bias in Large Language Models – A Case Study of ChatGPT, arXiv. http://arxiv.org/abs/2405.06996.