Enhancing Thematic Classification and Semantic Consistency in Albanian Text Using Synthetic Data and Transformer-Based Embedding Models


Abstract views: 35 / PDF downloads: 30

Authors

  • Leona Rexhaj Polytechnic University of Tirana
  • Luela Prifti Polytechnic University of Tirana

Keywords:

Academic Text Analysis, Synthetic Academic Data, Semantic Similarity, E5-Large, Albanian Language, Natural Language Processing

Abstract

This paper proposes a complete computational framework for the thematic classification and semantic
analysis of Albanian text, focusing on Mathematical Sciences. Due to the scarcity of structured and annotated
Albanian academic corpora, the study introduces a data augmentation approach combining real student theses
with synthetic texts generated through large language models (LLMs). The thematic distribution extracted from
the real dataset shows that the dominant academic sections are Rezultate (463), Perfundime (357), Hyrje (117)
and Metodologji (63). After integrating synthetic material, most sections increase substantially, particularly
methodological and theoretical components, while Rezultate decreases slightly due to the more theoretical
nature of synthetic documents. Semantic coherence was evaluated using cosine similarity with the multilingual
E5-Large encoder. For the real dataset, the strongest semantic alignment was found between Rezultate -
Përfundime (0.8171), followed by Hyrje -Metodologji (0.7923) and Analize teorike - Modelim matematik
(0.7864), confirming the expected structural and conceptual proximity between these academic sections. Post
augmentation values remained comparable at 0.8101, 0.7881, and 0.7902 respectively. These results
demonstrate that the generated synthetic material preserves academic structure and maintains coherent thematic
relationships. Transformer-generated synthetic data can enrich Albanian academic corpora without degrading
semantic or structural consistency. The proposed framework provides a replicable approach for large-scale text
processing in low-resource languages, supporting future research tasks such as document categorization,
academic writing analysis, and digital humanities.

Downloads

Download data is not yet available.

Author Biographies

Leona Rexhaj, Polytechnic University of Tirana

Department of Mathematical Engineering, Albania

Luela Prifti, Polytechnic University of Tirana

Department of Mathematical Engineering,  Albania

References

L. Rexhaj and L. Prifti, “Comparative Evaluation of Soft and Hard Naive Bayes Classification on Albanian Academic Texts,” in Proc. 6th Int. Conf. Engineering, Natural and Social Sciences (ICENSOS), Konya, Turkey, Aug. 10–11, 2025.

P. Üveges and M. Ring, “Evaluating the impact of synthetic data on emotion classification: A linguistic and structural analysis,” Information, 2025.

J. Chim, V. Bhagwan, Z. Wang, P. Pavlinek, and S. Basu, “Evaluating synthetic data generation from user-generated content,” Computational Linguistics, Assoc. for Computational Linguistics, 2025.

S. Sahu, R. Gupta, and C. Espy-Wilson, “On enhancing speech emotion recognition using generative adversarial networks,” 2018.

J. Wang, S. Li, H. Zhao, Z. Xu, and T. Zhang, “A systematic review on affective computing: Emotion models, databases, and recent advances,” 2022.

A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale (XLM-RoBERTa),” in Proc. ACL, 2020.

B. Wang, C.-C. J. Kuo, and Intfloat AI Team, “E5: Text Embeddings by Weakly-Supervised Vision-Language Models,” arXiv preprint, arXiv:2402.00770, 2024.

N. Martinez, T. Sato, K. Al-Khatib, and H. Wachsmuth, “An interpretability-guided framework for responsible synthetic data generation in emotional text,” 2025.

A. Kadriu, V. Krasniqi, and B. Reka, “Sentiment analysis for Albanian language using machine learning techniques,” Int. J. Adv. Comput. Sci. Appl. (IJACSA), vol. 10, no. 6, pp. 378–384, 2019.

N. Nagavci Mati, A. Raci, and A. Dika, “Automatic text classification in Albanian using Naive Bayes and TF-IDF,” in Proc. 11th Int. Conf. Information Systems and Technology (ICIST), 2019.

Y. Li, H. Zhang, X. Wang, and J. Liu, “Synthetic text generation for low-resource languages using transformer-based models,” J. Computational Linguistics, 2024.

Downloads

Published

2025-12-03

How to Cite

Rexhaj, L., & Prifti, L. (2025). Enhancing Thematic Classification and Semantic Consistency in Albanian Text Using Synthetic Data and Transformer-Based Embedding Models . International Journal of Advanced Natural Sciences and Engineering Researches, 9(12), 107–112. Retrieved from https://as-proceeding.com/index.php/ijanser/article/view/2943

Issue

Section

Articles