Modeling Word Frequency Distributions in Albanian Scientific Texts


Abstract views: 26 / PDF downloads: 36

Authors

  • Lediana Ndreca (Kola) Polytechnic University of Tirana
  • Luela Prifti Polytechnic University of Tirana

Keywords:

Zipf law, Rank–Frequency Analysis, stop words, rare terms, optimal interval fit

Abstract

Zipf’s law has numerous applications in Natural Language Processing (NLP) and serves as a foundational means for understanding the statistical behavior of a language. By describing the relationship between word frequency and rank, it provides key insights that inform a wide range of NLP tasks. This study estimates the word probability distribution within an Albanian scientific doctoral thesis dataset. The log–log plot of word rank–frequency distributions fits to Zipf’s law and the most frequent words in the corpus are identified.

Removing stop words did not improve the fit to Zipf’s law, whereas removing rare terms (frequency = 1) improved the fit, producing a linear relationship that explains over 96.83% of the variance. The paper introduces an optimal interval of 300 points that fits Zipf’s law, which can facilitate the extraction of key terms and contribute to the classification of these texts within larger corpora in future studies.

Downloads

Download data is not yet available.

Author Biographies

Lediana Ndreca (Kola), Polytechnic University of Tirana

Department of Mathematical Engineering

Luela Prifti, Polytechnic University of Tirana

Department of Mathematical Engineering

Downloads

Published

2026-03-15

How to Cite

Ndreca (Kola), L., & Prifti, L. (2026). Modeling Word Frequency Distributions in Albanian Scientific Texts. International Journal of Advanced Natural Sciences and Engineering Researches, 10(3), 101–106. Retrieved from https://as-proceeding.com/index.php/ijanser/article/view/3075

Issue

Section

Articles