Modeling Word Frequency Distributions in Albanian Scientific Texts
Keywords:
Zipf law, Rank–Frequency Analysis, stop words, rare terms, optimal interval fitAbstract
Zipf’s law has numerous applications in Natural Language Processing (NLP) and serves as a foundational means for understanding the statistical behavior of a language. By describing the relationship between word frequency and rank, it provides key insights that inform a wide range of NLP tasks. This study estimates the word probability distribution within an Albanian scientific doctoral thesis dataset. The log–log plot of word rank–frequency distributions fits to Zipf’s law and the most frequent words in the corpus are identified.
Removing stop words did not improve the fit to Zipf’s law, whereas removing rare terms (frequency = 1) improved the fit, producing a linear relationship that explains over 96.83% of the variance. The paper introduces an optimal interval of 300 points that fits Zipf’s law, which can facilitate the extraction of key terms and contribute to the classification of these texts within larger corpora in future studies.