Izvorni znanstveni članak
The Text Vocabulary Size Law. Heaps' Law and Determining Text Vocabulary Size in Croatian Language
Miroslav TUĐMAN
Sažetak
The existing formula / Vr(n)=Knß / of Heaps' Law regarding the
size of a text's vocabulary is not universal, thus the law needs to
be redefined, in order to be used for analysis of a different
language corpus. The analysis of a corpus of texts in the Croatian
language confirms the hypothesis that the number of
functional items (F) in a text is constant and amounts to 21% of
the size of the text n (there are 26% of functional items in English
texts). The author proves that the percentage of functional items
in a text can be used as the value for the parameter K, and that
the parameter K presents a constant value for every language
corpus. Empirical research has confirmed the author's thesis that
the number of functional items in a text can be calculated according
to the formula F=nK/100, and that for the value of the
most frequent item (MF) the formula MF=n(K/100)2 can be applied.
The value of the other parameter of Heaps' Law can also
be accurately determined: ß=log K/100. The author therefore
suggests a new form of the text vocabulary size law: Vr(n)=(Kn)ß.
The number of words appearing only once (HL) in the text can be
calculated according to the formula: HL= ((Kn)/2)ß . Research
confirms that there is a very high correlation between the calculated
and real values of the vocabulary size, i.e. between the real
and calculated values of single words in the text. Interpreted and
defined in such a way, the law of the text vocabulary size enables
the calculation of the text's vocabulary size in every language, if
the percentage of constant functional words for this language is
known. However, this interpretation of the law enables, apart
from determining the size of the text's vocabulary, also the
calculation of the number of functional items in the text, the size
of the most frequent word in the text, and the number of single
items comprising the text's vocabulary
Ključne riječi
Hrčak ID:
16266
URI
Datum izdavanja:
30.4.2005.
Posjeta: 3.448 *