Skoči na glavni sadržaj

Izvorni znanstveni članak

Building a Croatian language stemmer

Ivan Pandžić orcid id orcid.org/0000-0002-7741-8996 ; Institut za hrvatski jezik i jezikoslovlje Ulica Republike Austrije 16, HR-10000 Zagreb


Puni tekst: hrvatski pdf 816 Kb

str. 301-327

preuzimanja: 2.607

citiraj


Sažetak

The paper presents two conservative Croatian language stemmers, k2 and k3. These stemmers are based on the k1 stemmer, an aggressive Croatian language stemmer presented by Nikola Ljubešić in a 2007 paper. By introducing an expanded set of rules that use derivational morphemes of nouns, verbs, and adjectives to determine the stems of words, we hoped to create a more efficient
stemmer. In order to test whether the k2 and k3 stemmers were more efficient than the k1 stemmer, we calculated their precision, recall, and F1-score using a 9775 token corpus, and compared the results with the precision, recall, and F1-score of the k1 stemmer.

Ključne riječi

rule-based stemming; computational linguistics; natural language processing; Croatian language

Hrčak ID:

150047

URI

https://hrcak.srce.hr/150047

Datum izdavanja:

29.12.2015.

Podaci na drugim jezicima: hrvatski

Posjeta: 3.761 *