Skip to the main content

Original scientific paper

Building a Croatian language stemmer

Ivan Pandžić orcid id orcid.org/0000-0002-7741-8996 ; Institut za hrvatski jezik i jezikoslovlje Ulica Republike Austrije 16, HR-10000 Zagreb


Full text: croatian pdf 816 Kb

page 301-327

downloads: 2.616

cite


Abstract

The paper presents two conservative Croatian language stemmers, k2 and k3. These stemmers are based on the k1 stemmer, an aggressive Croatian language stemmer presented by Nikola Ljubešić in a 2007 paper. By introducing an expanded set of rules that use derivational morphemes of nouns, verbs, and adjectives to determine the stems of words, we hoped to create a more efficient
stemmer. In order to test whether the k2 and k3 stemmers were more efficient than the k1 stemmer, we calculated their precision, recall, and F1-score using a 9775 token corpus, and compared the results with the precision, recall, and F1-score of the k1 stemmer.

Keywords

rule-based stemming; computational linguistics; natural language processing; Croatian language

Hrčak ID:

150047

URI

https://hrcak.srce.hr/150047

Publication date:

29.12.2015.

Article data in other languages: croatian

Visits: 3.813 *