Filologija, No. 58, 2012.
Izvorni znanstveni članak
Functional lexicography of an online spellchecker
Šandor Dembitz
orcid.org/0000-0002-0642-845X
; Fakultet elektrotehnike i računarstva
Sažetak
Online spellchecking offers a unique possibility of permanent improving of spellchecker linguistic functionality through an interaction with the community of spellchecker users. Such a possibility is crucial for spellchecking in NLP non-central languages, like Croatian, in order to overcome gaps in natural language processing (NLP) tools between them and NLP central languages (English, Japanese, German, French, Russian, Mandarin Chinese etc.). The possibility will be discussed based on Hascheck example. Hascheck started as the first Croatian public spellchecker, operating with a very modest dictionary of 100,000 Croatian common word-types. Due to the learning the dictionary increased to 830,000 common word-types and 600,000 name-types, acronyms, abbreviations etc. It is a result of processing of a corpus which amounts to 260 millions tokens. Hascheck’s corpus is the biggest corpus ever processed in Croatia with a lexicographic aim. All those happened because of Learning System incorporated into spellchecker software environment, which converts individual user language competence into collective value. The Learning System is highly automated, but its results do not enter into Hascheck’s dictionary without human supervision. The supervision is needed because of precision reasons. The supervision takes a special care about potentially valid words which might be close to frequent or potentially frequent misspellings or typos. Abundance of collected data allows mathematical modeling of many aspects of Hascheck’s life, which are also presented in the paper.
Ključne riječi
spellchecker; language corpus; learning index; text coverage; Heaps’ law
Hrčak ID:
98051
URI
Datum izdavanja:
28.1.2013.
Posjeta: 2.084 *