Skoči na glavni sadržaj

Izvorni znanstveni članak

Functional lexicography of an online spellchecker

Šandor Dembitz orcid id orcid.org/0000-0002-0642-845X ; Fakultet elektrotehnike i računarstva


Puni tekst: hrvatski pdf 474 Kb

str. 55-98

preuzimanja: 838

citiraj


Sažetak

Online spellchecking offers a unique possibility of permanent improving of spellchecker linguistic functionality through an interaction with the community of spellchecker users. Such a possibility is crucial for spellchecking in NLP non-central languages, like Croatian, in order to overcome gaps in natural language processing (NLP) tools between them and NLP central languages (English, Japanese, German, French, Russian, Mandarin Chinese etc.). The possibility will be discussed based on Hascheck example. Hascheck started as the first Croatian public spellchecker, operating with a very modest dictionary of 100,000 Croatian common word-types. Due to the learning the dictionary increased to 830,000 common word-types and 600,000 name-types, acronyms, abbreviations etc. It is a result of processing of a corpus which amounts to 260 millions tokens. Hascheck’s corpus is the biggest corpus ever processed in Croatia with a lexicographic aim. All those happened because of Learning System incorporated into spellchecker software environment, which converts individual user language competence into collective value. The Learning System is highly automated, but its results do not enter into Hascheck’s dictionary without human supervision. The supervision is needed because of precision reasons. The supervision takes a special care about potentially valid words which might be close to frequent or potentially frequent misspellings or typos. Abundance of collected data allows mathematical modeling of many aspects of Hascheck’s life, which are also presented in the paper.

Ključne riječi

spellchecker; language corpus; learning index; text coverage; Heaps’ law

Hrčak ID:

98051

URI

https://hrcak.srce.hr/98051

Datum izdavanja:

28.1.2013.

Podaci na drugim jezicima: hrvatski

Posjeta: 2.084 *