Functional lexicography of an online spellchecker

Dembitz, Šandor

Filologija, No. 58, 2012.

Izvorni znanstveni članak

Functional lexicography of an online spellchecker

Šandor Dembitz orcid.org/0000-0002-0642-845X ; Fakultet elektrotehnike i računarstva

Puni tekst: hrvatski pdf 474 Kb

str. 55-98

preuzimanja: 1.088

citiraj

APA 6th Edition

Dembitz, Š. (2012). Functional lexicography of an online spellchecker. Filologija, (58), 0-0. Preuzeto s https://hrcak.srce.hr/index.php/98051

MLA 8th Edition

Dembitz, Šandor. "Functional lexicography of an online spellchecker." Filologija, vol. , br. 58, 2012, str. 0-0. https://hrcak.srce.hr/index.php/98051. Citirano 20.07.2026.

Chicago 17th Edition

Dembitz, Šandor. "Functional lexicography of an online spellchecker." Filologija , br. 58 (2012): 0-0. https://hrcak.srce.hr/index.php/98051

Harvard

Dembitz, Š. (2012). 'Functional lexicography of an online spellchecker', Filologija, (58), str. 0-0. Preuzeto s: https://hrcak.srce.hr/index.php/98051 (Datum pristupa: 20.07.2026.)

Vancouver

Dembitz Š. Functional lexicography of an online spellchecker. Filologija [Internet]. 2012 [pristupljeno 20.07.2026.];(58). Dostupno na: https://hrcak.srce.hr/index.php/98051

IEEE

Š. Dembitz, "Functional lexicography of an online spellchecker", Filologija, vol., br. 58, str. 0-0, 2012. [Online]. Dostupno na: https://hrcak.srce.hr/index.php/98051. [Citirano: 20.07.2026.]

Sažetak

Online spellchecking offers a unique possibility of permanent improving of spellchecker linguistic functionality through an interaction with the community of spellchecker users. Such a possibility is crucial for spellchecking in NLP non-central languages, like Croatian, in order to overcome gaps in natural language processing (NLP) tools between them and NLP central languages (English, Japanese, German, French, Russian, Mandarin Chinese etc.). The possibility will be discussed based on Hascheck example. Hascheck started as the first Croatian public spellchecker, operating with a very modest dictionary of 100,000 Croatian common word-types. Due to the learning the dictionary increased to 830,000 common word-types and 600,000 name-types, acronyms, abbreviations etc. It is a result of processing of a corpus which amounts to 260 millions tokens. Hascheck’s corpus is the biggest corpus ever processed in Croatia with a lexicographic aim. All those happened because of Learning System incorporated into spellchecker software environment, which converts individual user language competence into collective value. The Learning System is highly automated, but its results do not enter into Hascheck’s dictionary without human supervision. The supervision is needed because of precision reasons. The supervision takes a special care about potentially valid words which might be close to frequent or potentially frequent misspellings or typos. Abundance of collected data allows mathematical modeling of many aspects of Hascheck’s life, which are also presented in the paper.

Ključne riječi

spellchecker; language corpus; learning index; text coverage; Heaps’ law

Hrčak ID:

98051

URI

https://hrcak.srce.hr/98051

Datum izdavanja:

28.1.2013.

Podaci na drugim jezicima: hrvatski

Posjeta: 2.874 *

Prijava i registracija

Filologija, No. 58, 2012.

Sažetak

Ključne riječi

Hrčak ID:

URI

Datum izdavanja: