Functional lexicography of an online spellchecker

Dembitz, Šandor

Filologija, No. 58, 2012.

Original scientific paper

Functional lexicography of an online spellchecker

Šandor Dembitz orcid.org/0000-0002-0642-845X ; Fakultet elektrotehnike i računarstva

Full text: croatian pdf 474 Kb

page 55-98

downloads: 1.088

cite

APA 6th Edition

Dembitz, Š. (2012). Functional lexicography of an online spellchecker. Filologija, (58), 0-0. Retrieved from https://hrcak.srce.hr/98051

MLA 8th Edition

Dembitz, Šandor. "Functional lexicography of an online spellchecker." Filologija, vol. , no. 58, 2012, pp. 0-0. https://hrcak.srce.hr/98051. Accessed 20 Jul. 2026.

Chicago 17th Edition

Dembitz, Šandor. "Functional lexicography of an online spellchecker." Filologija , no. 58 (2012): 0-0. https://hrcak.srce.hr/98051

Harvard

Dembitz, Š. (2012). 'Functional lexicography of an online spellchecker', Filologija, (58), pp. 0-0. Available at: https://hrcak.srce.hr/98051 (Accessed 20 July 2026)

Vancouver

Dembitz Š. Functional lexicography of an online spellchecker. Filologija [Internet]. 2012 [cited 2026 July 20];(58). Available from: https://hrcak.srce.hr/98051

IEEE

Š. Dembitz, "Functional lexicography of an online spellchecker", Filologija, vol., no. 58, pp. 0-0, 2012. [Online]. Available: https://hrcak.srce.hr/98051. [Accessed: 20 July 2026]

Abstract

Online spellchecking offers a unique possibility of permanent improving of spellchecker linguistic functionality through an interaction with the community of spellchecker users. Such a possibility is crucial for spellchecking in NLP non-central languages, like Croatian, in order to overcome gaps in natural language processing (NLP) tools between them and NLP central languages (English, Japanese, German, French, Russian, Mandarin Chinese etc.). The possibility will be discussed based on Hascheck example. Hascheck started as the first Croatian public spellchecker, operating with a very modest dictionary of 100,000 Croatian common word-types. Due to the learning the dictionary increased to 830,000 common word-types and 600,000 name-types, acronyms, abbreviations etc. It is a result of processing of a corpus which amounts to 260 millions tokens. Hascheck’s corpus is the biggest corpus ever processed in Croatia with a lexicographic aim. All those happened because of Learning System incorporated into spellchecker software environment, which converts individual user language competence into collective value. The Learning System is highly automated, but its results do not enter into Hascheck’s dictionary without human supervision. The supervision is needed because of precision reasons. The supervision takes a special care about potentially valid words which might be close to frequent or potentially frequent misspellings or typos. Abundance of collected data allows mathematical modeling of many aspects of Hascheck’s life, which are also presented in the paper.

Keywords

spellchecker; language corpus; learning index; text coverage; Heaps’ law

Hrčak ID:

98051

URI

https://hrcak.srce.hr/98051

Publication date:

28.1.2013.

Article data in other languages: croatian

Visits: 2.874 *

Login and registration

Filologija, No. 58, 2012.

Abstract

Keywords

Hrčak ID:

URI

Publication date: