25 years of Hašek

Dembitz, Šandor

Jezik : Periodical for the Culture of the Standard Croatian Language, Vol. 66 No. 4-5, 2019.

Original scientific paper

25 years of Hašek

Šandor Dembitz

Full text: croatian pdf 829 Kb

page 138-150

downloads: 420

cite

APA 6th Edition

Dembitz, Š. (2019). 25 years of Hašek. Jezik, 66 (4-5), 138-150. Retrieved from https://hrcak.srce.hr/index.php/237727

MLA 8th Edition

Dembitz, Šandor. "25 years of Hašek." Jezik, vol. 66, no. 4-5, 2019, pp. 138-150. https://hrcak.srce.hr/index.php/237727. Accessed 6 Jan. 2025.

Chicago 17th Edition

Dembitz, Šandor. "25 years of Hašek." Jezik 66, no. 4-5 (2019): 138-150. https://hrcak.srce.hr/index.php/237727

Harvard

Dembitz, Š. (2019). '25 years of Hašek', Jezik, 66(4-5), pp. 138-150. Available at: https://hrcak.srce.hr/index.php/237727 (Accessed 06 January 2025)

Vancouver

Dembitz Š. 25 years of Hašek. Jezik [Internet]. 2019 [cited 2025 January 06];66(4-5):138-150. Available from: https://hrcak.srce.hr/index.php/237727

IEEE

Š. Dembitz, "25 years of Hašek", Jezik, vol.66, no. 4-5, pp. 138-150, 2019. [Online]. Available: https://hrcak.srce.hr/index.php/237727. [Accessed: 06 January 2025]

Abstract

Hašek is a Croatian on-line spellchecker that continuously operates since March 21, 1994,
nowadays at the address https://ispravi.me/. In 25 years of functioning Hašek processed
nearly 30 million texts, which build a corpus of more than 7 billion tokens. By comparison,
all books ever published in Croatian form a corpus with less than 20 billion tokens.
As a WWW-embedded tool, Hašek took advantage of many web-based services including
learning. Thanks to Hašek’s learning capability, its dictionary increased from initial 100
thousand to more than 2 million word-types. Another aspect of learning was the creating
and regular updating of the Croatian n-gram system. Unlike Google, whose n-gram systems
are based on the WaC (Web as Corpus) approach and cut-off criteria, Croatian n-grams
were extracted from processed texts by a lexical criterion: each n-gram constituent must
be proven by the spellchecker as valid in Croatian spelling. The difference in approaches
made Croatian n-gram system comparable in size to the largest Google n-gram systems.
Unfortunately, the advantages of on-line spellchecking for rapid breakthroughs into much
more sophisticated language technology areas were not recognized by Croatian decision
makers, with some consequences mentioned in the paper.

Keywords

Hašek; spellchecking; learning; Google; n-gram systems

Hrčak ID:

237727

URI

https://hrcak.srce.hr/237727

Publication date:

1.12.2019.

Article data in other languages: croatian

Visits: 2.745 *

Login and registration

Jezik : Periodical for the Culture of the Standard Croatian Language, Vol. 66 No. 4-5, 2019.

Abstract

Keywords

Hrčak ID:

URI

Publication date: