25 years of Hašek

Šandor Dembitz

str. 138-150
Hašek is a Croatian on-line spellchecker that continuously operates since March 21, 1994,
nowadays at the address In 25 years of functioning Hašek processed
nearly 30 million texts, which build a corpus of more than 7 billion tokens. By comparison,
all books ever published in Croatian form a corpus with less than 20 billion tokens.
As a WWW-embedded tool, Hašek took advantage of many web-based services including
learning. Thanks to Hašek’s learning capability, its dictionary increased from initial 100
thousand to more than 2 million word-types. Another aspect of learning was the creating
and regular updating of the Croatian n-gram system. Unlike Google, whose n-gram systems
are based on the WaC (Web as Corpus) approach and cut-off criteria, Croatian n-grams
were extracted from processed texts by a lexical criterion: each n-gram constituent must
be proven by the spellchecker as valid in Croatian spelling. The difference in approaches
made Croatian n-gram system comparable in size to the largest Google n-gram systems.
Unfortunately, the advantages of on-line spellchecking for rapid breakthroughs into much
more sophisticated language technology areas were not recognized by Croatian decision
makers, with some consequences mentioned in the paper.

Ključne riječi
Hašek; spellchecking; learning; Google; n-gram systems

