Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool

Garabík, Radovan

doi:10.31724/rihjj.46.2.8

Rasprave Instituta za hrvatski jezik, Vol. 46 No. 2, 2020.

Stručni rad

https://doi.org/10.31724/rihjj.46.2.8

Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool

Radovan Garabík ; Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences

Puni tekst: engleski pdf 1.013 Kb

str. 603-618

preuzimanja: 1.054

citiraj

APA 6th Edition

Garabík, R. (2020). Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool. Rasprave Instituta za hrvatski jezik, 46 (2), 603-618. https://doi.org/10.31724/rihjj.46.2.8

MLA 8th Edition

Garabík, Radovan. "Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool." Rasprave Instituta za hrvatski jezik, vol. 46, br. 2, 2020, str. 603-618. https://doi.org/10.31724/rihjj.46.2.8. Citirano 24.06.2026.

Chicago 17th Edition

Garabík, Radovan. "Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool." Rasprave Instituta za hrvatski jezik 46, br. 2 (2020): 603-618. https://doi.org/10.31724/rihjj.46.2.8

Harvard

Garabík, R. (2020). 'Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool', Rasprave Instituta za hrvatski jezik, 46(2), str. 603-618. https://doi.org/10.31724/rihjj.46.2.8

Vancouver

Garabík R. Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool. Rasprave Instituta za hrvatski jezik [Internet]. 2020 [pristupljeno 24.06.2026.];46(2):603-618. https://doi.org/10.31724/rihjj.46.2.8

IEEE

R. Garabík, "Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool", Rasprave Instituta za hrvatski jezik, vol.46, br. 2, str. 603-618, 2020. [Online]. https://doi.org/10.31724/rihjj.46.2.8

Sažetak

The Aranea Project offers a set of comparable corpora for two dozens of (mostly European) languages providing a convenient dataset for nLP applications that require training on large amounts of data. The article presents word embedding models trained on the Aranea corpora and an online interface to query the models and visualize the results. The implementation is aimed towards lexicographic use but can be also useful in other fields of linguistic study since the vector space is a plausible model of semantic space of word meanings. Three different models are available – one for a combination of part of speech and lemma, one for raw word forms, and one based on fastText algorithm uses subword vectors and is not limited to whole or known words in finding their semantic relations. The article is describing the interface and major modes of its functionality; it does not try to perform detailed linguistic analysis of presented examples.

Ključne riječi

corpus; word embedding; vector similarity; semantic similarity; web corpora; visualization

Hrčak ID:

245458

URI

https://hrcak.srce.hr/245458

Datum izdavanja:

30.10.2020.

Podaci na drugim jezicima: hrvatski

Posjeta: 3.285 *

Prijava i registracija

Rasprave Instituta za hrvatski jezik, Vol. 46 No. 2, 2020.

Sažetak

Ključne riječi

Hrčak ID:

URI

Datum izdavanja: