Compilation and Exploitation of Parallel Corpora

Erjavec, Tomaž

doi:10.2498/cit.2003.02.02

Journal of computing and information technology, Vol. 11 No. 2, 2003.

Izvorni znanstveni članak

https://doi.org/10.2498/cit.2003.02.02

Compilation and Exploitation of Parallel Corpora

Tomaž Erjavec

Puni tekst: engleski pdf 114 Kb

str. 93-102

preuzimanja: 913

citiraj

APA 6th Edition

Erjavec, T. (2003). Compilation and Exploitation of Parallel Corpora. Journal of computing and information technology, 11 (2), 93-102. https://doi.org/10.2498/cit.2003.02.02

MLA 8th Edition

Erjavec, Tomaž. "Compilation and Exploitation of Parallel Corpora." Journal of computing and information technology, vol. 11, br. 2, 2003, str. 93-102. https://doi.org/10.2498/cit.2003.02.02. Citirano 30.06.2026.

Chicago 17th Edition

Erjavec, Tomaž. "Compilation and Exploitation of Parallel Corpora." Journal of computing and information technology 11, br. 2 (2003): 93-102. https://doi.org/10.2498/cit.2003.02.02

Harvard

Erjavec, T. (2003). 'Compilation and Exploitation of Parallel Corpora', Journal of computing and information technology, 11(2), str. 93-102. https://doi.org/10.2498/cit.2003.02.02

Vancouver

Erjavec T. Compilation and Exploitation of Parallel Corpora. Journal of computing and information technology [Internet]. 2003 [pristupljeno 30.06.2026.];11(2):93-102. https://doi.org/10.2498/cit.2003.02.02

IEEE

T. Erjavec, "Compilation and Exploitation of Parallel Corpora", Journal of computing and information technology, vol.11, br. 2, str. 93-102, 2003. [Online]. https://doi.org/10.2498/cit.2003.02.02

Sažetak

With more and more text being available in electronic form, it is becoming relatively easy to obtain digital texts together with their translations. The paper presents the processing steps necessary to compile such texts into parallel corpora, an extremely useful language resource. Parallel corpora can be used as a translation aid for second-language learners, for translators and lexicographers, or as a data-source for various language technology tools. We present our work in this direction, which is characterised by the use of open standards for text annotation, the use of publicly available third-party tools and wide availability of the produced resources. Explained is the corpus annotation chain involving normalisation, tokenisation, segmentation, alignment, word-class syntactic tagging, and lemmatisation. Two exploitation results over our annotated corpora are also presented, namely aWeb concordancer and the extraction of bi-lingual lexica.

Ključne riječi

Hrčak ID:

44755

URI

https://hrcak.srce.hr/44755

Datum izdavanja:

30.6.2003.

Posjeta: 1.958 *

Prijava i registracija

Journal of computing and information technology, Vol. 11 No. 2, 2003.

Sažetak

Ključne riječi

Hrčak ID:

URI

Datum izdavanja: