Filologija, No. 38-39, 2002.
Original scientific paper
Identification of translational equivalents in Croatian-English parallel corpus
Marko Tadić
; Odsjek za lingvistiku Filozofski fakultet Sveučilišta u Zagrebu, Zagreb
Krešimir Šojat
; Zavod za lingvistiku Filozofski fakultet Sveučilišta u Zagrebu, Zagreb
Abstract
The contribution is investigating the possibilities of identification of translational equivalents (TE) in Croatian-English parallel corpus aligned at the sentence level and collected in the Institute of Linguistics, Faculty of Philosophy, University of Zagreb. At the beginning the identification of TEs between single words is being accomplished by generating all possible word pairs with first word in pair from source language and second word in pair from target language. Only sentences with 1:1 alignment were included in processing. The statistical measure of Mutual Information was applied to generated pairs of words and it gave us the statistically relevant cooccurences. Pairs with high MI value are considered good TE candidates. In the second part of paper the identification of multi-word units (in this case only MWUs with 2 elements) has been achieved by applying the same statistical measure in both, source (Croatian) and target (English) language. The MI value has been applied on pairs of pairs of words giving the possible candidates of translational patterns. By high MI values it has been detected that there were pairs of words in source language, which were regularly translated with fixed pair of words in target language although the MI values for monolingual pairs in each language were extremely low. The contribution aims to show how the usage of statistical methods in parallel corpora processing can facilitate the detection of collocations (possible multi-word terms) and their TEs. At the same time the correspondent co-textual examples of word-usage is being provided in both, source and target language. This is of relevance for multilingual lexicographers as dictionary-writers and translators as the most important group of dictionary-users.
Keywords
Croatian-English parallel corpus; multi-word units; translational equivalents; word alignment; mutual information
Hrčak ID:
173315
URI
Publication date:
20.5.2002.
Visits: 1.728 *