Izvorni znanstveni članak
Error-Tagging of CroLTeC (Electronic Learner Corpus of Croatian as a Foreign Language)
The paper describes the error-tagging scheme developed for the CroLTeC learner corpus (http://nlp.ffzg.hr/resources/corpora/croltec/) – the first electronic learner corpus of Croatian as a foreign language. CroLTeC contains essays collected from 755 students with 36 different mother tongues, among which the most prominent were Spanish, English, German, Polish, Chinese, French, and Arabic. It consists of 4,747 essays, out of which 1,217 were digitally born, while 3530 essays were scanned, transcribed in RTF format, and converted into XML format. CroLTeC has a total of 1,054,287 tokens, and essays have been collected on all 6 levels of Common European Framework of Reference for Languages (CEFR) at Croaticum – Center for Croatian as Second and Foreign Language at the Faculty of Humanities and Social Sciences in Zagreb, Department of Information Sciences, Natural Language Processing group. All CroLTeC essays contain metadata about the title, number, and type of essay (homework, part of an exam or field class, etc.). Data were lemmatized and annotated with morphosyntactic tags with the ReLDI tagger (Ljubešić et al., 2016). Also, the corpus is searchable by age, sex, language proficiency level, and the mother tongue of the learner.
The error-tagging scheme is partially based on Šolar (the scheme of Developmental corpus of Slovene) and the error-coding of the Cambridge Learner Corpus and further tailored to the Croatian language. The goal of the development of the error-tagging scheme is to build a sub-corpus that will serve as a repository of authentic data about the learner’s interlanguage. It should enable researchers and teachers of Croatian as a foreign language to explore the interlanguage, to discover the aspects of the grammar that are the most difficult to master and to tailor teaching materials to different groups of learners (not only according to their Croatian language proficiency level but also to their first language). Finally, the error-tagged sub-corpus should also serve as a starting point for designing computer-aided tools to correct lexical errors, misuse of verbal tenses, phrasal verbs, and collocations.
Posjeta: 1.780 *