Skip to the main content

Original scientific paper

https://doi.org/10.22210/suvlin.2022.094.01

Corpus compilation for digital humanities in lower– resourced languages: A practical look at compiling thematic digital media corpora in Serbian, Croatian and Slovenian

Ksenija Bogetić ; Research Centre of the Slovenian Academy of Sciences and Arts, Slovenija
Vuk Batanović orcid id orcid.org/0000-0003-2639-9091 ; Innovation Center of the School of Electrical Engineering, University of Belgrade, Srbija
Nikola Ljubešić orcid id orcid.org/0000-0001-7169-9152 ; Jožef Stefan Institute, Ljubljana Faculty of Computer and Information Science, University of Ljubljana, Slovenija


Full text: english pdf 162 Kb

page 129-152

downloads: 455

cite


Abstract

The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which
has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing
on some sort of corpus is increasingly resorted to for empirically–grounded social–scientific analysis
(sometimes dubbed ‘corpus–assisted discourse analysis’ or ‘corpus–based critical discourse analysis’,
cf. Hardt–Mautner 1995; Baker 2016). In the post–Yugoslav space, recent corpus developments have
brought table–turning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist – partly due to the fast–changing
background of these issues, but also due to the fact that there is still a gap in the corpus method, and in
guidelines for corpus compilation, when applied beyond the anglophone contexts.
In this paper we aim to discuss some possible solutions to these difficulties, by presenting one
step–by–step account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles
and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic
language contexts, including data scraping options, permissions and ethical issues, the factors that
facilitate or complicate automated collection, and corpus annotation and processing possibilities.
The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of
South–Slavic corpus linguistics and corpus linguistic approaches in the humanities more generally

Keywords

Hrčak ID:

289474

URI

https://hrcak.srce.hr/289474

Publication date:

29.12.2022.

Article data in other languages: croatian

Visits: 1.268 *