A Corpus-Linguistic Analysis of Sportske novosti

Stojanov, Tomislav; Vučić, Zoran

Filologija, No. 59, 2012.

Original scientific paper

A Corpus-Linguistic Analysis of Sportske novosti

Tomislav Stojanov orcid.org/0000-0002-6972-6518 ; Institut za hrvatski jezik i jezikoslovlje
Zoran Vučić

Full text: croatian pdf 785 Kb

page 103-129

downloads: 1.051

cite

APA 6th Edition

Stojanov, T. & Vučić, Z. (2012). A Corpus-Linguistic Analysis of Sportske novosti. Filologija, (59), 0-0. Retrieved from https://hrcak.srce.hr/98089

MLA 8th Edition

Stojanov, Tomislav and Zoran Vučić. "A Corpus-Linguistic Analysis of Sportske novosti." Filologija, vol. , no. 59, 2012, pp. 0-0. https://hrcak.srce.hr/98089. Accessed 24 Jul. 2026.

Chicago 17th Edition

Stojanov, Tomislav and Zoran Vučić. "A Corpus-Linguistic Analysis of Sportske novosti." Filologija , no. 59 (2012): 0-0. https://hrcak.srce.hr/98089

Harvard

Stojanov, T., and Vučić, Z. (2012). 'A Corpus-Linguistic Analysis of Sportske novosti', Filologija, (59), pp. 0-0. Available at: https://hrcak.srce.hr/98089 (Accessed 24 July 2026)

Vancouver

Stojanov T, Vučić Z. A Corpus-Linguistic Analysis of Sportske novosti. Filologija [Internet]. 2012 [cited 2026 July 24];(59). Available from: https://hrcak.srce.hr/98089

IEEE

T. Stojanov and Z. Vučić, "A Corpus-Linguistic Analysis of Sportske novosti", Filologija, vol., no. 59, pp. 0-0, 2012. [Online]. Available: https://hrcak.srce.hr/98089. [Accessed: 24 July 2026]

Abstract

The paper examines the role of corpus in linguistic research on the example of two Croatian language corpora interfaces, Philologic and Bonito, for language inquires about document and content relation, as well as the level of character and information display. For specialized linguistic search queries we have built the sport newspaper database made of Sportske novosti online texts (http://sportske.jutarnji.hr/), containing 3,6 mil. of tokens published since April 2008 till July 2009.
The computational procedures of information retrieval and n-gram SQL/regex queries will be shown in order to extract token co-frequencies and reveal phrases, collocations and more constant syntagmemes. The JavaScript wiring library WireIt is used for a token frequencies visualization in browser.
We have compared the output with Google search results based on which we have pointed out seven Google search shortcomings for linguistic investigations and have concluded that our approach could produce unique results in linguistic research.

Keywords

text search; SQLite; information retrieval; Google search; corpus linguistics; Sportske novosti; n-gram; collocation; Croatian language

Hrčak ID:

98089

URI

https://hrcak.srce.hr/98089

Publication date:

12.3.2013.

Article data in other languages: croatian

Visits: 2.963 *

Login and registration

Filologija, No. 59, 2012.

Abstract

Keywords

Hrčak ID:

URI

Publication date: