Skip to the main content

Original scientific paper

https://doi.org/10.31745/s.69.1

ON THE QUESTION OF THE APPLICATION OF STATISTICAL METHODS TO SEARCH FOR COLLOCATIONS AND COLLIGATIONS IN OLD SLAVONIC TEXTS (IN GLAGOLITIC MANUSCRIPTS FROM THE CORPUS »manuscripts.ru«)

Виктор A. БАРАНОВ orcid id orcid.org/0000-0003-1730-6359 ; Izhevsk State Technical University after M.T. Kalashnikov Izhevsk (Russia)


Full text: russian pdf 481 Kb

page 1-33

downloads: 641

cite


Abstract

The paper deals with the questions concerning the methodology used to search for fixed collocations in the collection of Glagolitic texts in the historical corpus Manuscript: Slavic written heritage (manuscripts.ru) and to evaluate their stability. It demonstrates the possibilities of the
n-gram module to extract collocations, consisting of words and their textual forms or lemmas, with different numbers of components and different frequency of occurrence. Analyzed are digrams and trigrams extracted using the statistical measure of Mutual Information that occur
simultaneously in several manuscripts from the collection.
Particular attention is given to n-grams with high statistical MI values. In accordance with the specifics of the measure, the greatest values belong to the collocations that are rare in the collection. The analysis of such digrams based on the word forms has enabled an identification of coherent grammatical structures – colligations. Trigrams consisting of textual forms are shown to be not only grammatical, but also semantic units – collocations. Digrams with components-lemmas have different forms: preposition-noun collocations, preposition-possessive pronoun collocations and other attributive constructions, relative verb-noun constructions, etc. The analysis of these groups identified both colligations and collocations. Extraction of trigrams on the basis of lemmas was the most productive – the greatest part of the first few dozens of collocations with a maximum MI value are grammatical and semantic units or their parts. A conclusion is made about the efficiency of application of statistical methods for the extraction of collocations and colligations from the corpora comprising medieval Slavonic manuscripts.
A complex solution of the given problem requires the use of different types of n-grams – two-components and triple-components, based on textual forms and lemmas, with free and fixed component order. The presence of grammatical and semantic units repeated in various
manuscripts leads to a conclusion about the supra-textual nature of such collocations.

Keywords

textual corpus; manuscripts.ru; Glagolitic manuscript; linguistic statistics; n-gram module; collocation; colligation

Hrčak ID:

231473

URI

https://hrcak.srce.hr/231473

Publication date:

30.12.2019.

Article data in other languages: croatian russian

Visits: 2.394 *