Original scientific paper
Language Historical and Computational Linguistic Aspects of the Descriptions and Norming of Dashes in the Croatian Language
Tomislav Stojanov
orcid.org/0000-0002-6972-6518
; Institut za hrvatski jezik i jezikoslovlje
Abstract
This paper describes one of two punctuation marks (dashes and quotation marks) that deviate significantly from the relationship of one character per (unicode) semantic value. While quotation marks have multiple graphemes (eight, specifically) for one semantic value, dashes typically have two graphemes (a short and a long dash) that cover as many as 11 (Unicode and Latin) dash characters. While the criteria for line length has typically been highly prominent in orthography manuals, it is only found in the presented categorization on the sixth hierarchical level.
Aside from two new Unicode dash characters (the two-em dash and three-em dash, Unicode 6.1, January 2012) having been standardized in the meantime, differing methodology and a comparison of the linguistic-historical and computational linguistic aspects have spread awareness of dash characters in the Croatian language as described in Portada-Stojanov (2009). A categorization is presented that is sensitive to the dichotomy of graphic representation and meaning that divides all dash characters into five hierarchical levels. Among the 44 Unicode horizontal and unbroken dash characters, a division into type, time, functionality, direction, and line height has resulted in 11 contemporary Latin alphabetic horizontal central characters, among which each language written in the Latin alphabet chooses its own. The semantic value and usage of all Unicode dash graphemes has been described.
On the other hand, the paper also described dash characters from the perspective of Croatian historical linguistics and orthography. In comparison to the rich repository of standardized Unicode dash characters, it has been shown that orthographic standards are significantly reductive. Orthographic norming of dash characters is divided into two periods and three groups, depending on their graphemic form (the first and second generation of orthography manuals) and terminology (the pre-standard phase and the two standard norming schools, depending on the acceptance of the terminological pairs “spojnica – crtica” and “crtica – crta”).
The historical linguistic and computational linguistic comparative research and the contrastive analysis of the Unicode standardization of dash characters with traditional orthographic descriptions of dash characters was intended to highlight (i) the need for a broader, interdisciplinary approach to describing written linguistic practice, (ii) the insufficiency of descriptions in primary and secondary school orthography manuals for modern writing, and (iii) the insufficiency of the existing Croatian codification of both terminological schools. In order for orthography manuals to be called scholarly, it is claimed that computer writing should be better described, and that a differentiation between characters and graphemes should be introduced on the level of punctuation. One of the areas in which orthography manuals could bring themselves technologically up to date is the issue of the writing of compound words at the beginning of a broken line, and the paper provides eight reasons to abandon the current tradition.
Analysis has shown that it would be justified to base dash codification on three or four characters, which reduces the 11 Latin Unicode characters to basic groups of dashes – the short, medium, long, and very long dashes, referred to as c1, c2, c3 and c4.
Keywords
Croatian language; orthography; linguography; Unicode; dash; hyphen
Hrčak ID:
141876
URI
Publication date:
17.7.2015.
Visits: 3.153 *