Language Historical and Computational Linguistic Aspects of the Descriptions and Norming of Dashes in the Croatian Language

Stojanov, Tomislav

Rasprave Instituta za hrvatski jezik, Vol. 41 No. 1, 2015.

Original scientific paper

Language Historical and Computational Linguistic Aspects of the Descriptions and Norming of Dashes in the Croatian Language

Tomislav Stojanov orcid.org/0000-0002-6972-6518 ; Institut za hrvatski jezik i jezikoslovlje

Full text: croatian pdf 1.892 Kb

page 127-161

downloads: 2.288

cite

APA 6th Edition

Stojanov, T. (2015). Language Historical and Computational Linguistic Aspects of the Descriptions and Norming of Dashes in the Croatian Language. Rasprave Instituta za hrvatski jezik, 41 (1), 127-161. Retrieved from https://hrcak.srce.hr/141876

MLA 8th Edition

Stojanov, Tomislav. "Language Historical and Computational Linguistic Aspects of the Descriptions and Norming of Dashes in the Croatian Language." Rasprave Instituta za hrvatski jezik, vol. 41, no. 1, 2015, pp. 127-161. https://hrcak.srce.hr/141876. Accessed 1 Jul. 2026.

Chicago 17th Edition

Stojanov, Tomislav. "Language Historical and Computational Linguistic Aspects of the Descriptions and Norming of Dashes in the Croatian Language." Rasprave Instituta za hrvatski jezik 41, no. 1 (2015): 127-161. https://hrcak.srce.hr/141876

Harvard

Stojanov, T. (2015). 'Language Historical and Computational Linguistic Aspects of the Descriptions and Norming of Dashes in the Croatian Language', Rasprave Instituta za hrvatski jezik, 41(1), pp. 127-161. Available at: https://hrcak.srce.hr/141876 (Accessed 01 July 2026)

Vancouver

Stojanov T. Language Historical and Computational Linguistic Aspects of the Descriptions and Norming of Dashes in the Croatian Language. Rasprave Instituta za hrvatski jezik [Internet]. 2015 [cited 2026 July 01];41(1):127-161. Available from: https://hrcak.srce.hr/141876

IEEE

T. Stojanov, "Language Historical and Computational Linguistic Aspects of the Descriptions and Norming of Dashes in the Croatian Language", Rasprave Instituta za hrvatski jezik, vol.41, no. 1, pp. 127-161, 2015. [Online]. Available: https://hrcak.srce.hr/141876. [Accessed: 01 July 2026]

Abstract

This paper describes one of two punctuation marks (dashes and quotation marks) that deviate significantly from the relationship of one character per (unicode) semantic value. While quotation marks have multiple graphemes (eight, specifically) for one semantic value, dashes typically have two graphemes (a short and a long dash) that cover as many as 11 (Unicode and Latin) dash characters. While the criteria for line length has typically been highly prominent in orthography manuals, it is only found in the presented categorization on the sixth hierarchical level.
Aside from two new Unicode dash characters (the two-em dash and three-em dash, Unicode 6.1, January 2012) having been standardized in the meantime, differing methodology and a comparison of the linguistic-historical and computational linguistic aspects have spread awareness of dash characters in the Croatian language as described in Portada-Stojanov (2009). A categorization is presented that is sensitive to the dichotomy of graphic representation and meaning that divides all dash characters into five hierarchical levels. Among the 44 Unicode horizontal and unbroken dash characters, a division into type, time, functionality, direction, and line height has resulted in 11 contemporary Latin alphabetic horizontal central characters, among which each language written in the Latin alphabet chooses its own. The semantic value and usage of all Unicode dash graphemes has been described.
On the other hand, the paper also described dash characters from the perspective of Croatian historical linguistics and orthography. In comparison to the rich repository of standardized Unicode dash characters, it has been shown that orthographic standards are significantly reductive. Orthographic norming of dash characters is divided into two periods and three groups, depending on their graphemic form (the first and second generation of orthography manuals) and terminology (the pre-standard phase and the two standard norming schools, depending on the acceptance of the terminological pairs “spojnica – crtica” and “crtica – crta”).
The historical linguistic and computational linguistic comparative research and the contrastive analysis of the Unicode standardization of dash characters with traditional orthographic descriptions of dash characters was intended to highlight (i) the need for a broader, interdisciplinary approach to describing written linguistic practice, (ii) the insufficiency of descriptions in primary and secondary school orthography manuals for modern writing, and (iii) the insufficiency of the existing Croatian codification of both terminological schools. In order for orthography manuals to be called scholarly, it is claimed that computer writing should be better described, and that a differentiation between characters and graphemes should be introduced on the level of punctuation. One of the areas in which orthography manuals could bring themselves technologically up to date is the issue of the writing of compound words at the beginning of a broken line, and the paper provides eight reasons to abandon the current tradition.
Analysis has shown that it would be justified to base dash codification on three or four characters, which reduces the 11 Latin Unicode characters to basic groups of dashes – the short, medium, long, and very long dashes, referred to as c₁, c₂, c₃ and c₄.

Keywords

Croatian language; orthography; linguography; Unicode; dash; hyphen

Hrčak ID:

141876

URI

https://hrcak.srce.hr/141876

Publication date:

17.7.2015.

Article data in other languages: croatian

Visits: 4.645 *

Login and registration

Rasprave Instituta za hrvatski jezik, Vol. 41 No. 1, 2015.

Abstract

Keywords

Hrčak ID:

URI

Publication date: