Izvorni znanstveni članak
https://doi.org/10.20532/cit.2023.1005719
N-gram Language Model for Chinese Function-word-centered Patterns
Jie Song
; School of Foreign Languages, Zhejiang University of Finance & Economics, Hangzhou, China
Yixiao Liu
; School of International Studies, Zhejiang University, Hangzhou, China
Yunhua Qu
; School of International Studies, Zhejiang University, Hangzhou, China
*
* Dopisni autor.
Sažetak
N-gram language modelling, a proven and effective method in NLP, is widely used to calculate the probability of a sentence in natural language. Language pattern is a linguistic level between word/character and sentence, which exists in pattern grammar. In this research, the approach of language model and language pattern are combined for the first time, and language patterns are studied by use of the N-gram model. Chinese function-word-centered patterns are extracted from the LCMC corpus, and aligned into pattern chains. The language model is trained from these chains to investigate the properties and distribution of Chinese function words, the interaction of content words and function words, and the interaction between patterns. The results indicate that there are approximately 10,000 function-word-centered patterns in the texts, which are distributed exponentially. This research summarizes the most common function-word-centered patterns and content-word-centered patterns, and discusses the interactions of patterns based on corpus data. The bigram language model of these patterns reflects the restrictions of function words. In addition, the research adopts an innovative method to visualize the interactions between patterns. This research fills the research gap between word/character and sentence and reveals basic Chinese pattern categories and the interactions between patterns, which makes a significant contribution to Chinese linguistic research, and improves the efficiency of NLP.
Ključne riječi
N-gram language model, language pattern, function word
Hrčak ID:
313206
URI
Datum izdavanja:
8.1.2024.
Posjeta: 415 *