A Machine Learning Classification Algorithm for Vocabulary Grading in Chinese Language Teaching

A Machine Learning Classification Algorithm for Vocabulary Grading in Chinese Language Teaching  Yinbing ZHANG, Jihua SONG*, Weiming PENG*, Dongdong GUO, Tianbao SONG  Abstract: Vocabulary grading is of great importance in Chinese vocabulary teaching. This paper starts with an analysis of the lexical attributes that affect lexical complexity, followed by an explanation of the extraction of lexical attribute information combined with the constructed word-formation knowledge base, the construction of mapping functions corresponding to lexical attributes, and the quantitative representation of the attributes that form the basis for vocabulary grading. Based on this, a machine learning classification algorithm is creatively applied to the Chinese vocabulary grading problem. Using the comparative analysis of vocabulary grading models based on common machine learning classification algorithms, the importance measurement analysis of Chinese vocabulary attributes based on different feature selection methods is performed, and a vocabulary grading model is constructed based on the machine learning classification algorithm and feature importance selection of different feature selection algorithms. A comparison of the experimental results demonstrated that the classification model based on the support vector machine (SVM) algorithm and top six attribute groups by the importance of feature selection received the best effect. To improve vocabulary grading, a variety of feature selection algorithms were used to fuse the importance of lexical attributes on average. Then an experiment was conducted for vocabulary grading combined with the Bagging + SVM integration algorithm and top six attribute groups by the importance of feature selection. The experimental results demonstrated that the combination scheme achieved a better effect.


INTRODUCTION
In essence, language is a tool for human communication. Whether people can communicate depends on the understanding of semantics, and words are the most important carrier of semantics and the only way to learn a language well. As Chomsky [1] indicated in his discussion on language systems, a person who has a language has access to detailed information about the words of the language. Vocabulary is one of the three elements of language, and the important content for learning and mastering a language. The importance of vocabulary learning is self-evident. David Wilkins [2] said that "without grammar very little can be conveyed, without vocabulary nothing can be conveyed".
In the international Chinese language teaching field, the new HSK syllabus is the main basis and guiding document for the proposition and test of Chinese-language ability. It has a wide influence, and after several revised editions, it has been greatly improved compared with the original edition. However, it still has some shortcomings. As Zhang Jinjun [3] indicated, the design of the vocabulary level still needs to be improved, including which words should be accepted, which words should be abandoned, and which words should be placed at a higher level, all of which need further detailed research and adjustment. Each revision and adjustment of the syllabus consumes a great deal of manpower and material resources. Despite this, for various reasons, each revision may not achieve satisfactory results. Therefore, it is necessary and important significant to study vocabulary grading in the Chinese language teaching field.
The vocabulary grading problem based on different lexical attributes can be regarded as a classification problem; that is, each word is mapped to the attribute vector of its corresponding attribute element by the extraction of the characteristic attribute value of the word, then the classification algorithm model in machine learning is used to determine the level of the word, and the word is divided into six levels from level 1 to level 6. In this study, we use a common machine learning classification algorithm based on the lexical attribute vector set to conduct a vocabulary grading experiment.
During the experiment, combined with the features widely used in [4][5][6][7][8] to assess the comprehensive complexity of vocabulary, the cross-validation method is used to predict the vocabulary levels of Chinese language teaching; that is, the vocabulary and its corresponding attribute vector set are divided into two parts; one part is used as the training set to train the classifier, and the other is used as the validation set to test the performance of the classification model. The schematic diagram of the vocabulary grading model based on the machine learning classification algorithm used in this study is shown in Fig.  1.

RELATED WORK
The machine learning classification algorithm is widely used in the field of natural language processing. There have been many studies on, for example, text classification, text readability prediction, complex vocabulary recognition, and vocabulary grading.
Regarding text classification, Pingpeng Yuan et al. [9] proposed a multi-class text classification algorithm called SVM-KNN by combining a support vector machine (SVM) with the KNN classification algorithm. First, SVM is used to recognize the category boundary, then KNN is used to classify the document boundary, which overcomes the defects of SVM and KNN, and improves the performance of multi-category text classification. In [10], a new text classification algorithm based on thesaurus knowledge and the KNN classification algorithm was proposed, and good classification results were obtained. In [11], a text classification algorithm was proposed based on a backpropagation neural network (BPNN) and modified BPNN (MBPNN). In [12], the text classification algorithms based on machine learning were summarized, and the principles and applications of text classification, text clustering, and text mining were introduced in detail.
Regarding text readability research, Kate Rohit [13] discussed the application of common machine learning classification algorithms in text readability prediction, and the natural language text readability prediction system constructed received relatively satisfactory results. Sarah Schwarm et al. [14] combined SVM with traditional reading-level measurement, a statistical language model, and other language processing tools to predict the readability of English news text, and achieved better prediction results than traditional methods. Sun Gang [15] proposed a text readability prediction method based on linear regression after comprehensively considering the surface features, lexical features, grammatical features, and other text features, and demonstrated the effectiveness of the method through experiments.
Regarding complex vocabulary recognition, Matthew Shardlow [16] compared the automatic recognition of complex vocabulary using different methods. Compared with other methods, the accuracy of the automatic recognition of complex vocabulary based on SVM slightly improved. Lucia Special et al. [17] trained an SVM, which can sort words according to their complexity to select the words that need to be simplified. Muralidhar Pantula et al. [18] trained a machine learning classification model with 19 internal vocabulary features, and achieved 84.75% accuracy on the given experimental dataset.
Regarding vocabulary syllabus development or vocabulary grading, Gala et al. [19] selected 9 of the 27 internal attributes of vocabulary that can best predict the vocabulary level, trained an SVM classifier, and used the method of five-fold cross-validation on the experimental data. The average accuracy of the three classification results was 62%. Additionally, Gala et al. [20] conducted a comparative study of the classification of words in MANULEX and FLELEX based on an SVM. They used four categories of 49 features, including spelling features, morphological features, semantic features, and statistical features, to train three and six categories of SVM classifiers. The accuracies of the final experimental results were 63% and 43%, respectively. The study in [21] showed that the length of English words does not seem to predict lexical complexity, and only two frequency features were used to obtain very good results.
The purpose of this study is to explore the application of a machine learning classification algorithm in Chinese teaching vocabulary grading based on the analysis of lexical attributes. In Section 3, the data resource for this study is introduced, the extraction of lexical attribute information combined with the constructed wordformation knowledge base is completed, the mapping functions corresponding to the lexical attributes are constructed, and the quantitative representation of the attributes that form the basis for vocabulary grading is obtained. In Section 4, first, the evaluation index of the effect of the vocabulary classification model is provided. Then, a comparative analysis of the experimental results of vocabulary grading based on different classification algorithms is performed. Additionally, to obtain a better effect, the average fusion of the importance of lexical attributes is conducted for a variety of feature selection algorithms. Finally, a summary of this study is provided in Section 5.

EXPERIMENTAL DATA ACQUISITION
To convenient for the comparison and analysis of the experimental results, we need to choose a comparison object that has high recognition or authority. For this study, we choose the intersection of vocabulary covered by the eight sets of textbooks planned by Hanban and the vocabulary of the new HSK syllabus as the research object, which includes 2885 words. Detailed information about the eight selected textbooks is shown in Tab. 1.
To study vocabulary grading, it is necessary to analyze the lexical attributes that affect it. Based on previous studies, we choose the Chinese character word formation attribute, general vocabulary attributes, and statistical vocabulary attribute as the characteristic attributes of Chinese vocabulary's comprehensive complexity. The Chinese character word formation attribute includes average strokes and structural types of Chinese characters; general vocabulary attributes include part of speech (POS), number of syllables, and word-formation structure; statistical vocabulary attributes include frequency, number of word senses, average number of morpheme senses, and morpheme word formation ability.
To apply lexical attributes to the quantitative representation of comprehensive vocabulary complexity, we need to map the lexical attributes using the mapping function. To eliminate the difference in dimensions and the value range between different attribute representations, and make the mapping value of the mapping function meet the requirements of standardization, we set the mapping value between 0 and 1 while constructing the mapping function.

Word Formation Chinese Character Attribute
(1) Average strokes. Average strokes refers to the average number of strokes of all Chinese characters in a word. For example, the average strokes of " 中 (zhōng; middle)" is 4, the average strokes of "什么(shénme; what)" is 3.5, and the average strokes of " 消 耗 (xiāohào; consume)" is 10.
In English, the number of letters in a word is often used to evaluate the complexity of a word, which is discussed in [22]. Similarly, the average strokes of a word in Chinese has an important influence on the complexity of the word. Combined with the average strokes in words in the corpus and the proportion of each vocabulary level in the HSK syllabus, the average strokes' piecewise mapping function f 1 (x) is defined, as shown in Tab. 2. (2) Structural types of Chinese characters. In this study, structural types of Chinese characters refers to the first structure of Chinese characters, which has 13 structure types: left right structure, upper and lower structure, left middle right structure, upper middle lower structure, whole surround structure, upper left surround structure, lower left surround structure, upper right surround structure, lower opening frame surround structure, right opening frame surround structure, upper opening frame surround structure, overlapping structure, and single structure.
In [23], a detailed statistical analysis was performed of 2905 Chinese characters' structure types in the list of "graded Chinese characters". The corresponding relationship between structural types of Chinese characters, storage symbols, and examples in this study is shown in Tab. 3.
Considering the frequency distribution and the proportion of each vocabulary level in the HSK syllabus, the structural types of Chinese characters' piecewise mapping function f 2 (x) is defined, as shown in Tab. 4. Space Single structure 我(wǒ; I), 了(le; has been), 不(bú;no), 也(yě; also) Table 4 Chinese characters structural types piecewise mapping function The structural types of a Chinese character's mapping value for a word are represented by the average value of each Chinese character in the word. For example, the structural types of the Chinese character sequence of "现 在(xiànzài;now)" is "⿰⿸" and its mapping value is (1/3 + 1)/2 = 0.6667.

Vocabulary General Attributes
(1) POS. Generally, Chinese POS are divided into 14 categories: nouns, verbs, adjectives, numerals, quantifiers, pronouns, distinguishing words, adverbs, prepositions, conjunctions, auxiliaries, interjections, modal words, and onomatopoeia. The POS and tagging symbol set used in this study are shown in Tab. 5. Considering the proportion distribution of each POS, the POS piecewise mapping function f 3 (x) is defined as shown in Tab. 6. (2) Number of syllables. In this study, the number of syllables refers to the number of Chinese characters contained in a word. For Chinese vocabulary, disyllabic vocabulary is the most common, accounting for the largest proportion, and monosyllabic vocabulary is the most complex. Table 7 Syllable number piecewise mapping function According to the statistical distribution of the number of syllables, the number of syllables' piecewise mapping function f 4 (x) is defined as shown in Tab. 7.

Vocabulary Statistical Attributes
(1) Frequency. Frequency refers to the frequency of words that appear in a specific corpus. In many studies, the results have shown that there is a close relationship between the frequency of words and their complexity, which is perhaps the most commonly used vocabulary attribute to express complexity [27]. In this study, word frequency statistics are based on the above listed eight sets of textbooks. According to the proportion of each vocabulary level in the HSK syllabus and the word frequency in the corpus, the frequency attribute's piecewise mapping function f 6 (x) is constructed as shown in Tab. 10. (2) Number of word senses. In this study, the number of word senses refers to the number of different senses of words in the Modern Chinese Dictionary. The lower the number of word senses, the easier it is to understand the word and the lower the difficulty; by contrast, the greater the number of word senses, the more difficult it is to distinguish and understand the word. Based on this idea, combined with the distribution of the number of word senses and the proportion of each vocabulary level in the HSK syllabus, the mapping function f 7 (x) of the number of word senses is defined, as shown in Tab. 11. (3) Average number of morpheme senses. The average number of morpheme senses refers to the average number of each morpheme sense of a word. Through statistical analysis, the mapping function f 8 (x) of the average number of morpheme senses is defined, as shown in Tab. 12. (4) Morpheme word formation ability. The morpheme word formation ability refers to the average number of times that morphemes of words appear as morphemes of other words. Based on the statistical distribution of the average morpheme word formation ability and the proportion of each vocabulary level in the HSK syllabus, the mapping function f 9 (x) of the morpheme word formation ability is defined, as shown in Tab. 13.

EXPERIMENT AND ANALYSIS 4.1 Evaluation Index
Combined with the purpose of vocabulary grading, this research selects several evaluation indexes to compare and evaluate the prediction results of the classification algorithm, such as accuracy, approximate accuracy, root mean square error, Kappa coefficient, Pearson correlation coefficient and so on.
(1) Accuracy. Accuracy of vocabulary grading refers to the proportion of the number of words whose grades are consistent with those in the vocabulary syllabus to the total number of predicted words. The calculation formula of accuracy for n samples is as follows: where y is the actual level of words in the syllabus, ŷ is the prediction level of words, and n samples is the number of words involved in the evaluation.
(2) Approximate accuracy. Approximate accuracy of vocabulary grading refers to the proportion of the number of words whose grades are similar to those in the syllabus to the total number of predicted words. The grades are similar, here it is defined that the absolute value of the difference between the two is not more than 1. The calculation formula of approximate accuracy is as follows: (3) Root mean square error (RMSE). Root mean square error is the square root of the average square error between the predicted and the actual vocabulary level, which is used to measure the deviation between the predicted and the actual vocabulary level. The calculation formula of root mean square error is as follows: (4) Kappa coefficient (KC). Kappa coefficient is a method used to evaluate the consistency in statistics. It can be used to evaluate the accuracy of multi classification models. It represents the proportion of errors in classification results that are less than those in completely random classification results. The calculation formula of Kappa coefficient is as follows: where n samples is the number of words involved in the experiment, x i, i is the number of words correctly predicted in class i, which corresponding to the elements on the diagonal of the prediction confusion matrix, and C is the number of vocabulary levels, a i is the actual number of words at level i, b i is the predicted number of words at level i. where y is the actual level of words in the syllabus, and y is the average of y. ŷ is the prediction level of words, and ŷ is the average of ŷ . n samples is the number of words involved in the evaluation.

Based on Several Common Machine Learning Classification Algorithms
Several commonly used machine learning classification algorithms were used in the experiment: LR, LDA, KNN, CART, NB, and SVM [28]. The experimental data were randomly divided, with 80% of data in the training set and 20% in the test set. Repeated experiments were conducted, and the average value of the experimental results was used to evaluate the classification model. The following is a comparative analysis of the classification results based on the new HSK syllabus vocabulary level. To observe the stability of the classifier, based on various classification algorithms, the average accuracy and average approximate accuracy were compared through 30, 100, and 300 iterations. Based on the new HSK syllabus vocabulary level, the average accuracy of 30 iterations and 100 iterations is shown in Fig. 2, the average accuracy and average approximate accuracy of 300 iterations is shown in Fig. 3, and detailed information based on each evaluation index of 300 iterations is shown in Tab. 16.   The experimental results in Tab. 16 show that SVM achieved the best effect of the six classification algorithms in terms of accuracy, which was 42.42%, and approximate accuracy, which was 87.61%. RMSE was 1.0623, KC was 0.2307, and PCC was 0.6764. However, generally, the accuracy of the six classification algorithms was not very high. The reason is that the vocabulary of the HSK syllabus was divided into six levels from level 1 to level 6, and the level division was relatively fine. It was reasonable to divide a specific word into the current level or adjacent level, and this was also verified by another analysis index: "approximate accuracy". The approximate accuracies of the six classification models were greater than 75%: SVM and LDA were 87.61% and 87.24%, respectively.
Compared with the results in [30], the accuracy and approximate accuracy of the six classification algorithms in this study were all greater, to a certain extent. Particularly, for the SVM classification algorithm, the accuracy was 42.42%, which is 14.38% higher than the accuracy of 28.04% in [30]; and the approximate accuracy was 87.61%, which is 15.13% higher than the approximate accuracy of 72.48% in [30].

Experimental Effect Improvement
Each classification algorithm has its own characteristics. To improve the experimental results, we can improve two aspects of the experiment: feature selection of lexical attributes and classification algorithm integration.

Classification Effect Improvement Based on Feature Importance Selection of Different Feature Selection Algorithms
To make full use of fewer features and improve the classification effect, it is necessary to select features. Through feature selection, we can reduce the number of features, and reduce the influence of irrelevant features and redundant features on the classification effect, which is more conducive to the understanding of the influencing factors of classification. Simultaneously, we can reduce the space and time costs, and improve performance. In [29], the authors indicated that the appropriate feature selection algorithm and the appropriate number of feature subsets do not affect the classification effect of the classifier, or even improve the classification effect.
Feature selection methods can be divided into three categories: filter, wrapper, and embedding. For the calculation of the importance of lexical attributes, different feature selection algorithms obtain different importance measurements. In this study, a variety of feature selection methods were used to calculate the importance of lexical attributes. To eliminate the influence of different feature selection algorithms and improve the rationality of the importance measurement of lexical attributes, it was necessary to standardize the importance calculated by each feature selection algorithm. The min-max standardization method was used for standardization; the calculation formula is as follows: Based on the results calculated, the average value was used to express the importance of the corresponding lexical attributes. The details are shown in Tab. 17. Tab. 17 shows that, according to the measurement results of feature importance for different feature selection methods, the order of importance of the number of lexical attributes was 6, 1, 2, 7, 9, 3, 4, 8, and 5. Through comparative analysis of the experiments, the best classification effect was obtained when the top six lexical attributes of importance were selected. According to the importance of the lexical attributes obtained from expert knowledge in [30], the top six numbers of lexical attributes were 6, 2, 7, 4, 8, and 1. A comparison of the two ranking results of lexical attribute importance demonstrates that there are some differences in the order of lexical attribute importance obtained by the two approaches. The comparison results for the three lexical attribute groups are shown in Tab. 18.
As shown in Tab. 18, a comparison of the classification results of the three attribute groups demonstrates that, considering all attributes, the LR and LDA classification algorithm achieved relatively good classification results based on the top six attribute groups in [30]; KNN and the CART classification algorithm achieved relatively good classification results; and based on feature selection for the top six attribute groups, NB and the SVM classification algorithm achieved relatively good classification results. Among all the experimental results, the SVM classification algorithm based on feature selection for the top six attribute groups achieved the best classification effect, and the accuracy reached 42.76% and the approximate accuracy reached 87.02%. Compared with the accuracy of 28.04% and approximate accuracy of 74.48% in [30], the classification effect greatly improved. A comparison of the experimental results demonstrated that the classification model based on the SVM algorithm and top six attribute groups by the importance of feature selection received the best effect. The reasons can be summarized into two aspects. Through feature selection, we could choose the set of lexical attributes that contributed more to classification from the combination of lexical attributes, which was more conducive to the improvement of the classification effect. By contrast, the SVM had good learning ability for small sample and high-dimension classification, and obtained a low error rate and made good classification decisions for data points outside the training set. For the six classification algorithms, the iterative curve for accuracy and approximate accuracy based on feature selection for the top six attribute groups is shown in Fig. 4.

Vocabulary Grading Effect Improvement Based on the Integration Algorithm
Based on the classification algorithm used in the above experiment, combined with the Bagging integration algorithm, the effect of vocabulary grading improved. The integration algorithm is a common method used to improve the experimental effect. At present, popular integration algorithms mainly include the bagging algorithm, boosting algorithm, and voting algorithm. (

1) Vocabulary grading effect improvement based on CART and the bagging integration algorithm
To measure the importance of lexical attributes based on CART, the "DecisionTreeClassifier()" model in "sklearn.tree" of Python was selected, and the "feature_importances_" property of the model was used to represent the importance of lexical attributes. Because the results returned each time were slightly different, to express the importance of lexical attributes more stably, the average value was calculated by repeating the experiment 300 times, and the calculation results were standardized using the min-max standardization method. The importance measure results for the lexical attributes are shown in Tab. 19. The bagging algorithm used in the experiment was implemented using the "Bagging Classifier" in "scikitlearn". A comparison of the experimental results based on CART and Bagging + CART is shown in Tab. 20, and the accuracy and approximate accuracy iteration curves based on the Bagging + CART algorithm and top three attribute groups is shown in Fig. 5. The experimental results in Tab. 20 demonstrate that, compared with the other three classification results in Tab. 20, the accuracy and approximate accuracy based on the Bagging+CART algorithm and top three attribute groups improved accordingly. (

2) Vocabulary grading effect improvement based on the Bagging+SVM integration algorithm
For the SVM classification algorithm, the "gridsearchcv()" function was used to adjust the parameters, and the best combination of parameters was C = 1, gamma = 0.1, kernel = 'rbf'. For the bagging integration algorithm, the "Baggingclassifier()" module in "sklearn.ensemble" in Python was selected. A comparison of the experimental results based on the SVM and Bagging + SVM is shown in Tab. 21, and the accuracy and approximate accuracy iteration curve based on the Bagging + SVM algorithm and top six attribute groups by feature selection is shown in Fig. 6.  Compared with the other two classification results in Tab. 21, the accuracy and approximate accuracy based on the Bagging + SVM algorithm and top six attribute groups improved accordingly. Analyze the reasons, in addition to the advantages of SVM described above, the vocabulary grading model based on "Bagging + SVM," Classification using SVM, and through the sampling method with put back on the original dataset, new datasets are selected to train the classifiers respectively. Thus, the effect of using multiple weak learners to achieve strong learners was achieved, which made the vocabulary classification receive a better effect.

CONCLUSION
This paper started with an analysis of the lexical attributes that affect vocabulary grading, followed by an explanation of the extraction of lexical attribute information combined with the constructed wordformation knowledge base, the construction of the mapping functions corresponding to the lexical attributes, and the quantitative representation of the attributes that form the basis for vocabulary grading. Using this as a guide, a vocabulary grading model based on common machine learning classification algorithms was constructed, which included the common machine learning algorithms LR, LDA, KNN, CART, NB, and SVM. In the experiment, the importance of lexical attributes was measured using different methods, and the results demonstrated that the frequency of words in the corpus played an extremely important role. In addition to frequency, the number of semantic items and the average number of strokes of Chinese characters were also important. To improve the effect of vocabulary grading, a variety of feature selection algorithms were used to fuse the importance of lexical attributes on average, then the vocabulary grading experiment was conducted combined with bagging in the integration algorithm. The experimental results demonstrated that the combination of feature selection and the integrated bagging algorithm achieved a better effect. Additionally, because only nine vocabulary attributes were used in the vocabulary grading experiment, this affected the vocabulary grading effect, to a certain extent, and in a follow-up study, we will further explore more lexical attributes to improve the vocabulary grading effect.