Text Classification of Mixed Model Based on Deep Learning

: At present, deep learning has been widely used many fields, but the research on text classification is still relatively few. This paper makes full use of the good learning characteristics of deep learning, proposes a hybrid model based on deep learning, and designs a text classifier based on the hybrid model. This hybrid model uses two common deep learning models, sparse automatic encoder and deep confidence network, to mix. The hybrid model is mainly composed of three parts, the first two layers are constructed by sparse automatic encoder, the middle layer is a three - layer depth Convolutional Neural Network (CNN), and finally Softmax regression is used as the classification layer. In order to test the classification performance of the classifier based on deep learning hybrid model, relevant experiments were conducted on English data set 20Newsgroup and Chinese data set Fudan University Chinese Corpus. In the English text classification experiment, the classifier based on deep learning hybrid model is used to classify, and a high classification accuracy rate is obtained. In order to further verify the superiority of its performance, a comparative experiment with naive Bayes classifier, K - Nearest Neighbor (KNN) classifier and Support Vector Machine (SVM) classifier demonstrates that the classification effect of the classifier based on deep learning hybrid model is better than that of naive Bayes classifier, KNN classifier and support vector machine classifier. In the experiment of Chinese text classification, the Chinese corpus of Fudan University is tested, and a good classification effect is obtained. The influence of different parameter settings on the classification accuracy is discussed.


INTRODUCTION
Since the widespread application of Internet technology, people are faced with the severe problem of information explosion. The information on the Internet keeps increasing, and its growth momentum is rapid, with geometric magnitude [1][2][3]. The Internet can carry amazing information, and the world is submerged in information. The Internet has now become a key tool for most people to search for or acquire information, making it an essential tool for people's daily lives and work. The Internet has provided quite a lot of information, among which how to find valuable information accurately and quickly has become an important issue. At present, text information contains a lot of valuable information [4][5][6]. Classifying published text is one of the important ways to analyze data, and improving the efficiency and quality of text usage makes it possible to organize or manage text effectively. Text classification refers to the analysis of the content of the text, and it is determined that the text belongs to any of the given categories [7] [8]. In the early days, people relied on manual classification of texts. This traditional method was time-consuming and laborious, unable to deal with massive text information, and it was difficult to unify the standard because of the unstable classification results caused by human factors. At present, the main methods of text classification are statistics and machine learning, which has made a lot of progress and entered a stage of rapid development [9] [10]. So far, text classification is still a hot research topic of many researchers. The main idea is to apply the text classification algorithm, learn the known samples, classify the unknown texts through the learned rules, and finally get the text categories. Text classification can handle a large number of texts, reduce the consumption of manpower and material resources, and enable users to obtain valuable content quickly and efficiently. It provides convenience for the follow-up research work and makes the text information processing rise to a new height. With the continuous in-depth exploration of the field of text classification, text classification has been well promoted in search engines, digital libraries, and email filtering and other fields.
(a) Search engine Search engine is an indispensable tool in people's life, which can get information from the Internet. The Internet is composed of a huge number of web pages, and it is difficult for people to get the information they want. The function of search engine is to quickly classify the information on the Internet and screen the relevant information from the categories. It involves the classification of texts, which classifies texts according to their contents and then manages them separately. When a user wants to inquire about information, the search engine can provide retrieval service to the user, retrieve the relevant information that the user wants from the relevant classified information, and provide the information in the form of a page.
(b) Digital library Because information technology has developed steadily and rapidly, digital library has become the development direction of most libraries. The technique of classifying acquired text has also become one of the important techniques for retrieving information. When the library classifies books, it adopts the text classification technology, which can effectively manage books and reduce the tedious work of librarians. The digitalization of the library is convenient for readers, and enables readers to get all kinds of library information in different places.
(c) Mail filtering With the development of the Internet, e-mail provided great convenience to communicate with other people. And the existence of spam also adds trouble to life. Text classification technology can filter out junk information, so that users can avoid the interference of this information. Text classification technology is trained according to the characteristics of spam, and a spam classifier is obtained. E-mail classifier can filter out the junk information and keep only the information that users need, so that users' daily life is free from interference. It can be seen from the above functions of text classification in various fields that the research of text classification has important theoretical and practical significance. Text classification, as a basic task, can provide an effective guarantee for deep mining of valuable information in the text.

RELATED WORK 2.1 Research Status of Text Classification
Text classification was first proposed in the 1960s. Manual classification is the earliest method of text classification, which was manually classified by professional researchers. This classification method would waste much manpower and resources and was limited by the number of professional researchers. Specific classification problems must be formulated and implemented by specific researchers. By the 1890s, the number of texts was exploding, and the proposal and development of machine learning attracted many researchers. Firstly, the text classifier trains a large number of data sets to establish a certain mathematical model, and then automatically classifies other new sample data. In the middle of 20th century, the research on text classification has been carried out abroad. In 1957, Luhn put forward the idea of applying word frequency statistics to text classification, which laid the foundation for text classification. Then, Maron et al. put forward probability model and factorization model algorithm successively, which made the text classification technology develop. In 1970, Salton et al. put forward a vector space model which can represent text well. During this period, the text classification technology mainly uses the method of knowledge engineering, and the method of knowledge engineering depends on the rules formulated by experts. However, the formulation of relevant rules will take a lot of time and energy, which makes this method unable to be popularized. In the 1990s, with the rapid development of the Internet, there was an urgent need to classify more and more different kinds of texts. At this time, machine learning methods emerged and were quickly applied to text classification. Text classification based on machine learning doesn't need manual operation to construct a classifier. It finds different features among texts by learning samples, summarizes these features, and automatically generates a text classifier according to relevant rules. Text classification using machine learning is superior to knowledge engineering in accuracy and efficiency, and it gradually replaces the method based on knowledge engineering and becomes the mainstream. Through a certain experimental analysis, it was found that the new text classifier was comparable to professional researchers in classification accuracy, so it became a common way of text classification technology at that time. In 1971, Rocchio proposed a new linear classifier [11]. In 1979, van Rijsbergen put forward some new concepts in the field of information retrieval and applied them to text classification technology, such as evaluation criteria such as accuracy and recall. In 1995, Vipnik proposed the method of Support Vector Machine. Thorsten Joachims applied linear kernel support vector machine to text classification technology for the first time, so up to now, the theory and application of support vector machine still have great influence on text classification technology. After 1995, Yoav Freund and Robert E. Schapire published a paper on Ada Boost. Robert E. Schapire proposed an Ada Boost algorithm framework and carried out relevant experimental verification. Later, some scholars designed many similar algorithms according to this framework, and these algorithms have made great achievements in the research of text classification [12,26]. Joachims took the lead in proposing a text classification algorithm based on support vector machine in 1997, which started the upsurge of various theories and researches on the application of support vector machine in text classification. Compared with the traditional classification methods, using the existing natural language processing tools has the problem of error superposition in the processing process [13]. In 2014, Zeng D J et al. put forward a learning method of text semantic features based on deep convolution neural network. According to the degree of correlation between apparent and potential semantics and the categories of documents, this method can handle the classification of irregular texts such as Chinese network short texts well [14]. In 2018, Li H M, et al. proposed a short text classification model based on dense network as direct expression text [15]. In 2019, Wang Gensheng, Huang Xuejian and Chloe Wang optimized the text classification algorithm by modifying the word vector weight and manually building a dictionary, respectively. However, its learning time complexity is much higher than the traditional method, and it needs further improvement [16]. In 2019, Jin W Z proposed a text classification method based on feature fusion model of deep learning [17]. Although machine learning has made extremely important achievements in the field of text classification, the research on text classification was once stagnant before this, but the characteristics of text classification itself put forward a new development direction for machine learning [18][19][20][21], so the research on text classification is still an extremely important direction in the area of NLP at present.

Overview of Text Classification
Text classification is a kind of supervised learning. It is known that there is a set of training documents D = {d 1 , d 2 , …, d m }, and each document in the set has a category label. The rules between the category label and the attributes in each document are found through supervised learning, and then the category label is obtained by using the rules for new documents.
Text classification can be defined in the following mathematical form: given a set of documents, d i represents the i th document, and there are m documents in D. Assuming a set of document categories C = {c 1 , c 2 , …, c m }, we can find that there is a certain mapping between the set of documents and the set of categories f → D × C → A, A = {0, 1}，the task of text classification is actually to make it equal to F as much as possible. Called a classifier. If ( ) does not belong to the class c i . The text classification process consists of training process and classification process as shown in Fig. 1. In the training process, the training text generally needs to go through the steps shown in Fig. 1, which are the basis of text classification, and then the classifier is continuously trained by the selected classification algorithm. In the process of classification, the test document generally needs to be processed by the steps shown in Fig. 1. After the trained classifier, the classifier will identify the category of the test document.

Figure 1 Process diagram of text classification
The text contains a large amount of unstructured or semistructured information, which is not easily recognized by the classifier. It is necessary to preprocess the text documents to remove these useless information. Text preprocessing refers to processing text information into structured information that can be operated by computer. Text preprocessing is the initial stage of text classification, and the preprocessing results have great influence on the classification results. Text preprocessing includes denoising, word segmentation and stopword removal. In English, spaces and punctuation marks are commonly used for word segmentation. In English, it is necessary to take root, which is to unify words with the same semantics but slightly different forms into one form. It mainly aims at singular and plural forms of nouns, comparative forms of adjectives and adverbs, and various tense forms of verbs in English. To go to stop words is to remove pronouns, prepositions, conjunctions and other features unrelated to classification. These stop words are irrelevant to the meaning of the document.

Common Models of Deep Learning 2.3.1 Automatic Encoder
Automatic encoder is a kind of unsupervised learning, a new network reconstructed by neural network [22,23]. The encoder principle of automatically acquiring data features makes input data and output data identical. By constantly adjusting the weight of each layer through training, each hidden layer is another representation of the input data and can be used as the features of the input data. Compared with principal component analysis, automatic encoder relies on the limitation of linear dimension reduction, and it can use nonlinear neural network to reduce the dimension of features. Automatic encoder consists of encoder and decoder. The output of the original data after passing through the encoder is used as the input of the decoder, and finally the output is obtained through the decoder. Then, the original data is printed in another form, as shown in Fig. 2 below.

Convolutional Neural Network (CNN)
It was put forward by Lecun in 1989, and it is well applied in the field of speech recognition and image recognition. Convolutional neural network is essentially a multilayer perceptron that can recognize images well [27]. Because of its special structure, it can highly perceive other forms of invariance such as translation and scaling of graphics.
CNN [24] is composed of one or more convolution layers and the top fully connected layer, and includes correlation weights and pooling layers. This structure enables CNN to make use of the two-dimensional structure of input data. The  Fig. 3. First, the input original features are convolved in C1 layer, and then transformed into feature maps after passing through three filters. Then, the feature maps are weighted, and then biased, and finally Sigmoid function is processed to generate S2 layer feature maps. The obtain feature map is processed as above to successively obtain C3 and S4 layer feature map. The feature mapping mentioned above can well realize the feature that the position is not easy to change, and Sigmoid function is used as the activation function. The middle layer C is the feature extraction layer, and the nodes of neurons in each layer are connected with the local nodes in the front layer to extract local features.

EXPERIMENTAL RESULT ANALYSIS 3.1 Data Set
The standard foreign language classification database includes: Reuters-21578, 20Newsgroups, OHSUMED, Web KB, etc. Domestic standard Chinese corpus such as Tan Corp, etc. These data sets can be downloaded for free. In this paper, 20Newsgroups data sets such as Tab. 1 are selected for the English text classification experiment. This data set is a text data set compiled by Lang in 1995. It contains the message texts of 20 newsgroups (20 categories) in Usenet, with a total of 1997 articles. Except one newsgroup contains 997 messages, each newsgroup has 1000 message texts. This data set is a typical single-label text classification corpus.

Text Classification Experiment
A certain number of texts are selected from the random English data set 20Newsgroup to preprocess the texts. Text preprocessing is implemented on Eclipse platform using Java language. For the preprocessed documents, 30% of them are randomly selected as test data sets, and the rest of them are training data sets. The feature dimension of the document is 1500 dimensions. In this paper, the classifier based on the mixed model of deep learning is implemented by MATLAB. Because the original feature dimension is 1500, the number of input nodes in the sparse automatic encoder layer is 1500. After using the sparse automatic encoder with 3000-1500 hidden nodes, the data is compressed by a three-layer deep confidence network with 200-100-20 hidden nodes. Finally, the Softmax layer outputs the test data set with the highest probability that the documents belong to all categories of documents. After text classification, the best accuracy of each category can be obtained as shown in Fig. 5 and Tab. 2. In order to further verify the performance of text classification based on the hybrid model of deep learning, this paper compares the proposed classifier SDBN with naive Bayes classifier, KNN classifier and support vector machine classifier. In the contrast experiment, the same data set is selected as the training set and the test set, and the text of the training set and the test set is preprocessed. In the experiment of naive Bayes classifier, the naive Bayes classifier [25] of MATLAB is used to get the classification accuracy of the test set. In the experiment of KNN classifier, the KNN-classify classifier built in MATLAB is used to get the classification accuracy of the test set. In the classification experiment of support vector machine, LIBSVM, an open source software package of support vector machine, is used for the experiment, and the classification accuracy of the test set is obtained. Fig. 5 shows the experimental results of comparative experiment, from which it can be seen that the NN performance of text classifier SDBN based on deep learning hybrid model is slightly better than other classifiers.  From Fudan University Chinese Text Classification Corpus, four kinds of documents, namely, economy, sports, computer and agriculture, are selected as the training set and test set of Chinese experiment. In this paper, the classifier based on the mixed model of deep learning is implemented by MATLAB, and 30% of the documents in the preprocessed Chinese data set are randomly selected as the test set. Select features with 1000 dimensions as raw data. Firstly, a sparse automatic encoder with 2000-1000 hidden nodes is used, then a three-layer deep confidence network with 200-100-20 hidden nodes is used to compress the data, and the BP finetuning times are 200. Finally, Softmax layer outputs the probability that the documents in the test data set belong to various categories in the documents, and the category with the highest probability is the category to which the documents belong. Tab. 3 is the best correct rate after text classification. Tab. 4 to Tab. 11 show the comparison of accuracy recall and F1 value of BRCNNs and ACNNNS on MR, Subj, SST, SST2, IMDB, TREC, CR and MPQA data sets respectively. As can be seen from the following accuracy recall and F1 tables, on all data sets used in this paper. There is little difference between the accuracy recall rate and F1 value of BRCNN and ACNN and their variants. This shows that there is no case that a large number of samples of a certain category are predicted as low recall rate caused by other categories, and there is no case that a large number of samples of other categories are predicted as a certain category with low accuracy rate. Therefore, the prediction of the model is relatively average, and even unbalanced data can predict the categories well. For example, the data in the data set TREC is unbalanced, but it can also be predicted well. This shows that the corresponding accuracy recall rate in the table above these models is the value near the "balance point" on the P-R curve. This is mainly because 0.5 is usually used as the threshold for judging categories in the second classification task, and this paper also uses 0.5 as the classification threshold in the experiment. This also shows that BRCNN and ACNN are suitable for text classification tasks.  In practice, sometimes we pay more attention to the accuracy of classification, and sometimes we may pay more attention to the recall rate of classification. At this time, we can set the classification threshold to different values according to the specific situation to get different accuracy rates or recall rates. For example, reducing the classification threshold will get a higher recall rate, but the accuracy rate will be relatively reduced. On the contrary, increasing the classification threshold will increase the pricision, but the recall rate will decrease. On the whole, the peicision and recall rate of the model are relatively average when it is near the "balance point". In practice, you can draw the P-R curve first, and then set the classification threshold according to the situation analysis.

CONCLUSION
To learn the classification performance of classifiers based on deep mixed model, relevant experiments were conducted on English data set 20Newsgroup and Chinese data set Fudan University Chinese Corpus respectively. In the English text experiment, the classifier based on deep learning hybrid model is used for classification, and the classification accuracy rate is 91%. In order to further verify the superiority of its performance, the comparison experiment with naive Bayes classifier, KNN classifier and support vector machine classifier shows that the classification effect based on deep learning hybrid model is slightly better than that of SVM classifier, KNN classifier and naive Bayes classifier. In the Chinese text experiment, the Chinese corpus of Fudan University is tested and a good classification effect is obtained, and the influence of different parameters on the classification accuracy is discussed.
In the future study, we will further study the work of this paper. We can improve the algorithm in the model of deep learning, and try to use other models in deep learning, such as CNN, to learn the features of text classification. In a word, text classification based on deep learning hybrid model will have a good application prospect in the future development.
With the continuous improvement of deep learning theory, more new research results of deep learning will be added to the further research of this paper, which will greatly improve the performance of text classifier based on deep learning hybrid model.