An Online Word Vector Generation Method Based on Incremental Huffman Tree Merging

: Aiming at high real-time performance processing requirements for large amounts of online text data in natural language processing applications, an online word vector model generation method based on incremental Huffman tree merging is proposed. Maintaining the inherited word Huffman tree in existing word vector model unchanged, a new Huffman tree of incoming words is constructed and ensures that there is no leaf node identical to the inherited Huffman tree. Then the Huffman tree is updated by a method of node merging. Thus based on the existing word vector model, each word still has a unique encoding for the calculation of the hierarchical softmax model. Finally, the generation of incremental word vector model is realized by using neural network on the basis of hierarchical softmax model. The experimental results show that the method could realize the word vector model generation online based on incremental learning with faster time and better performance.


INTRODUCTION
Since the word2vec model is proposed by   [1], academia and industry have set off a wave of research in the field of natural language processing (NLP) in recent years [2]. The distributed vectors of words represent a certain semantic or grammatical feature, that has been shown to be useful in various NLP tasks, such as text classification [3,4], semantic analysis [5][6][7]. However, since the initial feature vector of word is randomly generated, the generated word vector is different each time. If each corpus data is trained separately, the generated word vector will have an error impact on subsequent classification, clustering, predicting and other processing tasks. In particular, when incremental small sample data needs to be processed, it will take a lot of time and computational resources if all of words are retrained. Therefore, it is necessary to combine the old corpus and new corpus for incremental training.
In order to generate the word vector faster, Pennington et al. (2014) proposed a global vector word model Glove [8], that it does not need to go through each corpus for local word vector optimization, but only global optimization of the matrix. Ji et al. (2019) calculated word2vec model by paralleling in shared and distributed memory [9], reducing network communication and keeping the model synchronized effectively when number of nodes increases.
However, traditional word2vec, Glove, and parallel training are static, batch learning modes, that do not support incremental learning, only for offline training. As mentioned above, in practice, training corpus is usually not available all at once, but is gradually obtained over time. It will take a lot of time and space to retrain all the data after the new corpus arrives. Especially when the new corpus is relatively smaller than the existing corpus, it will be much less efficient to retrain the word vectors with the combined corpus. Incremental learning method is practical and necessary.
As known, word2vec employed two techniques called hierarchical softmax and negative sampling [10,11]. Hierarchical softmax constructs a Huffman tree to index all the words in a corpus as leaves, while negative sampling is developed based on noise contrastive estimation. It is empirically shown that hierarchical softmax performs better for infrequent words while negative sampling performs better for frequent words. Due to the good performance on rare words, which can benefit the further incremental training for new corpus, hierarchical softmax model is widely accepted in incremental learning. Peng et al. (2017) proposed an incremental training model based on hierarchical softmax for new corpus [12], and a new Huffman tree was constructed with exiting Huffman tree changed to update all the word vectors and parameters. Stochastic gradient based optimization is performed respectively through the new and old Huffman tree.
In this paper, a more practical method for online word vector model generation based on incremental learning is proposed. While maintaining the inherited word Huffman tree in existing word vector model unchanged, a new Huffman tree of incoming words is constructed and there is no leaf node identical to the inherited Huffman tree. Then an updated Huffman tree is constructed by a method of node merging. In the node merging method, find the shortest path leaf node of the inherited Huffman tree and use it as the root of the new Huffman tree. Thus based on the existing word vector model, each word still has a unique encoding for the calculation of the hierarchical softmax model [13]. Based on the hierarchical softmax model, distributed vectors would be generated using Continuous Bag-of-Words (CBOW) [14] or Skip-gram methods [15] as fully retrained ones. Under this method, the previously trained model is maximally utilized. For online learning, the Huffman code of all words in the original corpus is retained with Huffman tree unchanged, and only the new corpus data is dynamically updated.
The incremental Huffman tree merging method can vectorize all words (including new words) so that the words can quantitatively measure the relationship and explore the relationship between words [16]. The experimental results show that the method could realize the word vector model generation online based on incremental learning simply and effectively.

INCREMENTAL HUFFMAN TREE MERGING
As mentioned above, hierarchical softmax model would be selected in incremental learning for online processing tasks [17]. Huffman tree is constructed according to the frequency of words and the occurrence of words in the corpus. The default left side (coded as 0) is a negative class, and the right side (coded as 1) is a positive class. The leaf nodes of the Huffman tree are all in the corpus. The higher the word frequency, the closer the distance from the root node.
Take the word w 2 in Fig. 1 as an example. The input word is defined as w, and input layer word vector into the Huffman tree root node is x w , from the root node to the leaf node where w is located as shown by bold nodes, the total number of nodes included is l w , word w in Huffman tree, starting from the root node, the i th node passing through is represented as w i p , and the corresponding Huffman code

Gradient Optimization
The hierarchical softmax model uses a binary tree to represent all words in the vocabulary. Probability is defined as the probability of a random walk starting from the root ending at the leaf node. At each inner node, it is necessary to assign the probabilities of going left and going right. When w i d = 0, left path is defined as positive, and when w i d = 1, right path is defined as negative. Apparently the probability of going right or right at node w i p is Then for a target output word w, its maximum likelihood is For a target output word w, the log-likelihood function is Then take the derivative of E with regard to the vector representation of the inner node 1 In the same way, the gradient expression of x w is as follows:

Incremental Huffman Tree Merging Algorithm
The Huffman tree is constructed based on the word frequency. Once the corpus sample increases, the Huffman tree will change accordingly. The difficulty of word2vec online learning is how to create an incremental Huffman tree to obtain a unique Huffman code for each word. In the incremental learning method based on full updated Huffman tree, there could be almost half of the leaves changed from one side of the tree to the other side, corresponding to completely reversing the order of frequencies. It means that the coding of certain words has been changed, which will affect the subsequent network learning results.
Imagine that it is so time consuming and resource intensive to train the incremental learning model for massive data. This paper proposes a method for incremental learning based on tree merging algorithm. This algorithm is suitable for small sample online incremental learning for massive data, without the need for a large amount of computing resources to retrain the entire sample.
Based on this algorithm with in-memory database, the production system can realize online real-time application in the field of text processing. The Huffman Tree merging schematic diagram is shown as follows in Fig. 2.
There are two corpuses, respectively corpus1 = (w 1 , w 2 , …, w v ) and   Keep existing Huffman tree unchanged as inherited tree, and find the shortest path leaf node of the inherited Huffman tree as the root of the new Huffman tree. Finally the updated Huffman tree has the inherited unchanged word node and the new corpus word node to be trained. For example, in the inherited Huffman tree, word w 1 is encoded as 0001. When the new corpus also contains the word w 1 , the new Huffman tree does not contain the word node, so the word w 1 is still encoded as 0001 as original, avoiding re-learning all corpora. Then the shortest path node 1 w k p in the inherited Huffman tree is found, which the root node of new Huffman tree would be merged into. Thus based on the existing word vector model, each word still has a unique encoding for the calculation of the hierarchical softmax model. Through the tree merging method, the old tree model can be maximized, saving processing time to fully build a new tree. The core of the algorithm is to ensure that the original Huffman tree is unchanged. In order to reduce the calculation of the subsequent neural network layer, the shortest path node of the inherited Huffman tree is set as the root of the new tree. Finally, find the appropriate word vector for all nodes and all internal nodes θ to maximize the likelihood of training samples.

WORD2VEC BASED ON INCREMENTAL LEARNING
After the corpus is collected online, the updated Huffman tree is obtained. Based on the updated Huffman tree, word vector would be generated using traditional model, including the original continuous CBOW and Skipgram models shown as Fig. 3.

Figure 3 Huffman tree merging schematic diagram
Specific algorithm implementation details can be seen in the introduction of the word2vec principle. Here just generally talk about the difference paying particular attention to the different features in use.

CBOW Model
In the BBOW method, the central word is predicted by surrounding words, and the gradientascent algorithm is used to predict the surrounding words by using the prediction result of the central word. When the training is completed, each word will be used as the central word, and the word vector of the surrounding words will be adjusted, thus obtaining the word vector of all the words in the whole text. The input layer value is the sum of 2c word vectors around word w and averaged: Hidden layer neuron vector 1 w j   and x w as Eqs. (8) and (9) shown is updated using the gradient ascent method based on Eqs. (5) and (6).
η is the learning rate of the gradient ascending.

Skip-gram Model
Skip-gram uses a central word to predict the surrounding words. In the Skip-gram mode, gradient ascent is also used to continuously adjust the word vector of the central word according to the prediction results of the surrounding words. After all the texts are traversed, the word vector of all the words of the text is obtained. x w is the word vector corresponding to the word w. Note that there are 2c word vectors around x w here. Since the contexts are mutual, while expecting   be the largest. Thus in an iterative window, not only a word x w is updated, but also x i , i = 1, 2, …, 2c total 2c words. This way the overall iteration will be more balanced. The specific learning method is the same as Eqs. (8) and (9), and no more details here.
In Skip-gram, each word will be influenced by the surrounding words. When each word is used as the central word, it will be predicted and adjusted 2c-windows times. Therefore, when the amount of data is small, or the number of occurrences of rare words is small, this multiple adjustment will make the word vector relatively more accurate.

THE ANALYSIS OF EXPERIMENTAL RESULTS
To verify the efficiency and effectiveness of incremental training for online word vector generation, experimental evaluation has been performed on real data. Model performance includes many aspects [18], and the experiment in this paper will be carried out in two aspects: the training time and the semantic similarity of the distributed vector. The experimental results are compared with regular word2vec techniques and existing incremental learning based on Huffman tree global updated.

Training Time and Efficiency
For the near real-time streaming news corpus learning tasks, the new update data size fluctuates little and has been around 10 kB during processing interval time. The collected news data is divided into several sets. 100 kB, 1 MB, 10 MB, 100 MB data is selected as initial training corpus that is the existing corpus in previous sections.
For traditional global training, the old and new corpora are combined as a whole, and run the original CBOW and Skip-gram training models. For the incremental training, based on the initial different size training corpus of the trained model, and run incremental learning algorithm to Non-global update the Huffman tree as well as the parameters and word vectors.
The comparison results can be seen in Fig. 4.The four curves represent global learning based on CBOW, global learning based on Skip-grams, incremental learning based on CBOW and incremental learning based on Skip-grams.
In global leaning mode, as the amount of training data increases, the global training time becomes longer and longer. Although adding 10 kB is relatively small compared to the original training size 100 kB ~ 100 MB, the amount of data involved in training is high. For small sample online learning, incremental mode can greatly reduce incremental training time. Moreover, it is found that Skip-gram is in an order of magnitude slower than CBOW. By comparing CBOW and Skip-gram, it can be seen that for each context word, Skip-gram needs to update the parameters following the path from root to that word. Furthermore, incremental training for both CBOW and Skip-gram benefits from the algorithm and is faster than global training mode. There are two ways to update the Huffman tree in incremental mode training. One is based on global pattern rebuilt. The other is the Huffman tree online merging method proposed in this paper. Fig. 5 shows the time comparison of two Huffman tree update modes. The tree merging method can greatly reduce Huffman tree coding time. It is worth mentioning that Huffman tree generation time at 1 MB data is less than the time at 100 kB data, because the new incoming 10 kB data and 1 MB data are more repeated, the number of update nodes is less.

Word Similarity
The most important purpose of the word2vec model is to generate word vectors that map words to Highdimensional vector spaces [19,20]. Vector operations between words can also correspond to semantics. Next the word similarity evaluation benchmarks are used to evaluate the correctness of the proposed incremental training algorithm, as shown in Fig. 6.
Both modes are capable of extracting word vector features. In incremental learning mode based on Huffman tree global update, the word vector for all words is generated based on the reconstructed Huffman tree code. Word similarity will change with new incremental corpus. In incremental learning mode based on Huffman tree merging update, the feature distribution of the word vector shows a better effect, and can be better used for advanced applications such as text classification and emotion recognition.

CONCLUSION
In the text processing, small sample incremental learning is an important application scenario. For large data volume and high real-time performance requirements, an online word vector model generation method based on incremental Huffman tree merging is proposed.
Maintaining the inherited word Huffman tree in existing word vector model unchanged, a new Huffman tree of incoming words is constructed there is no leaf node identical to the inherited Huffman tree. Then an updated Huffman tree is constructed by a method of node merging. In the node merging method, find the shortest path leaf node of the inherited Huffman tree and use it as the root of the new Huffman tree. Thus based on the existing word vector model, each word still has a unique encoding for the calculation of the hierarchical softmax model. Finally, the generation of incremental word vector model is realized by using neural network on the basis of hierarchical softmax model. The experimental results show that the method could realize the word vector model generation online based on incremental learning with faster time and better performance.