A Named Entity Recognition Method Enhanced with Lexicon Information and Text Local Feature

: At present, Named Entity Recognition (NER) is one of the fundamental tasks for extracting knowledge from traditional Chinese medicine (TCM) texts. The variability of the length of TCM entities and the characteristics of the language of TCM texts lead to ambiguity of TCM entity boundaries. In addition, better extracting and exploiting local features of text can improve the accuracy of named entity recognition. In this paper, we proposed a TCM NER model with lexicon information and text local feature enhancement of text. In this model, a lexicon is introduced to encode the characters in the text to obtain the context-sensitive global semantic representation of the text. The convolutional neural network (CNN) and gate joined collaborative attention network are used to form a text local feature extraction module to capture the important semantic features of local text. Experiments were conducted on two TCM domain datasets and the F 1 values are 91.13% and 90.21% respectively.


INTRODUCTION
Traditional Chinese medicine (TCM) is one of the common means of modern medical treatment.With the continuous development of medical technology, TCM needs to continuously improve its knowledge system.Through thousands of years of inheritance, TCM has accumulated a large amount of valuable experience and literature, containing a wealth of scientific knowledge of TCM.TCM ancient literature is the crystallization of the wisdom of generations of TCM practitioners, containing clinical knowledge of TCM such as diagnosis, prescription and medication rules, which has important practical significance to the development of modern clinical treatment.In recent years, how to extract useful TCM knowledge from unstructured ancient TCM texts and apply it to practical life has received a lot of attention from researchers.
Named Entity Recognition (NER) [1] as a method of knowledge acquisition refers to identifying entities with special meaning from unstructured texts and classifying them into predefined meaningful categories.Generally, the TCM named entity recognition task is defined as a sequence labelling problem, aiming at assigning labels to each entity in the input text sequence.The variability of the length of TCM entities and the characteristics of the language of TCM texts result in the entity boundary being blurred.For example, "abdominal pain and bloating" is a symptom entity, including "abdominal pain".This problem makes many named entity recognition models designed for English corpus ineffective in the dataset of ancient Chinese medicine books.In view of the unclear boundary of Chinese medicine entities, the existing Chinese medicine named entity recognition methods mainly use combined characters and word embedding as the input of the model [2,3] to help the model more accurately model word features.In these methods, the word acquisition method is the Chinese word segmentation tool, while the text language of ancient Chinese medicine books is mostly ancient Chinese, which has a large gap with modern Chinese, leading to the occurrence of incorrect word segmentation, and thus affecting the performance of the model.In addition, the method based on word segmentation cannot get all the potential words of the sentence, which will also affect the accuracy of the prediction results of the model in practical applications.Therefore, accurately capturing all potential words in ancient Chinese medicine texts is the key to improve the performance of the model.
In addition, at the model construction level, most of the existing methods focus on extracting the context semantic features of text.For example, Zhang et al. [4] and Deng et al. [5] use BiLSTM to capture the context semantic features.Ma et al. [6] designed a multi-granularity text encoder to extract the context semantic features of text from multiple dimensions.The above work focuses on the extraction of the overall contextual information of the text.By analyzing the language characteristics of ancient Chinese medicine texts, we find that some Chinese medicine knowledge expressions are presented in the form of phrases, and there is no complete grammatical structure.Therefore, when building the model, we should consider the context semantic features and local semantic features of the input text at the same time, and make full use of their advantages to improve the overall performance of the model.The lexicon [7] is used to enhance the entity boundary discrimination of the model, and the local features of the text are used to enhance the semantic discrimination of the model.
In the task of named entity recognition, feature extraction of text is very important.The effectiveness and richness of feature information often determines the effect of entity recognition model.In recent years, many studies have shown that attention mechanism [8,9] and gate mechanism [10,11] have achieved ideal results in feature extraction and screening.Attention mechanism can usually distinguish the degree of information being focused by weight to determine the characteristics of information being focused.Therefore, we use attention mechanism to fuse character information and word information.The gate mechanism controls the flow of information in the model through some weighting functions, and can adaptively select and reject information according to the importance of feature information.Therefore, we apply gate mechanism to text feature fusion and local feature enhancement.
Based on the above analysis, we propose a TCM named entity recognition model with lexicon information and text local feature enhancement.Specifically, at the model input level, we will match the input text with a Chinese medicine lexicon to obtain all potential word information, and generate corresponding word sequences for the input text sequence.In order to efficiently use all the matched word information, we use the attention mechanism to dynamically extract the most relevant word semantic information for each character.In the aspect of local feature modelling, we propose a local feature awareness network, which is composed of two parts: a dual-channel convolutional neural network (CNN) unit, which can capture local semantic features of different granularity; a gate joined collaborative attention network, which can use local information of different granularity to achieve mutual reinforcement.In addition, the dualchannel CNN unit can also capture the potential semantic features of words when capturing local features, and can compensate for the impact of unlisted words in the lexicon on the performance of the model.Additionally, we designed complementary fusion gates to use to adaptively fuse contextual semantic features with local semantic features.The Puji Fang and the Materia Medical are authoritative works on TCM prescriptions and Chinese herbal medicine, and we used them as experimental data.The main contributions of this paper are as follows.
(1) To obtain better performance of the model on the TCM domain corpus, we created a TCM lexicon containing a total of 102 800 words related to prescription names, herb names, symptom names, tongue names, pulse names, evidence names and other related words, and generated word embedding vector representations for each word using a self-attentive mechanism and positional features.
(2) We proposed a two-branch model architecture, which jointly utilizes a fused lexicon fusion approach and a local feature extraction approach for the NER task.
(3) Experiments on two TCM data sets show that our method improves 0.51% and 0.73% compared with the optimal baseline.The ablation experiments and analysis also justified well the method in this paper.

METHODS
In this section, we first introduced the construction of the lexicon and the construction method of the word vector, and then illustrated the architecture of the TCM named entity model based on multi-feature fusion, and gave an example to show how our model works.

TCM Lexicon and Word Vector Construction
The vocabulary in the TCM lexicon constructed in this paper is derived from the TCM-related words in Sogou's thesaurus, TCM medical cases, Chinese herbal medicine manual and other related documents, and a total of 102 800 TCM words are collected.Since traditional word separation tools have poor effect on the word separation of ancient TCM texts, the word embedding vector representation of the vocabulary cannot be generated directly using the large-scale corpus.Therefore, we first used Word2vec [12] to convert the characters in each vocabulary into a dimensional embedding vector representation, which are trained by relying on the largescale TCM text corpus, and connect these embedding vectors to form a feature matrix representation of the vocabulary as follows.
where Word is the vocabulary to be vectorized and word_embedding is the feature matrix after vectorization.Then, we consider the positional features of each character in the vocabulary, so the positional feature vector p is added to each character embedding representation to obtain the feature matrix.The positional features of the t-th character are calculated as: where i takes a range of values 0, 2 and d denotes the dimensionality of the input character vector.Finally, we encode the feature matrix using the self-attentive mechanism [13] and obtain the final word embedding vector representation word_vec using the average pooling method as follows.
where SelfAttention denotes the self-attentive mechanism and mean -pooling denotes the average pooling.

Model Framework
The overall framework of our proposed model is shown in Fig. 1

1) Word Sequence Construction
Chinese sentences are usually expressed as a sequence of characters and do not contain word information.This phenomenon makes the model unable to clearly model the word information, resulting in poor model effect.In order to make full use of the word information in the sentence, we use the lexicon to identify all potential words in the sentence and generate a new word sequence.Specifically, given a lexicon and a sentence, we build a lexicon tree for the lexicon, then traverse all characters in the sentence, and obtain all words by matching with the lexicon tree and form a word sequence in turn, as the example shown in Fig. 1, " 哮 喘 且 喉 有 痰 声 ".After lexicon matching, we got a sequence of words {"哮喘", " 喉有痰声"}. 2

) Encoder and Embedding
We used a pre-trained model for vectorized representation of sentences, which addresses the drawbacks of traditional static word vector methods, and the pre-trained model contained rich prior knowledge, which is useful for NER tasks [14,15].Given a sentence denoting the overall semantic information of the sentence.The encoding process of the sentence can be represented formally as: where denotes the dimensionality of the hidden states output by the pre-trained model.We used the word vector constructed above to convert the generated word sequence 1 { ,..., ,..., } where m denotes the length of the word sequence and 1 denotes the dimensionality of the word vector.The word sequence encoding process can be represented formally as: 1 ( )

1) Char-word Attention (CW-ATT)
The input sentence gets the word sequence after lexicon matching, and our goal is to fuse the word information and character information.Inspired by the study of li et al. [16], we proposed the character-word attention mechanism.Unlike Li et al. we performed attentional computation on character sequences and word sequences directly, instead of self-attentional computation after splicing word sequences to character sequences, which has the advantage of reducing computational cost.
The character-word attention takes the characterembedding sequence c E and the word-embedding sequence 1 w E as input.Before calculating the attention weights, the dimensions of the two vectors need to be aligned, i.e., the word-embedding sequence E is calculated and the attention matrix is obtained as follows.(tanh( )) Finally, we injected the weighted information into the character embedding sequence: Subsequently, the c w E  was encoded using the discard and normalization layers.
2) Transformer Encoder The character-word attention incorporates word information into character information, but the contexts are not sufficiently interacted with each other such that the current character embedding sequence c w E  cannot express the full semantics of the sentence.We used the encoder part of Transformer to fully interact with the character-embedded sequence c w E  incorporating word information.Transformer encoder consists of a multiheaded attention layer, a feed-forward neural network layer, and a layer normalization and residual network with good global modelling capability and parallel computing capability.
Given the character embedding sequence c w E  , first added the position embedding information to c w E  get the embedding sequence H, and then to input H into the Transformer encoder, the calculation formula is as follows. ( where MultiHAtt is the multi-headed attention mechanism, LN is the layer normalization, and FFN is the feed-forward neural network.The equation for multiheaded attention is as follows. ( where ( ) are obtained by mapping the embedding sequence H, and h denotes the index of head.

Text Local Feature Extraction Module
There are many special cases in TCM ancient texts, for example, "牛黄解毒丸" and "牛黄" belong to the category of prescriptions and material medical respectively, so Local feature modelling plays a role in the task of TCM named entity recognition.CNN [17,18] is widely used in image processing because of their efficient local modelling performance.Therefore, we used CNN to model the local features of the sentences.Local feature modelling can capture not only local semantic information but also potential word information, which can help the model to clarify the boundary information of entities.
The analysis of the linguistic features of TCM ancient texts shows that TCM knowledge entities usually consist of two-word words and four-word words, which is also argued in the literature [19].From the starting point of the textual characteristics of TCM ancient texts, we designed a two-channel CNN to capture different granularity local features by setting up one-dimensional CNNs with window sizes of 2 and 4, respectively, calculated as follows.

tanh( ( , , )
) where k k H d  one-dimensional CNN generates the local feature vector, k denotes the size of the convolution kernel, and k W , k b denote the learnable weight matrix and bias parameters, respectively.
The local text semantic information of interest to different entities is different.Therefore, enhancing the relevant local textual feature representation can identify entities more accurately.With this motivation, we proposed a gate joined collaborative attention network based on the collaborative attention network [20], which controls the information flow by interacting local features of different granularity and using a gate mechanism to achieve the purpose of enhancing local features while being able to preserve the semantic features of the underlying vocabulary.The structure of the gate joined collaborative attention network is shown in Fig. 4, and the specific implementation process is shown in Eqs. ( 14) to (18). ,

( ) , ( )
where C denotes the similarity matrix, A and B are the outputs of the collaborative attention network, σ is the sigmoid activation function, FF denotes the feed-forward neural network, Specifically, Eq. ( 14) is used to calculate the similarity matrix between local features of different granularity, denoted as C, and then the interaction feature matrices, denoted as A and B , are computed separately using Eq. ( 15).The flow of information is controlled by gate in Eq. ( 16), which serves to retain the common semantic features related to the current semantic features while forgetting the  17), and then the final feature vector representations A H and B H are obtained by a feed-forward neural network.Finally, the local feature vector L H is calculated using Eq. ( 18).

1) Complementary Fusion Gate
In this section, we designed the complementary fusion gate to control the information flow and to be able to select the important information from the contextual semantic features and local semantic features.As shown in Fig. 1, the inputs of the complementary fusion gate are: the contextual semantic feature representation of the fused lexicon information and the local semantic feature representation of the unfused lexicon information.The output of the complementary fusion gate is represented as: where ( )   is the sigmoid function, 1 z W and 2 z W are the learnable weight matrices, and z b is the bias matrix.The feature matrices G H and L H have been transformed into the same dimension before the calculation of Eq. ( 19).
The complementary fusion gate aims to further enhance the representation of contextual semantic features of the information in the fused lexicon using latent word semantic features captured by the dual-channel local feature-aware network and local semantic features of different granularity.We believed that directly incorporating L H into G H or assigning weights only to L H would result in information redundancy, so we suggested assigning weights to both G H and L H to achieve an adaptive balance between both.
2) Decoder Conditional random field (CRF) [21] is able to consider the dependencies between successive labels, enhance the constraint information of the before and after labels, and can be learned automatically during model training.Therefore, this paper used CRF as a decoder to obtain the global optimal tag sequence.For a TCM sentence, its probability can be expressed as: where , i y i P denotes the predicted label probability of character i x and T is the transfer matrix.The conditional probability of the actual output label sequence y is: Finally, the optimal sequence [22] is calculated using the Viterbi algorithm [23] with an objective loss function as: where λ is the hyper parameter of L2 regularization and θ denotes the trainable parameter.

EXPERIMENTAL SETTINGS 3.1 Datasets
To evaluate our model, we conducted experiments on two TCM datasets, the TCM prescription dataset and the TCM material medical dataset, respectively.The TCM prescription dataset was derived from the Puji Fang, and the TCM herbal dataset was derived from the Material Medical.The data were manually labelled under the guidance of TCM experts, and the detailed statistics is shown in Tab. 1.

Baseline
To verify the effectiveness of our approach, it will be compared with the following models.BiLSTM-CRF [5] is one of the baseline models for TCM NER task.BERT-BiLSTM-CRF [4] model is one of the commonly used models for named entity task.Both of the two above models contain the BiLSTM module, which consists of LSTM and backwards LSTM.LSTM [24,25] is suitable for modelling textual data.LEBERT [26] is integrating lexicon information into the underlying BERT layer and achieved the optimal results for Chinese sequence annotation.Ma et al. [27] proposed to use label information to match entities in text and achieved good results on lowresource Chinese datasets.

Evaluation Metrics
Our evaluation metrics followed previous work [28], i.e., accuracy (p), recall (R), and F1 values [29].We performed 5 experiments for each implementation and reported their average value as the final experimental result.

Implementation Details
We used a pre-trained BERTbase (Chinese version) model as an encoder for the character sequences with an initial learning rate of 2e -5 and a learning rate of 1e -4 for the other parameters in the model.The detailed hyper parameter settings of the model are shown in Tab. 2.
In the experiment, the maximum number of dictionary tree word matches was set to 3. Also, the matching process discards words consisting of a single character.
Our model was based on the PyTorch architecture and trained using an NVIDIA GeForce RTX 2080Ti (11G) GPU.

EXPERIMENTS 4.1 Comparisons with Baseline Models
Tab. 3 shows the results of our model and the comparison model on the TCM prescription and TCM material medical datasets.It can be observed that our model achieves the optimal F1 value, which indicates that the model is able to effectively utilize not only the lexicon information but also the captured local information to enhance the model performance.The BERT-BiLSTM-CRF and BiLSTM-CRF show that the pre-trained model plays an important role in improving the model performance.The core assumption of Ma et al.'s method is that the label names carry information about the meaning of the labels, and the label names in the TCM dataset are abstract generalizations, thus leading to a lower model recognition accuracy.The LEBERT model outperforms the BERT-BiLSTM-CRF model significantly, proving that the lexicon as well as the word vectors created in this paper are effective.As a result, our model achieved 0.51% and 0.73% improvement on the TCM prescription and TCM material medical datasets improved, respectively.

Ablation Studies
In this section, we carry out detailed experiments on the data sets of traditional Chinese medicine prescriptions and traditional Chinese medicine herbs to prove that each component or design in our model plays a vital role in extracting entity performance methods.As shown in Tab. 4, we use the equation method to describe the removal or replacement of a component in the model.We take BERT-Transformer-CRF as the basis and mark it as M. From Tab. 5, we can draw the following conclusions: (1) It is reasonable for Transformer Encoder to sense the semantic characteristics of the context.Most previous work used BiLSTM to capture context information, so we used BiLSTM to replace Transformer Encoder to form a new model, called M1. Observe Tab. 4, variant #2 outperforms #1, so using Transformer Encoder can improve performance.
(2) Lexicon information enhancement is crucial to improve the performance of the model.We use char-word attention to form a variant model based on the basic model, which is recorded as M+CW-ATT.Compared with M+CW-ATT, the F1 value of M decreases by 0.81% and 1.08% on the two data sets, which indicates that dictionary information can improve the performance of the model, and also proves that the character-word attention we proposed is effective.
( (4) Complementary Features are crucial.We investigated the effect of complementary fusion gates on model performance.Variant #5 directly integrates context features and local features, and is recorded as M2.Variant #6 only assigns weights to local features, and then performs feature fusion, which is recorded as M3.Variant #7 uses our complementary fusion gate to fuse context features and local features.The results show that Variant #7 using complementary fusion gates achieves the best results, consistent with our intuition about the effect of information redundancy.

Error Analysis
To fully understand the disadvantages and advantages of the model, we performed error analysis.Through experiments, we found that the model was less effective in identifying symptom-like entities in the TCM dataset (Just take the data of TCM prescriptions as an example), and Tab. 5 shows the 3.61% improvement in F1 values for our model over the BERT-BiLSTM-CRF model.The lack of clear entity boundaries in Chinese text and the relatively complex structure of symptomatic entities with variable entity length lead to poor model recognition results.Although the use of lexicon information and the acquisition of potential word information can improve the entity recognition effect, how to further capture the boundary information accurately is the goal of the next stage of research.

CONCLUSIONS
In this paper, we proposed a TCM NER model with lexicon information and text local feature enhancement to deal with the problem of TCM entity boundaries ambiguity in NER task.Lexicon information is incorporated into the model to enable the model to learn entity boundary information better.And a text local feature extraction module is designed to capture latent lexicon information and local semantic features in sentences.Furthermore, a complementary fusion gate is proposed for adaptively fusing features.Our work provided a good method for practitioners of TCM to automatically extract valuable knowledge from ancient TCM texts.In future work, we will further investigate the TCM entity boundary blurring problem and the fine-grained TCM NER task to obtain knowledge in TCM ancient texts more accurately.
, which consists of four main components: encoder module, lexicon enhanced text global feature extraction module, text local feature extraction module and decoder module.The encoder module performs vectorized representation of the input character and word sequences.Lexicon enhanced text global feature extraction module serves to integrate words into the character information, aiming to use word information to enhance sentence semantic information and clarify word boundary information.Text local feature extraction module serves to capture potential word information and local features in the original semantic information, and the decoder module serves to adaptively fuse features and predict the class of entity labels.

,
n denotes the length of the input text sequence.The sentence is represented as an embedding sequence 1 { , ,..., ,..., } the pre-trained model, with c i e denoting the vector representation corresponding to each character in the sentence and c CLS e

where 1 W and 2 W are trainable weight matrices and 1 b and 2 b( 8 )
are bias matrices.Then we computed the weighting matrices of the character embedding sequence c E and the attention matrix Att : c Z E Att   Technical Gazette 30, 3(2023), 899-906

Figure 2
Figure 2 Gate joined collaborative attention mechanism irrelevant common semantic features to avoid information redundancy , denoted as A G and B G .The feature vectors A G and B G are incorporated into the local feature vectors B G and B, respectively, by Eq. (

Table 1
Datasets statistics

Table 2
Experimental parameter settings

Table 3
Experimental results ) Local feature extraction is effective.We form a new model M+CW-ATT+TLAE by adding a text local feature extraction (TLFE) module to the M+CW-ATT model.From this, we can see that the F1 value index of #4 model has increased by 0.81% and 0.94% respectively, which indicates the importance of potential word information and local semantic features in TCM NER task.

Table 4
Results of ablation experiments

Table 5
Effectiveness of different models for symptom recognition