Intelligent Case Assignment Method Based on the Chain of Criminal Behavior Elements

: The assignment of cases means the court assigns cases to specific judges. The traditional case assignment methods, based on the facts of a case, are weak in the analysis of semantic structure of the case not considering the judges' expertise. By analyzing judges' trial logic, we find that the order of criminal behaviors affects the final judgement. To solve these problems, we regard intelligent case assignment as a text-matching problem, and propose an intelligent case assignment method based on the chain of criminal behavior elements. This method introduces the chain of criminal behavior elements to enhance the structured semantic analysis of the case. We build a BCTA (Bert-Cnn-Transformer-Attention) model to achieve intelligent case assignment. This model integrates a judge's expertise in the judge's presentation, thus recommending the most compatible judge for the case. Comparing the traditional case assignment methods, our BCTA model obtains 84% absolutely considerable improvement under P@1. In addition, comparing other classic text matching models, our BCTA model achieves an absolute considerable improvement of 4% under P@1 and 9% under Macro F1. Experiments conducted on real-world data set demonstrate the superiority of our method.


INTRODUCTION
With the rapid development of big data and artificial intelligence technology, in the judicial field, countries all over the world are advancing judicial intelligence research. In this process, judicial intelligence assistance is a hot research issue of smart justice. In the complicated trial procedure, the assignment of cases is the starting point for cases to enter the trial procedure and the key to the reasonable allocation of judge resources. With the increasingly prominent contradiction of "more cases and fewer people", the study of an intelligent case assignment method, which is to protect the interests of the parties and ensure the compatibility between cases assigned and judges' professional ability, has great theoretical significance and application value.
According to surveys, there are two modes of traditional case assignment mechanism: manual and random case assignment. In manual case assignment, the chief judge appoints judges, and in random case assignment, a computer assigns judges. Thus, the traditional mechanism has some shortcomings: Firstly, it is weak to analyse the semantic structure of the case. Exactly, based on the facts of cases, these methods are easy to ignore the order of criminal behaviors, because the same criminal behaviors in a different order often led to different trial results. As in the example of two Chinese cases shown in Fig. 1, both only involved murder and rape, but the different order of occurrence led to different trial results. In the first case, the defendant first raped and then murdered the victim and was convicted of intentional homicide and rape. While in the second one, the defendant first murdered and then raped the victim and was convicted of insulting a corpse and intentional homicide. Secondly, judges' professional ability is not taken into consideration when cases are assigned, which can easily lead to incompatibility between cases assigned and judges' professional ability. Therefore, to solve the above problems, the development of an intelligent method of case assignment has great theoretical significance and application value.
In the long-term judicial practice, the court has accumulated a large amount of judgement document data. A judgement document is composed of the facts of a case, the results of judgements, the judges, and so on. Judgment documents are usually presented in text form, which contains important case information and knowledge value. How to extract valuable information from the text data to serve the intelligent case assignment is the key issue in the current research. By analyzing judges' trial logic, it is found that judges focus on the order of criminal behaviors and then make a final judgment. In this process, the behavioral elements of a case play a key role. The behavioral elements of a case describe behavior words of case process and they also describe the key elements of the case which related to the behavior words. Generally, a case can be regarded as being composed of a series of temporal behavioral elements. Thus, analyzing the elements of criminal behavior with temporal relationship is helpful to model the semantic information of cases, on which our work is based. The main contributions of this paper are as follows: (1) This is the first work regarding intelligent case assignment as a text-matching problem between cases and judges and introducing a deep neural network to realize intelligent case assignment in the legal domain.
(2) This is the first work focusing on the importance of the elements of criminal behavior and the order of criminal behaviors. We propose to build the chain of criminal behavior elements with temporal relationships to model case structured semantic information, which enhances the ability to represent cases.
(3) We build a BCTA (Bert-CNN-Transformer-Attention) model to achieve intelligent case assignment. The model integrates information about a judge's expertise in the judge's representation recommending the most compatible judge for the case. Experiments show that this method can significantly improve the accuracy of the assignment of cases.
The rest of this paper is organized as follows. Section 2 presents related works. Section 3 provides the essential definitions of intelligent case assignment. Section 4 describes our models. Section 5 discusses the experiments and results. The conclusion is given in Section 6.

RELATED WORKS
Our work is related to several research areas, including text representation, text matching models, and judicial intelligence research.
Text representation is an important basic task for many NLP tasks. The text representation model can be roughly classified into three categories. The first category is focused on text features, and the second category is based on topic features. Examples include VSM [1], LDA [2], LSI [3] and SSI [4]. These two types of text representation methods cannot model the position information and context information of words. The third category is a text representation model based on neural networks. Compared with previous methods, the neural network-based text representation method solves the problems of highdimensional sparseness and lack of semantic association in the representation text. Word vectors are also called word embedding. Hinton et al. [5] proposed the concept of distributed representation, and Bengio et al. [6] proposed a neural network language model (NNLM), which opened the study of distributed representation methods for words based on neural networks. The most commonly used model is Word2Vec [7,8]. The distributed representation of words has been greatly developed since the development of Word2Vec. Pennington et al. [9] proposed glove to learn word vectors in a global sense. Bojanowski et al. [10] proposed FastText to learn the morphological information of words. Prior to the FastText model, the representation of words was independent of the top and bottom. As ELMo [11], BERT [12] and other models have been proposed, text representation not only considers the morphological information of the word but also takes into account the context and semantic information. Recently, in the field of artificial intelligence and law, various neural network architectures such as CNN [13] and RNN [14] have been used for document embedding. Jiang et al. [15] use deep reinforcement learning methods to improve classification accuracy. Kang et al. [16] use CNN and GRU to improve the performance of the experimental results. Legal cases are often represented in text form. The key to our work is how to represent judges and cases. In view of the superiority of neural network-based text representation methods, we use neural network-based text representation methods to represent judges and cases.
In intelligent case assignment, the facts of a case are input, and all judges in the court are matched. This task can be regarded as a text-matching task. Many NLP tasks can be formulated as a matching problem between two texts. There have been many deep learning models proposed for text matching and ranking. These text matching models could be classified into three categories. The first category is a deep learning model based on single semantic document expression, which first learns the vector representations of two documents independently and then uses functions (such as vector dot product, cosine similarity function and MLP network) to calculate the similarity between the learned feature vectors. Typical models include DSSM (deep structured semantic models) [17]. However, they cannot capture the interactive information between texts and usually cannot achieve good performance. The second category is a deep learning model based on multi-semantic document expression, which is a multi-angle and multi-granularity generation of text vector representation for matching that can effectively reduce information loss. For example, MV-LSTM (multi-variable LSTM) [18], which uses Bi-LSTM to encode the text, interacts with the vectors at each moment of two sentences. However, the intrinsic structural properties of texts are not utilized by these models. The third category is the direct modelling matching mode, which pays attention to how to represent the text and pays attention to the dependence between text pairs. For example, Yin et al. [19] construct a similarity matrix of two sentences and then apply convolution to the matrix to extract features. Radford et al. [20] use RNN to build a Siamese network and use attention to capture the interactive information of two sentences. The third category considers the matching degree and matching structure at the same time, they achieve significant improvements in multiple text matching tasks. Following the work of text matching models, we build an interactionbased matching model to fully capture the semantic information and inherent structural information of texts.
The application of artificial intelligence technology in law has become an important aspect of judicial intelligence research. To date, some achievements have been made in judicial intelligence research, which focuses on the task of legal judgment prediction (LJP). For a given case, the task of LJP aims to empower machines to predict the judgment results (e.g., law articles, charges, and prison terms) of the case. Inspired by the success of deep learning techniques [13,14,21] on NLP tasks, researchers attempt to employ neural models to handle judgment prediction tasks. Some popular neural network methods are used in an automatic charge prediction task [22][23][24], and there are some works focusing on identifying applicable law articles for a given case [25][26][27]. In addition, some researchers focus on other areas of justice such as entity recognition [28,29], court opinion generation [30] and analysis [31].

PROBLEM FORMULATION
In this section, some notations and terminologies will be introduced, followed by the essential definitions of intelligent case assignment.
Legal Cases Legal cases are ultimately presented in the form of judgment documents. By analyzing judges' trial logic and the composition of a judgment document, it is found that the facts of a case are the key information of the case. Therefore, we extract the facts of the case to represent the case. Supposing the facts of a case as a word sequence fact = {w 1 , w 2 , …, w n }, where n is the number of words.
Intelligent Case Assignment The purpose of intelligent case assignment is to recommend judges automatically for cases. First, for each case, we construct the chain of criminal behavior elements cbe = {e 1 , e 2 , …, e m }, where m is the number of criminal behavior elements in the case. We use the language technology platform (LTP); (https://github.com/HIT-SCIR/ltp) to extract m behavior elements from the facts of the case. The LTP is developed by the Social Computing and Information Retrieval Research Center of Harbin Institute of Technology. Then, the feature vector of the facts of the case v fact and the feature vector of the criminal behavior element chain v chain are joined together to represent the case v case . Formally, let v case = v fact  v chain denote the feature vector of the case, and v judge = [fg 1 , fg 2 , …, fg i ] denote the judge's matrix vector, where fg i represents the feature vector of the i-th judge, and k is the number of judges in the court. Given a training dataset D [<v case , v judge >], we aim to train a model F(ꞏ) so as to recommend a trail judge for any test case.

OUR METHODOLOGY
In this paper, we build a BCTA (Bert-Cnn-Transformer-Attention) model to realize intelligent case assignment. The architecture of this model is shown in Fig.  2. Our method consists of three parts: the representation of the case, the representation of the judge, and the match between the case and the judge. The representation of the case is generated by the behavior chain encoder and the fact encoder. The judge's representation is generated by the judge encoder. The match between the judge and the case is realized by case assignment module. In the following subsections, we discuss our method in detail.

Figure 2
The framework of BCTA model

Case Representation Method
To solve the problem of weakly structured semantic analysis of the case in the traditional case assignment method, we construct the chain of criminal behavior elements with temporal relationships to enhance the ability to represent cases. We first extract the behavior elements from the facts of the case and then construct the chain of criminal behavior elements according to the order of behavior elements. Below, we first introduce the construction method of the chain of criminal behavior elements and then the representation method of the case.
The process of constructing the chain of criminal behavior elements is shown in Fig. 3. We use LTP to extract elements of criminal behavior and to build the relationships between elements. The specific steps are: first, we use LTP to preprocess the facts of the case; second, we use the semantic role tagging toolkit in LTP to perform semantic analysis on sentences; third, we filter elements according to Chinese grammar rules; and finally, according to the order of the elements of criminal behavior, we construct the chain of criminal behavior elements with temporal relationships. For instance, as shown in Fig. 4, LTP can extract three criminal behavioral words, namely, selling, arrested, and confiscated, and mark the semantic relationship between the elements as "A0", "A1", etc. Here "A0" represents the agent of the behavior, "A1" means the recipient of the behavior, "TMP" means the time when the behavior occurs, "LOC" means the place where the behavior occurs, "BNF" means the beneficiary of the act, and "ADV" means the adverbial modifier behavior. Therefore, according to the temporal relationship of the behavior elements, we can obtain the chain of criminal behavior elements in this case as (selling drugs, the public security organs arrested the defendant Wei Pengpeng, the public security organs confiscated 2 grams of drugs).

Behavior Chain Encoder
For a given case, fact = {w 1 , w 2 , …, w n }, is a word sequence of the facts of the case, where $n$ is the number of words in the facts of the case, and cbe = {e 1 , e 2 , …, e m }, is the chain of criminal behavior elements of the case, where m is the number of behavior elements, such as "the public security organs". The behavior chain encoder is shown in Fig. 2. Next, we will introduce the layers.
BERT Layer To transform each element in the behavior element chain into a vector, a commonly used word embedding methods include a random look-up table and a pre-trained language model. To capture stronger semantic information, BERT is used here to obtain the feature representation of each behavior element and learn the semantic information within each element. For the chain of criminal behavior elements cbe = {e 1 , e 2 , …, e m }, each element is randomly mapped to obtain a sequence represents the vector representation of the i-th behavior element. After the BERT layer, we can obtain the feature representation is the feature vector output by the BERT hidden layer, and h is the dimension of the BERT hidden layer. This process can be formalized as: CharCNN Layer To model the semantic relationship between behavior elements and the global dependency of the chain, the CharCNN layer is followed by the BERT layer. The dependency between the elements is learned through the convolution window. The convolution operation can be denoted as: Usually, multiple convolution windows of different sizes are set to obtain feature vectors with different granular information. If there are three convolution windows of different sizes, the obtained feature vector matrix is C = [c 1 , c 2 , c 3 ].
Pooling Layer After the CNN layer, to obtain more valuable features in the behavior element chain, a pooling operation is performed on the output results of the convolution. Pooling operations include maximum pooling, average pooling, and so on. In this paper, a max operation is implemented for the behavior element chain. The pooling operation can be formalized as: Fully Connected Layer After the BERT embedding operation, the convolution operation, and the pooling operation, the raw input sequence of the behavior element chain is transformed into a high-level abstract feature vector. Then, a fully connected layer can be adopted to give a global regulation, denoted by Conn. The process can be summarized as: BERT and CharCNN can learn the dependencies within and between behavioral elements. Here, we regard BERT, CharCNN, the pooling layer and the fully connected layer as the embedding layer of the behavior chain encoder as a whole. The output of the embedding layer is denoted by element embeddings. Inspired by the idea of BERT's position embedding, we add temporal relationships in the order in which behavior elements occur, denoted by timing embeddings. And the type of each behavior element is also marked when extracting behavior elements and their relationships. In this paper, we aggregate the typical feature of the elements as external features into the representation of the behavior element chain, denoted by type embeddings. We use a randomly initialized lookup Transformer Layer A transformer can be regarded as a graph attention network (GAT) [32]. To model better the temporal relationship of the criminal behavior element chain, we perform a transformer encoder operation on the output of the embedding layer. A transformer encoder computes the representation of each word through an attention mechanism with respect to the surrounding words. For the given behavior chain cbe = {e 1 , e 2 , …, e m }, we can obtain a semantic vector behave = {b 1 Attention Layer The attention mechanism [33][34][35] lets the model capture the whole traffic dynamics in the input sequence. Inspired by [36,37], we use the attention mechanism as an attention pooling mechanism, which can pay attention to the key element in the chain of criminal behavior elements and maintains the most meaningful information of the facts of the case. The formula is defined as follows: where W s is a learnable parameter. In summary, we obtain the final vector representation of the criminal behavior element chain as:

Fact Encoder
In this section, the fact encoder is used to obtain the semantic vector of the facts of a case. For a given case, fact = {w 1 , w 2 , …, w n } is a word sequence of the facts of the case, where n is the number of words in the facts of the case. To capture more valuable features, here, we also use CNN and BERT to encode the facts of the case. The fact encoder is the embedding layer in the behavior chain encoder. On the basis of the above description, the encoding process of the fact encoder can be formally described as follows: . However, both v chain and v fact are high-order feature vectors obtained independently, and it is not feasible to model the interaction between them. Therefore, a multilayer perceptron (MLP) is used to provide global regulation between them. The MLP layer map v case from 2h to h. The process can be formalized as: where  represents the concatenation operation.

Judge Representation Method
To solve the problem of the traditional case assignment methods without considering the expertise of judge, we integrate the feature of judges who are good at trial cases in the judges' representation to highlight the judges' expertise. Judges have heard numerous cases in the past and been involved in different causes. The differences in judges' expertise and experience have resulted in different cases with different trial quality. We assume that cases with high-quality judges are those that judges are good at. Using the feature of such cases can reflect judges' expertise.
In 2011, the Supreme People's Court of China published 31 indicators to evaluate the overall quality of court trials. Based on this, we consider how to evaluate the quality of individual judges' trials. After analysis, we select 3 indicators from 31 indicators to evaluate the quality of individual judges' trials: the rate of first-instance revised judgment and retrial(一审发改重审率), denoted by α, the average trial time of cases(案均审理时间), denoted by β, and the rate of cases closed within the statutory normal trial period( 法 定 正 常 审 限 内 结 案 率 ), denoted by θ. The weight calculation method for judges on the quality of any type of case can be formalized as: where w ij represents the weight of the judge's trial quality for any type of case, i = 1, …, n represents the number of judges, and j = 1, …, m represents the number of causes. One is added to the numerator and denominator to smooth the formula and prevent the result from being zero. For any judge, in any type of case, the trial quality weight w can be calculated according to Eq. (13). As seen by comparing different trial quality weights, the case type with the highest trial quality weight is the case type that judges are good at. In summary, we can determine the types of cases that any judge is good at. Then, we can generate the experimental labeled data set, which is in the form of <case, judge>.
We extract the feature of judge who is good at trial cases to represent the judge, so as to integrate judge's expertise into the judge's representation. As shown in Fig.  2, we use the judge encoder to encode the judge text to obtain the judge's feature vector. Because the judge's text also includes the facts of cases of the judge's good at cases, the combination of BERT and CNN is also used here to obtain more valuable feature vectors. As mentioned in the fact encoder, for a given the i-th judge's text f i = s 1 , s 2 , s 3 , ... , the process of the judge encoder can be formalized as: Here, h i fg R  represents the feature vector of the ith judge. If there are k judges in the court, we can obtain the feature vector matrix v judge = [fg 1 , fg 2 , …, fg k ] by means of the judge encoder.

Case Assignment Module
Given a training dataset D [<v case , v judge >], we aim to maximize the accuracy of recommending judges based on the facts of a case. Based on v case and v judge = [fg 1 , fg 2 , …, fg k ], we can calculate the matching degree between the case and each judge through the cosine similarity [38]. This process can be formalized as: Each value of k R   reflects the matching degree between the case and each judge. For any case, the judge with the largest matching value is the best judge to try the case.
For training, we use a cross-entropy loss function that is computed as follows: We employ Adam [39] for optimization, and apply dropout on every sematic vector to prevent overfitting.

EXPERIMENTS 5.1 Dataset and Settings
Dataset Currently, there is no publicly available datasets for intelligent case assignment. In this paper, we collect and construct an intelligent case assignment dataset. The dataset consists of criminal cases published by the Chinese government from China Judgements Online(http://wenshu.court.gov.cn). The data generation is divided into four steps: Step 1, rule-based methods are used to extract the facts of a case and judges. By analyzing the data, it is found that the facts of cases of criminal cases are between "the trial has been completed" (现已审理终结) and "this court considers" (本院认为). Thus, the facts of a case can be extracted directly by means of rule matching. The judge of the case is identified by the presiding judge" ( 审 判 长 ), so the name of the judge comes after the presiding judge is extracted.
Step 2, the data are cleaned. First, content irrelevant to the case is deleted; second, processing is normalized; and finally, the data with empty or garbled case facts are deleted.
Step 3, the weight of trial quality is calculated. The source data are structured data stored in a dictionary, which contains the fields "whether to send back for retrial", "whether to revise the judgment at the second instance", and "trial duration". From these field values, we can use Eq. (13) to calculate the judge's trial quality weight in various cases.
Step 4, data are generated. Different trial quality weights for the same judge can be compared to obtain the highest quality of judges' trial quality. Then, we can generate the experimental labelled data set, which is the form of <case, judge>. There are a total of 10979 cases and 11 judges in the labelled data set.
To prevent sample imbalance, we split the training set, test set and validation set from each judge at a ratio of 8:1:1.
The experimental data are shown in Tab. 1. Settings We employ Adam as the optimizer and set dropout as 0.1 to prevent overfitting. The maximum number of vocabularies is 5000. For BERT, we set the maximum sentence length as 512. For CNN, we set the number of filters as 128, and the filter widths as {2, 3, 5, 7}. For the chain of criminal behavior elements, we set the maximum number of elements in a case as 64, and the maximum length of each behavior element as 16. We set the batch size as 32 for all models. We train every model for 50 epochs. The learning rate is 2e-5. The dimension of the hidden size to 128. We employ macro-precision (MP), macro-recall (MR), macro-F1 (MF), and P@1 as our evaluation metrics.

Comparison with Traditional Case Assignment Methods
Comparing our method with the traditional case assignment methods. The case assignment methods commonly used in Chinese courts are lottery case assignment and balanced case assignment. Lottery case assignment means that the court numbers all judges without repeating. For any newly accepted case, court utilizes a computer program to select a random number, and the judge corresponding to the number is the one assigned the case. Balanced case assignment is based on lottery assignment to ensure that the number of cases handled by each judge is consistent. Our experimental results compared with those of the traditional case assignment methods are shown in Tab. 2. During the experiment, 100 sets of experiments are performed on the lottery case assignment and the balanced case assignment, and the average value of the 100 sets of experimental results is taken as the final P@1 value. It can be seen that the case assignment method proposed achieves the best experimental result, which is 84% higher than those of both the lottery case assignment and the balanced case assignment. The reason is that the case assignment method proposed in this paper integrates the expertise of judges in the judges' representation, so as to ensure the compatibility between cases assigned and judges' professional ability. The lottery case assignment and the balanced case assignment essentially generate random numbers to obtain the assignment results. There are 11 judges in total, and the probability of randomly selecting each of the 11 numbers from 0 to 10 is one in eleven, which is approximately 9.09%.

Comparison with Mainstream Matching Models
In this paper, intelligent case assignment is regarded as a text-matching problem. Many classic models have been proposed for text matching, such ESIM [40], BIMPM [41], and ABCNN [19]. These classic text matching models input two sentences and output a label to identify the relationship between the two sentences. This experiment reproduces three matching models, ESIM, BIMPM, and ABCNN, to achieve case assignment. The experimental results are shown in Tab. 3. It can be seen from Tab. 3 that the model proposed in this paper is far better than the classic text matching model in case assignment. The main reason is that we model the structured semantic information of the case by constructing the chain of criminal behavior elements, whereas other methods only use neural networks to extract the text features of the case, resulting in loss of the temporal information of the behavior elements. Moreover, due to the long case text, the ESIM model and the BIMPM model with LSTM as the core perform poorly on long-sequence texts. Because when extracting long text features, LSTM's loop mechanism determines that it pays more attention to the end of the sequence. However, the ending content of the facts of a case rarely contains key information of the case. Thus, the ESIM and BIMPM models have the worst case assignment effects. The case assignment performance of the ABCNN model is better than those of the ESIM model and the BIMPM model because it can capture more global and key information. The ABCNN model takes both word representation and phrase-level representation as model input, and performs attention calculations on the results after convolution, while the ESIM model and the BIMPM model only take word representations as model input, resulting in loss of semantic information.

Ablation Analysis
To further illustrate the significance of considering the chain of criminal behavior elements and to explore the criminal behavior elements in how to influence the performance, we conduct three sets of ablation experiments on our model. The first is to remove the criminal behavior element chain, such as −v chain in Tab. 4. The second is to remove a single element in the criminal behavior element chain. For example, −PRE in Tab. 4 means that when constructing the chain of criminal behavior elements, the behavior word elements will be removed. Similarly, we can build a BCTA(−type) model, where "type" stands for "AO" or "A1". The third set of experiments is to remove temporal features from the chain of criminal behavior elements, such as −TE in Tab. 4. The experimental results are shown in Tab. 4. It can be seen in Tab. 4 that if the criminal behavior element chain is removed, the accuracy of the assignment of cases is greatly reduced, with a drop of nearly 3% on P@1 and a drop of 6% on MF. Additionally, if one element or a single feature ("A0","A1","PRE","TE") is removed when constructing the criminal behavior element chain, when the criminal behavior words or temporal features are removed, the accuracy of the assignment of cases is much worse than when other elements are removed, but it is inferior to removing the criminal behavior element chain completely. The experimental results prove that constructing the chain of criminal behavior elements with temporal relationships can enhance the ability to represent cases and improve the accuracy of the assignment of cases.

CONCLUSIONS
In this paper, we propose an intelligent case assignment method based on the chain of criminal behavior elements. This method builds the chain of criminal behavior elements to model the structured semantic information of the case, avoiding the loss of the semantics of the case. We build a BCTA model to realize intelligent case assignment, which can integrate information about a judge's expertise in the judge's representation. The experimental results have proved that the method proposed in this paper can significantly improve the accuracy of the assignment of cases. When recommending a judge, the accuracy can reach 95.47%, and the macro average F1 value can reach 91.09%. Our method can effectively avoid manual intervention, recommend the most compatible judge for the case, and shorten the case trial process.