Route Restoration Method for Sparse Taxi GPS trajectory based on Bayesian Network

: In order to improve the availability of taxi GPS big data, we restore the chosen route for the sparse taxi GPS trajectory in this work. A trajectory restoration method based on Bayesian network is proposed. Compared with the traditional research solely based on time-spatial variables, this method additionally considers the characteristics of empty/heavy taxi status, weather conditions, drivers, vehicle running and other factors to carry out route restoration. A field case of grid network in Ningbo is taken to verify the applicability of the method, using the taxi GPS trajectory data from Ningbo Taxi Information Management Platform. The case results show that the accuracy of Bayesian network method based on multiple factors reaches 91.4%. Its performance is superior to the Multivariate logistic regression model. In addition, the proposed method is especially suitable for scenarios with a high missing rate of track data, such as a scene with timespan of about 5 min between neighbour trajectories.


INTRODUCTION
With the progress and development of information & communication technologies as well as the popularization of navigation-related smart devices, it is convenient to obtain the GPS data nowadays. If we expand the taxi GPS data relation with other data bases, we can acquire abundant vehicle state information, such as latitude and longitude coordinates, license plate number, travel time, empty/heavy taxi status, real-time speed, the name of road section etc. [1]. Assuming the high-frequency GPS is acquired, traveller position could match with the road network. Through map-matching, we can achieve the judgment of the traveller's route trajectory. Further, we can understand the path-selection behaviour of travellers by combining with individual characteristics [2][3][4].
It is convenient to restore vehicles' travel route by using GPS data with fine time granularity. However, influenced by several factors such as the quality problem of the positioning equipment and the interference of the surrounding environment, the taxi GPS sampling rate is sometimes low. Yuan et al. [5] present the statistical distribution of the sampling intervals of the GPS trajectories generated by more than10000 taxies in Beijing in a week. The average time interval of the data set is 3.27 minutes. According to the result, only 34% of the data sampling interval is less than 1 minute. Low sampling would negatively affect the route restoration. For example, there is a car running at a speed of 50 km/h. The situation, no GPS data within certain 2 minutes, would lead to no vehicle track information in the 1.6 km interval. If the car is running in a multi-intersection zone, there are lots of route options and it is a hard work to restore the actual route chosen by this driver. This issue stands out more especially in urban areas, where road network is composed of short links and vehicles can travel on many different road segments in few time intervals. This kind of problem can be named "restoring the real route based on the sparse GPS trajectory". It is a special type of map-matching. For the situation of high GPS missing rate, we cannot align a sequence of observed user positions with digital road network straightly. If this problem can be solved well, the cost of data collection and storage will be reduced.

LITERATURE REVIEW
The sparse taxi GPS data are mainly generated in the following four scenarios: (1) the uncertainty of taxi data transmission when it travels around tall buildings, tunnels, canyons and elevated roads; (2) the positioning system of a mobile phone or car GPS device is turned off and cannot transmit location information; (3) in order to save the communication and energy cost, the taxis usually report their GPS positions to the dispatching centre with low sampling-rate data [6]; (4) GPS data lost in the process of transmission. For the last three sparse data scenarios, we can use the historical taxi GPS data to restore the travel route. It is also the scenario and strategy we focus on in this paper.
Shortest-path method is one of the earliest methods proposed to deal with route restoration based on sparse GPS data, such as the work by Bierlaire et al. [7]. Some papers use other algorithms, such as the modified A* shortest route algorithm [8].Their method is successfully applied in sparse road network, where few optional routes are needed. But when applied to the complex urban road network, it is difficult to accurately restore the actual track by the shortest route method, because there are many candidate routes in the research range.
Brakatsoulas et al. [9] propose a map matching algorithm based on curve similarity. This algorithm uses Average Fréchet Distance (AFD) to measure the matching degree of GPS sequence and candidate section sequence, and the route with the highest matching degree is used as the final matching route. The example shows that the algorithm can be applied to the sparse GPS data with missing rate of about 30s and obtain good matching accuracy, but the application of this method is very complex.
A series of map-matching methods based on HMM are widely used in low-frequency sampling rate data sets [10]. Lou Y et al. proposed ST-Matching algorithm based on HMM [6], which can restore the vehicles' travel route by combining spatially geometric and topological structures. Compared with AFD-based algorithm, ST-Matching improves accuracy as well as running time. Furthermore, an Interactive Voting-based Map Matching (IVMM) algorithm [5] is proposed based on ST-Matching. The IVMM performs better than ST-Matching, because its voting measure could take into account the interaction of candidate mapping points. But the shortest-path method is used to restore the route between the adjacent GPS points in the IVMM algorithm, so the matching accuracy for complex road network is not high. Experiments show that the accuracy of IVMM algorithm is only 70% when the sampling precision is 2 minutes. There are lots of other hybrid HMM methods in the path-restoration work, such as Ozdemir et al. [11], Taguchi et al. [12], etc. However, these methods may not be capable for the condition of over-high GPS missing rate.
As to the condition of trajectory span time of more than 2 minutes, other methods are generally used to solve the path-restoration problem. In the study of the track patching for incomplete vehicle location data, Wang et al. [13] constructed the utility function by considering the matching score between travel time and distance, nonlinear rate of candidate route. The proposed utility is combined with Technique for Order Preference by Similarity to an Ideal Solution (TOPSIS) evaluation method to achieve the track patching. Their method requires that the field should be grid network. Zhao et al. [14] proposed a distributionally robust optimization to restore route for vehicle trajectory data. However, their data is obtained by experimentally connected vehicle, and the route-choice behavior may not be the same with reallife condition. Fu et al. [15] proposed an optimization algorithm for multi feature matching of road network. Starting from shape, distance and semantics, this algorithm can more accurately describe the feature differences between matching pairs of roads, and build regression matching model of road network by training similar feature sample sets.
In addition, some scholars improve the matching accuracy by combining GPS with multivariate data from other sensors such as WIFI [16], cellular fingerprint [17], inertial sensor [18] etc. Although these advanced devices embedded with road semantics [19] could improve the accuracy of route restoration, it also increases the cost of data acquisition.
In addition, some of the above methods do not take into account the characteristics of regional traffic operation, driver's route selection behavior characteristics, and other environmental factors, so their trajectory restoration methods are not necessarily suitable for road network scenarios with high GPS track missing rate. This paper mainly aims at the situation of high data missing rate (the time of no track point reaches 5 minutes or more), based on the collection of urban taxi GPS data, comprehensively considers the factors such as time, space, driver characteristics, environmental characteristics, operation characteristics, etc., and reconstructs the route track combining with the actual road network situation. Specifically, this paper uses the dense GPS data combined with the driver data, weather conditions and other related attributes to train the samples, so as to establish the Bayesian network model of route restoration, and then use the model to predict the sparse GPS trajectory route.

PROBLEM DESCRIPTION
Trajectory restoration process in our study could be described as follows. Firstly As to the situation that adjacent two points pi and p i+1 be across multiple road segments in the floating car GPS log, it means this point pair represent the sparse feature of GPS trajectory. The second step is the route restoration process; it includes several sub-steps. First, extract historical GPS data which has the same starting and ending links as input samples, and conduct data correction. Then, acquire the related attribute values for the vehicles, drivers and other factors in both matched links of upstream and downstream GPS points. Finally, use the Bayesian network model to restore the route trajectory of the vehicle.

BAYESIAN NETWORK MODEL
Bayesian network is a directed graphical model based on probabilistic reasoning. It expresses the association of variables in the problem through a visual network model, which is suitable for the representation of uncertain knowledge. The Bayesian network consists of a set of nodes U = {X 1 , X 2 , …, X n }, directed arcs and the network parameter Θ. A directed arc connects the parent node and its child node, which uses a quantitative, probabilistic causality measure to represent its attribute value. The network parameter represents the conditional probability set from each parent node to its child node in the network.
In this formula: pa(X i ) represents the parent node set corresponding to the child node X i ; The work of factor choice and feature extraction is the first step to construct the Bayesian network model. In other word about this step, we should measure the node variables by qualitative or quantitative methods. Structural learning is the second step. It means we should determine the structure of the Bayesian network by the methods of domain expert, prior knowledge and machine learning. Fitting the network parameters under the established network structure is the third step. Network inference is the fourth step, where we can calculate the posterior probability of the target situation.
Structure learning is the most critical step in the construction of Bayesian network model, in which machine learning is considered to be an effective method especially in a data-rich environment. As to the machine learning method of Bayesian network structure learning, score search method is generally used. Its principle is to find the best Bayesian network structure according to certain search strategy and scoring criteria. The typical methods based on score search include exhaustive search, K 2 algorithm [20] and hill-climbing algorithm, in which K 2 algorithm is widely used. The basic idea of K 2 algorithm is as follows. First use a graph without arcs as an initial input; then examine the next node in turn according to the node significance, and each time we decide whether the previous node needs to be the parent node of the current node until all the nodes are inspected. When the added node is under inspection, the maximum of scoring function is taken as the goal to optimize the current network topology G.
There are mainly two scoring functions for the Bayesian network structure learning. One is based on the Bayesian statistics and the other is related to Mutual Information. Bayesian scoring first gives the prior probability P(G) of the network structure G; then it calculates the posterior probability of G with Bayesian formula under the given data set D. The structure with the maximum posterior probability is the optimal network structure. The scoring function based on Mutual information theory mainly includes Bayesian Information Criterion (BIC) and Minimum Description Length (MDL). The scoring function is composed of the optimal parameter loglikelihood and the penalty term. Optimal parameter loglikelihood is used to evaluate the fitting degree of structure and data. Penalty term is used to avoid the overfitting of the network and make the network relatively simple. But sometimes the network structure is too simple, and there is no parent-child relationship between nodes with obvious logical correlation [21,22]. When the number of network nodes or the amount of sample data is too large, BIC scoring function should be the option.
Due to the moderate amount of network nodes and sample data in this paper, we applied K 2 algorithm to obtain the structure of Bayesian network and used the Bayesian scoring function in structure learning. The Bayesian scoring function is as follows: In this formula: D represents a set of sample data; G represents the network morphology; U represents the nodes of network; P(G) represents the priori probability of network morphology G (it is assumed that the probability of network morphological occurrence obeys the uniform distribution.); q i represents the number of total states for ' t X s parent nodes; r i represents the number of total states for node X i ; N ijk is the total sample number corresponding to the fixed feature values of child-parent nodes for certain combinatory structure. The variable

INDICATORS FOR MODEL TESTING 5.1 Accuracy
We use the sample ratio of correct route restoration to the total number to reflect the accuracy of the model. Its definition is as shown in Eq. (3). In this formula, F ii means that the model correctly identifies the route i. M and F represent the total number of candidate paths and samples separately.

Recall Ratio
The recall ratio represents the accuracy of specific path identification. It is used to analyse the effectiveness of the model to identify certain path. The recall ratio for path i is defined as shown in Eq. (4). The numerator represents the times of correct match about path i by the model [23]. The denominator represents the sample size of choosing path i in true world.

Precision Ratio
The precision ratio is used to evaluate the level of the model about distinguishing specific categories of samples from other samples. The definition of the precision ratio corresponding to route i identification is as follows.

Receiver Operating Characteristic (ROC) Curve
The ROC curve is frequently used to evaluate the predictive performance for the models of binary outcomes. The ROC curve is a graphical plot of the true positive rate (TPR) on the y-axis against false positive rate (FPR) on the x-axis for a threshold running from 0 to 1 [24]. The area under the ROC curve (AUC) can be used as an evaluation measure of the predictive performance. The AUC falls between 0 and 1. A larger AUC indicates the better predictive performance. Tab. 1 is the evaluation standard of ROC curve.
In this paper, the TPR represents the correct judgement ratio of a designated route, in terms of (times of correct judgement for designated route / sample size of designated route); FPR represents the sample occupancy of misjudgement, in terms of (times of error judgement for designated route from other samples/(1-sample size of designated route)). The threshold herein is used for a designated route judgement. We would match the GPS trajectories to the designated route if the probability (obtained by Bayesian network) was larger than the threshold rather than 0.5.

Mutual Information
The mutual information [25] indicates whether the two variables X and Y have a strong relationship. The formula is as follows. If X is independent of Y, then P(X, Y) = P(X) (Y), I(X, Y) = 0 and it means that X is not related to Y. This index would be used to analyse the sensitivity of the nodes in Bayesian network. Specifically, we analyse the sensitivity of factor nodes to route-choice node to test the rationality of factor selection in our model.

DATA SOURCE AND PREPROCESSING
The data used in this paper is obtained from the Ningbo Taxi Service Information Management Platform, which has taxi GPS track data with sampling period of 15 seconds and the basic information about the drivers. This paper selects the road network around Ningbo Sports Center as the research scope. As is shown in Fig. 1, we choose an upstream link and downstream link as origin link and destination link separately to search for GPS trajectory samples. After eliminating abnormal data such as unreasonable travel time, vehicle speeds and detour routes, we obtain 15248 valid data, and the valid rate is 90 percent. The data range is from December 9, 2017 to December 31, 2018, with an average of 41 valid daily data. We apply the ArcGIS software to match these data to the road network and correlate with the track route. We obtain 18 alternative routes.
In order to reasonably select finite routes as the alternative set, we set the following principles: first order the route according to the chosen probability; then the last route within the alternative set should be no less than the selection probability of the remaining route set. The study found that if the 6 routes of the highest choice ratios were included in the analysis, the above principles could be satisfied. In addition, in order to put all the routes into comparison and meantime limits choice number of the routes, we set the 7th route as the final alternative route. This route was actually the set of remaining routes, which was not identified in Fig. 2 because it contained so many community roads.
In order to test the performance of Bayesian network model in trajectory restoration, the GPS trajectory sequence between origin and destination is deleted to create a scenario of sparse GPS trajectory sample.  One more problem is how to unify the origin and destination locations of all samples? We map the GPS locations of upstream and downstream to the center of the road section to achieve this aim. The mapping process needs to correct the instantaneous velocity and timestamp of the starting and ending point. Fig. 2 illustrates this method. As is shown in Fig. 2, if S is the required starting point, then we should map other GPS data to this link center. First, we need to find the two adjacent GPS points. For example, p 1 and p 2 are the points that are collected when vehicle i and j passes through the starting road section; p' 1 and p' 2 are the latter valid adjacent sample points of p 1 and p 2 separately. We can calculate the speed of the vehicle i at the link center S as follows: In this formula: p is.v is the speed of the vehicle i at the link center S; l i1 represents the distance between p 1 and p' 1 ; Δt is the sampling time interval.
The timestamp correction result of the calculated vehicle i at point S is as follows: In this formula: p is.t is the timestamp of vehicle i at S of the route; p' is.t is the timestamp of vehicle i at the nearest GPS point from S; l i2 is the distance between the nearest GPS point and S. δ is a dummy variable. If the nearest point is at the upstream of S, then δ = 1; otherwise, δ = −1.

SELECTION OF INFLUENCING FACTORS AND THEIR FEATURES FOR ROUTE RESTORATION
For sparse GPS trajectories, various factors associated with route selection behavior are summarized as follows: 1) Operational characteristics. Taxi with various in-vehicle state will choose different routes. For instance, empty vehicles prefer to choose the route around the hot spot area [26]. In addition, the empty vehicle will drive slowly to observe potential guests.
2) Speed characteristics. There is a spatial correlation in traffic state between link and route [27]. As to each route, the taxi must pass through the starting and ending sections. The speeds on these two sections can reflect the traffic state of each route in certain degree. 3) Environmental characteristics. Weather conditions have a great influence on driver's route choice [28]. The source of weather data is from the network. 4) Driver characteristics. The route choice behaviours of a driver vary even in the same environment. These behaviours mainly depend on a driver's own characteristics (such as age, driving experience, gender, etc.); Zhang et al. use Logit model to find that young or short driving-years drivers tend to change route more frequently, and the elderly or long driving-years drivers are easier to adhere to a specific route [29].
Obviously, the drivers focus more on the taxi invehicle empty/heavy status, vehicle speed and weather condition when choosing a travel route. Based on the relevant literature, expert experience and further correlation analysis, we select 10 node variables including taxi in-vehicle empty/heavy state X 1 , weather condition X 2 , time period X 3 , travel time X 4 , starting section speed X 5 , ending section speed X 6 , driving years X 7 , driving ages X 8 , driver gender X 9 and the root node X 10 (represents the alternative routes for driver to choose). These variables are used as network input nodes. We extract the features of the aforementioned variables by considering sample proportion. After careful work for the test sample, we acquire discrete features for these variables in Tab. 2. Besides, we provide the variable classification and feature-related sample proportion as shown in tab.
The feature occupancies out of all the samples in the above Tab. 2 are reasonable, because their values are not far from real-world data. As to the taxi in-vehicle state, our research field is right in the downtown, which leads to a high heavy occupancy (74%) compared to the whole city. Some other feature occupancies of the indicators are relatively low, such as the occupancies of severe weather, peak hour, female drivers, which are reasonable. The feature distributions of remaining indicators are wellproportioned, such as travel time, start and end speeds, driving years, driving ages.
When we analyze the variable features of the same route-choice samples, we find these samples could correspond to lots of features for the same variable. For instance, as to the samples of route 2, the four travel-time features occupy 12.3%, 17%, 12.8% and 57.9% respectively. This phenomenon demonstrates that we should not only use sole or few factors to do the route restoration problem of GPS trajectories. We should use lots of comprehensive factors to improve the accuracy of this work.

RESULTS OF BAYESIAN NETWORK MODEL
In order to test the validity of Bayesian network structure model, the sample data are divided into training set and test set according to the proportion of 3:1. The training set and the test set contained 11436 and 3812 samples separately. In this paper, we use the expert knowledge combined with the K2 algorithm to set up the Bayesian network structure. First, the expert knowledge and the correlation ranking are used for the training data to determine the order of the variable nodes: weather, driver age, driver gender, time period, ending section speed, driving years, taxi in-vehicle state, starting section speed, travel time and route choice. Second, we set 4 as the maximum number of parent nodes for each node. Third, we use the Bayesian Networks Toolbox (BNT) in the software MATLAB to complete the programming of the K 2 algorithm, and acquire a Bayesian network structure as shown in Fig. 3.

TEST OF MODEL VALIDITY 9.1 Accuracy Test
According to the established Bayesian network and conditional probability table set, we used the test set to restore the route choice. According to the result, we calculated the recall and precision ratios. The confusion matrix of route identification is shown in the left of Tab. 3. Totally, the accuracy of the test set achieves 91.4%. The ratios of the recall and precision are both higher than 87%. As to route 2, these ratios even achieve 93%. It can be confirmed that the path-restoration accuracy of the model is high and reasonable.

ROC Curve
We calculate the TPR and FPR in terms of each threshold. We first draw the seven ROC curves respectively and then obtain the comprehensive ROC curve by averaging the six ROC curves. The AUC value of the comprehensive ROC is 0.8445. According to the evaluation standard of ROC curve, the accuracy of the Bayesian network performs well.
As shown in Fig. 4, the curve of route 7 is the closest to the upper left corner, so the prediction accuracy of this route is pretty good. The routes covered by it all have a certain circumambulation, so their travel time is significantly longer than the other 6 routes. Then the routes 1 and 6 are followed. When we take the rush hour data to do the analysis, we find that most of the drivers choosing route 6 have more than 15-year driving experience. It indicates that experienced drivers are prone to choose route 6 to avoid severe congestion in this area. As to route 1, the corresponding drivers mainly come from the driving experiences of less than 5 years and more than 10 years. As a result, we can infer that considering driving years they could make the prediction of route 1 and 6 more accurately.

Sensitivity Analysis
We use the mutual information value to analyse the significance of each factor in the Bayesian network. The mutual information value of each variable with respect to the target variable is calculated. Fig. 5 shows the test results of mutual information. The larger the mutual information value is, the greater the variable is to the model. It can be seen that the ranking of top 5 factor significance is as follows: travel time (0.28 bit), driving years (0.16 bit), taxi in-vehicle state (0.12 bit), starting section speed (0.10 bit).
We carry on Bayesian network reasoning to measure the influence of each factor on the match probability of specific route. The sensitivity is calculated by comparing the match probabilities between the whole samples and the samples of the specific factor feature. As shown in Tab. 4, we can know the sensitivity of each route selection with respect to each factor. Travel time, driving years and starting section speed is sensitive to the target variable. Sensitivity of other variables to the target variable is relatively weak. It is basically consistent with the analysis of aforementioned factor-significance ranking. In addition, the sensitivities of taxi in-vehicle state to route 2 and 5 are relatively high.

Influence of Trajectory Missing Rate on Model Prediction Accuracy
In order to study the influence of the trajectory missing rate on the restoration accuracy of our model, we take p 3 , p 4 , p 5 to replace previous destination p 2 separately in Fig.  6. The travel times between origin and destination are around 8 minutes, 12 minutes, 15 minutes respectively. After choosing these data missing rates, the AUC value is 0.825, 0.787 and 0.56 separately (see Fig. 7). When the missing rate approaches 12 minutes from 8 minutes, the decrease of AUC value is not much significant, and its evaluation remains "good". It indicates that our method can be used for the data missing rate of 5 minutes or above.

CONCLUSION
We propose a route restoration method with respect to sparse GPS trajectory. This Bayesian network model takes lots of route-choice factors into account. This multi-factor feature is our innovation when compared to other route restoration methods. We use the taxi GPS data in Ningbo City of China to test our model. It is found that these factors can really influence the route restoration and the accuracy of our model performs better than MLR model. 1) We used the K2 algorithm to establish the Bayesian network structure based on the data of training set. The total accuracy is more than 90%. It also shows that the driver's characteristics (especially the driving years) are highly correlated with the driver's route selection.
2) Via analysis of the ROC curve of the prediction results it is indicated that our method can be used for the data missing rate of 5 minutes or above.
In general, the proposed method shows its ability and great potential for route restoration using the low-quality GPS trajectory data. However, the case employed in this study is a simple one. In fact, the roads of several denser areas are complicated in the Ningbo dataset, some branchroad segments are not well connected but there are GPS trajectories in these roads. More tests should be conducted to generate candidate routes for choice. Hence, replacing this case with a more sophisticated case is expected to lead to further improvements in future studies.