A New Time Series Similarity Measurement Method Based on Fluctuation Features

Time series similarity measurement is one of the fundamental tasks in time series data mining, and there are many studies on time series similarity measurement methods. However, the majority of them only calculate the distance between equal-length time series, and also cannot adequately reflect the fluctuation features of time series. To solve this problem, a new time series similarity measurement method based on fluctuation features is proposed in this paper. Firstly, the fluctuation features extraction method of time series is introduced. By defining and identifying fluctuation points, the fluctuation points sequence is obtained to represent the original time series for subsequent analysis. Then, a new similarity measurement (D_SM) is put forward to calculate the distance between different fluctuation points sequences. This method can calculate the distance of unequal-length time series, and it includes two main steps: similarity matching and the distance calculation based on similarity matching. Finally, the experiments are performed on some public time series using agglomerative hierarchical clustering based on D_SM. Compared to some traditional time series similarity measurements, the clustering results show that the proposed method can effectively distinguish time series with similar shapes from different classes and get a visible improvement in clustering accuracy in terms of F-Measure.


INTRODUCTION
Time series data come up in a variety of domains [1], including financial data [2], environmental data [3,4], telecommunication data [5], and medical data [6]. A time series is a set of observations arranged in sequence according to the occurrence of time [7]. Data mining is a mechanism of discovering hidden knowledge from a spurious amount of data, and aims to extract interesting rules or patterns from huge databases [8]. It includes some research directions: clustering [9], similarity search [10], classification [11] and prediction [12]. Among them, clustering plays an essential role in the field of data mining, and it can also be considered as a pre-processing stage for other data mining tasks, especially decision making [13,14].
Time series similarity measurement is the basis for time series clustering [15], which is used to calculate the distance of two time series. The common time series similarity measurements are divided as follows [16,17]: Euclidean distance (ED),Dynamic Time Warping (DTW) distance, segmented representation distance, symbolic distance, model distance, and compression distance.
In time series similarity measurement, Euclidean distance [18] is the most commonly used distance, but it cannot measure the time series of unequal-length. In 1994, Berndt and Clifford [19] firstly introduced DTW distance widely used in speech recognition to the study of time series similarity measurement. DTW allows a certain degree of offset on the time axis, and can measure time series of unequal-length. However, it has a disadvantage that the computational complexity of DTW is high, and it cannot reach the requirement of the distance triangle inequality [20].
The similarity measurement based on segmented representation distance, such as Piecewise Linear Approximation (PLA) [21], Piecewise Aggregate Approximation (PAA) [22], and Derivative Segment Approximation (DSA) [23], segments the long time series into several short sequences and uses the features of segmented sequences to represent the original time series. PLA uses many short segment sequences, which results in a rough representation and approximate degree, so the representation is not accurate. PAA needs to define the indicator of dimensionality reduction in advance. Moreover, the segmented sequences are of fixed length, and are represented as the mean value, ignoring important information such as the shape changes and key points of time series.
The similarity measurement based on symbolic distance converts the original time series into a string sequence, and then calculates the distance between string sequences. Symbolic aggregate approximation (SAX) [24] is the most typical symbolic representation method. Because SAX is a symbolic representation method based on PAA, it also inherits the shortcomings of PAA.
The similarity measurement based on model distance includes Auto-Regressive Model (AR) [25], Auto-Regressive and Moving Average Model (ARMA) [26], and Hidden Markov Model (HMM) [27]. This method describes the original time series by solving the appropriate parameter to fit model, and then expresses the distance between the parameters as a similarity index. The disadvantage of the model-based method is that time series needs to be defined in advance to satisfy certain assumptions.
Compression-based Dissimilarity Measure (CDM) [28] is one of the similarity measurements based on compression distance. It combines the results of bioinformatics and compression theory, and is suitable for the measurement and discovery of subsequence similarity. However, the calculation process of this method is very complicated and time-consuming. Many parameters need to be set correctly and reasonably, so its application is limited.
To avoid the above shortcomings in the above time series similarity measurement methods, in this paper, a new time series similarity measurement method is proposed based on fluctuation features. On the one hand, this method can extract the fluctuation features of time series to reduce the dimension of the original time series. On the other hand, it can also calculate the distance between unequal-length time series.
The chapter structure of this article is as follows: Section 2 introduces the method of fluctuation points identification to represent the fluctuation features of time series. Section 3 proposes a new similarity measurement to calculate the distance of fluctuation points sequences. Section 4 agglomeration hierarchical clustering based on the proposed distance is used to perform experiments on some public time series and analyze the clustering results. Section 5 provides conclusions.

FLUCTUATION FEATURES EXTRACTION METHOD OF TIME SERIES
In this section, the fluctuation features extraction method of time series is proposed, which mainly includes three steps: identification of extreme points, selection of extreme points, and determination of fluctuation points. After completing the above steps, the fluctuation points are got to represent the fluctuation features of time series.

Identification of Extreme Points
In this subsection, we firstly introduce the definition of an extreme point.
Definition 1 (Extreme Point) Given a time series Where i = 2, 3, …, n−1, the starting and ending points of a time series are also considered as extreme points.
A time series is used as an example for discussion. According to the above definition, extreme points can be identified (Fig. 1).

Figure 1 Identification of Extreme Points in a time series
After obtaining the extreme points, the attributes of them are marked, where the attribute of the maximum point is 1 and the attribute of the minimum point is −1.

Selection of Extreme Points
In Fig. 1, we can see that some extreme point distributions are concentrated, and the fluctuations are relatively small in some subsequences.
Definition 2 (Candidate Fluctuation Point) For an extreme points sequence E = {e1, e 2 , …, em}, given a threshold set ε = {ε 1 , ε 2 , …, ε q }, if there is a relationship |e j -e j − 1 | > ε k between two adjacent points e j − 1 , e j in the sequence E, the point e j is called a candidate fluctuation point.
Where j = 2, 3, …, m, ε k is a certain threshold in the threshold set ε, and the starting point of a time series is also considered as a candidate fluctuation point.
According to the above definition, the extreme points with small changes are filtered to obtain candidate fluctuation points (Fig. 2).

Definition 3 (Fluctuation Point) For a candidate fluctuation points sequence
Where z = 2, 3, …, p, Attrc z−1 Attrc z = −1 represents the attributes of c z − 1 and c z are opposite, that is, one point is maximum and the other one is minimum. Meanwhile, the starting point of a time series is also considered as a fluctuation point.

Figure 3 Determination of Fluctuation Points in a time series
The candidate fluctuation points inherit the attributes of extreme points.The product of the attributes is −1 between the two adjacent points in the extreme point sequence. However, after deleting some extreme points with small changes, the outcome of the attributes may be 1between the two adjacent points in the candidate fluctuation point sequence. So further operation is needed to get the fluctuation points. For Attrc z-1 Attrc z = 1, there are the following two cases, and the corresponding operations are as follows: (1) If the attributes of two adjacent points are 1, meaning that they are both maximum points, delete the minimum point of them; (2) If the attributes of two adjacent points are −1, meaning that they are both minimum points, remove the maximum point of them.
For candidate fluctuation points, fluctuation points can be obtained (Fig. 3) by operating according to the above corresponding cases.
In summary, the flowchart of the fluctuation points identification algorithm is given in Fig. 4.
Where n is the number of data points in the time series, m is the number of extreme points, p is the number of candidate fluctuation points, and q is the number of thresholds in the threshold set. Obviously, m ≤ n, p ≥ 1, q ≥ 1.
The sequence which consists only of fluctuation points is called fluctuation points sequence and represents the fluctuation features of time series. For fluctuation points

TIME SERIES SIMILARITY MEASUREMENT BASED ON FLUCTUATION FEATURES
The traditional time series similarity measurements are usually used to measure equal-length sequences, such as Euclidean distance, which is a point-to-point calculation method. DTW distance can be used to measure the time series of unequal-length, but its computational complexity is high and always leads to over warping. As the calculation method is not point-to-point, it cannot satisfy the triangular inequality of distance.
Since the length between any two fluctuation points sequences is usually unequal, and they also correspond to different time points, in this section, a new time series similarity measurement is proposed for unequal-length time series. It includes two main steps: similarity matching and the distance calculation based on similarity matching.

Similarity Matching
For time series, most of the traditional similarity measurements belong to point-to-point matching mode, which requires the matching points to have the same time stamp. This matching mode is defined as precise matching.  Precise matching is too rigid, which will not effectively match the time series with a similar shape, while DTW allows a certain degree of offset on the time axis. Therefore, we take advantage of DTW and propose the following similarity matching method. On the one hand, this method allows the time series to be matched to have a certain degree of offset on the time axis. On the other hand, it meets each pair of matched points in a one-to-one relationship, that is, it satisfies the triangle inequality of distance.
Given two fluctuation points sequences X = {x1, x 2 , …, x m }; (x i = (t i , v i , Attr i )) and Y = {y 1 , y 2 , …, y n }; (y j = (s j , u j , Attr j )) of equal or unequal length, it needs to meet both of the following conditions to successfully match. This matching mode is defined as similarity matching.
Condition 1: where t i and s j represent the time stamp of x i and y j , v i and u j represent the value of x i and y j , Attr i and Attr j represent the attribute of x i and y j , ɛ is the threshold that is used to control how much time the axis is allowed to shift.

The Distance Calculation Method based on Similarity Matching
After similarity matching, we need to calculate the distance of two fluctuation points sequences. Firstly, the distance between each pair matched points are calculated by Eq. (1).
Where x i and y j is a pair of matched points, t i and s j represent the timestamp of x i and y j , v i and u j represent the value of x i and y j .
Next, in this subsection, some concepts are proposed, such as fluctuation degree, information weight, and the distance calculation method based on similarity matching is given as well.
Then the distance of X and Y based on similarity matching is: where n is the number of matched points.
Based on the given concepts, we summarise the distance calculation method based on similarity matching (D_SM), and the algorithm steps are as follows.

Algorithm 1The Distance Calculation Method based on Similarity Matching (D_SM)
Input: Fluctuation points sequences X and Y. Output: The distance between X and Y.
Step 1: Similarity matching. Set the neighbourhood to ɛ, if t i ∈ [s j − ε, s j + ε] and Attr i = Attr j , x i of X and y j of Y will be matched. As a result, the similarity matching sequence is achieved; Step 2: After similarity matching, the distance between each pair matched points is calculated by Eq. (1).
Step 3: Calculate FD of X and Y according to Eq. (2), and then calculate IW according to Eq. (3). At last, calculate SMD of X and Y using Eq. (4).
Step 4: According to Eq. (5), calculate the distance of X and Y.
The above similarity measurement algorithm is not only suitable for the similarity measurement between fluctuation points sequences, but also for the similarity measurement between unequal-length time series.

EXPERIMENTAL STUDIES
In this section, the time series clustering is used to demonstrate the performance of the proposed similarity measurement, and the clustering method we used is agglomerative hierarchical clustering. Because it does not need to predefine number of classes in advance, we can draw a dendrogram to visualize the clustering results. The experiments are carried on UCR time series [29].

Experiment 1 on Face All Dataset
In experiment 1, FaceAll dataset isused for clustering. As shown in Fig. 7, we randomly select nine time series from three classes. And the comparison similarity measurements are Euclidean distance and DTW distance.  The dataset shown in Fig. 7 exhibits three clusters: C 1 = {1, 2, 3}, C 2 = {4, 5, 6}, C 3 = {7, 8, 9}. The first third of time series in cluster C 1 has obvious fluctuations, while the remaining length fluctuates slightly. For cluster C 2 , the time series have significant fluctuations over its entire length. The shape of time series in cluster C 3 is similar to that in cluster C 1 , but the last two-thirds length changes more obviously.

Experiment 2 on Swedish Leaf Dataset
In experiment 2, we randomly select nine time series from three classes in SwedishLeaf dataset for clustering, see as in Fig. 9. And the comparison similarity measurements are Euclidean distance and DTW distance. Fig. 10 shows the clustering results of different methods. In Fig. 9, there are three clusters: C 1 = {1, 2, 3}, C 2 = {4, 5, 6}, C 3 = {7, 8, 9}. The trend and shape of nine time series are very similar, and the difference between 3 classes is not very obvious.

Experiment 3 on Synthetic Control Dataset
In experiment 3, we randomly select nine time series from three classes in Synthetic Control dataset for clustering, as shown in Fig. 11. And the comparison similarity measurements are Euclidean distance and DTW distance. Fig. 12 shows the clustering results of different methods.

Experiment 4 on Four UCR Datasets
In experiment 4, four datasets are conducted to experiment. Tab. 2 shows the basic information of these datasets.
The compared methods [30]   In clustering analysis, F-Measure is usually used as an evaluation metric to validate the quality of the clustering results [31].
After clustering, we calculate the index of F-Measure and Tab. 3 presents the calculated results of 6 clustering methods.
As shown in Tab. 3, D_SM achieves the highest F-Measure on all datasets (except Trace). Especially in Synthetic Control dataset and CBF dataset, the F-Measure value of D-SM is much higher than that of other methods. However, in Trace dataset, the F-Measure value of D-SM is only 0.1 less than that of DSA. To make it easier to compare the clustering results of these methods, we rank the values of F-Measure for all of the approaches and then calculate their average ranking. It is obvious that the average ranking of D_SM is the highest. Based on the above experimental results, the following conclusions can be summarised. In experiment 1, 2, and 3, our proposed similarity measurement (D_SM) can effectively distinguish time series with similar shapes from different classes compared to Euclidean distance and DTW distance. In experiment 4, compared to DTW, PLA_DTW, PAA_DTW, SAX_DTW, and DSA_DTW, D_SM achieves the highest average ranking on all datasets in terms of F-Measure, which means that D_SM has significantly improved the accuracy of the clustering results.

CONCLUSIONS
In this paper, we propose a new time series similarity measurement based on fluctuation features, which can calculate the distance between the unequal-length time series. Firstly, the fluctuation features extraction method of time series is proposed, which mainly includes three steps: identification of extreme points, selection of extreme points, and determination of fluctuation points. Completing the above steps, the fluctuation points are got to represent the fluctuation features of time series. After that, the fluctuation points sequence that consists of fluctuation points can replace the original time series. Since the length between any two fluctuation points sequences is usually unequal, a new similarity measurement (D_SM) is proposed to calculate the distance of fluctuation points sequences. This method allows the matched time series to have a certain degree of offset on the time axis, and it also satisfies the triangle inequality of distance. Finally, the experiments are performed on some public time series using agglomerative hierarchical clustering based on the proposed method. And the clustering results show that the proposed method can effectively distinguish time series with similar shapes from different classes and has significantly improved the clustering accuracy.
However, the fluctuation features extraction method of time series and the proposed time series similarity measurement both rely on setting reasonable thresholds. The solution to this problem is going to be solved and presented in future research.