Clustering Algorithm Based on Sparse Feature Vector without Specifying Parameter

Parameter setting is an essential factor affecting algorithm performance in data mining techniques. CABOSFV is an efficient clustering algorithm which can cluster binary data with sparse features, but it is challenging to specify the threshold parameter. To solve the difficulty of parameter decision, a clustering algorithm based on sparse feature vector without specifying parameter (CASP) is proposed in this paper. The calculation method of an upper limit of threshold is firstly defined to determine the range of threshold. Furthermore, we use the sparseness index to sort the data and conduct the clustering process based on the adjusted sparse feature vector after data sorting. An interval search strategy is adopted to find a suitable threshold within the defined threshold range, and the clustering result with the selected suitable parameter is the outcome. Experiments on 7 UCI datasets demonstrate that the clustering results of the CASP algorithm are superior to other baselines in terms of both effectiveness and efficiency. CASP not only simplifies the parameter decision process, but also obtains desirable clustering results quickly and stably, which shows the practicability of the algorithm.


INTRODUCTION
Clustering is an important method in identifying the natural structures of datasets [1]. As a fundamental technique in data mining [2,3], clustering analysis aims at dividing data objects into several groups such that data objects in each group are similar to one another and dissimilar to data objects in different groups [4,5]. Over the years, clustering algorithms are widely used in data analysis in different domains, such as text data [6,7], customer data [8,9], image data [10,11] and medical data [12,13].
Clustering algorithm based on sparse feature vector (CABOSFV) [14] is an efficient clustering algorithm for binary data with sparse features. The similarity of a set is measured by the defined Sparse Feature Dissimilarity (SFD).Moreover, Sparse Feature Vector (SFV) is exploited to compress sparse data effectively. Using the additivity of SFV, clustering can be completed only by scanning the data once. Hence, CABOSFV algorithm is of high computing efficiency and good clustering performance [15]. Subsequently, CABOSFV has attracted extensive attention, and it is exercised in many applications such as customer knowledge discovery [15], text mining [16,17], traditional Chinese medicine [18] and Intelligent Miner (I-MINER) [19].
However, CABOSFV algorithm has two shortcomings: firstly, clustering result is sensitive to data input order; secondly, a threshold parameter needs to be given in advance, which directly affects the final clustering result. Some improved algorithms have been developed to solve these problems. Improved CABOSFV clustering considering data sort (CABOSFV_CS) [20] defines the sparseness index of the object, and it is verified through experiments that the accuracy of clustering results will be improved if the data objects are sorted in ascending order according to the sparseness index. CABOSFV_CS gives a solution to weaken the sensitivity of data order, but the problem of threshold parameter determination has not been substantially broken.
The threshold b of CABOSFV is a crucial parameter, similar to the number of clusters -k-in the clustering problem [21,22], which is the upper limit of the SFD within a cluster. If the threshold is too large, it is easy to merge different clusters; if the threshold is too small, it is easy to split the same cluster. As a result, the selection of threshold b plays a decisive role in the clustering process. However, this parameter is usually determined empirically, and there are no criteria for determining it. Thus, how to determine the threshold more reasonably becomes an essential and challenging task of the CABOSFV-based algorithms.
Hierarchical clustering algorithm for binary data based on cosine similarity (HABOC) [23] is an improved algorithm of CABOSFV which exploits hierarchical clustering procedure and does not need to specify the threshold parameter in advance. Although HABOC can get clustering results without pre-setting threshold parameter, the time complexity of hierarchical clustering program is high, which changes the efficiency advantage of CABOSFV that can complete clustering in one scan. High dimensional data clustering algorithm based on extended dissimilarity (CABOSFV_D) [24] proposes a calculation method of extended dissimilarity, which makes the clustering process more accurate. Also, the literature [24] presents a method to determine the threshold b, which subjectively set the initial threshold range to (0, 3). In fact, the threshold parameter may be taken on (0, +∞) and varied from data to data. Thus, the determination of the threshold parameter remains to be further studied.
Therefore, a clustering algorithm based on sparse feature vector without specifying parameter (CASP) is proposed in this paper. Firstly, the calculation method of an upper limit of the threshold is defined to determine the threshold range. Next, the sparseness index and adjustment index are used to improve the stability and reliability of clustering. Then, a certain search strategy is adopted to find a suitable parameter within the initial threshold range, and the final partition result is obtained. Finally, the experimental results demonstrate the superiority of the proposed method.
The key contributions of this paper are as follows: (1) We give a method to calculate an upper limit of threshold, which theoretically reduces the threshold range to a definite interval. The proposed CASP method includes parameter decision, which improves the practicability of the algorithm. (2) The CASP algorithm combines the sparseness index and adjusted sparse feature vector to make the clustering results stable and reliable. (3) We evaluate the performance of CASP with several UCI datasets, and the experiments verify that CASP outperforms existing improved CABOSFV algorithms and classical categorical data clustering algorithm K-modes [25]. Moreover, CASP shows high computing efficiency. Therefore, the proposed CASP is promising for its practical application value.
The remaining chapters of this paper are organized as follows. Some preliminaries are firstly presented in Section 2. Section 3 defines an Upper Bound of the SFD Threshold and describes the details of the proposed method. Then, extensive experiments are presented in Section 4. Seven UCI datasets and three external clustering validation indices are used to verify the performance of algorithms. As a final part, conclusions are summarized in Section 5.

RELATED WORK AND PRELIMINARIES
In this section, some techniques for parameter determination in clustering algorithms are reviewed firstly. Then, we briefly review some preliminaries including CABOSFV, CABOSFV_CS and CABOSFV_D.

Techniques for Parameter Determination in Clustering Algorithms
As a fundamental technique in data mining, clustering can help humans understand and utilize data. In clustering analysis, parameter selection is one of the key factors to determine whether the clustering algorithm is effective. The number of clusters, usually notated as k, is a vital parameter in most clustering algorithms [26]. Thus, most researches on clustering algorithm parameters focus on how to determine the number of clusters k. A typical method is to evaluate a clustering validity index and optimize it as a function of the number of clusters [27]. Some studies use likelihood-based methods, such as Bayesian Information Criterion (BIC) and Akaike's information criterion (AIC), to estimate the correct k value in the context of likelihood function [28,29]. Recently, machine learning techniques are employed to estimate the number of clusters. Ünlü et al. proposed a weighted consensus clustering scheme which uses four different indices to estimate the correct number of clusters [30]. Pimentel et al. proposed a new methodology using meta learning to recommend the number of clusters [31]. Another direction is to use methods that do not require the a priori definition of the clusters number. Instead of defining the number of clusters, the CABOSFV algorithm needs to set a threshold parameter. Most of the existing researches focus on the estimation of parameter k. However, these existing research results cannot be applied to CABOSFV. This is because the candidate set of parameter k is usually a finite set, while the threshold is an infinite set. Therefore, the CASP algorithm is proposed, which includes the determination of threshold parameter. Users can get desirable clustering results without specifying parameter.

CABOSFV Algorithm
CABOSFV is a sparse feature-based clustering algorithm which can cluster sparse data described by binary variables [14]. The binary variable is a kind of categorical variable with only two values (usually expressed as 0 and 1). In real-life data sets, the categorical variable is usually with two or more values. The multivalued variable can be converted into binary variables by one-hot encoding, as shown in Tab. 1. Therefore, the CABOSFV algorithm can also be used to cluster categorical data. All of the following descriptions are based on the assumption that categorical data is converted to binary data.
In CABOSFV, Sparse Feature Dissimilarity is defined to measure the similarity of data objects in the cluster. The algorithm also applies Sparse Feature Vector to compress the data effectively. To make it more concrete, given a dataset with n objects, there are m attributes describing each object, with the value of 1 or 0 (known as the sparse feature). X is a subset of the dataset. The number of objects in X is marked as |X|. In subset X, the number of attributes with sparse feature values of 1 for all objects is a, and the corresponding attribute number set 1 The Sparse Feature Vectors (SFV) is defined as: Moreover, by using the additivity of SFV, the SFV of the merged new set is calculated directly. It is worth mentioning that the CABOSFV algorithm does not need to calculate and compare the differences between every two data objects one by one. It only needs one data scan to get the clustering results, so CABOSFV is particularly efficient. However, the clustering results are affected by the data input order and threshold parameter. CABOSFV_CS discusses the sensitivity of data input order. CABOSFV_D mentions the selection of threshold parameter. These two algorithms will be introduced in the following subsection.

A Review of CABOSFV_CS and CABOSFV_D
To solve the problem that the clustering quality of CABOSFV is affected by the order of data input, CABSOFV_CS proposes the concept of sparseness index to describe the sparse feature of data [20]. The real data experiments show that the clustering performance can be improved effectively by sorting data in ascending order of sparseness index. CABOSFV_CS provides a practical and straightforward solution to the data order sensitivity problem. Therefore, this sorting method is employed in this paper.
CABOSFV_D is a high dimensional data clustering algorithm based on extended dissimilarity [24]. CABOSDV_D introduces the adjustment index p to expand the original Sparse Feature Dissimilarity. The extended dissimilarity can prevent data objects from being assigned to a larger cluster, which makes the clustering process more accurate. In addition, CABOSFV_D is implemented by bit set to improve the efficiency of the algorithm. At the end of the literature list, the authors present a method for determining the threshold b. The method first sets the initial threshold range to an interval, such as (0, 3). Then, take a step length as increment, such as 0.1, conduct multiple experiments, and select the parameter with the best clustering result as the final input parameter of the algorithm. The problem with this approach is that the initial threshold range is set empirically. Furthermore, the fixed threshold is not suitable for all datasets, so it is unreasonable to set the threshold range to the same interval without considering the actual structure of the dataset. In this paper, we give a method to determine the threshold range according to the specific dataset and expect to simplify the user's attempt to determine the threshold parameter.

CASP ALGORITHM
In this section, the proposed CASP algorithm will be introduced in detail. Firstly, we define an Upper Bound of the SFD Threshold (TUB) to determine the threshold range. Then, we introduce the adjusted sparse feature vector after data sorting. Finally, the specific steps of CASP algorithm are described.

Determination of SFD Threshold Range
The Sparse Feature Dissimilarity (SFD) describes the similar degree of all objects in a set. The threshold b is a parameter of the algorithm, which represents the upper limit of SFD in a set. SFD and b jointly determine whether the current object can be added to a cluster. For different datasets, the calculated SFD is quite different. The fixed threshold is not adaptive on different datasets. As a result, there is no unified empirical standard for the selection of threshold b. It is necessary to find the appropriate threshold range from the given dataset.
As the threshold b changes, the clustering results will be various. When the value of b is very large, the clustering result will not change with b any more. In this case, b has no limiting effect on the SFD of a set. That is, we can find such a relatively large value of b as an upper bound of the threshold. Assuming that the maximum SFD of any subset of a dataset is max SFD , we expect to find a value equal to or slightly greater than max

SFD
. According to the definition of     SFD : SFD X e / X a   , the larger the e in the numerator is, and the smaller the |X| and a in the denominator are, the larger SFD is. From this perspective, we define the calculation method of an Upper Bound of the SFD Threshold range as follows. Definition 1(An Upper Bound of the SFD Threshold, TUB) Given a dataset X with n objects, each of them is described by m binary attributes (with values of 0 or 1). The number of attributes that equal 1 for all objects in X is represented as a. The number of attributes that equal 0 for all objects in X is indicated by z. An Upper Bound of the SFD Threshold, denoted as TUB, is defined as: In Eq. (3), the "2" in the denominator means that when two sets are merged, there are at least two objects in the new set. The "a" refers to the number of attributes with all values of 1 in the dataset X.
If a > 0, when two sets are merged, the number of attributes with all values of 1 in the new set is at least a. The "m-z-a" means the number of attributes that equal 1 for some objects and equal 0 for other objects in the dataset X. Then the number of attributes with values that are not all the same in a subset of X does not exceed m-z-a.
If a = 0, when two sets can be merged into one set, there is at least one attribute with all values of 1 in the new set; otherwise, the two sets are considered entirely different and cannot be merged into a new set. The number of attributes with values that are not all the same in the merged new set does not exceed m-z-1.
TUB represents the maximum set dissimilarity that a subset of a dataset may achieve. Obviously, the minimum dissimilarity of a set is 0. Therefore, the range of the SFD threshold can be obtained as (0, TUB). The following case shows how to calculate the TUB in detail.
Suppose that 1 . Thus, the threshold range of dataset X is (0, 5). Table 1 Converting categorical attributes to binary attributes

Adjusted Sparse Feature Vector after Data Sorting
Due to the sensitivity of CABOSFV to the order of data input, the proposed CASP algorithm firstly sorts the data objects according to the sparseness index [20], which is described as follows.
Definition 2 (The Sparseness Index of an Object, SIO) Suppose a dataset X has n objects, each of which is characterized by binary attributes. For object i, its sparseness index is denoted as: where m b represents the number of attributes whose value is equal to 1 in object i.
For dataset X, the sparseness index of each object is calculated and sorted in ascending order. The sorted dataset is X sort . The adjusted sparse feature vector after data sorting will be introduced next.
Definition 3 (The Adjusted Sparse Feature Vector, ASFV) For the sorted dataset X sort , X' is one of its subsets, and the number of objects in X' is recorded as |X'|. The number of attributes that equal 1 for all objects in X' is represented as A and the corresponding attribute set is S. The number of attributes that equal 1 for some objects and equal 0 for other objects in X' is denoted as E and the corresponding attribute set is NS.p, namely adjustment index is a constant integer greater than or equal to 1.
The adjusted sparse feature vector, namely ASFV, is defined as: Where ESFD(X') represents the extended sparse feature dissimilarity of X', which is defined as: According to [24], the adjustment index p in Eq. (6) is usually between (1, 4). When two sets, X' and Y', are merged, the ASFV of the new set can be calculated directly as follows: in which: The CASP algorithm exploits the sparseness index to weaken the sensitivity of data order. At the same time, the clustering based on the adjusted sparse feature vector can make the clustering process more accurate and improve the clustering effectiveness. CASP considers both data order sensitivity and rationality of data allocation. Therefore, it will be more stable and reliable than traditional CABOSFV algorithms.

Clustering Process without Specifying Parameter
When the threshold b takes different values within the given range, we can get different partitions. We find the best result from these partitions and the corresponding threshold b is the selected parameter. The proposed CASP algorithm mainly contains the following procedures: firstly, calculate the SFD threshold range according to Definition 1; then, sort the data according to the sparseness index; next, conduct the clustering process based on adjusted sparse feature vector and search for a suitable parameter in the defined threshold range; finally, output the final clustering result. The detailed steps of CASP algorithm can be outlined as follows: The computational complexity of CASP is ( ) O I k n   , where I is the number of iterations, k is the number of clusters, and n is the number of data objects. I is generally small. Therefore, as long as k is significantly less than n, the computational time of CASP is linearly related to n, which is effective and simple. The proposed CASP algorithm combines the strengths of CABOSFV_CS and CABOSFV_D. Moreover, when using CASP for clustering, the clustering results can be obtained by inputting only the datasets without setting parameter in advance. During the parameter determination, our method narrows the threshold search scope from (0, +∞) to (0, TUB). As a result, the appropriate parameter can be located quickly and accurately, and then the ideal clustering result can be obtained.

EXPERIMENTS
In order to verify the validity of our proposed CASP algorithm, extensive experiments are carried out based on several UCI datasets. In section 4.1, seven UCI datasets and three evaluation metrics are introduced. Section 4.2 describes benchmarks and experimental design. Section 4.3 presents the experimental results and evaluates the performance of CASP algorithm.

Datasets and Evaluation Metrics
In the experiment, seven datasets viz., Zoo, Soybean (Small), Congressional Voting Records, Solar Flare, Audiology (Standardized), Lymphography, and Breast Cancer are selected from UCI Machine Learning Repository [32] for algorithm verification. These datasets are all categorical data with binary or categorical attributes.
We remove data objects with missing values. The detailed information of datasets is described in Tab. 2.
The true class labels of these seven UCI datasets are known, so several external clustering validation indices are used to evaluate the clustering performance. The Rand index (RI), Fowlkes-Mallows scores (FMI) and Normalized Mutual Information (NMI) are employed in the experiment, as listed in Tab. 3. These indices are commonly used to compare the matching degree of clustering partitions and external standards.
More concretely, RI indicates the proportion that two objects originally in the same cluster are allocated to the same cluster and originally in the different clusters are correctly separated now; FMI is the geometric average of accuracy and recall; NMI reflects the consistency of the true label distribution and the clustering result label distribution. RI, FMI and NMI are all between 0 and 1. The greater the value of RI/FMI/NMI is, the more consistent the clustering result is with the real situation.  If two data objects with the same true labels are assigned to the same cluster, the number of such object pairs is denoted as TP. If two data objects with the different true labels are assigned to the different clusters, the number of such object pairs is denoted as TN. If two data objects with the different true labels are assigned to the same cluster, the number of such object pairs is denoted as FP. If two data objects with the same true labels are assigned to the different clusters, the number of such object pairs is denoted as FN. MI is the mutual information between the true labels and the result labels, and H is the information entropy.

Benchmarks and Experimental Design
Some binary or categorical data clustering algorithms, including CABOSFV_CS, CABOSFV_D, HABOC and K-modes, are selected to compare with the proposed CASP algorithm. All algorithms are described in Tab. 4. CABOSFV_CS proposes a data sorting method to improve CABOSFV. CABOSFV_D is a high dimensional data clustering algorithm based on extended dissimilarity. HABOC is an improved algorithm of CABOSFV, and it is a hierarchical clustering program that does not require to pre-set threshold parameter. K-modes is a representative partition-based clustering algorithm for categorical data. It should be noted that CABOSFV is not included in the baselines. This is because CABOSFV_CS, CABOSFV_D and HABOC are improved algorithms of CABOSFV and it has been proved in [20,23,24] that the clustering effectiveness of these improved algorithms is better than that of CABOSFV.
These algorithms need to pre-set parameters except the CASP algorithm. Since the proposed parameter determination method is suitable for CABOSFV-based algorithms which need to determine the threshold, both CABOSFV_CS and CABOSFV_D use this method to determine the threshold and get the final clustering result. The number of clusters is set to n = {2, 3, …, 25} for HABOC and K-modes. The best clustering results of HABOC and K-modes are selected as the final results for algorithm comparison. In particular, CABOSFV_D is sensitive to data order, so repeat the algorithm ten times with randomly sorted datasets and take the average of clustering results as the final result.

Results and Discussions
The experiments are carried out on a personal computer with Windows 10 operating system, Intel Core i5 8250u CPU and 8 GB memory. All algorithms are implemented by MATLAB.
The clustering results of CASP algorithm and other baseline algorithms on seven datasets with three metrics are reported in Tab. 5 to Tab. 7, and each table corresponds to one evaluation metric. The best results for each dataset are indicated in bold. The last row of each table represents the average performance of each algorithm on seven datasets.   As seen in Tab. 5 to Tab. 7, CASP shows the best clustering performance on the most datasets among all the comparison methods. More specifically, CASP gets the best results with all three metrics on five of seven datasets, including ZO, SO, SF, LY and BC. HABOC achieves the best results with three metrics on two of the seven datasets viz. SO and VO. With respect to dataset VO, clustering results of CASP are ranking third in terms of RI/FMI and ranking second in terms of NMI. For dataset AU, though CASP does not perform as well as K-modes and HABOC on FMI metric, it performs best on the other two metrics. From the last row of each table, it is clear that CASP gets the best average with all three metrics compared with the baseline algorithms.
Moreover, we plot the average performance of each algorithm in Fig. 1 to compare these algorithms. In Fig. 1, three subgraphs a, b, and c respectively represent the average results of all algorithms on the three metrics RI, FMI and NMI. Fig.1 shows that the performance ranking of each algorithm is consistent on three metrics. No matter which metric is used, CASP outperforms other algorithms, and CABOSFV_D is second only to CASP. CABOSFV_ CS and HABOC are ranking third and fourth, respectively. K-modes and HABOC get the approximate clustering results.
It can be inferred from the above analysis that the threshold determination method proposed in this paper is effective for CABOSFV-based algorithms which need to determine the threshold. Furthermore, the comprehensive clustering effectiveness of CASP is better than those baselines, including existing improved CABOSFV algorithms and the classical categorical data clustering algorithm K-modes, which proves that our proposed approach is of superiority. In addition, the performance of CASP is compared with other algorithms in terms of execution time. The running time of five algorithms to perform clustering on each dataset is recorded in Tab. 8, and we can see that CASP has the lowest running time on most datasets. In the last row of the table, the average running time of each algorithm over seven datasets is presented, and we plot the average running time in Fig. 2 for comparison. Tab. 8 shows that CASP has the least average running time. As seen in Fig. 2, the average running time of CASP, CABOSFV_CS, CABOSFV_D and K-modes is approximate and much smaller than HABOC. HABOC adopts the hierarchical clustering framework with time complexity of O(n 3 ). Therefore, HABOC is expensive in terms of computation. Compared with HABOC, the proposed method not only solves the problem of parameter selection, but also maintains the efficiency of the algorithm.  In summary, considering the clustering effectiveness and computation complexity, CASP obtains a better clustering performance than these baseline algorithms. With the development of information technology, the volume of data generated in real-world applications is increasing. CASP has great advantages in processing these large-scale data due to its low computation complexity. Moreover, CASP is able to automatically determine parameters, which provides great convenience for users.

CONCLUSION
Determining the threshold parameter is an essential but difficult step in CABOSFV-based clustering algorithms, which directly affects the stability of clustering. A clustering algorithm based on sparse feature vector without specifying parameter is proposed in this study to simplify the user's parameter attempt process. By defining an upper bound of the SFD threshold, the threshold range can be determined theoretically rather than empirically. In addition, CASP defines the adjusted sparse feature vector (ASFV), which combines the sparseness index and adjustment index to improve the stability and accuracy of the clustering. When using the proposed CASP algorithm for clustering, only the input dataset is needed to get the final clustering result, which makes the algorithm simpler and more practical. Based on the experiments on 7 UCI datasets, we compare the performance of the proposed CASP with several existing clustering techniques, including CABOSFV_CS, CABSOFV_D, HABOC, and K-modes. The experimental results with three external evaluation metrics indicate that CASP algorithm has better clustering effectiveness than baseline algorithms. CASP not only solves the difficulty of threshold parameter decision, but also has high computing efficiency. Moreover, the clustering results are of stability and reliability. So, it can be widely used in practical applications.
In the proposed CASP algorithm, a suitable threshold parameter can be determined to acquire final clustering results. However, the parameter found by this method is usually a relatively suitable parameter, not necessarily the optimal one. It is difficult to define an evaluation function that measures the clustering results of CASP. Therefore, the optimization of CASP evaluation function needs further study. In the future research, we will try some intelligent algorithms to solve the above problem.