CUBOS: An Internal Cluster Validity Index for Categorical Data

: Internal cluster validity index is a powerful tool for evaluating clustering performance. The study on internal cluster validity indices for categorical data has been a challenging task due to the difficulty in measuring distance between categorical attribute values. While some efforts have been made, they ignore the relationship between different categorical attribute values and the detailed distribution information between data objects. To solve these problems, we propose a novel index called Categorical data cluster Utility Based On Silhouette ( CUBOS ). Specifically, we first make clear the superiority of the paradigm of Silhouette index in exploring the details of clustering results. Then, we raise the Improved Distance metric for Categorical data (IDC) inspired by Category Distance to measure distance between categorical data exactly. Finally, the paradigm of Silhouette index and IDC are combined to construct the CUBOS , which can overcome the aforementioned shortcomings and produce more accurate evaluation results than other baselines, as shown by the experimental results on several UCI datasets.


INTRODUCTION
Clustering is one of the most important tasks in data mining and machine learning that partitions dataset into different clusters in which data objects are similar to those in the same cluster and dissimilar to those in different clusters, to identify the nature structures and mine the potential useful information hidden under mass data [1].It has been applied in many real-world domains, including pattern recognition [2], customer segmentation [3], anomaly detection [4,5] and trending topic detection [6], et al.Since most of data in real-world lacks labels or other external information, it is hard to identify which clustering algorithms or parameter configurations yield the optimal clustering result.To this end, internal cluster validity indices, which evaluate the clustering performance without reference labels or other external information besides the structure of clustering results, have attracted lots of researchers' attentions [7].
Internal cluster validity indices are used to evaluate the clustering performance by considering only the clustering data, which can be briefly classified into numerical dataspecific method and categorical data-specific method.Numerical data-specific method refers to the internal cluster validity indices that are applied to evaluate the clustering performance of numerical data.And categorical data-specific method refers to another kind of indices that are used to evaluate the clustering performance of categorical data.
The numerical data-specific method has been studied relatively adequately that evaluates clustering results according to the compactness of intra-cluster and separation of inter-clusters.Lots of internal cluster validity indices for numerical data have been proposed, such as Dunn index (D) [8], Calinski-Harabasz index (CH) [9], I index [10], Davies-Bouldin index (DB) [11] and Silhouette index (S) [12], et al.These indices measure the compactness of intra-cluster and separation of interclusters by computing the distance between numerical data objects or centroids, that are able to reflect the microscopic distribution information between data objects in clustering results and produce relatively more accurate evaluation results [13].
For categorical data, it is difficult to compute distance straightforward.The method used to measure the similarity or dissimilarity between two categorical data objects or of a categorical cluster can be divided into three types: simple matching-based approach, probability-based approach and entropy-based approach.Simple matching-based approach is to compute the dissimilarity between two categorical data objects according to whether the attribute values are identical, which is used in the well-known K-modes algorithms [14] typically.Probability-based approach is to measure the similarity or dissimilarity of a categorical cluster by computing the probability of identical attribute values of data objects in the cluster.In addition, entropybased approach is relying on the association between entropy and cluster: there is a lower entropy in the cluster of similar data objects than in the cluster of dissimilar data objects.COOLCAT is a traditional entropy-based categorical data clustering algorithm [15].The three types of measurement approaches are essentially rooted in the identity of categorical attribute values.Moreover, most of the existing internal cluster validity indices for categorical data rely on these similarity or dissimilarity measurement approaches.
There are some researches about internal cluster validity indices for categorical data, such as Cluster Cardinality Index (CCI) [16], Categorical Data Clustering with Subjective factors (CDCS) [17], Information Entropy (IE) [15], Category Utility (CU) [18] and New Condorcet Criterion (NCC) [19], et al.Among them, CCI and NCC rely on the simple matching-based approach to measure the compactness and separation, CDCS and CU rely on the probability-based approach and IE relies on the entropybased approach.Since the kernel of these approaches is the identity of categorical attribute values, which leads to two deficiencies among the existing internal cluster validity indices for categorical data.One is that the indices only take into account the otherness among attribute values, not considering the relationship between different attribute values.The other is that most of the existing internal cluster validity indices for categorical data measure the compactness and separation only according to the similarity or dissimilarity of a cluster, cannot measure the similarity or dissimilarity between two categorical data objects, so that more detailed information of clustering results is unable to be explored.In this paper, we limit our scope to the improvement of internal cluster validity indices for categorical data to overcome the two deficiencies.
For exploring more details hidden in clustering results of categorical data, we develop a distance metric for categorical data, called Improved Distance metric for Categorical data (IDC), which can compute distance between two categorical data objects considering the relationship between different attribute values.Moreover, the paradigm of Silhouette (S) index which obtains significantly better evaluation results than other existing internal cluster validity indices for numerical data [20] is used to construct a novel internal cluster validity index for categorical data, called Categorical data cluster Utility Based On Silhouette (CUBOS), with the proposed IDC.
The main contributions of this paper are summarized as follows.Above all, we analyse the characteristics of several existing representative internal cluster validity indices for numerical data and illustrate the essence of each index in a visual way to demonstrate the superiority of Silhouette (S) index.In addition, we develop a novel distance metric for categorical data IDC under the inspiration of Category Distance that has been presented in an existing work [21], which satisfies the distance conditions (non-negativity, symmetry and triangular inequality).The proposed distance metric IDC computes the distance between two categorical data objects considering the relationship between different attribute values.Finally, an internal cluster validity index for categorical data CUBOS is proposed which combines the IDC and the paradigm of S index, not only realizes the accurate measurement of the distance between two categorical data objects, but also explores detailed distribution information in clustering results.

RELATED WORK
In this section, we review several typical internal cluster validity indices for numerical data and categorical data and analyse their respective characteristics.

Let
where k is the number of clusters.The number of data objects in cluster , , , . c is the centroid of dataset X, c j is the centroid of cluster C j .
(1) Dunn index (D) Dunn index is formulated as follows: min , min min max max , where the numerator represents the separation of interclusters by computing the minimum distance between two data objects in different clusters and the denominator represents the compactness of intra-cluster by computing the maximum distance between two data objects in the same cluster.It is easy to see that large D value indicates good clustering performance.The distribution diagram of D index is shown in Fig. 1  (a).There are four clusters, the black points in each circle represent the data objects belonging to that cluster and the gray point represents the centroid of each cluster.The solid straight line and the dotted straight line are respectively used to indicate the compactness of intra-cluster and the separation of inter-clusters.It is obvious that D index evaluates the clustering performance based only on two distances, namely the maximum distance in a cluster and the minimum distance between clusters, without considering other distribution information, which results in the relatively inaccurate evaluation results.
(2) Calinski-Harabasz index (CH) Calinski-Harabasz index is given as follows: ( ) , where the numerator represents the separation of interclusters by computing the weighted average of the square of distance from the centroid of each cluster to the centroid of dataset, and the denominator represents the compactness of intra-cluster by computing the square of distance from each data object in a cluster to its centroid.Similarly, large CH value indicates good clustering performance.
The distribution diagram of CH index is shown in Fig. 1(b).The white point is the centroid of dataset.Compared with D index, CH index considers the distribution of all data objects.However, CH index only focuses on the centroid-based relationship, e.g. the distance from data object to its centroid and the distance from the centroid of cluster to the centroid of dataset, but not on the relationship between data objects, which leads to that CH index cannot accurately evaluate the clustering performance in some cases.For example, on one side, CH index might misjudge that the separation of inter-clusters is good where each cluster is far from the centroid of dataset but some clusters are close to each other, on the other side, the compactness of intra-cluster may be misjudged as good when each data objects is close to its corresponding centroid but some of them are far away.
(3) I index (I) I index is defined as follows: , where, the separation of inter-clusters is measured according to the distance from each data object to the centroid of dataset and the maximum distance between centroids of clusters, and the compactness of intra-cluster is measured by computing the distance between data object and its corresponding centroid.The maximum I index value indicates the optimal clustering result.The distribution diagram of I index is shown in Fig. 1(c).It is very similar to the distribution diagram of CH index, except that I index also measures the distance from each data object to the centroid of dataset.Although I index considers more distribution information than CH index, it still evaluates the clustering performance based on the centroid-based distance like CH index, this kind of distance metric results in the neglect of the relationship between clusters or data objects.
(4) Davies-Bouldin index (DB) Davies-Bouldin index is given as follows: where the DB index evaluates the clustering performance by measuring the performance of each cluster respectively based on the average similarity of the cluster with its most similar cluster.Small DB index value indicates good clustering performance.
The distribution diagram of DB index is shown in Fig. 1(d).It only shows the distance between data objects or centroids involved in measuring the performance of one cluster.DB index measures the compactness of intracluster by computing the distance from data objects to the centroid in the same way as CH index.Similarly, this kind of method would produce an inaccurate evaluation result.Furthermore, it measures the separation of inter-clusters by computing the distance between centroids, which cannot evaluate the distance between two clusters exactly, for example, when the centroids of the two clusters are far away but their boundaries are actually close to each other as shown in Fig. 1(d).
(5) Silhouette index (S) Silhouette index is formulated as follows: ( ) ( ) ( ) ( ) ( ) where S index evaluates the clustering performance by measuring the performance of each data object.The compactness of a data object is computed by the average distance from the data object to other data objects in the same cluster.In addition, the separation of a data object is the minimum of average distance from the data object to data objects in another cluster.Large S index value indicates good clustering performance.
The distribution diagram of S index is shown in Fig. 1(e).It only shows the distance related to the compactness and separation of a data object which is a black point with red edge.We can see that the distances computed in S index are between data object, but not related to the centroids, and compared to D index, more data objects are taken into account.Therefore, S index considers more distribution information and can produce much more accurate evaluation results.
After introducing the above five typical internal cluster validity indices for numerical data, we can know that CH, I and DB indices evaluate clustering performance through the centroid-based distance, D and S indices evaluate clustering performance through the data object-based distance.Since the centroid-based distance neglects the relationship between data objects resulting in the inaccurate reflection for the true distribution of clustering results, CH, I and DB indices cannot produce precise evaluation results for clustering results.Although D index is based on the data object-based distance, it only takes into account a little distribution information of clustering results, which leads to defective reflection for the overall distribution of clustering results.S index evaluates the clustering performance of each data object based on the data object-based distance and the true distribution information can be reflected as much as possible, hence S index can produce much more precise evaluation results than other indices.Based on this, we exploit the paradigm of S index to construct a novel internal cluster validity index for categorical data.

Internal Cluster Validity Indices for Categorical Data
where k is the number of clusters.i C is the number of data objects in cluster (1) Cluster Cardinality Index (CCI) Cluster Cardinality index is formulated as follows: where ( ) where intra(π) represents the compactness of intra-cluster for clustering results, which is computed as follows: ( ) ( ) where ( ) attribute a d in cluster C i .Moreover, inter(π) represents the separation of inter-clusters for clustering results, which is defined as follows: where ( ) Sim C C is the similarity between cluster C i and cluster C j that is computed as follows: ( ) denotes the probability of q Xd v on attribute a d in cluster C i .The idea of measuring the similarity between two categorical clusters is that the more the identical attribute values of the two clusters, the more similar they are.According to the equations and their description, the best clustering results would be indicated by the largest CDCS values.
(3) Information Entropy (IE) Information Entropy is given as follows: ( ) ( ) ( ) IE index evaluates the clustering performance by exploiting information entropy theory.The basic idea is that the entropy of cluster within similar data objects is lower than that of cluster within dissimilar data objects.It is obvious that smaller IE index values indicate better clustering results.
(4) Category Utility (CU) Category Utility is defined as follows: CU index tries to evaluate the clustering performance by measuring the identity of attribute values in a cluster.Larger value of CU index indicates better clustering result.
(5) New Condorcet Criterion (NCC) New Condorcet Criterion is formulated as follows: where S intra (C i ) denotes the compactness of intra-cluster for cluster C i , which is computed as follows: where jg denotes the separation of inter-clusters for the cluster C i , which is computed as follows: ( ) ( ) Apparently, larger NCC index values indicate better clustering results.
From the above description about internal cluster validity indices for categorical data, we can know two facts.One is that CCI, CDCS, IE and CU indices measure similarity or dissimilarity based on probability according to their definitions.This similarity or dissimilarity measurement method only pays attention to the number of occurrences of attribute values, ignoring the relationship between different attribute values.NCC index measures the distance between two data objects based on the matching of all attribute values compared, which is also incapable to identify the relationship between different attribute values.The other is that CCI, CDCS, IE and CU indices evaluate the clustering performance based on the similarity or dissimilarity of a cluster, but do not measure the similarity or dissimilarity of data objects more meticulously.Therefore, more detailed distribution information of clustering results cannot be discovered.
After analyzing the characteristics of several typical existing internal cluster validity indices for numerical and categorical data, we realize that Silhouette (S) index can obtain more accurate evaluation results for clustering performance compared with other internal cluster validity indices for numerical data.Therefore, we exploit the paradigm of S index to construct a new internal cluster validity index for categorical data.In addition, there are two deficiencies of most of the existing internal cluster validity indices for categorical data.One is the overlook of relationship between different attribute values.The other is the incapability of discovering more detailed distribution information between data objects.To overcome the two deficiencies, a new distance metric for categorical data IDC is proposed that can reflect the relationship between different attribute values and satisfy the distance conditions.By using this distance metric, we can explore more detailed distribution information in the clustering results.Moreover, a novel internal cluster validity index for categorical data CUBOS is developed by combining the IDC and the paradigm of S index.

CATEGORICAL DATA CLUSTER UTILITY BASED ON SILHOUETTE
Our proposed internal cluster validity index for categorical data CUBOS consists of two components: (a) presenting the Improved Distance metric for Categorical data (IDC) inspired by the Category Distance in an existing related work; (b) constructing the new internal cluster validity index CUBOS by combining the presented IDC and the paradigm of Silhouette index.Specifically, to illustrate our presented index clearly, the Category Distance that inspires our research will be reviewed and discussed firstly.

Discussion on Category Distance
The Category Distance has been proposed in [21], which relies on the weights of values on each categorical attribute, and no longer depends on the independence assumption that there is no relationship between the values on the same attribute.To define a distance formula satisfying the distance conditions that consist of nonnegativity, symmetry and triangular inequality, a general distance metric paradigm has been provided as follows: where c and c' are two values on the same attribute, ρ(c) denotes the dissimilarity of attribute value c when two data objects take identical value on the same attribute, correspondingly, ( ) c ρ denotes the dissimilarity of attribute value c when two data objects take different values covering c on the same attribute.It was proved that any distance metric meeting this paradigm satisfies the distance conditions.
Category Distance meeting the paradigm was proposed in their work, where ( ) are formulated as follows: Where ( ) ≤ is the weight of attribute value l Xd v that reflects the contribution of attribute value l Xd v for the distance computation.The exponent 1/β > 1 is used to control the strength of the contribution of attribute values.According to Eq. ( 16) and Eq. ( 17), the Category Distance was developed as follows: In their work, the computation of distance metric in Eq. ( 18) was converted into an optimizing problem, the weights optimization of attribute values, which was solved by a clustering algorithm.That means the distance between two categorical attribute values cannot be computed directly but be fused in clustering procedure.
The Category Distance ( ) independence assumption between categorical attribute values to reflect the relationship between different values on the same categorical attribute and satisfies the three distance conditions, which can be applied flexibly into the paradigm of internal cluster validity indices for numerical data studied more fully.Nevertheless, there are a few defects of Category Distance.On one hand, the heterogeneity of data objects sharing the identical categorical attribute value exists according to Eq. ( 18), that causes the distance between two data objects with identical values on all categorical attributes to be greater than 0. On the other hand, the distance computation and the clustering procedure are integrated together, so it is impossible to compute distance separately for other tasks.
In this paper, we improve the Category Distance to overcome its two defects: firstly, we develop the computation method for weights of attribute values based on the whole dataset X under the assumption that uncommon attribute values contribute more weights than common attribute values.This idea is consistent with the information theory that the events with lower occurrence probability can provide more information than events with higher occurrence probability.Secondly, we adjust the general distance metric paradigm listed in Eq. ( 16) for that the distance between two data objects with identical values on all categorical attributes is 0. Finally, we propose the Improved Distance metric for Categorical data called IDC based on the developed computation method for weights of attribute values and the adjusted general distance metric paradigm.

Improved Distance Metric for Categorical Data (1) Developed computation method for weights of attribute values
The developed computation method for weights of attribute values is shown as follows: where ( ) is the occurrence number of value q Xd v in dataset X.
( ) is the set of all values on d th categorical attribute in dataset X whose probabilities are equivalent to or smaller than the probability of l Xd v .The weight computation method is derived from the similarity measure proposed by Goodall [22] that reflects the relationship between different attribute values by giving grater weights to uncommon attribute values.
(2) Adjusted general distance metric paradigm The adjusted paradigm is shown as follows: We only change the distance between two identical attribute values to 0. Along this line, the distance between two data objects with identical values on all categorical attributes would equal to 0. Moreover, any distance metric applying this paradigm satisfies the distance conditions: where ψ(a, b) is the distance between a and b.The Eq. ( 21) obviously follows the conditions of non-negativity and symmetry.For triangular inequality, we illustrate through five cases: We can raise a distance metric for categorical data based on the developed computation method for weights of attribute values and the adjusted general distance metric paradigm to be applied in the S index for evaluating clustering performance.
(3) Improved Distance metric for Categorical data (IDC) The Improved Distance metric for Categorical data (IDC) is raised as follows: where λ X (c) can be computed according to Eq. ( 19) and Eq. ( 20).The 1/β is used to control the strength of weights.
IDC discards the assumption that the different values on the same attribute are independent of each other and can express their relationship.Additionally, IDC satisfies the distance conditions which can be applied directly in the existing internal cluster validity indices based on distance.

Categorical Data Cluster Utility Based on Silhouette
Considering the superiority of Silhouette (S) index over other internal cluster validity indices for numerical data, we combine the IDC and the paradigm of S index to develop an internal cluster validity index for categorical data named Categorical data cluster Utility Based On Silhouette (CUBOS), that is defined as follows: CUBOS index inherits the strength of S index that evaluates the clustering performance depending on the data object-based distance to expose as much as possible the more detailed distribution information in clustering results.Besides, IDC used in CUBOS index can compute the exact distance between two categorical data objects satisfying the distance conditions, rather than just estimate their similarity or dissimilarity.Meanwhile, IDC considers the relationship between different values on the same categorical attribute no longer based on the independence assumption.

EXPERIMENTAL RESULTS
Extensive experiments on several datasets from UCI are conducted to illustrate the effectiveness of CUBOS.

Experimental Datasets
Five typical categorical datasets from UCI are selected in the experiments.Tab. 1 lists these datasets.Specifically, there are missing values in BC dataset and CVR dataset.We delete the data objects containing missing values before clustering.

Evaluation Metrics
External cluster validity indices are to assess the consistency between clustering labels and true labels that can be used to evaluate the performance of internal indices.However, different external indices would lead to different measurement results for the same clustering results.Therefore, we exploit seven external cluster validity indices, including Accuracy (A), Adjusted Rand Index (ARI), F-measure (F), Micro-p (M), Normalized Mutual Information (NMI), Purity (P) and Rand Index (RI) [23,24], to evaluate the performance of clustering results selected by internal indices, as shown in Tab. 2.

Baselines and Experimental Configurations
We compare the proposed CUBOS index with five baselines, which are introduced as Tab.3: K-modes algorithm is used to conduct clustering with the number of clusters ranging from 2 to n , where n is the number of data objects in the dataset.And we preset the parameter β ={0.05, 0.1, 0.15,…, 0.95} for CUBOS index.

Performance Comparison
The evaluation results are reported in Tab. 4 to Tab. 10.The decimals in the tables are the evaluation scores for the performance of clustering results chosen by each internal index, and the integers in brackets indicate the ranking of effectiveness of internal indices.
First of all, we focus on the evaluation results with A as metric.In Tab. 4, we can see that CUBOS, IE and CU obtain better evaluation results than other indices and the performance of CCI is the worst.Although there are three times for being ranking first of CUBOS, IE and CU, CUBOS's rankings on the remaining two datasets are respectively second and third which are in front of the rankings of IE and CU on their remaining two datasets.Thus, the effectiveness of CUBOS is relatively superior than that of other indices with A as metric.
In the following table, ARI is used to be the evaluation metric.In Tab. 5, CUBOS is ranking first on all datasets whose effectiveness significantly surpasses that of other indices.CCI is the second best index.And IE is the worst index that is low-ranking on all datasets.
With respect to F (see Tab. 6), the effectiveness of CUBOS is still first-rate.Meanwhile CCI is slightly worse than CUBOS and IE is the worst index.
We now focus on the results shown in Tab. 7. CUBOS obtains the best evaluation results with M as metric.Nevertheless, CUBOS and CCI both perform best on these five datasets when M is used to evaluate the indices' effectiveness.Additionally, IE is still the worst performing index whose best ranking is only fifth place.
Next, we focus on the Tab. 8. CUBOS is also ranking first on all datasets with NMI as metric.And CU is ranking first on four datasets, CCI and NCC perform best on three datasets.The performance of IE is the worst.
With regard to P, Tab. 9 shows that CUBOS, IE and CU are all ranking first three times, and CCI performs poorly on all datasets.Specifically, the performance of CUBOS on the remaining two datasets which are not ranking first is better than that of IE and CU on their remaining two datasets.
Finally, Tab. 10 shows the evaluation results with RI as metric.The performance of CUBOS is excellent compared with other indices.CCI and CU are the second best indices.And IE is the worst index.From the above comprehensive analysis, it is clear that CUBOS can always choose a better clustering partition, compared with other internal indices for categorical data, no matter which external index is used.
Furthermore, we count the occurrence number of each ranking for each internal index compared in the experiments as shown in Fig. 2. It can be seen that CUBOS ranks first most frequently and its worst ranking is third, besides, CUBOS is ranking second and third on just a few datasets.Therefore, we could know that the performance of CUBOS proposed in this paper is significantly superior than other indices.

CONCLUSION
In this paper, we present a new internal cluster validity index for categorical data named CUBOS, in which a distance metric for categorical data IDC is derived from the Category Distance in an existing work and the paradigm of S index is used to construct CUBOS.The proposed index considers the relationship between different categorical attribute values and measures the distance between two categorical data objects exactly.Furthermore, the paradigms of S index and IDC are combined, so that much more detailed distribution information in clustering results of categorical data is explored and more precise evaluation results can be obtained.Experimental results on several UCI datasets show that CUBOS outperforms other internal cluster validity indices for categorical data compared.That demonstrates a reliable performance of our index and promises wide applicability in practice.

Figure 1
Figure 1 Distribution diagrams

x
in cluster C i .And D inter (C i )

Case 1 :
a = b = c; According to Eq. (21), when a = b, there is ψ(a, b) = 0. Similarly, ψ(a, c) = 0 and ψ(c, b) = 0. Hence, ψ(a, b) ≤ ψ(a, c) + ψ(c, b); Case 2: a = b, a ≠ c and b ≠ c; We have ψ(a, b) = 0, ( 2, a dataset with n data objects from kt classes of data objects which are from class j and partitioned into cluster i is n ij .Additionally, a refers to the number of data object pairs that belong to different classes and are still clustered into different clusters.b refers to the number of data object pairs that belong to the same class and are still clustered into the same cluster.In addition, the larger the values of external indices are, the better the performance of clustering results chosen by internal indices for categorical data.

) 1 CVRFigure 2
Figure 2The number of ranking of internal cluster validity indices compared

Table 1
Summary of datasets

Table 2
External cluster validity indices used in the experiments

Table 3
Summary of algorithms to be compared

Table 4
Evaluation of all indices with A as metric

Table 5
Evaluation of all indices with ARI as metric

Table 6
Evaluation of all indices with F as metric

Table 7
Evaluation of all indices with M as metric

Table 8
Evaluation of all indices with NMI as metric

Table 9
Evaluation of all indices with P as metric