Clustering Single-cell RNA-sequencing Data based on Matching Clusters Structures

Single-cell sequencing technology can generate RNA-sequencing data at the single cell level, and one important single-cell RNA-sequencing data analysis method is to identify their cell types without supervised information. Clustering is an unsupervised approach that can help find new insights into biology especially for exploring the biological functions of specific cell type. However, it is challenging for traditional clustering methods to obtain high-quality cell type recognition results. In this research, we propose a novel Clustering method based on Matching Clusters Structures (MCSC) for identifying cell types among single-cell RNA-sequencing data. Firstly, MCSC obtains two different groups of clustering results from the same K-means algorithm because its initial centroids are randomly selected. Then, for one group, MCSC uses shared nearest neighbour information to calculate a label transition matrix, which denotes label transition probability between any two initial clusters. Each initial cluster may be reassigned if merging results after label transition satisfy a consensus function that maximizes structural matching degree of two different groups of clustering results. In essence, the MCSC may be interpreted as a label training process. We evaluate the proposed MCSC with five commonly used datasets and compare MCSC with several classical and state-of-the-art algorithms. The experimental results show that MCSC outperform other algorithms.


INTRODUCTION
Single-cell sequencing is a recently developed technique for better understanding cellular heterogeneity [1,2], and it generates RNA sequencing data that consist of cells with many genes. The single-cell RNA sequencing data analysis has attracted much attention in the field of bioinformatics, especially in identifying the cell types. However, it is a big challenge for analysing single-cell RNA-sequencing data effectively. This is because singlecell RNA-sequencing data has very high dimensions and high level of noise [3], only some dimensions (genes expression levels) differ much, i.e., most attributes may not be helpful for identifying cell types. The simple and easyunderstanding way of analysing single-cell RNAsequencing data is to use clustering algorithms, which are unsupervised learning methods without using class labels. More specifically, clustering algorithms are the methods of grouping data points into multiple clusters with an objective function or a clusters structure hypothesis, such as K-means clustering algorithm [4], density-based spatial clustering algorithm with noise [5], affinity propagation clustering algorithm [6] and spectral clustering algorithm [7]. However, the above-mentioned traditional clustering algorithms cannot work well for analysing single-cell RNA sequence data, because traditional metrics (such as Euclidean distance) are not valid when data points become sparse in high dimensional space. An alternative similarity metric is based on shared nearest neighbour, which is proven to be an effective and robust way of describing relationships between data points in high dimensional space [8]. Concretely, the shared nearest neighbour is the intersection of neighbouring points of a pair of data points.
There exist some methods which can group data into different clusters based on shared nearest neighbour. Guha et al. proposed a robust clustering algorithm for categorical attributes based on the number of neighbouring points to clustering categorical data [9]. Jarvis et al. built a near neighbour list of every data point so as to compute similarities [10]. Ertoz et al. proposed an improved densitybased clustering algorithm based on shared nearest neighbour to identify clusters of varying densities and shapes [11]. Based on previous successful applications, shared nearest neighbour is proven to be capable of better revealing the relationships among data points in highdimensional space [12].
Based on the advantages of shared nearest neighbour similarity, we propose a novel clustering algorithm called Matching Clusters Structures-based Clustering algorithm (MCSC). Five commonly used public real-world datasets are used to evaluate the proposed MCSC and we compare it with four classical methods (Spectral clustering algorithm, K-means clustering algorithm, principal component analysis [13], and t-distributed stochastic neighbour embedding [14]) as well as two state-of-the-art methods (they will be described in Section 2).
The rest of this paper is organized as follows. In Section 2, we review related work. In Section 3, we present the details of our proposed MCSC method. In Section 4, we report the experimental results with discussions. In Section 5, we conclude the paper and propose future work.

RELATED WORKS
Clustering is an important approach to identify single cell among RNA sequence data, which has attracted great attention of many researchers.
To the best of our knowledge, many algorithms are proposed to identify cell types and help find new insights into biology. To name a few, Jiang et al. designed a similarity measure based on differentiability correlation between cell pair and then cooperated with hierarchical clustering to form a variance analysis-based clustering algorithm, which can find the true number of clusters automatically and identify cell types efficiently [15]. Wolf et al. developed a scalable tool kit to clustering single cell RNA sequencing data [16]. Kiselev et al. proposed a single-cell consensus clustering method, which is a useful tool for unsupervised clustering [17]. Nikolenko et al. introduced a novel algorithm based on hamming graphs and bayesian sub-clustering for error correction in singlecell sequencing data [18]. Aibar et al. developed a computational method for simultaneous gene regulatory network reconstruction and cell-state identification from single-cell RNA sequencing data [19]. Seyoung Park and Hongyu Zhaouse multiple doubly stochastic similarity matrices to learn a similarity called MultiPle similarity Sparse Spectral Clustering algorithm (MPSSC) [20]. Xu et al. proposed a clustering algorithm incorporating a shared nearest neighbour graph and quasi-clique recognition methods used to identify cell types from single-cell transcriptomes [21]. Wang et al. proposed a Single-cell Interpretation method via Multi-kernel LeaRning (SIMLR), which improves the visualization and interpretability of single cell RNA sequencing data [22]. Both SIMLR proposed by Wang et al. and MPSSC proposed by Park et al. are based on metric learning. We select SIMLR and MPSSC as benchmarking models because they are well-recognised algorithms.
Previous methods did not consider an unsupervised learning method, which combines different initial clusters into a unified cluster based on their structures matching degree. In this research, we propose a novel clustering method based on matching clusters structures, namely MCSC. It combines multiple different grouping results from the same dataset with an aim to produce superior results.

CLUSTERING BASED ON MATCHING CLUSTERS STRUCTURES (MCSC)
In this section, we present the proposed MCSC, which uses a consensus function based on matching clusters structures to decide if two initial clusters are merged into one or not. MCSC first uses K-means clustering algorithm to generate initial clusters because it can obtain stable clustering results by using neighbour information. For the two groups of results of K-means: R i and R j , we design a novel consensus function based on shared nearest neighbour to train the results of K-means. Based on the shared nearest neighbour information between different initial clusters, one cluster may be merged into the other. We give a consensus function to determine whether the merging process is reasonable or not. Finally, the categories of some original initial clusters will change, and we take the final results as output. In order to illustrate our algorithm more intuitively, the flow chart of the proposed algorithm MCSC is shown in Fig. 1.
We first introduce two basic tools: K-means clustering algorithm and a popular external evaluation criterion called Normalized Mutual Information (NMI) in Sections 3.1 and 3.2. The details of our algorithm are described in Section 3.3. The time complexity of the proposed MCSC is given in Section 3.4.

K-means
We choose the well-known K-means clustering algorithm as the initial clusters' generation methods [23]. The objective function of K-means is defined as follows: is the centroids in an initial cluster C i and k denotes the number of centroids. Note that k initial centroids are randomly selected so that the clustering results may be different even though parameter is fixed as shown in Tab. 1, F-measure is chosen as the evaluation metric. Fmeasure is a commonly used evaluation metric, and for a pair of points (x i , x j ), they are represented as TP if they have the same label and the same cluster. They are represented as FP if they have different labels but are grouped into the same cluster. They are represented as FN if they have the same label but are grouped into different clusters.

-
Pr ecision Re call F measure Pr ecision Re call In the above equations, #TP represents the number of data points belonging to TP, #FP denotes the number of data points belonging to FP and #FN denotes the number of data points belonging to FN. In brief, we use R i to denote the i-th time results of Kmeans when we fix the parameter k, perhaps R i ≠ R i + 1 . Many methods are proposed to improve K-means, such as automatically selecting k or centroids.

Normalized Mutual Information (NMI)
In the proposed MCSC algorithm, we take Normalized Mutual Information (NMI) as a consensus function to measure clusters structures similarity of any two initial clusters [24]. NMI is a popular external evaluation criterion for cluster quality. For ground-truth A and a group of clustering result B of a dataset D, the unique value in A is defined as a vector X and the unique value in B is defined as a vector Y. Thus, the NMI value of two vectors A and B is defined as follows: where p(x) denotes the probability of x in A, p(y) denotes the probability of y in B, and p(x, y) denotes the joint distribution probability of x and y. Usually, we use groundtruth and clustering results to compute the NMI value, whose range is between [0, 1]. If the NMI value is closer to 1, the quality of clustering result is higher.

Clustering Based on Matching Clusters Structures (MCSC)
In this section, we present the steps of the proposed MCSC.
Step1: we obtain two groups of clustering results R i and R j by K-means with the same parameter k.
We run K-means twice and keep the parameters unchanged each time. Then, we obtain two groups of results: R i and R j which contain the categorization information. MCSC first deals with R i and then uses the R j to train the R i with a consensus function.
Step2: calculate the label transition matrix T. Shared nearest neighbour can effectively represent the structures of high-dimensional data [25]. In MCSC, we use it to find the relationship between two initial clusters. We use A ij to denote the j-th nearest neighbour of data point x i .The similarity between data point x i and x j is defined as follows: where knum denotes the number of shared nearest neighbour of data points x i and x j , m and n denotes the position of shared nearest neighbours in data points x i and x j nearest neighbour list (the i-th and j-th rows of A), respectively. When S(x i , x j ) > θ, data points x i and x j be connected and they are very likely to be merged into one cluster. The parameter θ is user-defined. We use con(x i , x j ) to denote the connected state of data points x i and x j . For all the initial clusters C i in R i , we give a strong hypothesis that a bigger initial cluster is more likely to be a major part of a natural cluster. Thus, the clusters with small number of data points may be merged into bigger one. MCSC decides which initial clusters are microclusters (mc) based on their number of data points. The micro-clusters (mc) are defined as follows: where | C i | denotes the number of data points in the initial cluster C i . We refer to the remaining initial clusters as core clusters (cc). Then we propose a transition matrix to denote the shared nearest neighbour information between microclusters (mc) and core clusters (cc). The transition matrix T between mc and cc is defined as follows: (11) Where # con(x i , x j ) denotes the number of the pair of data points that can be connected, x i ∈ C i , x j ∈ C j , C i ∈ mc, C j ∈ cc, R i = cc ∪ mc. T ij describes the probability that the micro-cluster C i is merged into the core cluster C j . Obviously, for each micro-cluster C i , we may merge it into certain core cluster C j with maximum probability. To assess the rationality of the merging process, we employ a well-known metric called normalized mutual information (NMI), which is often used to measure clusters structures similarity of any two initial clusters. If the NMI value of the results after the merging process and the other group of clustering result of K-means (R j ) increases, the process is successful. Then the proposed method MCSC will continue to assign the remaining micro-clusters. The whole label training details are given in the next step.
Step3: training the results of K-means by consensus function.
For two groups of results R i and R j generated by Kmeans, we use a well-established cluster validity index normalized mutual information (NMI) to construct a consensus function: Obviously, the unique value in R i is defined as a vector X and the unique value in R j is defined as a vector Y. We take a group of clustering result as ground-truth in order to train the other group of results. When all the data points of C i are merged into C j for R i , if NMI(R i , R j )increases, the merging process is reasonable.
In short, each micro-cluster C i is merged into the most suitable core cluster, then MCSC will obtain the results that satisfy Eq. (12). The solution procedure of Eq. (12) is presented in Algorithm 1. MCSC has three parameters k, θ and knum. Parameter k denotes the number of centroids, parameter θ denotes the cut-off value of distances among data points, then decides which two data points are regarded as con(x i , x j ). Parameter knum denotes the number of nearest neighbours. In implementation, we set θ = 0 and knum = 5, k is depending on datasets.
After the above training process, we obtain the final results R (i + 1) . In implementation, if R i is closer to groundtruth, the final results will be better.

Complexity Analysis
The MCSC algorithm is based on k-means, whose time complexity is O(nlogn). In the process of calculating the label transition matrix, the time complexity is O(n 2 ). Then in the label training process, the time complexity is O(n). Thus, time complexity of the proposed MCSC is O(n 2 ). In terms of time complexity, the MCSC algorithm is not higher than other benchmarking models.

EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we use five single-cell RNA datasets to evaluate the performance of our proposed method and analyse the results.
We take NMI as evaluation index. The codes of SC, K-means, PCA, t-SNE, SIMLR and MPSSC can be downloaded online 2 . We use MATLAB R2014a to implement our algorithm and present the best results among 100 times trials. Note that we use the raw data without pre-processing.
In the meantime, the running time of all algorithms including four classical methods (Spectral clustering (SC), K-means, Principal Component Analysis (PCA) and tdistributed stochastic neighbour embedding (t-SNE) and The running time of the first four datasets is measured by seconds and the last is measured by minutes. The values of parameter k are selected as before.

Discussion
We propose a novel clustering method based on matching clusters structures, namely MCSC. MCSC can improve the results of clustering algorithms by a label training process. Different initial clusters generation method may have a great impact on the results. We select K-means as the initial clusters generation method, which will obtain unsatisfying results when the structure of dataset is not convex.
The centroids selection of K-means is random, so we need to experiment many times. Actually in 100 experimental results, several results are better than other algorithms. In implementation, we fix the parameters θ and knum, users only need to adjust the parameter k. The k value is usually larger than the number of classes.  [26] 0.7602 0.8666 0.0027 Treutlein [27] 0.6860 0.8286 0.0059 Pollen [28] 0.9183 0.9534 0.0006 Tasic [29] 0.4455 0.4746 0.0005 Buettner [30] 0.5846 0.7594 0.0038 As shown in Fig. 2 and Tab. 3, the maximum values of MCSC are better than other algorithms. For Deng and Treutlein datasets, our algorithm has obvious advantages because the structure of them is suitable for K-means algorithm. The advantage is slightly less for Pollen and Tasic datasets. For Buettner dataset, its structure is nonconvex and K-means easily obtains suboptimal results, so the NMI value of MCSC is lower than SIMLR and MPSSC.
From an algorithm runtime point of view, all the algorithms were run on the same device and software. The most advanced algorithms have no advantage in running time especially for MCSC. SC, K-means and PCA take the least time for all the five datasets, while SIMLR, MPSSC and MCSC take more time because they are based on basic algorithms. For all datasets, the running time of MCSC is close to t-SNE, SIMLR and MPSSC.
Overall, the proposed method MCSC obtains better results and requires close time in most cases.

CONCLUSION AND FUTURE WORK
In this research, we propose a novel clustering results improvement method based on matching clusters structures without true labels. However, the performance of MCSC depends on the algorithm that generates the initial clusters. If the structure of some datasets is nonconvex, MCSC may obtain unsatisfied results because it uses K-means to generate initial clusters. In our future work, we plan to further improve the label training process and choose more appropriate clustering algorithms to obtain initial clusters.