The key idea of our algorithm is to replace the calculation of density from individual points to discrete bins as defined by a uniform grid. This way, the clustering step of the algorithm will scale with the number of non-empty bins, which is significantly smaller than the number of points in lower dimensional data sets. Therefore the overall time complexity of our algorithm is dominated by the binning step, which is in the order of O(N). This is significantly better than the time complexity of DBSCAN, which is in the order of O(N log N). The definition and algorithm are presented in the following subsections.
Definition
The key terms involved in the algorithm are defined in this subsection. A graphical example can be found in Fig. 1.
-
Nbin is the number of equally sized bins in each dimension. In theory, there are (Nbin)d bins in the data space, where d is the number of dimensions. However, in practice, we only consider the non-empty bins. The number of non-empty bins (N) is less than (Nbin)d, especially for high dimensional data. Each non-empty bin is assigned an integer index i=1…N.
-
Bini is labelled by a tuple with d positive integers Ci=(Ci1,Ci2,Ci3,…,Cid) where Ci1 is the coordinate (the bin index) at dimension 1. For example, if Bini has coordinate Ci=(2,3,5), this bin is located in second bin in dimension 1, third bin in dimension 2 and the fifth bin in dimension 3.
-
The distance between Bini and Binj is defined as
$$ Dist(C_{i}, C_{j})=\sqrt{\sum_{k=1}^{d}\left(C_{ik}-C_{jk}\right)^{2}} $$
(1)
-
Bini and Binj are defined to be directly connected if \(Dist(C_{i},C_{j}) \leqslant \sqrt {\epsilon }\), where ε is a user-specified parameter.
-
Denb(Ci) is the density of Bini, which is defined as the number of points in Bini.
-
Denc(Ci) is the collective density of Bini, calculated by
$$ {Den}_{c}(C_{i})=\sum_{\{j|{Bin}_{j} \text{ and} {Bin}_{i} \text{ are directly connected}\}}{Den}_{b}(C_{j}) $$
(2)
-
Bini is a core bin if
-
1
Denb(Ci) is larger than MinDenb, a user-specified parameter.
-
2
Denb(Ci) is larger than ρ% of its directly connected bins, where ρ is a user-specified parameter.
-
3
Denc(Ci) is larger than MinDenc, a user-specified parameter.
-
Bini is a border bin if it is not a core bin but it is directly connected to a core bin.
-
Bini is an outlier bin, if it is not a core bin nor a border bin.
-
Bina and Binb are in the same cluster, if they satisfy one of the following conditions:
-
1
they are directly connected and at least one of them is core bin;
-
2
they are not directly connected but are connected by a sequence of directly connected core bins.
-
Two points are in the same cluster, if they belong to the same bin or their corresponding bins belong to the same cluster.
Algorithm
Algorithm 1 describes the key steps of FlowGrid, starting with normalising the values in each dimension to range between 1 and (Nbin+1). Then, we use the integer part of the normalised value as the coordinate of its corresponding bin. Then, the SearchCore algorithm is applied to discover the core bins and their directly connected bins. Once the core bins and connections are found, Breadth First Search(BFS) is used to group the connected bins into a cluster. The cells are labelled by the label of their corresponding bins.
Evaluation
Procedure
FlowGrid aims to be an ultrafast and accurate clustering algorithm for very large flow cytometry data. Therefore, both the accuracy and scalability performance need to be evaluated. The benchmark data sets from Flow-CAP [3], the multi-centre CyTOF data from Li et al. [9] and the SeaFlow project [10] are selected to compare the performance of FlowGrid against other state-of-the-art algorithms, FlowSOM, FlowPeaks, and FLOCK. These three algorithms are chosen because they are widely used, are generally considered to be quite fast, and have good accuracy.
Three benchmark data sets from Flow-CAP [3] are selected for evaluation, including the Diffuse Large B-cell Lymphoma (DLBL), Hematopoietic Stem Cell Transplant (HSCT), and Graft versus Host Disease(GvHD) data set. Each data set contains 10-30 samples with 3-4 markers, and each sample includes 2,000-35,000 cells.
The multi-centre CyTOF data set from Li et al. [9] provides a labelled data set with 16 samples. Each samples contains 40,000-70,000 cells and 26 markers. Since only 8 out fo 26 markers are determined to be relevant markers in the original paper [9], only these 8 markers are used for clustering.
We also use three data sets from the SeaFlow project [10] and they contain many samples. Instead of analysing the independent samples, we analyse the concatenated data sets as the original paper [10] and these concatenated data sets contain 12.7, 22.7 and 23.6 millions of cells respectively. Each data sets include 15 features but the original study only uses four features for clustering analysis. The four features are forward scatter (small and perpendicular), phycoerythrin, and chlorophyll (small) [10].
In the evaluation, we treat the manual gating label as the gold standard for measuring the quality of clustering. In the pre-precessing step, we apply the inverse hyperbolic function with the factor of 5 to transform the multi-centre data and the SeaFlow data. As the Flow-CAP and multi-centre CyTOF data contain many samples and we treat each sample as a data set, we run all algorithms on each sample. The performances are measured by the ARI and runtime, which are reported by the arithmetic means (\(\bar {x}\)) and standard deviation (sd). For the Seaflow data sets, we treat each concatenated data set as a data set. In the evaluation, all algorithms are applied on these concatenated data sets.
To evaluate the scalability of each algorithm, we down-sample the largest concatenated data set from the SeaFlow project, generating 10 sub-sampled data sets in which the numbers of cells range from 20 thousand to 20 million.
Performance measure
The efficiency performance is measured by the runtime while the clustering performance is measured by Adjusted Rand Index (ARI). ARI is used to measure the clustering performance. ARI is the corrected-for-chance version of the Rand index [11]. Although it may result in negative values if the index is less than expected, it tends to be more robust than many other measures like F-measure and Rand index.
ARI is calculated as follow. Given a set S of n elements, and two groups of cluster labels (one group of ground truth label and one group of predicted labels) of these elements, namely X={X1,X2,…,Xr} and Y={Y1,Y2,…,Ys}, the overlap between X and Y can be summarized by nij where nij denotes the number of objects in common between Xi and Yj: nij=|Xi∩Yj|.
$${\begin{aligned} ARI \,=\, \frac{\sum_{ij} \left(\underset{2}{n_{ij}} \right)- \left[ \sum_{i} \left(\underset{2}{a_{i}} \right) \sum_{j} \left(\underset{2}{b_{j}}\right) \right]/ \left(\underset{2}{n} \right)}{\frac{1}{2}\left[\sum_{i} \left(\underset{2}{a_{i}}\right)+ \sum_{j} \left(\underset{2}{b_{j}}\right) \right]- \left[\sum_{i} \left(\underset{2}{a_{i}}\right) \sum_{j} \left(\underset{2}{b_{j}}\right) \right] / \left(\underset{2}{n}\right)} \end{aligned}} $$
where \(a_{i}=\sum _{j} n_{ij}\) and \(b_{j}=\sum _{i} n_{ij}\)
Experimentation
FlowGrid is publicly available as an open source program on GitHub. FlowSOM and FlowPeaks are available as R packages from Bioconductor. The source code of Flock is downloaded from its Sourceforge repository. To reproduce all the comparisons presented in this paper, the source code and data can be downloaded from the GitHub repository FlowGrid_compare. We run all the experiments on six 2.60 GHz cores CPU with 32 G RAM.
FlowPeaks and Flock provide automated version without any user-input parameter. FlowSOM requires one user-supplied parameter (k, the number of clusters in meta-clustering step). FlowGrid requires two user-supplied parameters (binn and ε). To optimise the result, we try many k for FlowSOM and many combinations of binn and ε for our algorithm.