The key idea of our algorithm is to replace the calculation of density from individual points to discrete bins as defined by a uniform grid. This way, the clustering step of the algorithm will scale with the number of nonempty bins, which is significantly smaller than the number of points in lower dimensional data sets. Therefore the overall time complexity of our algorithm is dominated by the binning step, which is in the order of O(N). This is significantly better than the time complexity of DBSCAN, which is in the order of O(N log N). The definition and algorithm are presented in the following subsections.
Definition
The key terms involved in the algorithm are defined in this subsection. A graphical example can be found in Fig. 1.

N_{bin} is the number of equally sized bins in each dimension. In theory, there are (N_{bin})^{d} bins in the data space, where d is the number of dimensions. However, in practice, we only consider the nonempty bins. The number of nonempty bins (N) is less than (N_{bin})^{d}, especially for high dimensional data. Each nonempty bin is assigned an integer index i=1…N.

Bin_{i} is labelled by a tuple with d positive integers C_{i}=(C_{i1},C_{i2},C_{i3},…,C_{id}) where C_{i1} is the coordinate (the bin index) at dimension 1. For example, if Bin_{i} has coordinate C_{i}=(2,3,5), this bin is located in second bin in dimension 1, third bin in dimension 2 and the fifth bin in dimension 3.

The distance between Bin_{i} and Bin_{j} is defined as
$$ Dist(C_{i}, C_{j})=\sqrt{\sum_{k=1}^{d}\left(C_{ik}C_{jk}\right)^{2}} $$
(1)

Bin_{i} and Bin_{j} are defined to be directly connected if \(Dist(C_{i},C_{j}) \leqslant \sqrt {\epsilon }\), where ε is a userspecified parameter.

Den_{b}(C_{i}) is the density of Bin_{i}, which is defined as the number of points in Bin_{i}.

Den_{c}(C_{i}) is the collective density of Bin_{i}, calculated by
$$ {Den}_{c}(C_{i})=\sum_{\{j{Bin}_{j} \text{ and} {Bin}_{i} \text{ are directly connected}\}}{Den}_{b}(C_{j}) $$
(2)

Bin_{i} is a core bin if

1
Den_{b}(C_{i}) is larger than MinDen_{b}, a userspecified parameter.

2
Den_{b}(C_{i}) is larger than ρ% of its directly connected bins, where ρ is a userspecified parameter.

3
Den_{c}(C_{i}) is larger than MinDen_{c}, a userspecified parameter.

Bin_{i} is a border bin if it is not a core bin but it is directly connected to a core bin.

Bin_{i} is an outlier bin, if it is not a core bin nor a border bin.

Bin_{a} and Bin_{b} are in the same cluster, if they satisfy one of the following conditions:

1
they are directly connected and at least one of them is core bin;

2
they are not directly connected but are connected by a sequence of directly connected core bins.

Two points are in the same cluster, if they belong to the same bin or their corresponding bins belong to the same cluster.
Algorithm
Algorithm 1 describes the key steps of FlowGrid, starting with normalising the values in each dimension to range between 1 and (N_{bin}+1). Then, we use the integer part of the normalised value as the coordinate of its corresponding bin. Then, the SearchCore algorithm is applied to discover the core bins and their directly connected bins. Once the core bins and connections are found, Breadth First Search(BFS) is used to group the connected bins into a cluster. The cells are labelled by the label of their corresponding bins.
Evaluation
Procedure
FlowGrid aims to be an ultrafast and accurate clustering algorithm for very large flow cytometry data. Therefore, both the accuracy and scalability performance need to be evaluated. The benchmark data sets from FlowCAP [3], the multicentre CyTOF data from Li et al. [9] and the SeaFlow project [10] are selected to compare the performance of FlowGrid against other stateoftheart algorithms, FlowSOM, FlowPeaks, and FLOCK. These three algorithms are chosen because they are widely used, are generally considered to be quite fast, and have good accuracy.
Three benchmark data sets from FlowCAP [3] are selected for evaluation, including the Diffuse Large Bcell Lymphoma (DLBL), Hematopoietic Stem Cell Transplant (HSCT), and Graft versus Host Disease(GvHD) data set. Each data set contains 1030 samples with 34 markers, and each sample includes 2,00035,000 cells.
The multicentre CyTOF data set from Li et al. [9] provides a labelled data set with 16 samples. Each samples contains 40,00070,000 cells and 26 markers. Since only 8 out fo 26 markers are determined to be relevant markers in the original paper [9], only these 8 markers are used for clustering.
We also use three data sets from the SeaFlow project [10] and they contain many samples. Instead of analysing the independent samples, we analyse the concatenated data sets as the original paper [10] and these concatenated data sets contain 12.7, 22.7 and 23.6 millions of cells respectively. Each data sets include 15 features but the original study only uses four features for clustering analysis. The four features are forward scatter (small and perpendicular), phycoerythrin, and chlorophyll (small) [10].
In the evaluation, we treat the manual gating label as the gold standard for measuring the quality of clustering. In the preprecessing step, we apply the inverse hyperbolic function with the factor of 5 to transform the multicentre data and the SeaFlow data. As the FlowCAP and multicentre CyTOF data contain many samples and we treat each sample as a data set, we run all algorithms on each sample. The performances are measured by the ARI and runtime, which are reported by the arithmetic means (\(\bar {x}\)) and standard deviation (sd). For the Seaflow data sets, we treat each concatenated data set as a data set. In the evaluation, all algorithms are applied on these concatenated data sets.
To evaluate the scalability of each algorithm, we downsample the largest concatenated data set from the SeaFlow project, generating 10 subsampled data sets in which the numbers of cells range from 20 thousand to 20 million.
Performance measure
The efficiency performance is measured by the runtime while the clustering performance is measured by Adjusted Rand Index (ARI). ARI is used to measure the clustering performance. ARI is the correctedforchance version of the Rand index [11]. Although it may result in negative values if the index is less than expected, it tends to be more robust than many other measures like Fmeasure and Rand index.
ARI is calculated as follow. Given a set S of n elements, and two groups of cluster labels (one group of ground truth label and one group of predicted labels) of these elements, namely X={X_{1},X_{2},…,X_{r}} and Y={Y_{1},Y_{2},…,Y_{s}}, the overlap between X and Y can be summarized by n_{ij} where n_{ij} denotes the number of objects in common between X_{i} and Y_{j}: n_{ij}=X_{i}∩Y_{j}.
$${\begin{aligned} ARI \,=\, \frac{\sum_{ij} \left(\underset{2}{n_{ij}} \right) \left[ \sum_{i} \left(\underset{2}{a_{i}} \right) \sum_{j} \left(\underset{2}{b_{j}}\right) \right]/ \left(\underset{2}{n} \right)}{\frac{1}{2}\left[\sum_{i} \left(\underset{2}{a_{i}}\right)+ \sum_{j} \left(\underset{2}{b_{j}}\right) \right] \left[\sum_{i} \left(\underset{2}{a_{i}}\right) \sum_{j} \left(\underset{2}{b_{j}}\right) \right] / \left(\underset{2}{n}\right)} \end{aligned}} $$
where \(a_{i}=\sum _{j} n_{ij}\) and \(b_{j}=\sum _{i} n_{ij}\)
Experimentation
FlowGrid is publicly available as an open source program on GitHub. FlowSOM and FlowPeaks are available as R packages from Bioconductor. The source code of Flock is downloaded from its Sourceforge repository. To reproduce all the comparisons presented in this paper, the source code and data can be downloaded from the GitHub repository FlowGrid_compare. We run all the experiments on six 2.60 GHz cores CPU with 32 G RAM.
FlowPeaks and Flock provide automated version without any userinput parameter. FlowSOM requires one usersupplied parameter (k, the number of clusters in metaclustering step). FlowGrid requires two usersupplied parameters (bin_{n} and ε). To optimise the result, we try many k for FlowSOM and many combinations of bin_{n} and ε for our algorithm.