 Methods
 Open Access
 Published:
Ultrafast clustering of singlecell flow cytometry data using FlowGrid
BMC Systems Biology volume 13, Article number: 35 (2019)
Abstract
Background
Flow cytometry is a popular technology for quantitative singlecell profiling of cell surface markers. It enables expression measurement of tens of cell surface protein markers in millions of single cells. It is a powerful tool for discovering cell subpopulations and quantifying cell population heterogeneity. Traditionally, scientists use manual gating to identify cell types, but the process is subjective and is not effective for large multidimensional data. Many clustering algorithms have been developed to analyse these data but most of them are not scalable to very large data sets with more than ten million cells.
Results
Here, we present a new clustering algorithm that combines the advantages of densitybased clustering algorithm DBSCAN with the scalability of gridbased clustering. This new clustering algorithm is implemented in python as an open source package, FlowGrid. FlowGrid is memory efficient and scales linearly with respect to the number of cells. We have evaluated the performance of FlowGrid against other stateoftheart clustering programs and found that FlowGrid produces similar clustering results but with substantially less time. For example, FlowGrid is able to complete a clustering task on a data set of 23.6 million cells in less than 12 seconds, while other algorithms take more than 500 seconds or get into error.
Conclusions
FlowGrid is an ultrafast clustering algorithm for large singlecell flow cytometry data. The source code is available at https://github.com/VCCRI/FlowGrid.
Background
Recent technological advancement has made it possible to quantitatively measure the expression of a handful of protein markers in millions of cells in a flow cytometry experiment [1]. The ability to profile such a large number of cells allows us to gain insights into cellular heterogeneity at an unprecedented resolution. Traditionally, cell types are identified based on manual gating of several markers in flow cytometry data. Manual gating relies on visual inspection of a series of two dimensional scatter plots, which makes it difficult to discover structure in high dimensions. It also suffers subjectivity, in terms of the order in which pairs of protein markers are explored, and the inherent uncertainty of manually drawing the cluster boundaries [2]. An emerging solution is to use unsupervised clustering algorithms to automatically identify clusters in potentially multidimensional flow cytometry data.
The Flow Cytometry Critical Assessment of Population Identification Methods (FlowCAP) challenge has compared the performance of many flow cytometry clustering algorithms [3]. In the challenge, ADIcyt has the highest accuracy but has a long runtime, which makes it impractical for routine usage. Flock [4] maintains a high accuracy and reasonable runtime. After the challenge, several algorithms have been built for flow cytometry data analysis such as FlowPeaks [5], FlowSOM [6] and BayesFlow [7].
FlowPeaks and Flock are largely based on kmeans clustering. kmeans clustering requires the number of clusters (k) to be defined prior to the analysis. It is hard to determine a suitable k in practice. FlowPeaks performs kmeans clustering with a large initial k, and iteratively merges nearby clusters that are not separated by low density regions into one cluster. Flock utilises grids to identify high density regions, which the algorithm then uses to identify initial cluster centres for kmeans clustering. This gridbased method of identifying high density region allows kmeans clustering to converge much quicker compared to using random initialisation of cluster centres, and also directly identifies a suitable value for k. FlowSOM starts with training SelfOrganising Map (SOM), followed by consensus hierarchical clustering of the cells for metaclustering. In the algorithm, the number of clusters (k) is required for metaclustering.
BayesFlow uses a Bayesian hierarchical model to identify different cell populations in one or many samples. The key benefit of this method is its ability to incorporate prior knowledge, and captures the variability in shapes and locations of populations between the samples [7]. However, BayesFlow tends to be computational expensive as Markov Chain Monte Carlo sampling requires a large number of iterations. Therefore, BayesFlow is often impractical for flow cytometry data sets of realistic size.
These algorithms perform well on the FlowCAP data sets, but they may not be scalable to larger data sets that we are dealing with nowadays – those with tens of millions of cells. Aiming to quantify cell population heterogeneity in huge data sets, we have to develop an ultrafast and scalable clustering algorithm.
In this paper, we present a new clustering algorithm that combines the benefit of DBSCAN [8] (a widelybased densitybased clustering algorithm) and a gridbased approach to achieve scalability. DBSCAN is fast and can detect clusters with complex shapes in the presence of outliers [8]. DBSCAN starts with identifying core points that have a large number of neighbours within a userdefined region. Once the core points are found, nearby core points and closely located noncore points are grouped together to form clusters. This algorithm will identify clusters that are defined as highdensity regions that are separated by the lowdensity regions. However, DBSCAN is memory inefficient if the data set is very large, or has large highly connected components.
To reduce the computational search space and memory requirement, our algorithm extends the idea of DBSCAN by using equalspaced grids like Flock. We implemented our algorithm in an open source python package called FlowGrid. Using a range of real data sets, we demonstrate that FlowGrid is much faster than other stateoftheart flow cytometry clustering algorithms, and produce similar clustering results. The detail of the algorithm is presented in the Methods section.
Methods
The key idea of our algorithm is to replace the calculation of density from individual points to discrete bins as defined by a uniform grid. This way, the clustering step of the algorithm will scale with the number of nonempty bins, which is significantly smaller than the number of points in lower dimensional data sets. Therefore the overall time complexity of our algorithm is dominated by the binning step, which is in the order of O(N). This is significantly better than the time complexity of DBSCAN, which is in the order of O(N log N). The definition and algorithm are presented in the following subsections.
Definition
The key terms involved in the algorithm are defined in this subsection. A graphical example can be found in Fig. 1.

N_{bin} is the number of equally sized bins in each dimension. In theory, there are (N_{bin})^{d} bins in the data space, where d is the number of dimensions. However, in practice, we only consider the nonempty bins. The number of nonempty bins (N) is less than (N_{bin})^{d}, especially for high dimensional data. Each nonempty bin is assigned an integer index i=1…N.

Bin_{i} is labelled by a tuple with d positive integers C_{i}=(C_{i1},C_{i2},C_{i3},…,C_{id}) where C_{i1} is the coordinate (the bin index) at dimension 1. For example, if Bin_{i} has coordinate C_{i}=(2,3,5), this bin is located in second bin in dimension 1, third bin in dimension 2 and the fifth bin in dimension 3.

The distance between Bin_{i} and Bin_{j} is defined as
$$ Dist(C_{i}, C_{j})=\sqrt{\sum_{k=1}^{d}\left(C_{ik}C_{jk}\right)^{2}} $$(1) 
Bin_{i} and Bin_{j} are defined to be directly connected if \(Dist(C_{i},C_{j}) \leqslant \sqrt {\epsilon }\), where ε is a userspecified parameter.

Den_{b}(C_{i}) is the density of Bin_{i}, which is defined as the number of points in Bin_{i}.

Den_{c}(C_{i}) is the collective density of Bin_{i}, calculated by
$$ {Den}_{c}(C_{i})=\sum_{\{j{Bin}_{j} \text{ and} {Bin}_{i} \text{ are directly connected}\}}{Den}_{b}(C_{j}) $$(2) 
Bin_{i} is a core bin if

1
Den_{b}(C_{i}) is larger than MinDen_{b}, a userspecified parameter.

2
Den_{b}(C_{i}) is larger than ρ% of its directly connected bins, where ρ is a userspecified parameter.

3
Den_{c}(C_{i}) is larger than MinDen_{c}, a userspecified parameter.

1

Bin_{i} is a border bin if it is not a core bin but it is directly connected to a core bin.

Bin_{i} is an outlier bin, if it is not a core bin nor a border bin.

Bin_{a} and Bin_{b} are in the same cluster, if they satisfy one of the following conditions:

1
they are directly connected and at least one of them is core bin;

2
they are not directly connected but are connected by a sequence of directly connected core bins.

1

Two points are in the same cluster, if they belong to the same bin or their corresponding bins belong to the same cluster.
Algorithm
Algorithm 1 describes the key steps of FlowGrid, starting with normalising the values in each dimension to range between 1 and (N_{bin}+1). Then, we use the integer part of the normalised value as the coordinate of its corresponding bin. Then, the SearchCore algorithm is applied to discover the core bins and their directly connected bins. Once the core bins and connections are found, Breadth First Search(BFS) is used to group the connected bins into a cluster. The cells are labelled by the label of their corresponding bins.
Evaluation
Procedure
FlowGrid aims to be an ultrafast and accurate clustering algorithm for very large flow cytometry data. Therefore, both the accuracy and scalability performance need to be evaluated. The benchmark data sets from FlowCAP [3], the multicentre CyTOF data from Li et al. [9] and the SeaFlow project [10] are selected to compare the performance of FlowGrid against other stateoftheart algorithms, FlowSOM, FlowPeaks, and FLOCK. These three algorithms are chosen because they are widely used, are generally considered to be quite fast, and have good accuracy.
Three benchmark data sets from FlowCAP [3] are selected for evaluation, including the Diffuse Large Bcell Lymphoma (DLBL), Hematopoietic Stem Cell Transplant (HSCT), and Graft versus Host Disease(GvHD) data set. Each data set contains 1030 samples with 34 markers, and each sample includes 2,00035,000 cells.
The multicentre CyTOF data set from Li et al. [9] provides a labelled data set with 16 samples. Each samples contains 40,00070,000 cells and 26 markers. Since only 8 out fo 26 markers are determined to be relevant markers in the original paper [9], only these 8 markers are used for clustering.
We also use three data sets from the SeaFlow project [10] and they contain many samples. Instead of analysing the independent samples, we analyse the concatenated data sets as the original paper [10] and these concatenated data sets contain 12.7, 22.7 and 23.6 millions of cells respectively. Each data sets include 15 features but the original study only uses four features for clustering analysis. The four features are forward scatter (small and perpendicular), phycoerythrin, and chlorophyll (small) [10].
In the evaluation, we treat the manual gating label as the gold standard for measuring the quality of clustering. In the preprecessing step, we apply the inverse hyperbolic function with the factor of 5 to transform the multicentre data and the SeaFlow data. As the FlowCAP and multicentre CyTOF data contain many samples and we treat each sample as a data set, we run all algorithms on each sample. The performances are measured by the ARI and runtime, which are reported by the arithmetic means (\(\bar {x}\)) and standard deviation (sd). For the Seaflow data sets, we treat each concatenated data set as a data set. In the evaluation, all algorithms are applied on these concatenated data sets.
To evaluate the scalability of each algorithm, we downsample the largest concatenated data set from the SeaFlow project, generating 10 subsampled data sets in which the numbers of cells range from 20 thousand to 20 million.
Performance measure
The efficiency performance is measured by the runtime while the clustering performance is measured by Adjusted Rand Index (ARI). ARI is used to measure the clustering performance. ARI is the correctedforchance version of the Rand index [11]. Although it may result in negative values if the index is less than expected, it tends to be more robust than many other measures like Fmeasure and Rand index.
ARI is calculated as follow. Given a set S of n elements, and two groups of cluster labels (one group of ground truth label and one group of predicted labels) of these elements, namely X={X_{1},X_{2},…,X_{r}} and Y={Y_{1},Y_{2},…,Y_{s}}, the overlap between X and Y can be summarized by n_{ij} where n_{ij} denotes the number of objects in common between X_{i} and Y_{j}: n_{ij}=X_{i}∩Y_{j}.
where \(a_{i}=\sum _{j} n_{ij}\) and \(b_{j}=\sum _{i} n_{ij}\)
Experimentation
FlowGrid is publicly available as an open source program on GitHub. FlowSOM and FlowPeaks are available as R packages from Bioconductor. The source code of Flock is downloaded from its Sourceforge repository. To reproduce all the comparisons presented in this paper, the source code and data can be downloaded from the GitHub repository FlowGrid_compare. We run all the experiments on six 2.60 GHz cores CPU with 32 G RAM.
FlowPeaks and Flock provide automated version without any userinput parameter. FlowSOM requires one usersupplied parameter (k, the number of clusters in metaclustering step). FlowGrid requires two usersupplied parameters (bin_{n} and ε). To optimise the result, we try many k for FlowSOM and many combinations of bin_{n} and ε for our algorithm.
Results
Performance comparison
Table 1 summaries the performance of our algorithm and three other algorithms – FlowSOM, FlowPeaks, and Flock in terms of runtime. Our algorithm is substantially faster than other clustering algorithms in all the data sets. This improvement of runtime is especially substantial in the Seaflow data sets. FLOCK and FlowPeaks sometimes fail to complete in some of the data sets. In a data set of 23.6 million cells, FlowSOM completes the execution in 572 s, whereas FlowGrid completes the execution in only 12 s. This is a 50 × speed up. Table 2 summaries the clustering accuracy performance. In FlowCAP and the multicentre data sets, FlowGrid shares the similar clustering accuracy (in terms of ARI) with other clustering algorithms but in Seaflow data sets, FlowGrid gives higher accuracy than other clustering algorithms.
Figure 2 shows that the clustering results of our algorithm and three other algorithms in a HSCT sample. FlowGrid, FlowSOM and FlowPeaks recover similar number of clusters, and the clustering results are largely similar. Flock generates too many clusters in this case. It is important to note that FlowGrid also identifies cells that do not belong to a main cluster (i.e., a high density region). These cells can be viewed as ’outliers’, and are labelled as ’1’ in Fig. 2. This is a feature that is not present in other clustering algorithms.
Scalability analysis
To further evaluate the scalability of the algorithms, we subsample one Seaflow data set and the sampled data sets range from 20 thousand to 20 million cells. Figure 3 shows the scalability of our algorithm and three other algorithms. Flock has a low runtime when processing a small data set, but its runtime dramatically increases to 6640 s for a 20 millioncell data set. FlowPeaks and FlowSOM share similar scalability but FlowPeaks is not able to execute 20 million data set. Our algorithm have the best performance in the evaluation as FlowGrid is faster than other algorithm in all the sampled data by an order of magnitude.
Parameter robustness analysis
Like other densitybased clustering algorithm, parameter setting is important. In our experience, Bin_{n} and ε are datasetdependent. We recommend trying out different combinations of Bin_{n} between 4 and 15, and ε between 1 and 5. To pick the best parameter combinations, some prior knowledge is helpful such as the expected number of clusters and the proportion of outliers which should be less than 10% in our experience.
We found that other parameters, namely MinDen_{b}, MinDen_{c} and ρ are mostly robust across a wide range of values.
To demonstrate this robustness, we used the benchmark data sets from FlowCAP for a parameter sensitivity analysis. For these experiments, we first set 3, 40, 85, 4 and 1 as the default value for MinDen_{b}, MinDen_{c}, ρ, Bin_{n} and ε, respectively. In each experiment, we only change one parameter to test its sensitivity to the overall classification result. The performance is measured by ARI and runtime. In the first experiment, we varied MinDen_{b} from 1 to 50 while fixing other parameters. In the second experiment, we varied MinDen_{c} ranging from 10 to 300 while fixing other parameters. In the third experiment, we varied ρ ranging from 70 to 95 while fixing other parameters.
Figure 4 demonstrates that the clustering accuracy and runtime are largely insensitive to MinDen_{b}, MinDen_{c} and ρ across a large range of parameter values. The experiments are applied to all the benchmark data sets from FlowCAP and similar results are observed in all the benchmark data sets. In our experiments, when MinDen_{b}, MinDen_{c} and ρ are set to be 3, 40 and 85 respectively, FlowGrid maintains good clustering performance and excellent runtime. They are therefore set as the default parameters for FlowGrid.
Discussion
In this paper, we have developed an ultrafast clustering algorithm, FlowGrid, for singlecell flow cytometry analysis, and compared it against other stateoftheart algorithms such as Flock, FlowSOM and FlowPeaks. FlowGrid borrows ideas from DBSCAN for detection of high density regions and outliers. It does not only perform well in the presence of outliers, but also have great scalability without getting into memory issues. It is both time efficient and memory efficient. FlowGrid shares similar clustering accuracy with stateoftheart flow cytometry clustering algorithms, but it is substantially faster than them. With any given number of markers, the runtime of FlowGrid scales linearly with the number of cells, which is a useful property for extremely large data sets.
MinDen_{b} and MinDen_{c} are density threshold parameters to reduce the search space of high density bins. If the parameters are set very low, the runtime may fractionally increase but the accuracy is not likely to be affected. However, if the parameters are set very high, the runtime will also fractionally decreases but it may lead to separation of real clusters and create spurious outliers. In any case, we showed that the performance of FlowGrid is generally robust against changes in MinDen_{b}, MinDen_{c} and ρ.
The current implementation of FlowGrid is already very fast for most practical purposes. In the future, if the data size grows even larger, it is possible to further speed up FlowGrid by parallelising the binning step of the algorithm, which is currently the most computationally intensive step of the algorithm.
Abbreviations
 ARI:

Adjusted rand index
 BFS:

Breadth first search
 CyTOF:

Mass cytometry
 DBSCAN:

Densitybased spatial clustering of applications with noise
 FlowCAP:

Flow cytometry critical assessment of population identification methods
 SOM:

Selforganising map
References
 1
Weber LM, Robinson MD. Comparison of clustering methods for highdimensional singlecell flow and mass cytometry data. Cytom Part A. 2016; 89(12):1084–96.
 2
Saeys Y, Van Gassen S, Lambrecht BN. Computational flow cytometry: helping to make sense of highdimensional immunology data. Nat Rev Immunol. 2016; 16(7):449.
 3
Aghaeepour N, Finak G, Hoos H, Mosmann TR, Brinkman R, Gottardo R, Scheuermann RH, Consortium F, Consortium D, et al. Critical assessment of automated flow cytometry data analysis techniques. Nat Methods. 2013; 10(3):228.
 4
Qian Y, Wei C, EunHyung Lee F, Campbell J, Halliley J, Lee JA, Cai J, Kong YM, Sadat E, Thomson E, et al. Elucidation of seventeen human peripheral blood Bcell subsets and quantification of the tetanus response using a densitybased method for the automated identification of cell populations in multidimensional flow cytometry data. Cytom Part B Clin Cytom. 2010; 78(S1):69–82.
 5
Ge Y, Sealfon SC. flowPeaks: a fast unsupervised clustering for flow cytometry data via kmeans and density peak finding. Bioinformatics. 2012; 28(15):2052–8.
 6
Van Gassen S, Callebaut B, Van Helden MJ, Lambrecht BN, Demeester P, Dhaene T, Saeys Y. FlowSOM: Using selforganizing maps for visualization and interpretation of cytometry data. Cytom Part A. 2015; 87(7):636–45.
 7
Johnsson K, Wallin J, Fontes M. BayesFlow: latent modeling of flow cytometry cell populations. BMC Bioinformatics. 2016; 17(1):25.
 8
Ester M, Kriegel HP, Sander J, Xu X, et al. A densitybased algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96).Portland: AAAI Press: 1996. p. 226–231.
 9
Li H, Shaham U, Stanton KP, Yao Y, Montgomery RR, Kluger Y. Gating mass cytometry data by deep learning. Bioinformatics. 2017; 33(21):3423–30.
 10
Hyrkas J, Clayton S, Ribalet F, Halperin D, Virginia Armbrust E, Howe B. Scalable clustering algorithms for continuous environmental flow cytometry. Bioinformatics. 2015; 32(3):417–23.
 11
Hubert L, Arabie P. Comparing partitions. J Classif. 1985; 2(1):193–218.
Acknowledgments
We thank members of the Ho Laboratory for their valuable comments.
Funding
This work was supported in part by funds from the New South Wales Ministry of Health, a National Health and Medical Research Council Career Development Fellowship (1105271 to JWKH), and a National Heart Foundation Future Leader Fellowship (100848 to JWKH). Publication charge is supported by the Victor Chang Cardiac Research Institute.
Availability of data and materials
∙ Project Name: FlowGrid
∙ Project Home Page: https://github.com/VCCRI/FlowGrid
∙ Operating Systems: Unix, Mac, Windows
∙ Programming Languages: Python
∙ Other Requirements: sklearn, numpy
∙ License: MIT Public License
∙ Any Restrictions to Use By NonAcademics: None
About this supplement
This article has been published as part of BMC Systems Biology Volume 13 Supplement 2, 2019: Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): systems biology. The full contents of the supplement are available online at https://bmcsystbiol.biomedcentral.com/articles/supplements/volume13supplement2.
Author information
Affiliations
Contributions
XY and JWKH initiated and designed the project. XY implemented the algorithm, carried out all the experiments, and wrote the paper. JWKH revised the paper. Both authors approved the final version of the paper.
Corresponding author
Correspondence to Joshua W. K. Ho.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Published
DOI
Keywords
 Clustering
 Flow cytometry
 Single cell
 DBSCAN