 Research
 Open Access
A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics
 Longlong Liao^{1, 2},
 Kenli Li^{3}Email author,
 Keqin Li^{4},
 Canqun Yang^{1, 2} and
 Qi Tian^{5}
https://doi.org/10.1186/s1291801806306
© The Author(s) 2018
 Published: 22 November 2018
Abstract
Background
While there are a large number of bioinformatics datasets for clustering, many of them are incomplete, i.e., missing attribute values in some data samples needed by clustering algorithms. A variety of clustering algorithms have been proposed in the past years, but they usually are limited to cluster on the complete dataset. Besides, conventional clustering algorithms cannot obtain a tradeoff between accuracy and efficiency of the clustering process since many essential parameters are determined by the human user’s experience.
Results
The paper proposes a Multiple Kernel Density Clustering algorithm for Incomplete datasets called MKDCI. The MKDCI algorithm consists of recovering missing attribute values of input data samples, learning an optimally combined kernel for clustering the input dataset, reducing dimensionality with the optimal kernel based on multiple basis kernels, detecting cluster centroids with the Isolation Forests method, assigning clusters with arbitrary shape and visualizing the results.
Conclusions
Extensive experiments on several wellknown clustering datasets in bioinformatics field demonstrate the effectiveness of the proposed MKDCI algorithm. Compared with existing density clustering algorithms and parameterfree clustering algorithms, the proposed MKDCI algorithm tends to automatically produce clusters of better quality on the incomplete dataset in bioinformatics.
Keywords
 Density clustering
 Matrix completion
 Unsupervised multiple kernel learning
 Dimensionality reduction
 Outlier detection
Background
Any nonuniform data contains an underlying structure due to the heterogeneity of the data, the process of identifying this structure in terms of grouping the data samples is called clustering, and the resulting groups are called clusters. The grouping is usually based on the similarity measurements defined for the data samples. Clustering provides a meaningful data analysis method concerning data mining and data classification from largescale data samples, which is mostly used as an unsupervised learning method in a wide range of areas, for example, bioinformatics, biomedicine and pattern recognition. It aims at finding hidden structure, identifying clusters with similar characteristics in given datasets, and then grouping the similar samples into the same cluster and classify different data samples into different clusters. Thus, over the past years, a number of clustering algorithms have been proposed and improved. The most popular clustering methods include partitionbased (e.g., kmeans [1] and kmeans ^{++} [2]), densitybased (e.g., DBSCAN [3], DENCLUE [4] and CFSFDP [5]), graphbased (e.g., Spectral [6]), and hierarchical (e.g., BIRCH [7] and ROCK [8]) methods.
Most of the proposed clustering algorithms assume that the input dataset is complete during the past few years, they are not applicable directly if the input dataset is incomplete, i.e., attribute values of some elements in the datasets are missing. In reality, many largescale datasets are incomplete due to various reasons. Thus, it is essential to make the proposed clustering algorithm to work on the incomplete datasets, by recovering missing attribute values of incomplete samples in the input datasets. Besides, compared with other clustering methods, the clusters in the density clustering are the areas with a higher density than their neighbors and a relatively larger dissimilarity from other samples of the given dataset with higher density; they also have an arbitrary shape in the attribute space. However, most of existing density clustering algorithms are effective only when the human users set appropriate parameters, for example, distance threshold, the minimum number of samples to form a cluster, and etc. The performance of clustering results is significantly affected by these input parameters. Human users need to guess them via several exploratory processes that make it more inconvenient.
Traditional multiple kernel learning (MKL) methods are supervised learning since that the kernel learning task requires the class labels of training data samples. Nevertheless, class labels may not always be available in some realworld scenarios beforehand, e.g., an unsupervised learning task such as clustering and dimension reduction. Unsupervised Multiple Kernel Learning (UMKL) is an unsupervised learning method. It does not require class labels of training data as needed in a conventional multiple kernel learning task. Then, it learns an optimal kernel based on multiple predefined basis kernels and an unlabeled dataset [9].
In a previous study, we have proposed a density clustering approach with multiple kernels for highdimension bioinformatics dataset [10]. However, this initial study did not provide detailed studies for the multiple kernel density clustering approach on incomplete datasets. In this work, we present a Multiple Kernel Density Clustering algorithm for Incomplete datasets in bioinformatics, which is called MKDCI. In the MKDCI method, the incomplete dataset is completed with matrix completion method based on spare selfrepresentation, then the cluster centroids are automatically spotted with the Isolation Forests method, and the clusters with an arbitrary shape are easily obtained by the proposed multiple kernel density clustering method. Differing from existing density clustering algorithms, the MKDCI algorithm functions automatic determination of relative parameters for clustering incomplete datasets, including the optimal value of cutoff distance, the optimal combination of multiple basis kernels, number of clusters and centroids. Besides overcoming the limitation of determining many critical parameters manually during clustering process, the proposed MKDCI algorithm works on the highdimensional incomplete dataset and obtains clustering results with improved accuracy and stability.

It recovers the missing attributes values in the input dataset by utilizing matrix completion based on sparse selfrepresentation, instead of directly fills the missing attributes with average value or deletes the data samples with missing attributes from the input dataset.

It learns an optimal kernel based on multiple predefined basis kernels with a UMKL method, and obtains the optimal value of cutoff distance d_{c} with entropic affinity, as opposed to adopt the strategy for determining parameter d_{c} as described in [5].

It automatically detects cluster centroids of the given dataset by using the Isolation Forests method [11], which is based on the distribution of local density ρ_{i} of each data sample and its minimum distance δ_{i} from other data samples with higher density.

It clusters highdimensional data samples and visualizes the results efficiently with Multiple Kernel tDistributed Stochastic Neighbor Embedding (MKtSNE).
The remaining parts of the paper are organized as follows: In the next section, a brief overview of existing literature about matrix completion, density clustering algorithms and parameterfree clustering algorithms are presented. Then, the proposed MKDCI algorithm is discussed thoroughly, including formal definition of the problem, steps, and mathematical properties. In the final section, the selected bioinformatics clustering datasets and their preprocessing approaches are introduced, the tricks of the MKDCI implementation and quality evaluation metrics, and discusses the extensive experimental evaluation and its results.
Related work
It is a difficult task to perform clustering on the incomplete datasets in which some data samples contain missing attribute values, but the missing value imputation can be utilized to predict missing attribute values by reasoning from the observed attribute values of other data samples. Consequently, the effectiveness of missing value imputation is dependent on the observed attribute values of other data samples in the incomplete datasets, the imputation of missing attribute values impacts on the clustering performance. To deal with kmeans clustering on the incomplete datasets, the similarity between two incomplete data samples is measured with the distribution of the incomplete attributes [1]. Collective Kernel Learning [12] collectively completes the kernel matrices of incomplete datasets by inferring hidden sample similarity from multiple incomplete datasets. However, it is limited to deal with multiple incomplete datasets that share common data samples and cover all data samples, i.e., there are no missing data samples in the intersection set of data samples coming from all incomplete datasets.
Matrix completion is to recover an incomplete matrix where part of elements is missing. Linear matrix completion methods assume that the given data come from linear transformations of low dimensional subspace and the data matrix is lowrank. The property of lowrank is utilized to recover the missing elements in the data matrices by minimizing the matrix rank, and the missing elements of a lowrank matrix can be recovered with high probability under the constraints of missing rate, matrix rank, and sampling scheme [13]. Matrix factorization and rank minimization are two classic linear matrix complete methods. For the matrix factorization based matrix completion methods, its main idea is that an m×n matrix of rankr can be factorized into two smaller matrices of size m×r and r×n, where r<min(m,n), the missing elements are predicted by finding such pairwise matrices [14]. For rank minimization based matrix completion methods, nuclearnorm is the sum of the singular values of a matrix, and a number of extensions of nuclearnorm are utilized to complete the matrices with missing elements. For example, Schatten pNorm [15] is used to recover incomplete matrices, defined as the proot of the sum of singular values’ ppower.
Nuclearnorm is a special case of Schatten pnorm when p=1. Truncated nuclearnorm [16] refers to the nuclearnorm subtracted by the sum of the largest few singular values, and tends to get the better approximation than nuclearnorm for matrix rank since that the largest few singular elements contain important information and should be preserved. The iteratively reweighted nuclearnorm algorithm [17] is proposed to deal with Schatten pNorm of the lowrank minimization problem, and the evaluation results show that Schatten pNorm outperformed other nonconvex nonsmooth extensions of rankminimization. Besides, a spare selfrepresentation based matrix completion method is proposed for predicting missing elements of the incomplete matrices drawn from multiple subspaces [18].
Following the proposal of kmeans clustering approach, hundreds of new clustering methods have been introduced in literature, especially in the last 20 years many variants of classical clustering problems have been studied, such as partitionbased clustering, hierarchical clustering algorithms, graphbased clustering and densitybased clustering.
The key of partitionbased clustering methods is that they initially partition the dataset into k clusters and then iteratively improve the accuracy of clustering by reassigning the data samples to a more appropriate cluster. One of the most widely used clustering algorithms of this kind is kmeans [1], owing to its efficiency and logical simplicity. The kmeans algorithm randomly selects k samples as initial k cluster centroids and assigns the remaining samples to the nearest cluster regarding the distance metric between them and the cluster centroids, such as Euclidean distance and Mahalanobis distance. Then, it iteratively updates the centroids as the new initial cluster centroids and reassigns the remaining samples to the newly computed centroids, until the cluster reassignment no longer changes at each iteration. kmeans tends to generate approximately equal sized clusters for minimizing intracluster distances and has the poor performance when it is used to reproduce clusters for the given dataset with the distribution of complex shape. kmeans++ [2] improves the performance of kmeans by optimizing the initial seeding, which reduces the variability of the cluster results by using the distancebased probabilistic approach to selecting the k initial centroids. However, most of the partitionbased clustering methods have a serious shortcoming that the clustering performance relies heavily on the initial parameter k. They tend to obtain a local optimum result rather than a global one.
Hierarchical clustering algorithms can be classified into two main categories: divisive clustering algorithms and agglomerative clustering algorithms. The divisive clustering algorithms start from all samples as one cluster and then recursively divides the cluster into many smaller ones until the expected clusters are produced. Instead, the agglomerative approaches, such as BIRCH [7] and ROCK [8], initial every sample as a cluster and then iteratively merges pairs of clusters till obtaining the expected number of clusters. Unfortunately, they are sensitive to the clustering shape and slower than the partitionbased clustering methods.
The graphbased clustering algorithms represent the nonuniform data samples as a graph, where a vertex denotes a data sample, and the weight of an edge denotes the similarity between the two data points connected by the edge. Then a graph cut method is applied to cut the whole graph into several subgraphs, and each subgraph is a cluster. Spectral clustering is a widely used graphbased clustering algorithm, and it can be implemented efficiently with standard linear algebra methods [19]. The main shortcoming of graphbased clustering algorithms is the computational bottleneck.
Densitybased clustering algorithms find the points with higher density as the cluster centroids over the distribution of data samples [20]. The data samples having the higher density over a region will form a cluster, such as DBSCAN [3], DENCLUE [4] and CFSFDP [5].
DBSCAN algorithm uses the distance of data samples to create a neighboring relation, implies prior information of radius and minimum point number to form a cluster, and it has shown good clustering performance on the arbitrarily shaped distribution of data samples. However, DBSCAN clustering algorithm has two shortcomings: (1) Clustering results heavily depend on the maximum radius of a neighborhood and the minimum number of the data samples contained in this neighborhood. Nevertheless, these two parameters are difficult to be determined by human users. (2) Given the assumption that clusters have similar densities, DBSCAN tends to obtain unintended clustering results on varying densities of datasets. Compared with DBSCAN, DBSCANGM [21] method tries to find suitable parameters for DBSCAN, which uses Gaussian Means to find a radial distance and a minimum number of points to form clusters. Hierarchical DensityBased Spatial Clustering (HDBSCAN) [22] forms clusters of different densities with varying epsilon values and is more robust for corresponding parameter selection.
DENCLUE [4] algorithm utilizes the Gaussian kernel density estimation to define clusters and assigns the data samples with the similarity local density maximum to the same cluster. Owing to the hill climbing approach is utilized, it may run unnecessary small steps in the beginning and never exactly converges to the maximum. DENCLUE 2.0 [23] introduces a new hill climbing method for Gaussian kernels, which adjusts the step size automatically at no extra costs, and the procedure converges precisely towards a local maximum by reducing it to a special case of the expectation maximization algorithm. It needs fewer iterations and can be accelerated, but the accuracy of clustering results is decreased.
“Clustering by fast search and find of density peaks (CFSFDP)” [5] is a classic density clustering algorithm, which can generate the clusters regardless of its density distribution and dimensions of data samples. This method has efficient performance since that the whole process of clustering only iterates the data points once, and can correctly recognize clusters regardless of their shape. However, this approach has several limitations as follows: (1) It requires manual determination of a cutoff threshold in the decision graph to determine the density peaks. The cutoff threshold is a cutoff distance used to calculate the local density of each data point. It is set by users with respect to their experience. The choice of the cutoff threshold for the given dataset is usually inefficient and difficult in two special cases. One case is that the data points with lower (or higher) local density and higher (or lower) relative distance are hard to be determined whether they are chosen as the density peaks or not. The other case is that it results in one cluster is erroneously divided into multiple subclusters when there is more than one density peak in the same cluster. (2) The clustering results are influenced by kernel functions used in dissimilarity computation, such as Gaussian kernel, Exponential kernel, Truncated kernel, Gravity kernel, etc. (3) The read and write of the input distance matrix of CFSFDP algorithm always exceeds the memory of personal computers for clustering the largescale dataset.
Kernel clustering algorithms can capture the nonlinear structure inherent in various datasets, such as kernel kmeans and spectral clustering, and usually achieve better clustering performance and identify arbitrarily shaped clusters. Spectral clustering is a weighted variant of kernel kmeans clustering algorithm. However, the performance of the single kernel methods is largely determined by choice of kernel functions. Unfortunately, the most appropriate kernel function for the target clustering task is often unknown in advance, and it is timeconsuming to search exhaustively when the size of the userdefined pool of basis kernels is large [24].
Besides, single kernel methods tend to fail to utilize the heterogeneous features of the datasets fully, but most data samples are represented by multiple groups of features. Therefore, multiple kernel methods are proposed to leverage the different features of the clustering datasets fully. They can learn an appropriate kernel efficiently to make the kernel kmeans clustering robust and improved in various scenarios [25]. Multiple kernel learning algorithms attempt to optimize the combination kernel by maximizing the centralized kernel alignment between the combined kernel and the ideal kernel [26]. These multiple kernel clustering algorithms belong to supervised kernel learning and require the class labels of training data samples.
Differing from above clustering algorithm, Parameter Free Clustering (PFClust) [27] can automatically cluster data and identify a suitable number of clusters to group them without requiring any parameters to be specified by the human users. It partitions the input dataset into a number of clusters that share some common attributes, such as their minimum expectation value and variance of intracluster similarity. However, its performance on clustering highdimensional datasets is poor.
Methods
Given an input dataset X^{n×d}={x_{1},x_{2},…,x_{n}} is a set containing n data samples, and each data sample has d attributes. The high dimensional dataset [28] means that the number of attribute values for each data sample is larger than ten, i.e., d>10. By predefining several basis kernel functions, e.g., Gaussian kernel, Exponential kernel, and Laplace kernel, the proposed MKDCI algorithm aims to generate a cluster partition D={D_{1},D_{2},…,D_{k}} with 0<k<n for the data samples in the input dataset X, such that data samples in the same cluster could have larger similarity than others in the different clusters. Thus, the proposed MKDCI algorithm is illustrated in Algorithm 1.
Completeness of incomplete datasets based on matrix completion
A variety of bioinformatics datasets are naturally organized in matrix form since that the matrix provides a convenient way for storing and analyzing a wide range of bioinformatics data samples. However, a large number of bioinformatics datasets are incomplete in many practical scenarios, in other words, there are missing values in the matrix form of the dataset. The missing values usually raise from failures in data sampling processes. Matrix completion [29] is an effective method to fill the missing elements of an incomplete matrix and recover the entire matrix format of bioinformatics datasets.
Conventional matrix completion approaches are based on rank minimization, they are limited to process the lowrank incomplete matrices, and the data samples are sampled from a single lowdimensional subspace. The approach of completing matrix based on sparse selfrepresentation [18], can recover matrices with following properties: (a) the dimensions of each element in the matrices are unknown; (b) the incomplete matrix is a highrank or fullrank matrix, and not limited to the lowrank matrix.
Thus, each element in an incomplete matrix is represented by a linear combination of values of other elements in the matrix, the angles among these data points should be small enough, then the missing elements can be recovered with matrix completion based on sparse selfrepresentation by solving the optimization problems shown in Eqs. (3) and (4).
which is leastsquare selfrepresentation based matrix completion. Setting the diagonal elements of S as zeros is to avoid that a data sample is reconstructed by itself.
Learning an optimal kernel using unsupervised multiple kernel learning
Kernel function
For nonlinear mapping \(\Phi :\mathcal {X} \rightarrow \mathcal {H}\) to a Hilbert space \(\mathcal {H}\) called a feature space. Since an inner product is a measure of the similarity of two vectors Φ(x) and Φ(x^{′}), the kernel function k(·,·) is often interpreted as a similarity measure between points of the input space \(\mathcal {X}\). An important advantage of a kernel function k(·,·) is efficiency: the computation of k(x,x^{′}) is often significantly more efficient than the computation of an inner product of the nonlinear mapping Φ(x) and Φ(x^{′}) in Hilbert space \(\mathcal {H}\).
Kernel learning method
Kernel learning methods embed the input data into a Hilbert space by specifying the inner product between each pair of data points. They are formulated as convex optimization problems, which have a single global optimum and do not require heuristic choices of learning rates, starting configurations or other parameters.
Thus, the kernel matrix is a symmetric positive semidefinite matrix that contains its entries the inner products between all pairs of input data points \(x_{i}\subseteq \mathcal {X}\), and it determines the relative positions of those data points in the Hilbert space \(\mathcal {H}\).
Unsupervised multiple kernel learning
Multiple Kernel Learning(MKL) methods [30] aim at learning a linear combination of a set of predefined basis kernels to identify an optimal kernel for the corresponding applications. Compared with conventional kernel methods only using a single predefined kernel function, MKL methods have the advantages of automatic kernel parameter tuning and capability of concatenating heterogeneous data. To choose the most suitable kernel and exploit heterogeneous features of input datasets, MKL methods construct a few candidate kernels and merges them to form a consensus kernel [26]. The traditional MKL algorithms are supervised learning since that the optimal kernel learning task requires the class labels of training data samples. However, the class labels of training data samples may not always be available prior to execute the MKL task in some realworld scenarios, such as clustering and dimension reduction. Unsupervised Multiple Kernel Learning(UMKL) determines a linear combination of multiple basis kernels by learning from unlabeled data samples, and the generated kernel can be used in data mining, such as clustering and classifying, as it is supposed to provide an integrated feature of input datasets [31]. Thus, to apply multiple kernels to clustering, MKDCI obtain an optimal kernel by the UMKL method.
where each candidate kernel k(·,·) is the combination of m basis kernels {k_{1},…,k_{m}}, μ_{t} is the coefficient(weight) of the tth base kernel.
 1)
A suitably combined kernel enables each training data sample to be reconstructed from the localized bases weighted by the kernel values, i.e., for each data sample x_{i}, the optimal kernel minimizes the approximation error \(\left \ x_{i}  {\sum \nolimits }_{j} x_{j}k(x_{i},x_{j}) \right \\).
 2)
An idea kernel induces the kernel values that are coincided with the original topology of the unlabeled training dataset, i.e., the optimal kernel minimizes the distortion over all training data samples \({\sum \nolimits }_{ij}k(x_{i},x_{j})\parallel x_{i}  x_{j} \parallel ^{2}\).
where k_{ij}=k(x_{i},x_{j}), the target kernel k and local bases set \(\mathcal {B}_{i}\) will be optimized by UMKL, the parameter γ_{1} controls the tradeoff between the coding error and the locality distortion, and γ_{2} controls the size of local basis set \(\mathcal {B}_{i}\).
where the optimal kernel matrix K is determined by \([\mathbf {K}]_{ij}= {\sum \nolimits }_{t=1}^{m} \mu _{t}k_{t}(x_{i},x_{j}), 1\leq i,j\leq n\), B≤n denotes the size of \(\mathcal {B}_{i}\) for each data samples x_{i}, ∘ denotes an elementwise multiplication of two matrices, ∥·∥_{F} denotes the Frobeniusnorm of a matrix, and tr denotes the trace of a matrix.
To apply the UMKL method, the input dataset is split into a training set and a test set with the ratio of 70:30 by randomly sampling, i.e., they account for 70% and 30% of entire input dataset respectively. According to each predefined basis kernel, m kernel matrices are computed for the training data samples, the parameters γ_{1} and B are estimated by crossvalidation on the training data samples, and the above optimization problem can be solved with the algorithm discussed in [31]. Thus, by training on the unlabeled input dataset with the UMKL method, an optimally combined kernel k(·,·) with the weights of the predefined basis kernels μ_{t} are learned. The learned optimal kernel can be utilized to compute the local density of each data sample in the input dataset and dimensionality reduction of highdimensional datasets.
Computation of the optimal parameters
The radius of attenuation is regarded as the impact scope for the optimal kernel function k(·,·), the value of the cutoff distance threshold d_{c} is determined by the radius of attenuation since that one data point only influence the other data points inside its radius. The most data points stochastically distribute inside the interval between the expectation plus threefold variances and the expectation minus threefold variances in a normal distribution [32], the radius of attenuation is \(\frac {3\sigma }{\sqrt {2}}\) for each point of the data filed. Thus, the parameter σ obtained when the entropy H reaches the smallest value, and \(\frac {3\sigma }{\sqrt {2}}\) is chosen as the optimal cutoff distance threshold d_{c} in the proposed MKDCI algorithm.
Dimensionality reduction of input data samples
where k(·,·) is the optimally combined kernel function for the input dataset which is obtained by the UMKL method.
Thus, a faithful representation in the twodimensional space for each data sample in the input dataset can be found with the MKtSNE method. The method preserves both local and global information of data samples in the corresponding lowdimensional space [34] and is suitable to be applied on the largescale datasets with several attributes.
Calculation of local density and minimum distance
Estimation of cluster centroids
To detect the suitable cluster centroids is the critical step of the proposed MKDCI algorithm for generating the optimum clustering results. In the MKDCI algorithm, the cluster centroids are the set of data samples with higher local density ρ_{i} and larger relative distance δ_{i}, the parameter θ_{i}=ρ_{i}×δ_{i} transforms the local density ρ_{i} and relative distance δ_{i} of each data sample into one parameter.
Since the outliers are few and different data samples in the dataset, outlier detection methods can be used to automatically detect cluster centroids based on the set of local density ρ_{i} and parameter θ_{i} in the MKDCI algorithm. Thus, cluster centroids with lager θ_{i} will be automatically detected by searching for outliers in the set of variable θ_{i} with the outlier detection method. Nevertheless, the data samples both with high ρ_{i} and low δ_{i}, and with low ρ_{i} and high δ_{i} will be assigned with high θ_{i}. Thus false cluster centroids may be generated when the set of variable θ_{i} is only searched. Therefore, the outliers in the set of variable δ_{i} are also searched with the outlier detection method. Then, the potential cluster centroids are determined by the intersection of the two sets of outliers detected from both θ_{i} and δ_{i}.
There might be multiple potential cluster centroids that have short relative distance between each other. Thus, the false cluster centroids should be deleted. First, the potential cluster centroids are sorted in descending order according to their local density, and the first cluster centroid is considered as the first actual cluster centroid. If the minimum distance between another potential cluster centroid and the known actual cluster centroids is shorter than the cutoff distance threshold d_{c}, the potential cluster center will be removed from the set of potential cluster centroids, and become a member of the cluster. Otherwise, the potential cluster centroid is recognized as a new actual cluster center to form another cluster. Finally, the actual cluster centroids will be generated by refining those potential cluster centroids.
Assignment of data samples
Second, to recognize the noise points, a border region for each cluster is defined as the set of data samples assigned to the cluster D_{k}, but being within the cutoff distance threshold d_{c} from data samples assigned to other clusters D_{k:k≠l}. For the cluster D_{k}, the MKDCI algorithm searches the lowest density ρ_{b} within its border region, the data samples with a local density higher than ρ_{b} and belonging to the cluster D_{k} are assigned as the data samples of this cluster. The other data samples in the cluster D_{k} are determined as noise. Thus, the assignment of data samples is completed only in a single step, in contrast with other clustering algorithms where the generation of correct clusters usually needs to be optimized iteratively.
Results
In this section, test datasets and corresponding preprocessing methodology are described firstly. Then, the implementation trick of MKDCI algorithm are explored, i.e., the input distance matrix is calculated by the splitapplycombine strategy so that the proposed clustering algorithm can efficiently process highdimensional datasets with millions of data samples. Finally, the evaluation metrics, extensive experiments and their results are discussed in detail.
Datasets and preprocessing
 (1)
Primary Biliary Cirrhosis (PBC) dataset contains the followup laboratory data for each studied patient with fatal chronic liver disease of unknown cause. Between 1974 and 1984, a doubleblinded randomized clinical trial conducted in primary biliary cirrhosis of the liver, recording a large number of clinical, biochemical, serologic, and histologic parameters. This dataset also records the survival status of these studied patient in 1986.
 (2)
Anuran Calls (MFCCs) dataset is used to recognize anuran species through their calls. It is a multilabel dataset with three labels, and the records belong to 4 different families, 8 genera, and 10 species according to 7195 syllables. In the following experiments, the species labels severed as the ground truth labels.
 (3)
Diffuse large Bcell lymphoma (DLBCLB) dataset contains the data samples deriving from germinal center cells, which can be distinguished from their immunoglobulin gene rearrangements, morphologic, molecular characteristics and clinical presentation. Disease staging and choice of treatment, including the type, number, sequence of chemotherapy agents and the need for consolidative radiation therapy, should be made base on these clinical factors, which collectively determine response to therapy and survival.
 (4)
The other four bioinformatics datasets derive from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml) including Wine, Breast Cancer Wisconsin Diagnostic (WDBC), Mice Protein Expression (MPE) and Epileptic Seizure Recognition (ESR) dataset. Wine dataset contains the results of a chemical analysis of wines grown in the same region but derived from three different cultivars, and the analysis determines the quantities of 13 constituents found in each type of wines. WDBC dataset consists of features which were computed from digitized images of FNA tests on a breast mass. MPE dataset consists of the expression levels of proteins/protein modifications that produced detectable signals in the nuclear fraction of the cortex. ESR dataset is a preprocessed and restructured/reshaped version of a very commonly used dataset featuring epileptic seizure detection.
Summarizes the properties of the datasets
Properties  PBCA  PBCR  MFCCs  DLBCLB  Wine  WDBC  MPEA  MPER  ESR 

k  4  4  10  3  3  3  8  8  5 
dim  18  18  22  643  13  30  80  80  178 
N  624  624  7195  180  178  569  1080  1080  11500 
Computation of distance matrix for largescale datasets
To compute and store the entire distance matrix for a largescale dataset with millions of data samples is memory intensive, and the matrix tends to beyond the memory capacity of current personal computers. The splitapplycombine strategy [36] breaks up a big matrix into manageable chunks, operate on each chunk independently and then pulls the chunks together. Thus, the proposed MKDCI algorithm utilizes the splitapplycombine strategy to calculate and store the distance matrix, in order to make the algorithm applicable to the largescale bioinformatics datasets.
Evaluation metrics of clustering quality
There are mainly three types of clustering evaluation metrics that are widely used, namely contingency tablebased measures, pairwise measures and entropybased measures. Contingency tablebased measures, such as accuracy, error rate and Fmeasure (Fm), assume that the groundtruth clustering labels are known as a priori. Pairwise measures, such as Adjusted RandIndex (ARI) and Adjusted Rand Error (aRe), utilize the partition information and the clustering labels over all pairs of data samples. Entropybased measures, such as Adjusted Mutual Information (AMI) and Normalized Mutual Information (NMI), make use of entropy concept as well as groundtruth clustering labels to evaluate the clustering results.
Fm and aRe are not suitable to describe a comparison among different clustering algorithms on the datasets with numerous noisy data samples because they only take already clustered data samples into account. Hence, AMI and NMI are employed to quantify the amount of shared information between the clusters obtained by the clustering algorithm and the given groundtruth clusters in the datasets. Thus, four different metrics are jointly used in this section to evaluate the quality of different clustering algorithms, including Fm, aRe, NMI and AMI.
To compute the metrics of clustering evaluation, assume the set C is the distribution of the groundtruth clustering labels in the input dataset, which contains n data samples and is partitioned into t subsets {C_{1},…,C_{t}}. Meanwhile, the distribution of clustering results is the set D={D_{1},…,D_{k}}, which is obtained by a clustering algorithm applied to the same dataset.
Fm and aRe
Accuracy can be ambiguous, because it only evaluates the exactness of individual clusters, regardless of the overall number of clusters. Thus the larger the number of identified clusters is, the higher the accuracy will be. Meanwhile, error rate only calculates the mispredicted ratio of individual clusters, regardless of the total number of clusters, leading to the clusters with more mispredicted samples have the higher error rate. Whereas, Fm takes the overall number of clusters into account and keeps a balance between the overall number of clusters and the accuracy (or error rate) of individual clusters.
The Fm measures the success of retrieving the groundtruth clusters C in items of the precision and recall of clustering results D produced by the clustering algorithm, whereby the prefect clustering result is denoted by Fm=1.

m_{i,j}=C_{i}∩D_{j} is the number of data samples in the groundtruth cluster C_{i} assigned to the generated cluster D_{j} by a clustering algorithm,

m_{i,all}=C_{i} is the total number of data samples in the groundtruth cluster C_{i},

m_{all,j}=D_{j} is the total number of data samples in the generated cluster D_{j},

m_{all,all}=C is the total number of data samples in the dataset except the data samples that are difficult to clustering, i.e., the values of their groundtruth clustering labels are 1.
The perfect clustering algorithm is that the predicted clusters generated by the algorithm are identical to the groundtruth clusters. Thus, aRe is defined as 1−ARI, the prefect clustering clusters is denoted by aRe=0.
NMI and AMI
By comparing clustering results with corresponding groundtruth clusters directly based on the data samples, it is hard to decide whether the assignment of one clustering result is right or wrong for the given dataset. Therefore, an effective method to evaluate the quality of the clustering results is to measure the relationships of each pair of data samples in the dataset. For each pair of data samples that share at least one cluster in the overlapping clustering results, pairwise measures try to estimate whether the prediction of this pair as being in the same cluster was correct with respect to the true underlying categories in the dataset.
where I(X,Y) is the mutual information between the groundtruth labels X and clustering results Y, it is a nonnegative quantity upper bounded by the entropies H(X) and H(Y). H(X) and H(Y) are the entropy of X and Y respectively, max{H(X),H(Y)} denotes the maximum entropy of X and Y, and E(I(X,Y)) is the expected value of I(X,Y). N_{ij} denotes the number of data samples belonging to both cluster C_{i} and D_{j}, N_{i} and N_{j} denote the number of data samples in the cluster C_{i} and D_{j} respectively. The range of NMI and AMI is from 0 to 1. Their values are larger denotes that the clustering results are better, and the value equal to 1 denotes that the two clusters are identical.
Evaluation results
Quality comparison of different clustering algorithms on bioinformatics datasets
Dataset  Measure metrics  PBCA  PBCR  MFCCs  DLBCLB  Wine  WDBC  MPEA  MPER  ESR 

MKDCI  Fm  0.351  0.360  0.728  0.749  0.652  0.858  0.470  0.482  0.491 
aRe  0.956  0.953  0.406  0.526  0.704  0.382  0.693  0.689  0.852  
NMI  0.351  0.362  0.692  0.532  0.414  0.495  0.538  0.554  0.446  
AMI  0.070  0.076  0.615  0.496  0.379  0.453  0.429  0.438  0.219  
DBSCAN (MinPts=4, ε_{1})  Fm  0.660  0.665  0.509  0.510  0.576  0.811  0.448  0.452  0.350 
aRe  0.999  0.998  0.858  0.956  0.772  0.602  0.796  0.794  0.967  
NMI  0.023  0.026  0.221  0.054  0.361  0.395  0.492  0.499  0.060  
AMI  0.005  0.005  0.124  0.039  0.269  0.295  0.347  0.347  0.003  
HDBSCAN (MinPts=4)  Fm  0.623  0.627  0.785  0.565  0.620  0.853  0.265  0.271  0.332 
aRe  0.998  0.998  0.260  0.985  0.715  0.386  0.926  0.923  0.989  
NMI  0.029  0.032  0.686  0.174  0.386  0.469  0.518  0.523  0.082  
AMI  0.019  0.020  0.613  0.115  0.353  0.373  0.335  0.337  0.020  
DENCLUE2.0 (ε_{2},h=std(X)/5)  Fm  0.023  0.025  0.415  0.493  0.372  0.007  0.304  0.308  0.650 
aRe  0.997  0.996  0.983  0.987  0.908  0.998  0.708  0.699  0.685  
NMI  0.344  0.347  0.105  0.184  0.385  0.322  0.472  0.478  0.472  
AMI  0.061  0.064  0.018  0.114  0.122  0.002  0.392  0.396  0.201  
PFClust  Fm  0.315  0.320  0.375  0.442  0.373  0.432  0.202  0.207  0.271 
aRe  0.981  0.978  0.887  0.993  0.971  0.988  0.998  0.998  0.872  
NMI  0.002  0.002  0.123  0.043  0.033  0.019  0.024  0.028  0.135  
AMI  0.001  0.001  0.094  0.001  0.001  0.007  0.006  0.007  0.111  
Parameters  ε _{1}  24.657  24.657  0.306  19.819  3.626  20.413  2.221  2.221  1.426 
ε _{2}  19.591  19.591  0.306  0.413  6.552  1.426  0.432  0.432  1.853 
Discussion
Compared with the PFClust algorithm, the proposed MKDCI algorithm significantly improves the quality of the parameterfree clustering. Meanwhile, MKDCI algorithm also automatically generate clustering results of higher quality on the most of highdimensional bioinformatics datasets. The reason is that the utilized UMKL methods can obtain the optimal map between highdimensional data samples and the lowdimensional data samples, and MKDCI algorithm automatically determines the optimally combined kernel function and similarity measure for dimensionality reduction and density clustering respectively. Compared with the method of filling missing attribute values of data samples with the average value, the method of matrix completion can improve slightly the performance of clustering algorithm. But the improvement of performance of MKDCI algorithm is mainly attributed to the optimization of combined kernels learned with UMKL. However, for the part of evaluation metrics on the PBC, MFCCs and ESR datasets, such as Fm and aRe, the quality of the clustering results generated by the MKDCI algorithm is lower than the ones generated by the HDBSCAN and DENCLUE2.0 algorithms. This is because these evaluation metrics only take already clustered data samples into account. The other four density clustering approaches need to determine parameters manually beforehand, and the clustering results heavily depend on the user’s experience, while the advantage of MKDCI algorithm is free from requiring determination of critical parameters by users. Thus, the proposed MKDCI is an efficient unsupervised learning algorithm. It is especially suitable for analyzing the highdimensional bioinformatics data samples in a wide variety of applications, since that it aims to determine an optimally linear combination of multiple basis kernels by learning from the unlabeled dataset and automatically complete the clustering process without critical parameters determined manually by users in advance.
Conclusions
The proposed MKDCI algorithm provides an automatic density clustering approach with multiple kernels for bioinformatics datasets. It is especially suitable for largerscale incomplete datasets in bioinformatics by combining the advantages of the density clustering method, prediction of the missing attribute values of data samples with the matrix completion method, the UMKL method for unlabeled training data samples, detection of cluster centroids based on the Isolation Forests method. The quality of the proposed MKDCI algorithm is evaluated with several wellknown evaluation metrics, the results on multiple bioinformatics datasets with missing attribute values show that the MKDCI algorithm generates better clustering results than most of density clustering methods and the PFClust parameterfree clustering method. However, the optimal kernel used in the MKDCI algorithm is only the combination of three prespecified basis kernels, the performance of the clustering can be improved by utilizing more basis kernels to obtain more suitable kernel function. Meanwhile, due to the sensitivity and privacy of the bioinformatics datasets, the privacypreserving clustering method based on differential privacy is another promising topic for the future research.
Declarations
Acknowledgements
The research was partially funded by the Program of National Natural Science Foundation of China (Grant No. 61751204), the National Outstanding Youth Science Program of National Natural Science Foundation of China (Grant No. 61625202), the Natural Science Foundation of Jiangsu Higher Education Institutions (Grant No. 18KJB520022), and the National Key R &D Program of China (Grant Nos. 2016YFB0201303, 2016YFB0200201).
Funding
Publication costs were funded by the Program of National Natural Science Foundation of China (Grant No. 61751204).
Availability of data and materials
All data generated or analyzed during this study are included in this published article, the datasets used in this study can be download from the public websites.
About this supplement
This article has been published as part of BMC Systems Biology Volume 12 Supplement 6, 2018: Selected articles from the IEEE BIBM International Conference on Bioinformatics & Biomedicine (BIBM) 2017: systems biology. The full contents of the supplement are available online at https://bmcsystbiol.biomedcentral.com/articles/supplements/volume12supplement6.
Authors’ contributions
LL and KLL conceived the study and wrote the manuscript. QT gave helpful suggestions and helped to revise the English. All authors provided valuable advice in developing the proposed method and modifying the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 AbdAllah L, Shimshoni I. Kmeans over incomplete datasets using mean euclidean distance In: Perner P, editor. Machine Learning and Data Mining in Pattern Recognition. Cham: Springer: 2016. p. 113–127.Google Scholar
 Arthur D, Vassilvitskii S. Kmeans++: the advantages of careful seeding. In: Proceedings of the 18th Annual ACMSIAM Symposium on Discrete Algorithms. SODA ’07. New Orleans: Society for Industrial and Applied Mathematics: 2007. p. 1027–35.Google Scholar
 Anant R, Sunita J, Jalal AS, Manoj K. A density based algorithm for discovering density varied clusters in large spatial databases. Int J Comput Appl. 2011; 3(6):1–4.Google Scholar
 Hinneburg A, Keim DA. An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, KDD’98. New York: AAAI Press: 1998. p. 58–65.Google Scholar
 Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science. 2014; 344(6191):1492–6. https://doi.org/10.1126/science.1242072.View ArticleGoogle Scholar
 Borg A, Niklas Lavesson VB. Comparison of Clustering Approaches for Gene Expression Data. In: Twelfth Scandinavian Conference on Artificial Intelligence: 2013. p. 55–64. https://doi.org/10.3233/978161499330855.
 Zhang T, Ramakrishnan R, Livny M. Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, SIGMOD ’96. New York: ACM: 1996. p. 103–114. https://doi.org/10.1145/233269.233324.Google Scholar
 Guha S, Rastogi R, Shim K. Rock: A robust clustering algorithm for categorical attributes. Inf Syst. 2000; 25(5):345–66. https://doi.org/10.1016/S03064379(00)000223.View ArticleGoogle Scholar
 Wang J, Zhuang J, Hoi SCH. Unsupervised multiple kernel learning. J Mach Learn Res. 2011; 20:129–44.Google Scholar
 Liao L, Li K, Li K, Tian Q, Yang C. Automatic density clustering with multiple kernels for highdimension bioinformatics data. In: Workshop of IEEE BIBM 2017. Kansas City: IEEE: 2017.Google Scholar
 Liu FT, Ting KM, Zhou Z. H.Isolationbased anomaly detection. ACM Trans Knowl Discov Data. 2012; 6(1):3–1339. https://doi.org/10.1145/2133360.2133363.View ArticleGoogle Scholar
 Shao W, Shi X, Yu PS. Clustering on multiple incomplete datasets via collective kernel learning. In: IEEE 13th International Conference on Data Mining. 2013. p. 1181–1186. https://doi.org/10.1109/ICDM.2013.117.
 Liu G, Li P. Lowrank matrix completion in the presence of high coherence. IEEE Trans Sig Process. 2016; 64(21):5623–33. https://doi.org/10.1109/TSP.2016.2586753.View ArticleGoogle Scholar
 Wen Z, Yin W, Zhang Y. Solving a lowrank factorization model for matrix completion by a nonlinear successive overrelaxation algorithm. Math Program Comput. 2012; 4(4):333–61.View ArticleGoogle Scholar
 Nie F, Wang H, Huang H, Ding C. Joint schatten p norm and ℓ _{p} norm robust matrix completion for missing value recovery. Knowl Inf Syst. 2015; 42(3):525–44.View ArticleGoogle Scholar
 Liu Q, Lai Z, Zhou Z, Kuang F, Jin Z. A truncated nuclear norm regularization method based on weighted residual error for matrix completion. IEEE Trans Image Process. 2016; 25(1):316–30. https://doi.org/10.1109/TIP.2015.2503238.View ArticleGoogle Scholar
 Lu C, Tang J, Yan S, Lin Z. Generalized nonconvex nonsmooth lowrank minimization. In: IEEE Conference on Computer Vision and Pattern Recognition. 2014. p. 4130–4137. https://doi.org/10.1109/CVPR.2014.526.
 Fan J, Chow TWS. Matrix completion by leastsquare, lowrank, and sparse selfrepresentations. Pattern Recog. 2017; 71:290–305. https://doi.org/10.1016/j.patcog.2017.05.013.View ArticleGoogle Scholar
 Rohe K, Chatterjee S, Yu B. Spectral clustering and the highdimensional stochastic blockmodel. Ann Stat. 2011; 39(4):1878–915.View ArticleGoogle Scholar
 Fahim A. A clustering algorithm based on local density of points. IJMECS. 2017; 9:9–16.View ArticleGoogle Scholar
 Smiti A, Elouedi Z. Dbscangm: An improved clustering method based on gaussian means and dbscan techniques. In: IEEE 16th International Conference on Intelligent Engineering Systems (INES). 2012. p. 573–578. https://doi.org/10.1109/INES.2012.6249802.
 Campello RJGB, Moulavi D, Sander J. Densitybased clustering based on hierarchical density estimates In: Pei J, Tseng VS, Cao L, Motoda H, Xu G, editors. Advances in Knowledge Discovery and Data Mining. Berlin, Heidelberg: Springer: 2013. p. 160–172.Google Scholar
 Hinneburg A, Gabriel HH. Denclue 2.0: Fast clustering based on kernel density estimation In: R. Berthold M, ShaweTaylor J, Lavrač N, editors. Advances in Intelligent Data Analysis VII. Berlin, Heidelberg: Springer: 2007. p. 70–80.Google Scholar
 Liu X, Li M, Wang L, Dou Y, Yin J, Zhu E. Multiple kernel kmeans with incomplete kernels. In: AAAI. San Francisco: IEEE: 2017.Google Scholar
 Li T, Dou Y, Liu X, Zhao Y, Lv Q. Multiple kernel clustering with corrupted kernels. Neurocomputing. 2017; 267:447–54. https://doi.org/10.1016/j.neucom.2017.06.044.View ArticleGoogle Scholar
 Gnen M, Alpayd E. Multiple kernel learning algorithms. J Mach Learn Res. 2011; 12:2211–68.Google Scholar
 Mavridis L, Nath N, Mitchell JB. Pfclust: a novel parameter free clustering algorithm. BMC Bioinformatics. 2013; 14(1):213. https://doi.org/10.1186/1471210514213.View ArticleGoogle Scholar
 Kriegel HP, Kröger P, Sander J, Zimek A. Densitybased clustering. Wiley Interdiscip Rev Data Min Knowl Disc. 2011; 1(3):231–40. https://doi.org/10.1002/widm.30.View ArticleGoogle Scholar
 Xiao G, Li K, Li K. Reporting l most influential objects in uncertain databases based on probabilistic reverse topk queries. Inf Sci. 2017; 405:207–26. https://doi.org/10.1016/j.ins.2017.04.028.View ArticleGoogle Scholar
 Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B. Large scale multiple kernel learning. J Mach Learn Res. 2006; 7:1531–65.Google Scholar
 Mariette J, VillaVialaneix N. Unsupervised multiple kernel learning for heterogeneous data integration. Bioinformatics. 2017; 34(6):1009–1015.View ArticleGoogle Scholar
 Barany I, Vu V. Central limit theorems for gaussian polytopes. Ann Probab. 2008; 36(5):1998. https://doi.org/10.1214/07AOP378.View ArticleGoogle Scholar
 van der Maaten LJP, Hinton GE. Visualizing highdimensional data using tsne. J Mach Learn Res. 2008; 9(11):2579–605.Google Scholar
 Güngör E, Özmen A. Distance and density based clustering algorithm using gaussian kernel. Expert Syst Appl. 2017; 69:10–20. https://doi.org/10.1016/j.eswa.2016.10.022.View ArticleGoogle Scholar
 Manoj K, Kannan KS. Comparison of methods for detecting outliers. Publ Econometriques. 2013; 4(9):43–53.Google Scholar
 Wickham H. The splitapplycombine strategy for data analysis. J Stat Softw Artic. 2011; 40(1):1–29. https://doi.org/10.18637/jss.v040.i01.Google Scholar
 Li K, Yang W, Li K. Performance analysis and optimization for spmv on gpu using probabilistic modeling. IEEE Trans Parallel Distrib Syst. 2015; 26(1):196–205. https://doi.org/10.1109/TPDS.2014.2308221.View ArticleGoogle Scholar
 Li K, Tang X, Veeravalli B, Li K. Scheduling precedence constrained stochastic tasks on heterogeneous cluster systems. IEEE Trans Comput. 2015; 64(1):191–204. https://doi.org/10.1109/TC.2013.205.View ArticleGoogle Scholar
 Li K, Tang X, Li K. Energyefficient stochastic task scheduling on heterogeneous computing systems. IEEE Trans Parallel Distrib Syst. 2014; 25(11):2867–76. https://doi.org/10.1109/TPDS.2013.270.View ArticleGoogle Scholar
 Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J Mach Learn Res. 2010; 11:2837–54.Google Scholar