An improved sparse representation model with structural information for Multicolour Fluorescence In-Situ Hybridization (M-FISH) image classification

Background Multicolour Fluorescence In-Situ Hybridization (M-FISH) images are employed for detecting chromosomal abnormalities such as chromosomal translocations, deletions, duplication and inversions. This technique uses mixed colours of fluorochromes to paint the whole chromosomes for rapid detection of chromosome rearrangements. The M-FISH data sets used in our research are obtained from microscopic scanning of a metaphase cell labelled with five different fluorochromes and a DAPI staining. The reliability of the technique lies in accurate classification of chromosomes (24 classes for male and 23 classes for female) from M-FISH images. However, due to imaging noise, mis-alignment between multiple channels and many other imaging problems, there is always a classification error, leading to wrong detection of chromosomal abnormalities. Therefore, how to accurately classify different types of chromosomes from M-FISH images becomes a challenging problem. Methods This paper presents a novel sparse representation model considering structural information for the classification of M-FISH images. In our previous work a sparse representation based classification model was proposed. This model employed only individual pixel information for the classification. With the structural information of neighbouring pixels as well as the information of themselves simultaneously, the novel approach extended the previous one to the regional case. Based on Orthogonal Matching Pursuit (OMP), we developed simultaneous OMP algorithm (SOMP) to derive an efficient solution of the improved sparse representation model by incorporating the structural information. Results The p-value of two models shows that the newly proposed model incorporating the structural information is significantly superior to our previous one. In addition, we evaluated the effect of several parameters, such as sparsity level, neighbourhood size, and training sample size, on the of the classification accuracy. Conclusions The comparison with our previously used sparse model demonstrates that the improved sparse representation model is more effective than the previous one on the classification of the chromosome abnormalities.


Background
Chromosomal abnormalities (e.g., changes in number and translocations of structures) could all cause genetic diseases and cancers. To detect these deathful diseases, multicolour Fluorescence In-Situ Hybridization (M-FISH) technique use different colours to paint human chromosomes. Therefore, this technique can be employed to analyze these abnormalities simultaneously [1,2]. This cytogenetic approach uses N fluorochromes to label a metaphase cell; there are 2 N -1 different combinations that can differentiate different types of chromosomes. It is obviously that 5 different fluorochromes are enough to differentiate 24 types of different human chromosomes. Therefore, the S Gold (F), S Green (G), S Aqua (A), Red (R) and S Red (Y) are used to paint the chromosomes. The painted chromosomes are illuminated by specific wavelength light. The fluorochromes on the chromosomes emit florescent light with distinct wavelength which can be detected by the microscopy. To acquire images of different fluorescence colours, 5 different emission filters were employed to avoid the disturbance of the other fluorescence colours and keep the valid emission light. Figure 1 illustrates M-FISH image set which is collected by microscopy with CCD camera. In addition, the last image in Figure 1 is the DAPI channel which shows the whole chromosomes in a cell. For each fluorescence channel, one image is generated and the chromosomes are detected by the pixels with high intensity. Ideally, a chromosome can be dyed with at least two fluorochromes, for example, S Green (G) and DAPI. Hence, the chromosome should be visible only in G and DAPI channels, but sometimes it might be observed in other channels because of spectral mixing, inhomogeneous background [3]. Therefore, it is extremely challenging to identify the chromosomes accurately based on M-FISH image set in practice.
For detecting the chromosomal abnormalities associated with genetic diseases or cancers by M-FISH technique, it is important to improve the accuracy of the classification of the chromosomes. Before classification, some preprocessing methods [3][4][5][6][7] are necessary to increase the accuracy by reducing the noise of the original images. In classification, there are two major types of classifiers: the pixel by pixel classifier [8][9][10] and the region-based classifier [6,7]. For the classification, we have proposed Bayesian classifier [11] and sparse representation based classification (SRC) [12]. For the segmentation purpose, we have developed Adaptive Fuzzy C-Means (AFCM) segmentation method [6]. To bring the imaging technique into clinical use, further effort is needed to improve the classification accuracy.
Sparse representation methods including compressive sensing have been widely studied recently in applied mathematics and signal/image processing for their advantages in processing high dimensional data [13,14]. There are many algorithms ( e.g., greedy algorithms (Matching Pursuit (MP [15]), OMP [16] and Homotopy [17]) to solve the sparse models. Recently Multiple Measurement Vectors (MMV) based models have also been proposed to recover a set of vectors that share a common support. Such models can find wide applications in many research fields (e.g., multiple signal classification(MUSIC) [18], blind multiband signal reconstruction [19] and compressive diffuse optical tomography [20]), where MMV problem is commonly applied. Motivated by these efforts on the MMV problem, we proposed a novel sparse representation model by incorporating the structural information into the classification of M-FISH image set, which was reported in our preliminary study [21]. This improved model considers the correlations of neighbouring pixels, which often share the same features and belong to the same class. By utilizing multiple information both from the neighbourhood of a pixel as well as from different spectral channels, the classification results of the proposed sparse model are better than that of sparse model we used before [12].
The paper is organized as follows. First, we introduce the SRC model without structural information and then propose an improved sparse model as well as the corresponding algorithm (i.e., SOMP) for estimating the solution. Next, we apply the improved model to M-FISH classification and compare it with a conventional sparse model which was employed in our previous model [12]. Finally, the paper is concluded with a short summary and discussion of the proposed model.

Methods
The SRC model has been successfully used in many fields (e.g., hyperspectral imaging classification [22] and M-FISH chromosome classification [12]). Before introducing the improved sparse model, we first review the sparse model and show how to apply it on M-FISH image data analysis. Then, we present the improved sparse model with the structural information for M-FISH chromosome classification by utilizing correlated information of the neighbouring pixels within a region. Finally, we describe the numerical algorithm, SOMP, for solving this improved model.

SRC algorithm for M-FISH data
A general type of sparse model is shown in Eq. (1), where y is a vector with different observations; A is a matrix consisting of features from different classes; and x is a vector of coefficients corresponding to the observation vector y. If the observations y belongs to a particular class, the corresponding coefficients in x will have a few non-zero entries concentrated around a particular region, whereas the rest will be zeros; i.e., the vector x is a sparse vector with many zero entries. Figure 2 shows the schematic diagram of the sparse model. In Figure 2, matrix A consists of features from three different classes which are represented by different colours: yellow, red and green respectively. x is a sparse vector with non-zero entries in red region and zero entries in white regions. Given an observation vector y, the sparse vector x can be solved with the optimization model shown in Eq. (2). Assuming we have m (i.e., m = 24 in our case) classes and each pixel corresponds to a n (n = 5) dimensional feature vector a j, j = 1, 2, ..., N i, we can have a feature matrix A represented by Here N i is the number of training pixels from the i-th class. In matrix A, the number of Based on the sparse model in Eq.
(1), a testing sample y can be approximated by a sparse solutionx with non-zero coefficients corresponding to a particular class using Eq. (2).
where y ∈ R n is the test pixels to be classified; x p , p ∈ [0, 1] is the L-p norm of x and is usually used to shrink the solutionx ∈ R N to have small percentage of nonzero coefficients, which results in the sparse of the Figure 2 The schematic diagram of a sparse model used for classification. This figure shows a sparse model. A represents the feature matrix, and y is a vector to be classified. In matrix A, three different colours represent three different classes. × is a sparse vector, and has many zero entries (white colour) but a few non-zero entries (red colour) corresponding to a particular class to which y belongs.
solution; by specifying the values of K 0 ,we can obtain the solution with different sparse levels. For the sake of simplicity, we take the case of p = 0, and x 0 is the corresponding L 0 norm of x, which means the number of the non-zero coefficients in x.
After estimating the solution of Esq. (1)-(2), we will classify a test sample y as follows: where m represents the number of different classes; andx i is the sparse solution corresponding to class i . The class that y belongs to is determined by assigning it to the one that the distance between the y and estimated solution A ixi is minimum.

Improved sparse model with structural information for M-FISH data analysis
In the Eq. (1), y is a feature vector consisting of 5channel spectral information at only one pixel. However, in practice a pixel usually shares the same feature with its neighbouring pixels, which is the case with M-FISH image set. The neighbouring pixels with similar intensity values are the nearest neighbourhood of y 5 which is a central pixel, as illustrated in Figure 3.
The classification accuracy of a pixel by pixel classifier and its robustness to noise can be improved by considering structural information of the pixel within a neighbour region. Therefore, we exploit a new sparse model with structural information by utilizing the information of neighbouring pixels simultaneously instead of a single pixel as shown in Eq. (4): where y 1 ,...,y 9 are the test samples within a neighbourhood that form the matrix Y and y 5 is the central pixel.
x 1 ,...,x 9 are the vectors of corresponding weights. Eq. (4) shows that y 1 ,...,y 9 share the same features in matrix A but different weights. Figure 4 shows the schematic diagram of the improved sparse model with structural information based on the Eq. (4).
Since matrix X is a row-wise sparse matrix, as shown in Figure 4, the improved model is an extension of our previous sparse model (1) by considering multiple pixels simultaneously. With this improved sparse model, we propose to use the following optimization for the solution: where Y = y 1 , ...y j , ...y s is a test matrix instead of the vector in SRC model [12]. The text matrix contains s test pixels within a neighbouring region. Assuming that there is spatially correlated among the s pixels. The row-sparse solutionX = x 1 , ...x j , ...x s corresponds to the input matrix Y. The entries in x j j=1,...,s share the same Figure 3 The schematic diagram of neighbouring region. Each square represents a pixel in an M-FISH image, corresponding to five channel spectral information. The y 5 is a central pixel surrounded by its eight neighbouring pixels y 1 ,y 2 ,y 3 ,y 4 ,y 6 ,y 7 ,y 8 and y 9 . The size of the neighbouring pixels can vary. non-zero supports. They are obtained by Eq. (5) with the following regularization term: where X 0,q indicates the number of non-zero rows of X, and x i indicates the i-th row of X. I x i q > 0 is an indicator function that has the value 1 if x i q > 0 and 0 otherwise. In this work, we set q = 2. The solution vectors x j j=1,...,s have the row-wise sparsity (i.e., the non-zero entries in the same row), which indicates the high correlation of the neighbouring pixels. The rule of the decision used in the Eq. (3) and in the improved model is similar. After we getX, we will employ Eq. (7) to determine to which class the test samples surrounding a central vector y c belongs to, where y c is the central pixel of a neighbourhood and is the residual between an input matrix Y consisting of neighbouring pixels around y c and the product of the solutionX i and the corresponding sub-matrix A i . The minimum value of the residual determines the class which the central pixel belongs to.
Algorithms for the solution of the improved sparse model There have been many approximate algorithms for solving the optimization problems (i.e., Eq. (2) and (5)). When p equals 0 [15,16], e.g., L 0 norm, the greedy algorithms (e.g., MP, OMP) will be employed to solve the problem of Eq. (2). In [23], simultaneous OMP (SOMP) algorithm for Eq. (5) was employed instead of OMP algorithm for solve Eq. (2) and the detail of SOMP is described in Table 1. At each iteration, the algorithm will pick up one column a k from the training matrix Abased on the criterion that the maximum q-norm value of the projection on the current residual matrix could be obtained only by selecting the column a k. Once the column is selected, it will be included for reestimating the signalX i and thus the new reduced residual. The algorithm will continue until the solution reaches the pre-specified sparsity level.

M-FISH database
We have collaborated with Advanced Digital Imaging Research (ADIR; League City, Texas, USA) to establish the M-FISH image database, which is a valuable source for chromosome imaging studies [18]. The database is publicly available from [24]. A set of images from five different fluorescence channels and a DAPI channel were acquired by microscopy and an example is shown in Figure 1. In addition, to evaluate the classification accuracy, an experienced cytogeneticist provided a ground truth image which is shown in Figure 5(a) in the form of pseudo colours, where different colours indicate different types of chromosomes. There are totally 24 different classes including male and female chromosomes.
In the ground truth images, the background pixels were labelled with 0. The pixels in the region of overlap were labelled with 255. Others were labelled by numbers from 1 to 24 which was used to discriminate different types of chromosomes. The ground truth will be employed to verify the accuracy of classification algorithms for M-FISH image set.

Segmentation of chromosome regions
In M-FISH images, background usually contains most pixels, but the chromosomal regions are of most interests. Therefore, to separate the chromosomal region  from the background and improve the efficiency of the classification, a mask was generated by the DAPI channel which can show all chromosomes in a cell. The AFCM method we proposed in [5] was employed for this purpose. This mask was then applied on the other five channels, so that the chromosome regions could be extracted based on the mask while the pixels out of the mask were removed. In Figure 6, an image of a DAPI channel is demonstrated as well as how the mask is generated by the segmentation.

M-FISH training and testing data
The improved sparse model with structural information was applied on the classification of M-FISH image data. 20 cells (i.e., 10 male, 10 female) were chosen from our database [24]. The features of different types of chromosome were constructed by randomly sampling pixels from M-FISH images to form the training matrix A, which satisfy the sparsity concentration index (SCI) proposed by [25]. SCI is used to measure the sparsity concentration of the feature vectors. Matrix A is an n×N matrix, in which n represents the spectral dimension of pixels and N represents the number of training features.
In the case of M-FISH image data, n equals 5. After completing the matrix training, the rest of the pixels were taken as testing data to validate our proposed classification method.

The analysis of the classification results with different models
Both the sparse model incorporating the structural information and our previously used sparse model [12] were tested and compared on our M-FISH data set. Figure 5(b) and 5(c) show the classification results of two different models on the same cell, with and without the use of structural information respectively. It can be seen that there are more isolated spots in the chromosomal regions of Figure 5 (c) than those of Figure 5 (b). These isolated spots are mostly misclassifications, which can be effectively corrected by using the improved sparse model with structural information. The ratio of correct classification (RCC) as follow: RCC = the ratio of the number of correctly classified pixels the number of all pixels in a chromosomal region (8) Table 2 shows RCC of different types of chromosomes for one M-FISH image set. The RCC of the improved sparse model with structural information is generally greater than that of our previously used sparse model. Figure 7 compares the classification results of both models on each cell in terms of RCC. It can be seen that the accuracy of the classification of the improved sparse model with structural information (in red) is Figure 5 The classification results of using two different models. The first figure (a) is the ground truth used to verify the classification results. The second one (b) is the classification result using the improved sparse model with structural information. The third one (c) is the classification result using our previous model [12]. The classification results (i.e., types of chromosomes) are visualized in the form of pseudo colours. greater than that of the previously used sparse model [12] (in blue). Therefore, with the structural information of neighbouring region, the improved sparse model can increase the accuracy of the classification for the M-FISH image set.

Significance analysis of the new sparse model with structural information
Statistical analysis by using a paired-sample t-test was performed to demonstrate the significant level between the two different models. The null hypothesis is that there are no differences between both models. Figure 8 shows the results of the statistical analysis based on the results in Figure 7. The improved sparse model with structural information has the greater mean value while less standard deviation, 76.72 ± 9.3 (i.e., the left box plot in Figure 8), than those of the previous sparse model, 72.94 ± 9.82 (i.e., the right box plot in Figure 8). The significant level (i.e. p-value) of this statistical analysis is less than 1e-6. Therefore, the improved sparse model with the structural information significantly outperforms our previous sparse model, by incorporating the structural information available in the neighbour of each pixel.

Effects of parameters used
There are three important parameters, neighbour size (s), sparsity level (K 0 ), and training sample size (N i ), which are involved in the improved sparse model. The accuracy of the classification results can be affected by these three parameters and hence it is worthwhile to study their effects. Figure 9 shows how the RCC is affected by different values of K 0 and s. When K 0 is fixed, the RCC will raise with the increase of the neighbourhood size s until a certain threshold (e.g., s = 121). This indicates that the use of correlated information within a window can generally increase the classification accuracy, however, if the window size is too large, there is high probability that more irrelevant or other chromosomal pixels will be included, which tends to increase the classification error. An appropriate window size is therefore needed. A neighbourhood size (s = 9) is recommended based on our experiments. When the neighbourhood size s is fixed, from Figure 9, the smaller  value of the sparsity level K 0 will give the greater accuracy of the classification. In addition, the correct ratio of classifying the M-FISH image is affected by the training sample size N i for both models as shown in Figure 10. A number of different percentages of training samples were selected: 1%, 3%, 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, and 50%. In Figure 10, the correct classification ratios of the two models are represented by stars and triangles respectively. The analysis results show that the correct Figure 7 The RCC of two models. The red line is the RCC of using the new sparse model incorporating the structural information for 20 different cells. The green line is the RCC of using our previous model (i.e., general sparse model) for the same 20 cells. Figure 8 Statistical analysis of two different sparse models based classification methods. The first box plot shows the quartiles and ranges of the classification results of the new sparse model incorporating the structural information, while the second one shows those using our previous model. Figure 9 The effect of the K 0 and s on the RCC. Different lines represent the accuracy of classifying an M-FISH image for different sparsity level (K 0 ) and different neighbourhood size s. K 0 is changed from 1 to 10 and s is changed from 1 to 169. When K 0 equals 1 to 5, there will be almost no difference among different neighbourhood size s. However, when K 0 is greater than 5, the RCC will change dramatically with different sparsity level. Figure 10 The effect of the N i on the RCC. The star line represents the RCC of the new sparse model incorporating the structural information with respect to different percentage of training samples, whereas the triangle line represents the RCC of our previous model [12]. The results of the improved sparse model are much better than those of our previous model. In addition, with the increase of the training samples, the classification ratio increases. classification ratios increase with the increasing size of the training samples for both models, which is reasonable.

Conclusions and discussion
A sparse model based classifier that we proposed before [12] used the pixel by pixel classification, overlooking structural information so that there are much more isolated spots in the results leading to the low accuracy of the classification. In this paper we proposed an improved sparse model, in which the information of a central pixel as well as its neighbouring pixels is used simultaneously for improved classification. This is validated by the comparison of chromosomal classification accuracy between the two models on a real M-FISH database [24]. The comparison (as illustrated by Figure  5) shows that there are more isolated spots (i.e., misclassifications) in the classification results of our previously model [12] than those of using new sparse model incorporating the structural information. The correct classification ratio in Table 2 also shows the improved accuracy of using the improved sparse model. The statistical comparison between the two models indicates that the new sparse model with structural information is superior to the previously used sparse model, with the significant level less than 1e-6,. The effects of parameters used in the model on the accuracy of classification were also investigated. We have shown how the sparsity level (K 0 ) and the neighbourhood size (s) and the training sample size (N i ) affected the RCC of our improved sparse model incorporating structural information and how the training sample size (N i ) affected the RCC of our previously used model as well as improved model. A proper choice of sparsity level (K 0 < = 5) and neighbourhood size (s = 9) is recommended based on our experiments.
In summary, all the result shows that our proposed improved sparse model incorporating structural information can significantly improve the accuracy of the classification compared with a general sparse model that we proposed before [12]. This will in turn improve the M-FISH imaging technique for detecting chromosome abnormalities to better diagnose genetic diseases and cancers.