SysNet provides interactive analysis and graphic visualization of molecular expression data. There are two major functionalities in the current version of this software: interactive analysis of molecular correlations and comparative analysis of 'omics expression data.
Interactive analysis of molecular correlation – Measures of molecular correlation are descriptive statistics that represent the degree of relationship between two or more variables, but are not inferential statistical tests. Parametric and nonparametric statistical methods are available for correlation measurement [21]. The parametric method is based on assumptions that include 1) the subjects are randomly selected from the population; 2) the size of subjects is large enough to represent the distribution of a population; and 3) the variables have a bivariate normal distribution. Nonparametric or parameter-free methods do not rely on the estimation of parameters (such as the mean or the standard deviation) but describe the distribution of the variable of interest in the population.
SysNet has implemented both parametric and non-parametric pairwise measures including the parametric Pearson product-moment correlation (r
p
), the non-parametric Spearman correlation (r
s
) and the non-parametric Kendall's coefficient of rank correlation (τ).
Pearson product-moment correlation coefficient (r
p
) assumes that a linear function best describes the relationship between two variables. It can be used to evaluate data for n subjects, each of which has contributed a score on two variables designated as X and Y. r
p
is calculated as follows:
(1)
Spearman's rank-order correlation (r
s
) is a bivariate measure of correlation employed with rank-order data. It determines the degree to which a monotonic relationship exists between two variables. Equation (2) shows the r
s
calculation for n subjects where each subject has an X and a Y score. The rank of n subjects scores on the X and Y variables are recorded in R
X
and R
Y
, respectively. contains a difference score for each subject that is obtained by subtracting a subject's rank on the Y variable from the subject's rank on the X variable.
(2)
The non-parametric Kendall's coefficient of rank correlation (τ) is also a bivariate measure of correlation employed with rank-order data. In this case, one assumes that data are in the form of the following two pairs of observations expressed in a rank-order format: a) (that represents the ranks on variables X and Y for the i- th subject, respectively); and b) (that represents the ranks on variable X and Y for the j-th subject, respectively). If the sign of the difference is the same as the sign of the difference , a pair of ranks is said to be concordant. If the sign of the difference is not the same as the sign of the difference , a pair of ranks is said to be discordant. τ is calculated as follows.
(3)
where n
C
is the number of concordant pairs of ranks. n
D
is the number of discordant pairs of ranks, [n(n-1)/2] is the total number of possible pairs of ranks.
The Kendall coefficient τ is equivalent to Spearman's r
s
with regard to the underlying assumptions and the two are comparable in terms of statistical power. However, Spearman r
s
and Kendall τ are usually not identical in magnitude because the underlying logic and computational formulas are different. Importantly, Kendall τ and Spearman r
s
may lead to different interpretations. Spearman r
s
can be thought of as the regular Pearson product moment correlation coefficient in terms of proportion of variability accounted for, except that Spearman r
s
is computed from ranks. Kendall τ, on the other hand, represents the difference between the probabilities that in the observed data two variables are in the same order versus different orders.
In an 'omics global profiling experiment, multiple samples (subjects) will be analyzed and many molecules (observations) detected in each sample. These molecules can be proteins, metabolites and/or metal ions, etc., depending on experimental design. Even though the experimental analyses vary significantly in different types of omics research, the final expression data are similar. Basically, multiple molecules will be detected in each sample and each detected molecule has a digital value indicating the relative expression level of that molecule in the sample. The molecular expression data are then organized as a data table. For example, the column represents samples while each row stores the expression values of a specific detected molecule in each sample. We selected a relatively simple tabular ionomics dataset to illustrate the capability of SysNet. The software can be applied for visualization and correlation of data from all such high volume molecular expression experiments, including proteomics and metabolomics.
For interactive analysis of intermolecular correlations, SysNet focuses on expression data from a single experiment, where multiple subjects are used for analysis. By default, SysNet calculates Pearson product-moment correlation (r
p
) between every two molecular pairs. The user is able to select other methods based on the nature of data. These measurements are computed dynamically and stored in RAM. The molecular correlation is displayed in a 'main window' that is divided into two panels (Fig. 2). The left panel lists all molecules measured while the right panel displays the molecular correlation. The circumference of each circle is proportional to the number of molecules to be displayed.
Molecular correlation analysis evaluates the concentration change of different molecules in all samples. The maximum number of pairwise correlations among these molecules can be represented as n(n-1)/2. In our ionomics experimental setup 17 elements are measured for each sample. Figure 2 displays correlation networks for four Arabidopsis strain experimental groups: ler2, col0, 152–54 and fpt2 with just 68 elements displayed. This visualization will become extremely busy if thousands of correlations are displayed. For this reason, we implemented three methods for visual analysis of large numbers of correlations: one is to filter correlations based on correlation strength, the second is to create a larger image using zooming functions, the third enables the user to move a molecule (node) or an experiment category (circle) around to facilitate visualization. The two sliding bars at the bottom of the screen determine the correlation coefficient value used to filter the data displayed. All molecules having at least one correlation coefficient higher than the filtering criteria will be displayed as a node. The user can adjust the filter values either by moving the sliding bar or by entering a number at the bottom of the right panel. Molecular and correlation information is automatically updated on the graph in response to user changes. In the second approach, SysNet changes the size of the correlation map with zooming functions that enable the user to perform focused analyses. The user can also re-arrange the correlation map by simply selecting a circle or node and dragging it to another panel location. SysNet displays all nodes on a circle by default. Figure 2 is a screen shot showing that node 21 and 23 have been moved from their default location on the circle to another screen location for easy visualization.
Molecular profiling 'omics experiments include very many molecules, only some of which are of interest to biologists. For this reason, SysNet enables the user to add or remove a molecule by changing the status of the check box in the left panel. If a molecule is unchecked in the left panel, the node in the right panel representing that molecule and all correlation edges related with that molecule will disappear and the entire correlation network will be re-arranged. If an un-checked molecule on the left panel is checked, that molecule will be randomly inserted into the corresponding graphic display and the entire correlation network updated.
Three models are available for display of the correlation map in the main window: multiple circles, single circle, and heatmap. The multiple circle display enables effective usage of screen space (Fig. 2). Each circle of the multiple-circle display shows all molecules belonging to a single experimental group (the Arabadopsis strains col0, ler2, fpt2 and 152–54, in this case). In the single circle display (Fig. 3a), all molecules with the same EIN recorded in the input database or data files are displayed in one circle, with breaks in the circle representing divisions between the different experimental groups. All molecules from the same experimental group are displayed in the same arc. Each arc and molecular node can be re-arranged to ease visualization.
The disadvantage of circular display is the overlap of molecular indexes (software-assigned numbers to represent molecules in a graphic display) that may obscure visualization of correlations with these molecules. It is easier to see correlation patterns in a heat map display when dealing with large numbers of molecules. For example, three intense color regions are apparent along the diagonal indicating elements that are strongly correlated within experimental-sets (Fig. 3b; highlighted with dotted circles). It is not easy to recognize this pattern in the circular display (Fig. 2 and Fig. 3a). The disadvantage of the heat map is that all molecules are displayed on one axis so that it is difficult to see details of correlations for a single molecule if a large number of molecules are included. This problem is overcome in SysNet by creation of a large correlation map using the zooming functions.
Two color schemas are implemented to visualize the correlation strength: normal (Fig. 2) and high contrast (Fig. 3). The normal color scheme focuses only on the absolute value of correlation strength with white indicating zero and red indicating a correlation strength of 1. The high contrast color scheme differentiates positive (green lines) and negative correlations (red lines).
To investigate the details of a specific molecule of interest, SysNet provides two visualization methods. By clicking a node (i.e., molecule of interest) in either circular or heat map layout on the graph in the main window, a molecular window will pop up with a list of details for that molecule in the left panel and information about correlated molecules in the right panel (Fig. 4). The filtering criteria for molecular correlation coefficients in this window are the same as specified in the main window and are indicated in the upper right of the screen. Multiple sorting functions are provided for the correlated molecules (right panel) including sorting by molecular index, correlation values (Correl) and molecule name. With a double-click on the selected molecule, SysNet brings up a web browser displaying the search results for the corresponding molecules from public databases relevant for the type of molecule. For example, the current version of SysNet displays protein information from UniProt database [22], metabolite information from KEGG [23], and gene information from GenBank [24]. The user can highlight a molecule displayed in the right panel with a single click. A molecular information window for the highlighted molecule can then be evoked by clicking the 'Show Element' button. Correlations for that molecule are displayed in another window upon clicking of the 'Show Correlation' button.
SysNet also allows the user to view details of a correlation by clicking on a correlation edge on the graph in the main window to invoke a correlation window, which displays details of the two correlated molecules and a graph showing molecular expression levels for the two molecules measured in different samples. Figure 5 shows a correlation between the elements Li and P in the ler2 strain. Elemental information of these two elements is displayed in the two list boxes on the left. There are 12 ionomic samples from the Arabidopsis strain ler2. Each dot in the middle graphic display represents the expression level of the Li (x-axis) and P (y-axis) in one sample. Apparent negative correlation of these elements in this strain is indicated in the graphic; as P levels increase, Li levels decrease and vice versa. The table of critical values for a selected statistical test is automatically displayed on the right side of the screen to enable the user to evaluate the significance of the current correlation. The molecular window and the correlation window may also call each other with the "Show Correlated" button enabling the user to toggle between these information resources.
Comparative analysis of omics expression data – SysNet also enables researchers to interrogate comparative molecular expression studies. This may include any study that monitors molecular behavior under different conditions: platform comparisons, treatments, drug effects, time lapse, etc. Multiple samples are typically analyzed in parallel for 'omics studies, as is the case with our ionomics study. This experimental design enables scientists to understand both the technical and inter-sample variation. For SysNet comparative analyses, all expression data to be compared are concatenated into a single expression data table, where EIN is used to differentiate data for comparison.
SysNet aligns molecules based on molecular name and experimental groups. The aligned molecules are displayed in multiple concentric circles, where each circle includes all molecules measured in the same comparative experiment, i.e., having the same EIN. Each circle in the graphic is separated into multiple segments representing the different experimental groups (Fig. 6). The experimental information panel is displayed with a tree structure on the left side of screen. The root of this structure is each EIN composed of information from multiple experimental groups such as col0, ler2, fpt2 and 152–54. Each experimental group contains molecular information of each molecule analyzed in the experiment, e.g. molecular index and molecular name. The molecular information is static, but the user can change the check status of an EIN or experimental group to decide whether the related molecular information should be displayed in the graphic panel.
Red coloring in figure 6 is used to indicate molecules detected in every experimental group in the comparative experiments while black indicates a molecule that was not detected in all experimental groups. If a molecule is not detected in any experimental group, or the molecule is deselected in the experimental information panel, that node does not appear in the graphic display. An index number of all molecules detected in a comparative experiment is displayed in the outermost circle. The designated index number may be employed to find molecular and experimental information in the experimental information panel.
Displaying all molecules in multiple concentric circles enables experimental information for each molecule to be easily categorized by location on the circle. This design also enables the user to perform interactive visual data analysis by simply clicking on the node representing each molecule of interest. However, the concentric circle display will become congested with large numbers of molecules. To address this problem, the SysNet zoom function may be employed to display the concentric circles in a larger graph. The zoom function is invoked by a single mouse right click.
The user can focus on the behavior of a single molecule in multiple experiments. By clicking a node on the graph of the comparative window (Fig. 6), a multiple panel 'Molecular Evolution' window will appear that displays the expression information for that molecule in each experiment (Fig. 7a). The upper left 'dataset panel' displays EIN, experimental groups, and individual experimental samples. In the upper graphic is shown the behavior of the molecule of interest in multiple samples including response range, average and median expression level value in each comparative experiment displayed. The user may add or remove molecules using check boxes in the dataset panel. If an EIN is unchecked, expression level information for all samples in that comparative experiment will be assigned as zero in the graphic display. This information is also reflected in the lower left 'sample list' panel that displays all samples being analyzed in the an experimental group, which is highlighted in the dataset panel by a single-click on the experimental group. The molecular expression level in each sample is displayed in the lower graphic. In this graphic, the user may remove the molecular response detected in a sample by unchecking the specific sample box in the sample list panel. This information is automatically updated in the two graphics on this screen.
Molecular expression level detected in analytical instruments may be affected by many factors during data acquisition and analysis. We used Sprent's equation [21] to find statistical outliers in sample replicate experiments:
(4)
where X
i
is molecular expression data being evaluated as a potential outlier, and M is the median of the molecular expression data in all samples. MAD is the median absolute deviation, and Max is the threshold value that must be exceeded to conclude that the value X
i
is an outlier. The value Max is set as 50, which is extremely likely to identify molecular expression data that deviates from the mean by more than three standard deviations.
Molecular expression data points identified as outliers are highlighted in red in the Molecular Evolution screen lower graphic (Fig. 7a). The user can remove outliers by un-checking the corresponding sample names in the Sample List panel. Figure 7b displays molecular behavior after the samples containing outlier molecular expression data have been removed (S1 and S8). Manually removing samples containing outlier molecular expression data can be an inefficient method of data selection when dealing with a large number of molecules. Therefore, SysNet automatically removes all samples containing outlier molecular expression data and the check box of each sample containing the outliers on the left panel is un-checked. The user can re-visit these outliers by checking the corresponding sample box. The graph in the upper central portion of Figure 7a displays the molecular concentration evolution for a time course study. In our example, this graph displays the concentration dependency of the element Cd, with the concentration of Fe in growth medium.
SysNet also provides quantitative modeling to evaluate the profile of molecular responses. We have implemented algorithms to model chemical kinetics for first order, second order and third order chemical reactions evaluated on a molecule-by-molecule basis. Chemical kinetics describes how the rate of a reaction varies with the concentrations of various reactants in the system. The rate of reaction is proportional to the rates of change in concentrations of the reactants and products; that is, the rate is proportional to a derivative of a concentration. This approach can be used to model simple biological process. More sophisticated models will be implemented in future.
The implemented visual analysis approaches are non-quantitative and used in cases where the molecular concentration profile can not be modeled based on accurate and absolute quantification. In our study, we investigate the metal ion concentration change in growth medium with different Fe concentrations. There are many biological processes involved in establishing the final concentration of each metal ion and in many cases, quantification of molecular expression levels for each of these biological processes is not available. The visual analysis approach however, enables us to identify the trends of metal element absorption with the increase of Fe concentration in growth medium. SysNet implements three functions for visual analysis: not fitting, robust fitting and chi square fitting (Fig. 7) [25]. Both robust- and chi square- fit the molecular response to a straight line. Analysis of all elements in each group indicated that the concentration of Fe in the growth medium differentially effects elemental profiles in the col0, fpt2, 152–54 and ler2 experimental groups. For example, with increasing Fe concentration in the growth medium, the concentrations of Cd, Co and As in mutant 152–54 decrease. This suggests that the elemental ion absorption pathways of Cd, Co and As are related with the growth medium in 152–54 mutant. The concentration of other elements did not show a significant dependency on the concentration of Fe in the growth medium. It is interesting that the concentration of Fe in the plant does not vary significantly with the increase of the concentration Fe in the growth medium. This indicates that the process of absorption of elemental ions is selective. Details of these experimental analyses related to the mechanisms of elemental ion absorption will be reported separately. We have also employed SysNet to study protein and metabolite correlation networks in proteomics and metabolomics data sets.
The current version of SysNet is developed in Microsoft Visual Studio .Net using Visual C++. Most data file types and database sources can be employed as its input. The system is therefore open for analyses by the vast majority of users. To further expand the application of SysNet, we plan to develop a Unix version of SysNet using Java.