Flow chart of scoring each gene perturbation's probability of causing disease phenotypes. (A) Decision tree-based phenotype classifiers give the raw probability scores based on expanded and filtered phenotype descriptions (Phenotype Ontology terms) averaged over 100 trees using different random negative gene sets, penalized by the phenotype commonality, and finally summed over all the phenotypes for a disease. See text and Additional File 1 for details. (B) An exemplary decision tree for the phenotype "MP00001556" learned from the MGI data. Starting from the top root node, if one gene is annotated with the phenotype in each node (ellipse), it travels down to the branch of "Y"; otherwise to the branch of "N". Leaf nodes (rectangle) represents the number of GSP(+) and GSN(-) in training set that located in this leaf node through all the splits shown in its parent nodes. (C) ROCs of 10-fold cross-validation for decision tree-based phenotype classifiers on 10 randomly selected phenotypes (blue dot-dashed lines), HT phenotypes (green dotted lines) and T2D phenotypes (red dashed lines). The black solid line indicates random expectation. Sensitivity = TP/(TP+FN) and false positive prediction rate (1-specificity = FP/(FP+TN)) were used as the y-axis and x-axis variables, where TPs (true positives) are positive predictions which belong to gold standard positives (GSPs), FNs (false negatives) are negative predictions which belong to GSPs.