NR6WT cells expressing human EGF receptor (EGFR) were maintained in modified Eagle's medium-α containing (MEMα) 7.5% fetal bovine serum (FBS) and 1% of each of the following: penicillin/streptomycin, L-Glutamine, non-essential amino acids and sodium pyruvate (all from GIBCO). The medium contained 350 μg/ml of G418 as a selection agent for human EGFR. Cells were quiesced in a medium containing 0.5% dialyzed FBS for 24 hours before addition of EGF. The MDA-MB-231 invasive human breast cancer cell line was maintained in RPMI 1640 medium (GIBCO) containing 10% FBS and 1% penicillin/streptomycin. Migration and immunoblotting assays were conducted by quiescing the cells in a medium containing 0.5% dialyzed FBS for 24 hours prior to experimentation.
Reagents and antibodies
Antibodies used to detect activated status of EGFR (phosphorylated Tyr1173), ERK (phosphorylated Thr202/Tyr204), PKCδ (phosphorylated ser643), and myosin light chain (phosphorylated ser19) were obtained from Cell Signaling technology (Danvers, MA). Activated status of phospholipase Cγ (PLCγ) was probed using a rabbit polyclonal antibody against phosphorylated tyrosine 783 residue obtained from Santa Cruz Biotechnology (Santa Cruz, CA). ML-7 was utilized as MLCKinase specific inhibitor and was purchased from Calbiochem, EMD Biosciences (La Jolla, CA).
Preparation of fibronectin-coated surfaces
Fibronectin coating concentrations of the surfaces were 0.1, 0.3, 1 and 3 μg/ml. Tissue culture plates were incubated with fibronectin at required concentrations diluted in PBS at room temperature for a period of 2 hours. The plates were washed once with PBS and incubated with 1% bovine serum albumin for another 1 hour to block non-specific protein binding during the course of the experiment. The plates were washed three times with PBS and cells plated directly in complete growth medium over these surfaces.
Quantitative immunoblotting for signaling protein data
NR6WT mouse fibroblasts engineered to express human EGFR were utilized for our baseline modeling studies. These cells are derived from the 3T3 lineage, are devoid of an endogenous EGF receptor and serve as an excellent model system to study EGFR mediated cell migratory events. Equal number of NR6 WT cells were plated on fibronectin coated surfaces and allowed to grow in MEMα containing 7.5% fetal bovine serum (FBS) for 24 hours, by which time cells reached about 90% confluence. Subsequently, cells were quiesced in media containing 0.5% dialyzed FBS for another 24 hours, to minimize the effect of exogenous growth factors present in the serum. Cells were either lysed in the quiescent medium without any exogenous human EGF or stimulated with 10 nM (saturating concentration) of human EGF for either five minutes, one hour or 16 hours. Such time frames were selected to capture the entire spectrum of signaling protein activation during the motility response . After stimulation, cells were washed once with ice cold PBS, and then lysed in lysis buffer containing 50 mM HEPES, pH 7.4, 150 mM NaCl, 1% Triton X-100, 1 mM Na Vanadate and 10% glycerol supplemented with protease inhibitors including 1 μg/ml Leupeptin, 1 μg/ml Aprotinin and 1 mM phenylmethylsulfonylfluoride (PMSF). Cell lysates were quantified using Biorad protein assay. Equal amount of total proteins were mixed with the loading buffer containing 4% SDS (w/v), 0.1 M Tris-HCl, pH 6.8, 20% glycerol, 0.2% Bromophenol blue and 5% β-mercaptoethanol, boiled for 5 minutes and then loaded on either 7.5% (for analysis of pPKCδ, pERK, pEGFR, pPLCγ) or 15% (for pMLC) SDS polyacrylamide gels. Cell lysates were resolved by electrophoresis and subsequently transferred onto nitrocellulose membranes, after which, membranes were immunoblotted with specific antibodies to detect the specific proteins or their activated phospho-protein forms. Immunoblots were quantified with the NIH image analysis densitometry software. The software generates an area plot for each protein band, the density of which represents the amount of the protein in each lane. In the signaling protein experiments, the quantitative values generated represented the activated status of a protein since the proteins detected were in their activated or phosphorylated state. At least 5 replicates were analyzed for each protein at each timepoint; all immunoblots performed were analyzed to capture the full extent of the noise inherent in such measurements .
Prior to polynomial modeling and decision tree analysis, the data were thoroughly preprocessed by normalization and quality-control approaches described in . First, densities in each band were divided by the value of the first lane (Fn = 0.1 and EGF = 0) for each immunoblot. After this between-band normalization, the numbers within an immunoblot become comparable to other immunoblots since the experimental conditions in each of the experiment were kept constant. For quality control purposes, the bands were also within-band normalized: all protein conditions in a band without exogenous EGF were divided by the value with EGF = 0 and Fn = 0.1, while all protein conditions in a band with exogenous EGF were divided by the value with EGF = 1 and Fn = 0.1. The within-band normalization ensures that proteins under the same EGF condition within a band are comparable. Prior to normalization all basal values below 250 were converted to 250 in order to prevent division by a small value that is likely due to noise. After normalization, all the values were log2-transformed.
Normalization was followed by the ANOVA based quality control approach and statistical outliers were discarded . Each variable (signaling protein) had at least five replicate values (except PKCδ for 16 h that had four replicates) after quality control for polynomial modeling.
Development of computational model
Our goal was to create a predictive model that is able to predict migration speed as a function of signaling proteins, and provide insight on what signaling proteins could key elements governing migration. Accordingly, we chose the decision tree methodology since decision trees both show the predictive structure of the signaling proteins and are fairly accurate classifiers . As there are eight observations across EGF and fibronectin concentrations per variable, a classifier based on these data only would be weak. Thus, we first used polynomial modeling to find parametric models for the variables to capture protein activity as a function of fibronectin. These models were then used to simulate data in an interpolative manner across fibronectin concentrations and used in the classification.
Polynomial interpolation of signaling protein data set
Prediction algorithms in general require large training and validation data sets to ensure that the resulting predictor is reliable and the results reproducible. Therefore, we developed mathematical models that capture signaling protein activity and migration speed profiles as a function of fibronectin concentrations. Variables (signaling proteins and migration speed) were modeled using the polynomial function family. Polynomial functions family was chosen because it allows for modeling of a large spectrum of different trends. To choose degree for a polynomial model, we applied normalized maximum-likelihood (NML) approach, which is an implementation of the minimum description length (MDL) principle and aims at describing the data best without overfitting . Technical details of the NML approach in estimating polynomial degrees are derived and discussed in .
The polynomial models were constructed separately for the values with or without exogenous EGF. As the first value (no exogenous EGF and fibronectin concentration of 0.1 μg/ml) in each immunoblot was used in normalization, the polynomial modeling for data without exogenous EGF was done with three data points, whereas data with exogenous EGF was modeled with four values. Accordingly, the maximum polynomial degree in the NML modeling step was set to two. The resulting polynomial estimates (β) and squared pooled standard errors (spooled) used in the simulations are given in Additional File 1.
We used the resulting polynomial models to create 10000 simulated training sets (58002 cases in each data set) and 1000 validation data sets (5802 cases in each data set). Data for each signaling protein and migration speed were then discretized using the Lloyds algorithm , which minimizes the average quantization noise power and is essentially the same as the k-means clustering method. Thus the only parameter needed in the Lloyds discretization method is the number of discrete categories. In this study the number of discrete categories was chosen to be the number of the polynomial estimates for 5 min data set. For example, EGFR for 5 min has three parameters, so EGFR is discretized to low (0), medium (1) and high (2) phosphorylation levels. We have illustrated the discrete regions for cell migration speed and MLC in Figure 2.
Decision tree construction
Decision tree predictors aim to uncover the predictive structure of a classification or prediction problem while still maintaining good prediction accuracy. Here, we used the classification and regression trees (CART) approach . A more detailed description of the use of the CART in modeling migration speed using signaling proteins is given in .
The CART results in a decision tree where interior nodes represent signaling proteins and leaves migration speed classes. Each interior node is actually a question that splits the data into two subsets. For example, the first question in the 1 h decision tree (see Figure 3) is whether activity of EGFR is low (0). Accordingly, all cases where EGFR is low go to left (29005 cases), while the rest (28995 cases) go right. The rule "EGFR is low" results in 20790 cases having slow migration speed of 22883 cases belonging to slow migration category (91%). Further, as in the data set split to the right there are only 8211 cases belonging to medium speed and 4 to fast speed classes, the data are not split further and the rule "EGFR is low" predicts slow migration speed. If EGFR is medium or high, however, the set of 22883 cases is split further until sufficiently good prediction accuracy is achieved. The parameters for the decision tree learning were as follows. Purity function was the Gini-index, variables having more than five cases were considered for a split and prior probability for i th class was obtained by dividing the number of the cases of i th class by the total amount of observations. The cost of a misclassification from high to low speed was 2, medium to high or low was 1, and the cost for correct classification was 0. After constructing a decision tree, we applied the cost-complexity pruning method  to avoid over-fitting. All computations were performed using MATLAB v6.5 with Statistics toolbox.
We simulated 10000 training data sets and used them to learn decision tree predictors. These 10000 decision tree predictors were then applied to 1000 independent validation data sets and the predictor giving the best classification accuracy was chosen. For 5 min, 1 h and 16 h data sets, the best decision tree predictors achieved 70%, 75% and 57% accuracy, respectively.
In vitro migration assay
Cell migration was measured as the distance traveled by the cells into a cellular area. Cells were seeded in 6-well tissue culture plates for a period of 24 hours in growth medium. Cells were quiesced for another 24 hours in serum free medium at which time cells formed a confluent monolayer. A denuded area was created by scraping with a pipet tip, washed three times with phosphate buffered saline (PBS) to remove dead cells, and kept under serum free conditions throughout the experiment. EGF at 10 nM (and inhibitors or diluent as indicated) was added to the serum free medium. Cells were then photographed using an inverted microscope immediately following scraping (0-hour condition) and 24 hours later (24-hour condition) in exactly same three different areas. The photographs were merged and analyzed using Adobe Photoshop program to determine the average distance traveled by the cells in 24 hours. All experiments were performed in triplicate.
Single cell tracking for cell speed analysis
For final validation of cell migration, individual cell speeds were measured using time-lapse videomicroscopy. 6,000 cells were plated on each fibronectin-coated DeltaT imaging dish (Bioptechs) in 2 ml of assay medium containing 0.5% dialyzed FBS and 1% BSA. 16 hours post-seeding, the medium was replaced with 3.2 ml of fresh assay medium. In migration versus fibronectin validation studies, the replacement medium contained 10 nM EGF. In MLC inhibition studies, the replacement medium contained 0, 2, 4, or 10 μM ML-7 (MLCK inhibitor), and 10 nM EGF was added 45 minutes after ML-7 exposure. The plates were then sealed with a vacuum grease-lined coverglass lid and placed in a heated stage insert for a Ludl 99S008 motorized stage on a Zeiss Axiovert 35 microscope. Three fields of cells, with five to ten cells per field, were tracked by recording an image for each field every 15 minutes for up to 20 hours. Individual cell speeds were calculated using Visible (Reify Corporation, Cambridge, MA), which determines speeds by generating instantaneous velocity vectors for each pixel of the image that is associated with a cell. We found that cell speeds reach a steady-state 4–6 hours after adding EGF as previously reported , and as such the reported speed ± SEM for each condition is an average of 15–20 cells' speeds at each time point between 6 to 8 hours.