Modeling the protein folding process is crucial in understanding not only how proteins fold and function, but also how they misfold triggering many devastating diseases (e.g., Mad Cow and Alzheimer’s [1]).
Knowledge of the stability, folding, kinetics, and detailed mechanics of the folding process may help provide insight into how and why the protein misfolds. Since the process is difficult to experimentally observe, computational methods are critical.
Traditional computational approaches for generating folding trajectories such as molecular dynamics [2], Monte Carlo methods [3], and simulated annealing [4] provide a single, detailed, high-quality folding pathway at a large computational expense. As such, they cannot be practically used to study global properties of the folding landscape or to produce multiple folding pathways. The use of massive computational resources, such as tens of thousands of PCs in the Folding@Home project [5, 6] have helped improve the time overhead involved but still are unable to handle very large proteins. Statistical mechanical models have been applied to compute statistics related to the folding landscape [7, 8]. While computationally more efficient, they do not produce individual pathway trajectories and are limited to studying global averages of the folding landscape.
Robotics-based motion planning techniques, including the Probabilistic Roadmap Method (PRM), have been successfully applied to protein folding [9–11]. They construct a roadmap, or model, of the folding landscape by sampling conformations and connecting neighboring ones together with feasible transitions using a simple local planner. They can generate multiple folding pathways efficiently (e.g., a few hours on a desktop PC) enabling the study of both individual folding trajectories and global landscape properties.
While promising, making good choices for each of the algorithmic steps remains difficult. Machine learning approaches have been used to dynamically decide which approach to take for generating samples and connecting them together. These approaches generally learn globally and can perform well in homogeneous spaces or partitioned spaces where each partition is homogeneous [12]. Preliminary work applied connection learning to protein folding simulations [13], but with no way to ensure a good partitioning of the landscape, the results were only comparable to methods with no learning involved.
We present Local Adaptive Neighbor Connection (ANC-local) that localizes learning to within the vicinity of the current conformation being connected. When choosing a connection method (i.e., the neighbor selection method and local planner combination), we first dynamically determine a neighborhood around the conformation under consideration. Then, the performance history within this neighborhood is used to bias learning. Our method adapts both over time and to local regions without any prior knowledge about the methods involved. This approach has been successfully used in robotics [14], and here we adapt it to protein folding.
We compare ANC-local’s performance to three distance-based connection methods and to global learning over 23 proteins of varying secondary structure makeup with 52–114 residues. We examine both the time to build roadmaps and the resulting trajectory quality. We further look at the local planner success rate to understand performance changes between methods. Our results confirm that learning is necessary, as no individual method is the best choice for all proteins. We also show that ANC-local generates better quality trajectories in comparable time than the best connection method for each individual input and outperforms global learning.
We next describe some preliminaries and related work in further detail, including experimental protein dynamics, the protein model used, PRMs for protein folding, and several key components such as candidate neighbor selection methods and distance metrics. We also discuss existing machine learning techniques for PRMs and for protein motion and analysis.
Experimental protein dynamics
There have been several advances in experimental techniques to study protein dynamics and motion including circular dichroism, fluorescence experiments, hydrogen exchange and pulse labeling, NMR spectroscopy, and time-resolved X-ray crystallography. We briefly discuss each in turn.
Circular dichroism (CD) is a spectroscopic technique used to investigate the structure and conformational changes of proteins [15]. By informing on binding and folding properties, CD provides information about the protein’s biological functions. The CD signal occurs when chromophores in an asymmetrical environment interact with polarized light. In the case of proteins, the main chromophores are the peptide bonds as they absorb polarized light in the far-UV wavelength region (i.e., below 240 nm).
Fluorescence spectroscopy analyzes the emission of fluorophores in the protein as the protein undergoes conformational change [16], such as during folding or upon binding. These fluorophores act as indicators of the state of the local environment, e.g., how structured the portion of the protein is near the fluorophore. As almost all proteins have natural fluorophores (i.e., tyrosine and tryptophan residues), fluorescence spectroscopy has broad applicability.
Hydrogen exchange mass spectrometry and pulse labeling can investigate protein folding by identifying which parts of the structure are most exposed or most protected [17]. From this data, one can infer which portions of the protein fold first and which are last to form, up to the millisecond timescale.
NMR spectroscopy, another experimental tool often used to study protein dynamics, is a technique used to determine a compound’s unique structure. It identifies the carbon-hydrogen framework of an organic compound and has been used to study side-chain motion and backbone motion [18]. See [19] for a recent review of current techniques.
X-ray crystallography obtains a three dimensional molecular structure from a crystal [20]. A purified sample at high concentration is crystallized and the resulting crystals are exposed to an x-ray beam. This produces a pattern of diffraction spots. The intensities of these spots can be used to determine the structure factors from which an electron density map can be calculated.
While experimental methods can probe some fine-grained details of protein motion, they are time intensive and limit the time scales they can access. In addition, experimental methods may not be able to be applied to all proteins, e.g., some proteins naturally precipitate out and cannot be analyzed. Simulations, instead, affords the opportunity to study such proteins and others much faster (hours vs. days) with computational resources which will potentially save both time and money.
Protein model
Proteins are sequences of amino acids, or residues. We model the protein as a linkage where only the ϕ and ψ torsional angles are flexible, a standard modeling assumption [21]. A potential energy function models the many interactions that affect the protein’s behavior [2]. This function helps quantify how energetically feasible a given conformation is.
In this work, we employ a coarse-grained potential function [9] which help define some characteristics of our modeling and they state that- If the atoms are too close to each other (less than 2.4Å in sampling and 1.0Å in connecting), the conformation is unfeasible; otherwise, the energy is calculated by:
$$ U_{tot} = \sum\limits_{constraints}K_{d}\{[(d_{i} - d_{0})^{2} + {d_{c}^{2}}]^{1/2} - d_{c}\} + E_{hp} $$
(1)
where K
d
is 100 kJ/mol, d
i
is the length on the ith constraint, E
hp
is the hydrophobic interaction, and d
0=d
c
=2Å as in [2]. The coarse grain model has been shown to produce qualitatively similar results as all-atoms models faster [22].
PRM for protein folding
The Probabilistic Roadmap Method (PRM) [23] is a robotics motion planning algorithm that first randomly samples robot (or protein) conformations, retains valid ones, and then connects neighboring samples together with feasible motions (or transitions). To apply PRMs to proteins, the robot is replaced with a protein model and collision detection computations are replaced with potential energy calculations [9–11, 24].
Sampling
Protein conformations, or samples, are randomly generated with bias around the native state, the functional and most energetically stable state. Samples are iteratively perturbed, starting from the native state, and retained if energetically feasible by the following probability:
$$ P(q) =\left\{ \begin{array}{ll} 1 & \text{if}~E(q) < E_{min} \\ \frac{E_{max} - E(q)}{E_{max} - E_{min}} & \text{if}~ E_{min} < E(q) \leq E_{max} \\ 0 & \text{if}~E(q) > E_{max} \end{array} \right. $$
(2)
where E
min
is the energy of the open chain and E
max
is 2 E
min
. We use rigidity analysis to focus perturbations on flexible portions as detailed in [25].
Connection
Once a set of samples is created, they must be connected together with feasible transitions to form a roadmap, or model of the folding landscape. Connecting all possible pairs of samples is computationally unfeasible, and it has been shown that only connecting the k-closest neighbors results in a roadmap of comparable quality [26].
Given a pair of samples, we compute a transition between them by a straight-line interpolation of all the ϕ and ψ torsional angles. Straight-line local planning involves the fewest number of intermediates to check for validity and has been shown to be a sufficient measure of transition probability; i.e., it can accurately predict secondary structure formation order [9, 22].
We assign an edge weight to reflect the energetic feasibility of the transition as \(\sum _{i=0}^{n-1}-log(P_{i})\) where P
i
is the probability to transit from intermediate conformation c
i
to c
i+1 based on their energy difference Δ
E
i
=E(c
i+1)−E(c
i
):
$$ P_{i} =\left\{ \begin{array}{ll} e^{\frac{-\Delta E_{i}}{kT}} & \text{if}~\Delta E_{i} > 0 \\ 1 & \text{if}~\Delta E_{i} \leq 0 \end{array} \right. $$
(3)
where k is the Boltzmann constant and T is the temperature. This allows the most energetically feasible paths to be extracted by standard shortest path algorithms.
Validation by secondary structure formation order
Proteins are composed of secondary structure elements (i.e., α-helices and β-strands). Experimental methods, such as hydrogen exchange mass spectrometry and pulse labeling, can investigate protein folding by identifying which parts of the structure are most exposed or most protected [27]. From this data, one can infer the secondary structure formation order.
In [9, 21, 22], we compared the secondary structure formation order of folding pathways extracted from our maps to experimental results [28] by clustering paths together if they have the same formation ordering. We return a stable roadmap when the distribution of secondary structure formation orderings along the folding pathways in the graph stabilizes, i.e., the percentage of pathways following a given ordering does not vary between successive graphs by more than 30 %. As our roadmaps contain multiple pathways, we estimate the probability of a particular secondary structure formation order occurring by the percentage of roadmap pathways that contain that particular formation order. The roadmap corroborates experimental data when the dominant formation order (i.e., the one with the greatest percentage) is in agreement.
Candidate neighbor selection methods
Recall that only neighboring (or nearby) samples are attempted for connection because it is unfeasible to attempt all possible connections. Typically, conformations that are more similar are more energetically feasible to connect.
There have been a number of methods proposed for locating candidate neighbors for connection. The most common is the k-closest method which returns the k closest neighbors to a sample using a distance metric. This can be implemented in a brute force manner taking O(k logn)-time per node, totaling O(n
k logn)-time for connection. A similar approach is the r-closest method which returns all neighbors within a radius r of the node as determined by some distance metric.
Other methods use data structures to more efficiently compute nearest neighbors. Metric Trees [29] organize the nodes in a spatial hierarchical manner by iteratively dividing the set into two equal subsets resulting in a tree with O(logn) depth. However, as the dataset dimensionality increases, their performance decreases [30]. KD-trees [31] extend the intuitive binary tree into a D-dimensional data structure which provides a good model for problems with high dimensionality. However, a separate data structure needs to be stored and updated.
Approximate neighbor finding methods address the running time issue by instead returning a set of approximate k-closest neighbors. These include spill trees [30], MPNN [32], and Distance-based Projection onto Euclidean Space [33]. These methods usually provide a bound on the approximation error.
In this paper, we work with proteins with a higher dimensionality (104 to 228 degrees of freedom) than approximate methods can handle. Note, however, that there is nothing inherent in our approach that precludes the use of approximate methods.
Distance metrics
The distance metric plays an important role in determining the best connections to attempt. It is a function δ that computes some “distance” between two conformations a=〈a
1,a
2,…,a
d
〉 and b=〈b
1,b
2,…,b
d
〉, i.e., \(\delta (a, b) \rightarrow \mathbb {R}\), where d is the dimension of a conformations. Here, a
1… and b
1… are the ϕ and ψ torsional angles for each protein conformation. A good distance metric generally predicts how likely it is that a pair of nodes can be successfully connected. Their success is dependent on the nature of the problem studied. We use the following set of distance metrics commonly used for motion planning:
Euclidean distance metric
The Euclidean distance metric captures the amount of physical movement (around the torsional angles) that conformation a would undertake to move to a conformation b. This distance is computed by measuring the difference in the ϕ and ψ angle pairs of the two conformations:
$$ {{} {\begin{aligned} \delta_{\text{Eucl}}(a,b) = \sqrt{\frac{\left({{\phi_{1}^{a}} - {\phi_{1}^{b}}}\right)^{2} + \left({{\psi_{1}^{a}} - {\psi_{1}^{b}}}\right)^{2} +... + \left({{\phi_{n}^{a}} - {\phi_{n}^{b}}}\right)^{2} + \left({{\psi_{n}^{a}} - {\psi_{n}^{b}}}\right)^{2}}{2n}}. \end{aligned}}} $$
(4)
Cluster rigidity distance metric
Rigidity analysis [34] computes which parts of a structure are rigid and flexible based on the constraints present. It may be used to define a rigidity map r, which marks residue pairs i,j if they are in the same rigid cluster.
Rigidity maps provide a convenient way to define a rigidity distance metric, between two conformations a and b where n is the number of residues:
$$ \delta_{\text{Rig}}(a,b) = \sum\limits_{0 \leq i < j \leq 2n} (r_{a}(i,j) \neq r_{b}(i,j)). $$
(5)
More details may be found in [25].
Root mean square distance metric
The protein model has 6 atoms for each amino acid. Thus, a protein with n amino acids will have 6n atoms. Denoting the coordinates of these atoms as x
1 to x
6n
, the root mean square distance (RMSD) between conformations a and b is:
$$ {{} {\begin{aligned} \delta_{\text{RMSD}}(a,b)=\sqrt{\frac{\left({{x_{1}^{a}} - {x_{1}^{b}}}\right)^{2} + \left({{x_{2}^{a}} - {x_{2}^{b}}}\right)^{2} +... + \left({x_{6n}^{a} - x_{6n}^{b}}\right)^{2}}{6n}}. \end{aligned}}} $$
(6)
Least RMSD (lRMSD) is the minimum RMSD over all rigid body superpositions of a and b.
Machine learning for protein analysis and motion
Machine learning algorithms have been employed to predict protein folds, estimate folding rates, and study folding motions. We highlight a few relevant techniques here.
Protein fold recognition
Protein fold recognition involves identifying the correct structural fold from among a set of known template protein structures for a given protein sequence. Fold recognition is essential for template-based protein structure modeling. The fold recognition problem is defined as a binary classification problem of predicting whether or not the unknown fold of the input protein is similar to an already known template from a protein structure library.
RF-Fold uses random forests, a highly scalable classification method, to recognize protein folds [35]. A random forest is composed of many decision trees that are each trained on datasets of target-template protein pairs. RF-Fold recognition rate is comparable to the best performance in fold recognition at the family, superfamily, and fold levels.
DN-Fold is another fold recognition technique, but it uses a deep learning neural network as a basis for learning [36]. A deep learning network has many more layers than a typical neural network. In addition, they may be trained through unsupervised learning. Deep learning was applied to fold prediction by restating the problem as predicting if a given target-template pair belonged to the same fold. They showed that DN-Fold achieved comparable performance over a wide variety of methods at all three fold levels.
Folding rate prediction
In addition to predicting the fold of a protein, it is useful to estimate its folding rate. This is important when studying properties such as stability and classifying kinetics. Characteristics of the protein structure, such as contact order and total contact distance, affect the folding rate. However, the precise relationship between these characteristics and the rate are unknown. A back-propagation neural network was used to quantify this relationship [37]. Their results showed that correlations exist between these properties and the folding rate with relative errors for predicted results lower than competing methods.
Simulating protein motion trajectory
Machine learning has also been applied to studying protein folding trajectories. In [38] they use unsupervised learning to cluster similar states and basins present in the folding landscape. They then use this clustering to construct an exploration bias to speed up molecular dynamics simulations. Specifically, the exploration bias guides the next basin to jump to in the simulation while ensuring that the entire conformation space is explored. They provide simulation results for an alanine trajectory.
Machine learning for PRMs
Many techniques use machine learning to improve PRM performance. In this section we briefly highlight some of these methods.
Learning sampling methods
In Feature Sensitive Motion Planning [39], the planning space is recursively subdivided and machine learning is used to characterize the resulting partitions and select an appropriate PRM variant to use in each. A key strength of this approach is its ability to map workspace/C-Space topologies for planners to work in. However, it does not adapt planner choices over time.
HybridPRM [40] uses reinforcement learning to adaptively select the appropriate sampling method over time. It does so by maintaining a selection probability for each method and updates these probabilities based on the method’s past performance. While learning adapts over time, it is global. It does not perform well when the planning space is heterogeneous, as is the case for most protein folding landscapes.
RESAMPL [12] is similar in spirit to Feature Sensitive Motion Planning, but it dynamically generates local regions to plan in. Instead of using supervised learning, it uses local region information (e.g., entropy of neighboring samples) to make decisions about how and where to sample, and which samples to connect together.
While the classification of a region may change over time as it is explored, it’s placement does not. Thus it cannot adequately adapt if the initial region placement or resolution is not sufficient.
Learning connection methods
Prior work [41] adaptively selects the appropriate connection method to use over time. As the roadmap is built, it records the performance of several connection methods and with this history, decides which to employ by maintaining a selection probability for each. The main weakness of Adaptive Neighbor Connection (ANC-global) is that it bases its decisions on the performance of connection methods over the entire planning space. This is problematic in protein landscapes that are naturally heterogeneous. Therefore, to obtain better results, it became necessary to first partition the space into smaller (and hopefully homogeneous) regions. This puts greater burden on the user, particularly as the dimensionality of the problem increases. While ANC-global was applied to proteins, its performance was limited and so a local learning approach is needed.
Learning from trajectories
Some methods have been proposed to learn from previous experience. For example, the Lightning framework [42] executes two components in parallel: a traditional planning from scratch approach and an approach that extracts and repairs paths from a path history library. It uses the result of the fastest component as the final solution and then adds it to the path history library for future use. Note that as the size of the library grows, it becomes impractical to add additional paths to it.
Apprenticeship Learning [43] also uses existing trajectories to plan motion, but instead aims to learn good trade-offs between different cost functions that describe properties of the trajectories. It learns these trade-offs via inverse reinforcement learning The premise is to learn from a small set of demonstration trajectories instead of a large path library.