Fragments
The fragment library generated for this study, contains fragment-pairs of length 10, 15, 20 and 30, with a maximum allowed gap-lengths of 2, 3, 4 and 6 respectively. All fragments are based on pairwise comparisons between structural domain as defined by SCOP. The pairs are scored for similarity purely on structural grounds, using the coordinates of the c-alpha atoms. This is to avoid bias, based on sequence similarity. All possible pairwise fragments between two domains of the given lengths are first screened and aligned using a method similar to the pre-filter used by MAMMOTH [4]. Each fragment pair with an alignment score above a threshold is then superimposed giving the c-alpha RMSD score for the fragment pair.
Age estimates
Age estimates for protein folds or superfamilies are generated using fold recognition of structural domains on a set of completed genomes. The occurrence patterns of such predictions, are analysed with a parsimony algorithm to estimate an age for a superfamily, for more details see [3]. The age of a superfamily is based on a score between [0.0,1.0], with 1.0 indicating the superfamily was estimated to be present at the root of the species tree (oldest), and 0.0 estimating that the superfamily was created at the leaf level (youngest). Here an 'old' fold is defined as a fold with an age of 1.0, and a 'young' fold with an age < 0.5.
Linking Folds
Some fragments might be over-represented (e.g. secondary structure is not considered) therefore the number of shared fragments needs to be normalised for the number of times a fragment occurs. Friedberg and Godzik (2005) used a superfamily based normalisation to overcome this problem [2]. We use a similar approach, although the fragment-pairs in this study are based on structural similarity only. (whereas Friedberg and Godzik (2005) used a combination of sequence and structural similarity). A link between two superfamilies (I and J) is established when f(I, J) > 0.1, which is calculated as:
Here Sim(A, B) is the number of shared fragments between two set of domains (e.g. superfamilies), and A is the set of all domains. In this study we do not consider self-similarity of superfamilies.