The organization of domains in proteins obeys Menzerath-Altmann’s law of language

Shahzad, Khuram; Mittenthal, Jay E.; Caetano-Anollés, Gustavo

doi:10.1186/s12918-015-0192-9

Research article
Open access
Published: 11 August 2015

The organization of domains in proteins obeys Menzerath-Altmann’s law of language

Khuram Shahzad¹,
Jay E. Mittenthal² &
Gustavo Caetano-Anollés^1,3

BMC Systems Biology volume 9, Article number: 44 (2015) Cite this article

2947 Accesses
17 Citations
3 Altmetric
Metrics details

Abstract

Background

The combination of domains in multidomain proteins enhances their function and structure but lengthens the molecules and increases their cost at cellular level.

Methods

The dependence of domain length on the number of domains a protein holds was surveyed for a set of 60 proteomes representing free-living organisms from all kingdoms of life. Distributions were fitted using non-linear functions and fitted parameters interpreted with a formulation of decreasing returns.

Results

We find that domain length decreases with increasing number of domains in proteins, following the Menzerath-Altmann (MA) law of language. Highly significant negative correlations exist for the set of proteomes examined. Mathematically, the MA law expresses as a power law relationship that unfolds when molecular persistence P is a function of domain accretion. P holds two terms, one reflecting the matter-energy cost of adding domains and extending their length, the other reflecting how domain length and number impinges on information and biophysics. The pattern of diminishing returns can therefore be explained as a frustrated interplay between the strategies of economy, flexibility and robustness, matching previously observed trade-offs in the domain makeup of proteomes. Proteomes of Archaea, Fungi and to a lesser degree Plants show the largest push towards molecular economy, each at their own economic stratum. Fungi increase domain size in single domain proteins while reinforcing the pattern of diminishing returns. In contrast, Metazoa, and to lesser degrees Protista and Bacteria, relax economy. Metazoa achieves maximum flexibility and robustness by harboring compact molecules and complex domain organization, offering a new functional vocabulary for molecular biology.

Conclusions

The tendency of parts to decrease their size when systems enlarge is universal for language and music, and now for parts of macromolecules, extending the MA law to natural systems.

Background

“Life is a relationship between molecules, not a property of any one molecule” Emile Zuckerkandl and Linus Pauling [1]

Early last century, Paul Menzerath proposed a generality for language constructs [2]. He found that longer syllables contained shorter articulated sounds and later revealed that words with more syllables were phonetically shorter. He summarized his findings with the motto: “the greater the whole, the smaller its constituents” (“Je größer das Ganze, desto kleiner die Teile”) [3]. These qualitative statements were later elaborated mathematically by Gabriel Altmann [4] and supported by statistical analyses of many languages and linguistic and phonetic relationships of many types. One general formulation of the accepted Menzerath-Altmann’s (MA) law that adds the effect of hierarchy in the makeup of parts [4] follows eq. (1)

$$ y(x)=A{x}^b{e}^{-cx} $$

(1)

with y(x) being the length of the parts, x representing the length of the system (or constructs of parts), and A, b and c fitting parameters. x can also represent a discrete variable describing the number of parts that make up the system. A more general formulation adds dependences on additional variables [5]. y(x) is generally measured by counting parts defined at a deeper level of the system’s organization (e.g., amino acids of domains). This general formulation of the law accommodates the effects of multi-level structure that is typical of language. Two special cases of the equation occur when b = 0 or c = 0. The first mathematical formulation describes how the length or size of parts y(x) decreases monotonically with the length or size of systems. However, the second formulation, eq. (2)

$$ y(x)=A{x}^b $$

(2)

is the most commonly used equation of the MA law, since it enables computation of fitting parameters in log-log plots. This equation delimits a curve of a general two-parameter power law form.

Language-like behavior has been extended to music [6] and recently to genomes [7–10], making the MA law a generality of both natural and human-made systems. In biology, Menzerath’s tendency of the mean size of the parts to decrease as the number of parts increases in a system was shown to be expressed at the cellular and biomolecular level as negative correlations between the mean chromosome length and the number of chromosomes or the size of genomes [7, 8] and mean exon size and the number of exons [9]. Very recently, quantitative linguistic distribution models and statistical analyses have also been used to explore the self-organization of coding and non-coding genomic components [11] and amino acid length distributions of proteins [12]. Here we report that the organization of structural domains in proteins obeys the MA law at the proteome level.

Protein molecules are eminently modular [13]. Recurrent substructures appear in different molecular contexts. This is particularly evident when considering the structural domains of proteins. Domains are 3-dimensional (3D) atomic arrangements of elements of secondary structure that fold into well-packed structural units [14, 15] and are evolutionarily conserved [16–18]. They fold and function largely independently and contribute to overall protein stability by establishing a multiplicity of intramolecular interactions [19]. In evolution, domains combine in multidomain proteins by fusion or excise by fission processes, driven mostly by the forces of genome rearrangement [20]. Consequently, the resultant ‘architectures’ afford functional diversity drawn from both domain structure and domain organization [21]. This fact is made evident by wide co-option of ancient enzymatic activities in metabolic networks [22]. The dynamics of the complex evolutionary mechanics of domain combination results in global patterns of domain gain and loss that materialize differently in the proteomes of the three superkingdoms of life, Archaea, Bacteria and Eukarya [23]. Moreover, phylogenomic analyses of protein domain structures in hundreds of proteomes have shown that the bulk of multidomain proteins appeared explosively quite late in evolution [20]. The rise of domain organization possibly impacted constraints imposed on early proteins by folding speed and protein flexibility [24]. Domain combinations also affected the length of domains and proteins [25, 26], with younger domains exhibiting simpler and smaller structures [27].

Multidomain proteins, which globally make a significant minority (26–32 %) of proteins in proteomes (they are highly represented in eukaryotes), have on average substantially smaller domains than single domain proteins [25]. This trend persists despite proteins of bacterial and archaeal microbes evolving reductively relative to those of eukaryotes by significant shortening of non-domain linker sequences that do not affect domain length. Here we explore how the number of domains in proteins impacts the length of domains. Using a selected set of proteomes sampled from the three superkingdoms we dissect significant law-abiding reductive patterns operating at the proteome level. Our results uncover the important role of cellular economy, as it imposes strong evolutionary pressure on domain structure and organization and biases trade-off relationships needed for organismal persistence.

Results and discussion

The longer the protein the smaller its structural domains

We studied the dependence of the average domain length (z _k) of a protein on the number of protein domains it holds (k) for a set of 60 proteomes representing organisms in superkingdoms Archaea and Bacteria and kingdoms Metazoa, Fungi, Plants and Protista of superkingdom Eukarya. Each and every one of the 60 proteomes examined showed a significant negative correlation between average domain lengths and numbers of domains in proteins, both in logarithmic scale, when using a weighted nonlinear least-squares curve fitting approach (Table 1). To avoid fitting artifacts due to a small minority of proteins harboring high number of domains, we excluded the terminal outliers while retaining an average of 99.44 % (±0.91 SD) (range 96.8–100 %) of entries. Figure 1 shows an example plot describing tight correlation in the proteomic data of Homo sapiens. The linear regression lines in the log-log plots showed high coefficients of determination (R²) with values ranging 0.85–1.00 and significant F test-derived correlations (F test; F = 11.5-2714; p < 0.0001-0.133; only 3 proteomes had p-values higher than 0.05) (Table 1). Since R² > 0.85 values are assumed to indicate satisfying fits and F-test outliers may result from methodological weaknesses of the regression statistics [27], both statistics support in concert significant goodness of the regression fits over ranges of k. In all cases, domain length decreased monotonically with number of domains in proteins, delimiting a MA law for proteomes. Slopes (b) in the log-log plots ranged −0.113 to −0.404 (Table 1), making explicit the negative correlation typical of the MA power law.

Table 1 Summary table of correlation data for the 60 proteomes examined

Full size table

Following elaborations by Meyer [28], we consider two levels i and j of a system to be ‘MA-related’ when (i) the system is hierarchically structured with n + 1 levels of organization and i > j > n, (ii) a significant fit of the relation between the length x of a higher level i sub-system and the average length y(x) of the parts of a lower level j sub-system exists, and (iii) immediate parts and subsystems (level i parts and level i + 1 subsystems) are stochastically independent. Specifically, length x of subsystem i (proteins in proteomes) can be measured by counting terminal (lowest) level n parts (amino acids) or by counting the number of level-j subsystems (domains). Table 1 therefore shows that domain parts and protein subsystems measured using terminal amino acid parts are MA-related at the proteome system level. We note that the evaluation of 60 proteomes appropriately samples the diversity of the cellular world and meets in every case the fitting requirements of the MA-relationship. It reveals a power law-generating stochastic behavior that is likely universal for proteomes and follows the MA law in a hierarchical system of molecular structure. However, its study only gains empirical interest if a rationale for the MA behavior can be envisioned.

Menzerath-Altmann’s law links trade-offs between determinants of persistence

Altmann suspected that the MA law was “somehow connected with the principle of least effort or with some not yet known principle of balance recompensating lengthening on one hand with shortening on the other” [4]. Here we put forth the hypothesis that the MA law represents a tendency towards economy in a trade-off relationship, where improvement in one property occurs at the expense of others. We will therefore unfold empirical patterns at protein and proteome levels that would support our rationale and mathematical formulations.

In order to interpret the fitting parameters of the MA law in linguistics, a statistical mechanics approach can be used that makes use of classical particle physics to describe words in text [29]. In the absence of a similar approach for protein domain organization, we start by defining a persistence function, which provides a heuristic argument for interpreting the MA power law. We introduce a principle of decreasing returns in domain organization to explain the MA-dependency of Table 1. The principle states that the persistence of a system (P) is related to two terms, a cost describing the energy-matter investment in the molecule (P _C) that depends both on k, the number of domains in a protein, and z _k, the average length of a domain [corresponding to x and y of eq. (2)], and a term describing the flexibility and robustness of the molecular system (P _FR) that depends on L ₁, the length of single domain proteins [i.e., the intercept, which corresponds to A of eq. (2); Table 1], b, the slope (which describes the decreasing return in domain length z _k with increasing k) and k. Persistence follows eq. (3)

$$ P = {P}_C + {P}_{FR}=-k{z}_k + \frac{L_1}{b+1}{k}^{b+1} $$

(3)

The derivative of the persistence function P with respect to k, when set equal to zero, gives the power law version of the MA formulation [eq. (2)] of eq. (4)

$$ {z}_k=A{k}^b $$

(4)

with A = L ₁. The function P is not always positive; it becomes negative for sufficiently large k or z _k, beyond the curve P = 0 in the (k, z _k) plane. However, eq. [4] corresponds to a ridge of maximum values for P between this curve and the k and z _k axes. Thus eq. (4) maximizes the persistence function P. Substituting eq. (4) into eq. (3), we get along the ridge eq. (5)

$$ {P}_{max}=-{L}_1{k}^{b+1} + \frac{L_1}{b+1}{x}^{b+1}={L}_1{k}^{b+1}\left(\frac{1}{b+1}-1\right)=-{L}_1{k}^{b+1}\left(\frac{b}{b+1}\right) $$

(5)

Given eqs. (3) and (5), the flexibility plus robustness-to-cost ratio R depends on slope b, following eq. (6)

$$ R=\left|\frac{P_{FR}}{P_C}\right|=\frac{1}{b+1} $$

(6)

Steeper slopes (more negative b, −1 < b < 0) give bigger R ratios, which suggest increased trade-offs benefitting flexibility and robustness over economy in the frustrated landscape of molecular persistence. As we will now elaborate, this agrees with b representing a measure of structural and functional cooperativity among domains as these accrete in proteins and extend their length.

Multidomain proteins provide both structural and functional plasticity, including an increased repertoire of active, regulatory, allosteric and binding sites, an increased landscape of intramolecular stabilizing interactions, enhanced molecular flexibility, and the option of distributing functions among the different domains [21, 30]. The combination of domains in multidomain proteins by genomic rearrangements, gains and losses manifests quite late in evolution [13, 20], suggesting that domain accretion in proteins is a derived evolutionary trait that benefits the increasing tasks of evolving multi-level molecular and cellular organization. Domains stabilize proteins in multidomain proteins mainly through interaction between hydrophobic residues in inter-domain interfaces [19]. The energy of these interactions scales linearly with the surface area of domain-domain interfaces, which depends on the size of the protein. Interactions also enhance the stability of individual domains, which constrains mutational substitution of interacting residues. This matches the broad observation that surface residues are less conserved in proteins when compared to those that are buried in the structural core (e.g., [31]). A recent comparison of number of buried residues normalized to the radius of gyration of domain structure has shown that younger domains tend to have higher surface area to volume ratio than older counterparts [27]. Since in general, younger domains engage in massive domain combinatorics [13], then multidomain proteins must be enriched in domains with relatively more stable structural cores. Thus, increases in k must result in increases of domain cooperativity during folding and consequent increases of protein stability.

If the proteome imparts limits to cellular behavior, then a number of crucial biophysical properties of proteins could constrain proteomic and cellular make up. Biophysical considerations have established that many properties of single-domain proteins, including folding rate and collapse, protein stability and size, and diffusion coefficients, simply depend on chain length and are important for the growth and fitness of the cell [32–35]. Scaling and distribution relationships reveal that folding rate, collapse, size, stability and diffusion of proteins depend simply on chain length [33]. While proteomes were marginally stable to denaturation, the function of cells appeared rate-limited not only by protein synthesis but also by the diffusional transport of proteins (which could explain compartmentalization in eukaryotic organisms) and the folding kinetics of the slowest-folding proteins of the cells. The dependence of cellular processes on protein folding and length is not a surprise. Length is a fundamental biophysical property of biopolymers as they self-assemble to maximize thermodynamic dissipation of energy [35]. Proteins transition abruptly into the folded state through a remarkable cooperative and frustrated process. Hydrophobic residues are buried to form the globular core and charged and polar residues that extend protein structure are exposed. This process exhibits remarkable universal behavior. Folding rates of both proteins and RNA scale as e ^√L, with L representing the length of the polymer. Similarly, the folding and collapse transitions, which coincide, exhibit a cooperative behavior Ω that scales with L ^1.22 [35]. Therefore, folding cooperativity scales with protein length and therefore with k in multidomain proteins.

We reiterate that the persistence function P for proteins and proteomes of eq. (3) depends solely on the length and number of domains, and can be apportioned into two separate terms. The first term reflects the matter-energy cost of lengthening domains by addition of amino acids or lengthening proteins by domain accretion. This cost is mainly imposed by protein synthesis, diffusion and folding and delimited by the mass-energy equivalence imparted by biochemistry. For example, shorter proteins that retain maximum rates of function and have similar kinetic characteristics incur in lower metabolic costs of translation [36], as long as the trade-off maximizes cell physiology and growth rates. We note however that the intensity of protein length reductive pressure decreases if the fraction of cellular mass of the protein decreases. This would be particularly significant for highly diverse proteomes (e.g., Eukarya) and macromolecular crowding environments that maximize diffusion rates and the kinetic efficiencies of proteins [25]. Similarly, domain length follows a narrow distribution [37], limited by the benefits of fast folding of shorter proteins and the stability offered by burial of hydrophobic residues of structural cores of sufficient size. The second term of P reflects the benefits of larger domains and multidomain proteins, which contribute intramolecular interactions and provide additional structural and functional bases for increasing information flux through the system and enhancing flexibility and robustness. Borrowing from Yafremava et al. [38], we here define flexibility broadly as those structural and functional mechanisms that respond to changes internal and external to the molecular system and require processing of information. More flexible systems are generally larger, harbor more complex functionalities, and are more diverse in finding trade-off solutions. We define robustness as mechanisms that use information to maintain structure and function despite external influence and protect molecules from malfunction. Robustness includes stability but refers to broader processes that are passive from an information point of view. Information in molecules is stored in intramolecular and intermolecular interactions necessary for molecular function and stability [39]. In domain combinations, information also materializes in the combinatorics of domains, which manifests at chain and 3-D levels, and can be equated with language information [21].

The persistence function therefore makes a mathematically explicit framework of persistence strategies for biomolecular systems, in which economy, flexibility and robustness engage in various trade-off solutions. This framework defines a ‘triangle of persistence’, which has the potential to successfully explain organismal diversity [38]. Figure 2 summarizes the framework as it applies to domain structure and organization.

Patterns of decreasing returns in proteomes of kingdoms and superkingdoms

The MA-law imposes patterns of decreasing returns for domain lengths of proteins of a proteome. These patterns relate to protein domain make up, domain function, and evolutionary pressures imposed on the proteome as an interacting body of the cell. Analyses of domain length in proteins sampled from many proteomes (e.g., a set of PDB structures [37]) may not reveal the MA relationship because the scaling patterns are global and proteome centric. Conversely, a simple comparative analysis of the complement of protein domains in four kingdoms of Eukarya and superkingdoms Archaea and Bacteria hold very distinctive distributions of molecular functions [40] and domain rearrangements [20]. Thus, it is expected that specific patterns of decreasing returns will exist for those groups. We therefore plotted slope (b) versus intercept (L ₁) for each proteome that we studied with the goal of dissecting the contributions of economy and length of domains in single domain proteins that are characteristics of organismal groups (Fig. 3a). The lengths of single domain proteins L ₁ act as upper bounds for the MA’s ‘shortening’ principle of domain length, establishing a flexibility-robustness stratum for a proteome in the triangle of persistence. Slopes ranged from −0.045 for Medicago truncatula (Plants) to −0.404 for Brachiostoma floridedae (Metazoa). Intercepts ranged from 183 for Medicago truncatula to 247 for Aspergillum nidulans (Fungi). Most fungi exhibited the largest intercepts and a substantial number of plants and metazoans showed the smallest. Higher intercepts should be interpreted as larger ‘starting’ domain sizes fostering opportunities for flexibility and robustness but counteracted by increased burdens of cost. Most metazoans showed the steepest slopes and substantial number of plants and protists the shallowest. Steepest slopes should be interpreted as stronger ‘push’ towards flexibility and robustness and corresponding ‘counter-push’ towards economy in domain organization. Proteomes distributed in the plot following a fan-like pattern, with the top segment of the semi-circle occupied by Fungi, Protista-Bacteria-Plants, and Archaea, in that order, and the bottom part by Metazoa. Plants and Protista occupied the fan handle.

We find that proteomes in the plot showed higher linear correlations for Fungi, Archaea and Plants (R² = 0.59-0.87; F = 11.4-54.6; p < 0.0001-0.01), the lowest correlation for Bacteria (R² = 0.36; F = 4.51; p = 0.067), and no significant trends for Metazoa and Protista (R² = 0.04-0.08; F = 0.34−0.68; p = 0.432-0.575). Since slopes of proteome groups in the slope b versus intercept L ₁ plots increase with single-domain length (intercept L ₁) and increasing linear fits, we hypothesize that this increasing trend, which is maximal in Fungi, describes a ‘compressible’ property capable of reducing domain length (L _k) when additional domains are accreted in proteins (k > 1). In other words, proteomes like those of fungi that exhibit on average longer domains in single domain proteins are capable of considerable length reduction as domains accrete in proteins. In turn, those that have shorter average single domain proteins relax the reductive tendency in multidomain proteins. Given the theoretical link that exists between b and both domain cooperativity and stability elaborated above, and the high surface area to volume ratio detected in new emergent proteins [27], we propose that the ‘compressible’ property is associated with contact density in domain structures, i.e., the fraction of buried sites in the atomic structure. Contact density correlates positively with evolutionary rate, measured as substitutions in protein sequence, without being confounded by gene expression levels [41]. Consequently, the larger numbers of contacts buried in the structures of larger domains, such as those of fungi, are prone to increased structural change. This could accelerate the reduction of the length of secondary structures by domain accretion in multidomain proteins, as accretion increases buried surface area. Since domains in a multidomain protein are translated at the same rate, the effect of gene expression levels on sequence change homogenizes differences in evolutionary rates of domains in multidomain proteins [42]. Thus, increases in evolutionary rates with domain number should extend to the entire protein. We note that both fungi and plants, as a group, are subject to increased levels of genomic rearrangements (via high recombination rates or transposon activities), when compared to metazoan, bacterial and archaeal microbes. This could result in increased insertion-deletion (indel) dynamics in regions of secondary structure that would decrease the length of these segments in evolution. Moreover, organismal groups such as Archaea and Fungi are subjected to strong reductive evolutionary pressures [43] that manifest in highly reduced proteins and proteomes [25]. This trend adds ‘compression’ tendencies to the length of multidomain proteins in this group, even if the lengths of single domain proteins are on average low.

We also plotted total number of domains in proteomes versus intercept (L ₁) to reveal the effect of reductive evolution at proteome level on starting domain size of organismal groups (Fig. 3b). As expected, the proteomes of the microbial superkingdoms were highly reduced, an evolutionary tendency imposed by an early pressure of demanding microbial lifestyles to reduce protein complements [38, 43]. However, proteomes of Bacteria showed larger L ₁ values than those of Archaea, uncovering additional reductive evolutionary constraints imposed on the archaeal microbes by lifestyle and history. With exception of Fungi, the rest of eukaryotic kingdoms relaxed reductive evolutionary constraints. Metazoa showed the largest repertoires and low L ₁ domain lengths. Fungi showed the smallest repertoires and the largest L ₁ values. All organismal groups in the plot were clearly dissected but none showed significant correlations (R ² = 0.001-0.136).

Patterns of domain length over-representation in single domain proteins

The effective average protein length (L _e) represents the sum of the length of individual domain constituents of a protein, without considering linkers and terminal non-domain sequences. We calculated L _e for each proteome using weights M _k, the number of proteins with k domains, and averaging over all k up to K’, the largest value of k on the linear part of the log-log plot. The plot L ₁ versus L _e (Fig. 4a) showed linear correlations with low goodness-of-fit for proteomes in all kingdoms and superkingdoms (R ² = 0.42-0.85; F = 5.92-44.74; p = 0.0002-0.041) with the exception of Bacteria (R ² = 0.37; F = 4.66; p = 0.063). All trend lines clustered together quite tightly showing an expected overall increase of L ₁ with increasing L _e. The slopes, which vary from 0.352 to 0.866, represent the fraction of total domain length apportioned to single domain proteins (L ₁/L _e). Slopes show the disproportionate large representation of single domain proteins in microbial proteomes that hold only a limited repertoire of multidomain proteins. Slopes are maximal in Fungi and Archaea (0.866 and 0.742), intermediate in Plants and Bacteria (0.545 and 0.458) and minimal in Protista and Metazoa (0.352 and 0.390). Thus, Fungi and Archaea have significant overrepresentation of the length of single domain proteins, a feature that correlates with the high ‘compressible’ property revealed in Fig. 3a and the fact that they represent the organismal groups subjected to highest reductive tendencies in microbial and eukaryotic superkingdoms, respectively, revealed in Fig. 3b. The steepness of slopes follows the Fungi–Archaea > Plants–Bacteria > Protista–Metazoa trend of the slope versus intercept plot. Similarly, the best supported linear fits correlate with proteomes harboring larger proteins resulting from larger single domain proteins. Archaea is the superkingdom harboring the most reduced protein domain repertoires and the shortest proteins [25, 43]. This reductive trend is likely the result of mass economy and growth rate optimization. It is therefore unsurprising that it is costly for archaeal proteins to add more domains to a single domain protein; L ₁ takes more of L _e. A similar trend exists in fungi, especially in ascomycetous yeast, which already show significant reductive trends compared to other fungi and other eukaryotes [40] (Nasir, A. and Caetano-Anollés, unpublished). In our study, ascomycetes that include unicellular yeasts and dimorphic fungi that switch between unicellular and hyphal phases, have on average higher L ₁ (236 ± 6) and steeper slopes (−0.259 ± 0.018) than the rest of fungi examined (220 ± 10 and −0.202 ± 0.045), supporting the reductive trend visible in Fig. 3b. Within Eukarya, fungi also show maximum reductive evolutionary tendencies in the repertoire of domains and associated functions, when these are defined at fold superfamily level of structural classification (see Table S1 in [40]).

We also plotted L _e versus slope b again revealing linear correlations for Fungi and Archaea with low goodness-of-fit (R ² = 0.63-0.66; F = 13.48-15.22; p = 0.0045-0.0063) but non-significant fits for the rest (Fig. 4b). Most correlations showed that b became steeper with increasing L _e. This is expected since larger proteins must impose increased pressure to fulfill the decreasing return strategy of the MA law and the principle of maximum economy. Remarkably, groups showing the more significant linear correlations (Fungi and Archaea) showed maximum slopes in the plot, matching patterns observed in Fig. 3a. Thus, the marked reductive evolutionary trends of Archaea and Fungi that manifest at proteome level carry over to the length of individual proteins, supporting a previous study of reductive evolution [25]. We note that in Fig. 4b, the slope of the archaeal group is steeper (−0.0044) than that of fungi (−0.0025), revealing additional reductive constraints that are imposed on the akaryotic microbial superkingdom, which is significantly marked and unfolded very early in protein evolution [43]. This is also evident in the plot of Fig. 3b.

Conclusions

Processes of diminishing returns manifest when systems search for optimality. The closer to the optimum condition, the more difficult the effort invested in attaining it. For example, laboratory optimization of an arylesterase function in an in vitro evolution experiment revealed strong diminishing returns on enzymatic activity [44]. The first mutations in the bacterial population accounted for most improvements and the last ones simply reinforced the effects of early ones. In general, experiments that unfold new molecular functions also reveal the existence of evolutionary trade-offs between stability and function (e.g., [45]). Here we uncover similar processes of diminishing returns and trade-offs operating during molecular accretion of domains in proteins.

Menzerath’s insight suggested the existence of a universal tendency of parts to decrease their size when systems enlarge. The MA law appears universal for language and music. Our study extends its validity to biological parts and systems. In language, constituents of language constructs, such as the phonemes of words, are dynamic. They change as language unfolds in human history. Similarly, parts of biological systems, such as the domains of proteins, change in molecular evolution. In the case of domains, they increase or decrease in length and accrete in multidomain proteins by the pervasive effects of mutations and genomic rearrangements. We now find that protein domain length decreases with increasing number of domains in the proteins of proteomes. The existence of an MA law in protein domain organization can be explained as the consequence of the frustrated interaction between the strategies of economy, flexibility and robustness. The MA law represents a power law relationship that manifests when unfolding molecular persistence P as a function of domain accretion, measured as number of domains k in proteins. P holds two terms, one reflecting the matter-energy cost of adding domains and extending their length in proteins, the other reflecting how domain length and number impinges on information and the flexibility and robustness of the molecular system. Thus, our persistence function describes a frustrated landscape in a ‘persistence triangle’ with vertices representing the three main strategies.

A previous analysis of proteome makeup revealed that organisms in kingdoms and superkingdoms preferentially use flexibility and robustness properties in trade-off relationships with economy as they face environmental uncertainties and negotiate survival [38]. Archaea and the more flexible Bacteria gravitate towards the triangle’s economy vertex. In turn, eukaryotic organisms trade economy for flexibility and robustness as they massively expand biological repertoires and levels of organization. Protista occupy a saddle manifold separating Archaea and Bacteria from multicellular organisms. Plants and the more flexible Fungi are less affected by the positive feedback loop that pushes Metazoa towards maximum flexibility. Our mathematical formulations of persistence, which explain the MA power law, manifest similar trade-off relationships in the proteins of proteomes (Figs. 3b and 4b). Archaea, Fungi and to a lesser degree Plants show the largest push towards economy, each at their economic stratum. Fungi increase domain size in single domain proteins while reinforcing the pattern of diminishing returns in multidomain proteins. Archaea and Plants follow the same strategy but relaxing the push towards larger single domain size. In contrast, Metazoa, and to lesser degrees Protista and Bacteria, relax the MA pattern of economy returns within a broad range of single domain sizes. Metazoa achieves maximum flexibility and robustness in proteins by generating compact molecules with a large number of domains and a multiplicity of combinations. This strategy implemented by Metazoa offers a new vocabulary for molecular functions in biology and new levels of structural organization.

Methods

We selected 60 proteomes of free-living species from the highly curated dataset of Wang et al. [25], which holds ~ 3 million sequences (from 745 proteomes) with structural domains assigned using hidden Markov models (HMMs) of structural recognition in SUPERFAMILY [46]. Species covered superkingdoms Archaea and Bacteria and the four main kingdoms of Eukarya, Protista, Plants, Fungi and Metazoa (animals). Protein entries were retrieved trusting the reliability and robustness of HMMs that were used to delimit domains, the low probability of cryptic domains matching non-domain linker sequences (P < 0.0001) that could affect assignments of sequences to multi-domain protein groups, and the absence of biases imposed on length estimates by superkingdom-specific Markovian models [25]. A flat file was created with information about protein ID, domain ID defined at superfamily level, domain length and whole protein length. We averaged out domain lengths (Y _k ^j) against each domain number (k) for the selected proteins. The following eqs. (7) and (8) were then used to calculate the mean value (z _k) and variance (s _k ²) respectively.

$$ {z}_k = \frac{{\displaystyle {\sum}_{i=1}^{M_k}}{Y}_k^j}{M_k} $$

(7)

$$ {\left({s}_k\right)}^2=\frac{{\displaystyle {\sum}_{j=1}^{M_k}}{\left({Y}_k^j-{z}_k\right)}^2}{M_k-1} $$

(8)

where z_k = mean value of Y_k ^j within a k, Y _k ^j = sum of the value for M _k’s at k point, M _k = number of proteins with k domains, i = number of unique domains starting from 1 to M, k = unique domain number, j = number of Y _k points starting from 1, and (s _k ) ² = variance.

The graphs of k versus z_k were plotted with both axes on a log₁₀ scale. To avoid biases introduced by a small minority of proteins harboring a large number of domains (outliers with k ≤ K domains), we excluded proteins with more than K’ domains and used the rest to fit the lines. K’ was chosen by eye with the goal of maximizing both R² and the number of proteins retained. Initial boundaries for the optimization were R ² > 0.8 and > 95 % of protein entries retained. Analysis of several proteomes in preliminary studies showed that the by-eye choice of K’ judged by marked departures from a line gives nearly optimal fit. For example, inclusion of proteins with K’ ≥ 14 domains of H. sapiens in the example of Fig. 1 (up to the maximum of 20) decreases the R² statistics from 0.91 to 0.7. In turn, selecting K’ ≤ 5 domains decreases the number of proteins retained from 99.7 to 95 %. This brackets the K’ = 13 domain boundary by exactly k = ±7.

Lines were fitted in log space to eq. (9)

$$ {z}_k=A{k}^b $$

(9)

using the Excel solver for weighted and non-weighted least squares of Harris [47], which fits experimental data using non-linear functions. For the solver input, we used k (k = 1 to K’), z _k, standard errors of the means (Y _err), and weight of kth value (w _k) to calculate the slope (b), intercept (L ₁) and their respective standard errors of the means (SEM). We used the following eqs. (10) and (11) to calculate (Y _err) and (w _k):

$$ {Y}_{err}=\kern0.5em \sqrt{\frac{{\left({s}_k\right)}^2}{M_k}} $$

(10)

$$ {w}_k = \frac{M_k}{{\left({s}_k\right)}^2} $$

(11)

Effective average protein lengths (Le) were calculated using the following eq. (12)

$$ {L}_e=\frac{{\displaystyle {\sum}_{k=1}^{K\hbox{'}}}\left(k*{M}_k*{z}_k\right)}{{\displaystyle {\sum}_{k=1}^{K\hbox{'}}}{M}_k} $$

(12)

We used the F statistics of Proc GLM (SAS, SAS Inst. Inc., Cary, NC) to test the linear relationship between k vs. z _k, b vs. L ₁, genome size vs. L ₁, L _e vs. b and L _e vs. L ₁. We report dependencies that are most useful for biological interpretation. In particular, L ₁ describes the average length of single domain proteins, which serves to define an upper bound for the MA-dependency of a proteome. In turn, L _e describes the sum of the length of individual domain constituents of a protein, which is an indicator of mass economy for growth rate optimization. An example of a regression model is given by eq. (13)

$$ {V}_{ij}={L}_1+b{U}_i + {\varepsilon}_{ij} $$

(13)

where V_ij is the observation of the ith effect and the jth replication, U _i is the ith effect, and ε _ij is a random error term of the ith effect and jth replication, assuming NID (0, σ²), i.e., normality, independence and identical data distribution.

Availability of supporting data

A file with the proteomic data of Wang et al. [25] analyzed in this study can be found at LabArchives: http://dx.doi.org/10.6070/H4513W6X.

Abbreviations

HMMs:: Hidden Markov models
MA:: Menzerath-Altmann

References

Zuckerkandl E, Pauling L. Molecular disease, evolution, and genic heterogeneity. In: Kasha M, Pullman B, editors. Horizons in Biochemistry. New York: Academic; 1962. p. 189–225.
Google Scholar
Menzerath P. Uber einige phonetische probleme. In: Actes du Premier Congrès International de Linguists. Leiden: Sijthhof; 1928. p. 104–5.
Google Scholar
Menzerath P. Die Architektonik des Deutschen Wortschatzes. Bonn: Dümmler; 1954.
Google Scholar
Altmann G. Prolegomena to Menzerath’s law. Glottometrika. 1980;2:1–10.
Google Scholar
Strauss S, Altmann G. Hierarchic relations. In: Altmann G, Köhler R, Vulanović R, editors. Encyclopedia of linguistic laws; 2006. http://lql.uni-trier.de/index.php/Main_Page Accessed 15 Feb 2015.
Boroda MG, Altmann G. Menzerath’s law in musical texts. Musikometrica. 1991;3:1–13.
Google Scholar
Ferrer-i-Cancho R, Forns N. The self-organization of genomes. Complexity. 2010;15:34–6.
Google Scholar
Baixeries J. Hernandez-Fernández A, Ferrer-i-Cancho R. Random models of Menzerath-Altmann law in genomes. Biosystems. 2012;107:167–73.
Article CAS PubMed Google Scholar
Li W. Menzerath’s law at the gene-exon level in the human genome. Complexity. 2012;17:49–53.
Article Google Scholar
Ferrer-i-Cancho R, Forns N, Hernández-Fernández A, Bel-Enguix G, Baixeries J. The challenges of statistical patterns of language: The case of Menzerath’s law in genomes. Complexity. 2013;18:11–7.
Article Google Scholar
Eroglu S. Self-organization of genic and intergenic sequence lengths in genomes: Statistical properties and linguistic coherence. Complexity. 2014. doi:10.1002/cplx.21563.
Google Scholar
Eroglu S. Language-like behavior of protein length distribution in proteomes. Complexity. 2014;20:12–21.
Article Google Scholar
Caetano-Anollés G, Wang M, Caetano-Anollés D, Mittenthal JE. The origin, evolution and structure of the protein world. Biochem J. 2009;417:621–37.
Article PubMed Google Scholar
Wetlaufer DB. Nucleation, rapid folding, and globular intrachain regions in proteins. Proc Natl Acad Sci U S A. 1973;70:697–701.
Article CAS PubMed Central PubMed Google Scholar
Richardson JS. The anatomy and taxonomy of protein structure. Adv Protein Chem. 1981;34:167–339.
Article CAS PubMed Google Scholar
Janin J, Wodak SJ. Structural domains in proteins and their role in the dynamics of protein function. Prog Biophys Mol Biol. 1983;42:21–78.
Article CAS PubMed Google Scholar
Murzin A, Brenner SE, Hubbard T, Clothia C. SCOP: a structural classification of proteins for the investigation of sequences and structures. J Mol Biol. 1995;247:536–40.
CAS PubMed Google Scholar
Riley M, Labedan B. Protein evolution viewed through Escherichia coli protein sequences: Introducing the notion of a structural segment of homology, the module. J Mol Biol. 1997;268:857–68.
Article CAS PubMed Google Scholar
Bhaskara RM, Srinivasan N. Stability of domain structures in multi-domain proteins. Sci Rep. 2011;1:40.
Article PubMed Central PubMed Google Scholar
Wang M, Caetano-Anollés G. The evolutionary mechanics of domain organization in proteomes and the rise of modularity in the protein world. Structure. 2009;17:66–78.
Article CAS PubMed Google Scholar
Bashton M, Chothia C. The generation of new protein functions by the combination of domains. Structure. 2007;15:85–99.
Article CAS PubMed Google Scholar
Kim HS, Mittenthal JE, Caetano-Anollés G. Widespread recruitment of ancient domain structures in modern enzymes during metabolic evolution. J Integr Bioinform. 2013;10:214.
PubMed Google Scholar
Nasir A, Kim KM, Caetano-Anollés G. Global patterns of domain gain and loss in superkingdoms. PLoS Comput Biol. 2014;10:e1003452.
Article PubMed Central PubMed Google Scholar
Debès C, Wang M, Caetano-Anollés G, Gräter F. Evolutionary optimization of protein folding. PLoS Comput Biol. 2013;9:e1002861.
Article PubMed Central PubMed Google Scholar
Wang M, Kurland CG, Caetano-Anollés G. Reductive evolution of proteomes and protein structures. Proc Natl Acad Sci U S A. 2011;108:11954–8.
Article CAS PubMed Central PubMed Google Scholar
Edwards H, Abeln S, Deane CM. Exploring fold preferences of new-born and ancient protein superfamilies. PLoS Comput Biol. 2013;9:e1003325.
Article PubMed Central PubMed Google Scholar
Grotjahn R. Evaluating the adequacy of regression models: some potential pitfalls. Glottometrika. 1993;13:121–72.
Google Scholar
Meyer P. Two semi-mathematical asides on Menzerath-Altmann’s law. In: Grzybek P, Köhler R, editors. Exact methods in the study of language and text: Dedicated to Gabriel Altmann on the occasion of his 75th birthday. Hague: Mouton de Gruyter; 2007. p. 449–60.
Chapter Google Scholar
Eroglu S. Parameters of the Menzerath-Altmann law: Statistical mechanical interpretation as applied to a linguistic organization. J Stat Phys. 2014;157:392–405.
Article Google Scholar
Han J-H, Batey S, Nickson AA, Teichmann SA, Clarke J. The folding and evolution of multidomain proteins. Nature Rev Mol Cell Biol. 2007;8:319–30.
Article CAS Google Scholar
Conant GC, Stadler PF. Solvent exposure imparts similar selective pressures across a range of yeast proteins. Mol Biol Evol. 2009;26:1155–61.
Article CAS PubMed Google Scholar
Thirumalai D, Obrien EP, Morrison G, Hyeon C. Theoretical perspectives on protein folding. Annu Rev Biophys. 2010;39:159–83.
Article CAS PubMed Google Scholar
Dill KA, Ghosh K, Schmit JD. Physical limits of cells and proteomes. Proc Natl Acad Sci U S A. 2011;108:17876–82.
Article CAS PubMed Central PubMed Google Scholar
Kepp KP, Dasmeh P. A model of proteostatic energy cost and its use in analysis of proteome trends and sequence evolution. PLoS One. 2014;9:e90504.
Article PubMed Central PubMed Google Scholar
Thirumalai D. Universal relationships in the self-assembly of proteins and RNA. Phys Biol. 2014;11:053005.
Article CAS PubMed Google Scholar
Ehrenberg M, Kurland CG. Costs of accuracy determined by a maximal growth rate constraint. Q Rev Biophys. 1984;17:45–82.
Article CAS PubMed Google Scholar
Wheelan SJ, Marchler-Bauer A, Bryant SH. Domain size distributions can predict domain boundaries. Bioinformatics. 2000;16:613–8.
Article CAS PubMed Google Scholar
Yafremava LS, Wielgos M, Thomas S, Nasir A, Wang M, Mittenthal JE, et al. A general framework of persistence strategies for biological systems helps explain domains of life. Front Genet. 2013;4:16.
Article PubMed Central PubMed Google Scholar
Caetano-Anollés G, Mittenthal JE. Exploring the interplay of stability and function in protein evolution. Bioessays. 2010;32:655–8.
Article PubMed Google Scholar
Nasir A, Naeem A, Khan MJ, Lopez-Nicora HD, Caetano-Anollés G. Annotation of protein domains reveals remarkable conservation in the functional make up of proteomes across superkingdoms. Genes. 2011;2:869–911.
Article CAS PubMed Central PubMed Google Scholar
Zhou T, Drummond DA, Wilke CO. Contacts density affects protein evolutionary rate from bacteria to animals. J Mol Evol. 2008;66:395–404.
Article CAS PubMed Google Scholar
Wolf MY, Wolf YI, Koonin EV. Comparable contributions of structural-functional constraints and expression level to the rate of protein sequence evolution. Biol Direct. 2008;3:40.
Article PubMed Central PubMed Google Scholar
Wang M, Yafremava LS, Caetano-Anollés D, Mittenthal JE, Caetano-Anollés G. Reductive evolution of architectural repertoires in proteomes and the birth of the tripartite world. Genome Res. 2007;17:1572–85.
Article PubMed Central PubMed Google Scholar
Tokuriki N, Jackson CJ, Afriat-Journou L, Wyganowski KT, Tang R, Tawfik DS. Diminishing returns and tradeoffs constrain the laboratory optimization of an enzyme. Nature Commun. 2012;3:1257.
Article Google Scholar
Nagatani RA, Gonzalez A, Shoichet BK, Brinen LS, Babbitt PC. Stability for function trade-offs in the enolase superfamily “catalytic module”. Biochemistry. 2007;46:6688–95.
Article CAS PubMed Google Scholar
Wilson D, Madera M, Vogel C, Chothia C, Gough J. The SUPERFAMILY database in 2007: Families and functions. Nucleic Acids Res. 2007;35:D308–13.
Article CAS PubMed Central PubMed Google Scholar
Harris DC. Nonlinear least-squares curve fitting with Microsoft Excel Solver. J Chem Ed. 1998;75:119.
Article CAS Google Scholar

Download references

Acknowledgments

We thank Minglei Wang for help with genomic data and Marcos Santana Mendoza for preliminary analyses. Research was supported in part with funds from the University of Illinois and grants from the National Science Foundation (OISE-1132791) and the United States Department of Agriculture (ILLU-483-625) to GCA. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies. We thank members and friends of the Evolutionary Bioinformatics lab for valuable discussions.

Author information

Authors and Affiliations

Illinois Informatics Institute, Urbana, IL, 61801, USA
Khuram Shahzad & Gustavo Caetano-Anollés
Department of Cell and Developmental Biology, Urbana, IL, 61801, USA
Jay E. Mittenthal
Department of Crop Sciences, Evolutionary Bioinformatics Laboratory, University of Illinois, 332 NSRC, Urbana, IL, 61801, USA
Gustavo Caetano-Anollés

Authors

Khuram Shahzad
View author publications
You can also search for this author in PubMed Google Scholar
Jay E. Mittenthal
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo Caetano-Anollés
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gustavo Caetano-Anollés.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All authors designed experiments and analyzed the data. GCA wrote the paper with the help of all authors. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Shahzad, K., Mittenthal, J.E. & Caetano-Anollés, G. The organization of domains in proteins obeys Menzerath-Altmann’s law of language. BMC Syst Biol 9, 44 (2015). https://doi.org/10.1186/s12918-015-0192-9

Download citation

Received: 25 February 2015
Accepted: 30 July 2015
Published: 11 August 2015
DOI: https://doi.org/10.1186/s12918-015-0192-9

The organization of domains in proteins obeys Menzerath-Altmann’s law of language