# Dynamics of the discovery process of protein-protein interactions from low content studies

- Zichen Wang†
^{1, 2, 3}, - Neil R. Clark†
^{1, 2, 3}and - Avi Ma’ayan
^{1, 2, 3}Email author

**Received: **15 January 2015

**Accepted: **29 May 2015

**Published: **6 June 2015

## Abstract

### Background

Thousands of biological and biomedical investigators study of the functional role of single genes and their protein products in normal physiology and in disease. The findings from these studies are reported in research articles that stimulate new research. It is now established that a complex regulatory networks's is controlling human cellular fate, and this community of researchers are continually unraveling this network topology. Attempts to integrate results from such accumulated knowledge resulted in literature-based protein-protein interaction networks (PPINs) and pathway databases. These databases are widely used by the community to analyze new data collected from emerging genome-wide studies with the assumption that the data within these literature-based databases is the ground truth and contain no biases. While suspicion for research focus biases is growing, a concrete proof for it is still missing. It is difficult to prove because the real PPINs are mostly unknown.

### Results

Here we analyzed the longitudinal discovery process of literature-based mammalian and yeast PPINs to observe that these networks are discovered non-uniformly. The pattern of discovery is related to a theoretical concept proposed by Kauffman called “expanding the adjacent possible”. We introduce a network discovery model which explicitly includes the space of possibilities in the form of a true underlying PPIN.

### Conclusions

Our model strongly suggests that research focus biases exist in the observed discovery dynamics of these networks. In summary, more care should be placed when using PPIN databases for analysis of newly acquired data, and when considering prior knowledge when designing new experiments.

## Keywords

## Background

Protein-protein interaction networks (PPINs) are an abstract representation of the body of knowledge about the known physical interactions between proteins within cells of an organism. In these networks, proteins are the nodes and their known physical interactions (PPIs) are the links. Literature-based PPINs and pathway databases are central in computational systems biology since they summarize accumulated knowledge and are reused for various types of analyses. For example, PPINs can be used to predict disease genes and identify disease related pathways or modules [1–5], applied to predict gene/protein function [6, 7] and predict undiscovered PPIs [8]. Commonly, lists of genes and proteins identified experimentally by high content profiling methods use literature curated PPINs and pathway databases for enrichment analyses [9], or such lists are seeded within PPINs to identify functional subnetworks, and this helps to provide global biological context to the identified gene lists [10, 11]. Inclusion of PPINs was shown to improve the quality of inferred co-expression networks and the prioritization of genes that harbor mutations and copy number variations to better correlate these with disease [12–14].

There are several reasons to suspect that literature-based PPINs and pathway databases contain research focus biases. For instance, the uneven availability of tools such as mouse models or quality antibodies enable the study of some genes and proteins over others [15]. However, so far, concrete proof that such discovery bias really exists has not been reported. It is difficult to prove that such bias exists because the real PPINs are mostly unknown. One null model for the discovery of any network is a uniformly even, uncorrelated exploration of all links and nodes without bias. An alternative model can simulate the network discovery process whereby the discovery in one region of the network will predispose the expansion of related discoveries. Such models can be compared to empirical observations. Tria et al. [16] empirically observed that with open data resources, such as online music catalogues and Wikipedia pages, one discovery spurs another. They then quantified their observation with the theoretical concept of “the adjacent possible” proposed by Kaufman [17]. This concept was first proposed in the context of biological evolution and technological evolution [18, 19]. Tria et al. were able to observe counterparts of Heap’s law, whereby the number of discoveries made increases sub-linearly, and Zipf’s law whereby the rank distribution of the frequencies of the discovered elements follow a power-law [16]. These observations were illuminated with a model based on Polya’s urn [20–22] which was able to unify Heap’s and Zipf’s laws and capture the correlations in the discoveries without explicit reference to the unknown space of possibilities to which the concept of “the adjacent possible” refers.

Here we used the PubMed IDs associated with protein-protein interactions (PPIs) as a time-stamp to temporally resolve the discovery dynamics of mammalian and yeast PPINs extracted manually from low-content published studies. We observe the counterparts of Heap’s and Zipf’s laws in the discovery of these mammalian and yeast PPINs. Furthermore, we identify individual proteins which exhibit accelerated or decelerated discovery process rates. We then propose an original model which is related to Polya’s urn. The model features “reinforcement”, rich-get-richer type dynamics with “triggering” whereby novel discoveries trigger the possibility for a subset of new discoveries. Our model is the first network discovery model to explicitly incorporate a space of possibilities, which are the basis of Kaufman’s “adjacent possible”, in the form of an underlying network. Our model captures the observed dynamics of PPIN discovery, and provides strong suggestive evidence that research-focus biases exist within the patterned discovery of the yeast and mammalian PPINs.

## Methods

### Construction of the mammalian PPIN

*Saccharomyces cerevisiae*) PPIN was downloaded from iRefWeb 4.1 [23] by including only experimental physical interactions, filtering out unary interactions, and excluding from most analyses 82,391 PPIs from publications associated with more than 10 interactions. The yeast PPIN has 9678 PPIs between 3154 proteins, extracted from 6208 publications with a range of discovery time spanning from June 1946 to November 2011.

Mammalian PPINs resources

PPI databases | PMID | Publication coverage | PPIs | Latest publication time |
---|---|---|---|---|

BIND | 12519993 | 10069 | 15895 | 2010 Aug. |

BioCarta | NA | 1 | 189 | 1994 Jun |

BioGrid | 16381927 | 22277 | 131438 | 2013 Nov. |

DIP | 10592249 | 491 | 873 | 2004 Feb. |

Ewing et al. | 17353931 | 1 | 3585 | 2007 Jan. |

HPRD | 14681466 | 18515 | 35433 | 2010 Aug. |

InnateDB | 18766178 | 3028 | 6052 | 2011 Jun. |

IntAct | 14681455 | 3300 | 54248 | 2013 Jun. |

KEA | 19176546 | 6790 | 16193 | 2010 Jun. |

KEGG | 18077471 | 1 | 7207 | 2000 Jan. |

MINT | 17135203 | 1265 | 11750 | 2009 Oct. |

MIPS | 14681354 | 170 | 323 | 2004 Jan. |

PDZBase | 15513994 | 141 | 234 | 2003 Jul. |

PPID | 21516116 | 1980 | 2904 | 2003 May |

SNAVI | 16099987 | 1059 | 1156 | 2006 Jan. |

Stelzl et al. | 16169070 | 1 | 1560 | 2005 Sep. |

Rual et al. | 16189514 | 1 | 4225 | 2005 Oct. |

Total | NA | 37015 | 185068 | 2013 Nov. |

### Entropy calculation

*i*with known degree \( {\tilde{k}}_i \) by:

Where *f*
_{
i
} is the number of discovered PPIs involving protein *i* in the *j*
^{
th
} interval of time, where the time intervals are defined by taking the time at which protein *i* was first observed until the final observation in the whole dataset, and dividing into \( {\tilde{k}}_i \) equal-sized bins. This entropy measure was also normalized by dividing by the maximum possible entropy \( \log \left({\tilde{k}}_i\right) \).

### Random data permutations

In order to compare the entropy and interval distributions to a null distribution based on uniform randomization of the data, we destroyed the original data order while preserving the frequency distributions by employing random permutations. The first reshuffling method acts globally in time by randomly reassigning the time index to PPI discoveries. The second reshuffling method is local in that it only randomly reassigns time indices from the first appearance of the protein under consideration.

### Generation of artificial networks for the network discovery model

Underlying networks for the PPI discovery model were generated by five different algorithms which resulted in networks with various global properties. In order to approximate the size of the true underlying mammalian PPIN, we constructed artificial networks with 25,000 nodes and tuned the parameters of the different network construction models to produce networks that have ~650,000 links. These numbers agree with a recent estimate of the size of the human PPIN [24].

Properties of the artificial network models

Networks | Nodes | Edges | Clustering coefficient | Power-law exponent | Connected components |
---|---|---|---|---|---|

BA graph | 25000 | 649324 | 0.011 | 1.9 | 1 |

BA cluster graph | 25000 | 649304 | 0.182 | 2 | 1 |

Duplication-Divergence | 25000 | 655271 | 0 | 1.7 | 1 |

Erdős-Rényi | 25000 | 650069 | 0.002 | NA | 1 |

Complete graph | 1000 | 499500 | 1 | NA | 1 |

### A model of protein-protein interaction network discovery

*G*(

*V*,

*E*) where the vertices

*V*correspond to the set of all proteins and the edges

*E*correspond to the set of all true PPIs. We examine five different network structures in order to study their effect on network discovery dynamics as described above. For a given PPIN, edges are “discovered” by a random choice. At a given time step, the probability of discovering the true link between vertices

*i*and

*j*is given by,

*μ*

_{ ij }∝

*μ*(\( {\tilde{k}}_i,{\tilde{k}}_j \)), where \( {\tilde{k}}_x \) is the currently known degree of vertex

*x*. The form of the function

*μ*determines the nature of the discovery process in this model, for example,

In this case only links which are connected to at least one previously discovered protein can possibly become discovered.

Where *d*
_{
i
} is the true of degree i, and the factor of 2 arises because each link is shared by two nodes. In this case we do not expect any significant acceleration of growth for the nodes, i.e., we expect to discover interactions involving any given protein at a roughly constant rate.

### Community structure analysis

Where *c*
_{
i
} is the community to which node *i* is assigned, \( m = \frac{1}{2}{\displaystyle \sum_{ij}}{a}_{ij} \), and *δ*-function *δ*(*u*, *v*) is 1 if *u* = *v* and 0 otherwise, *a*
_{
ij
} denote the element of the symmetric adjacency matrix *A* of the graph *G*, and *d*
_{
i
}, *d*
_{
j
} are the degrees of node *i*, *j*, respectively. This unsupervised algorithm involves modularity optimization by local changes to communities and aggregation of communities to build new communities. As a result, the algorithm generates a hierarchy of community structures. In practice, a Python implementation named “python-louvain” of this algorithm was applied.

## Results

To examine these possibilities we compared the observed distribution of proteins with accelerated or decelerated rates to the distributions observed for random permutations of the same data (Fig. 2c-f). Similar null distributions were also examined by Tria et al. [16] in a completely different context. This analysis shows that there are significantly more proteins that are growing super-linearly than would be expected by random chance. This is indicative of correlations in the discovery process of PPIs – discoveries involving particular proteins tend to arrive in bursts with their corresponding short time intervals. To explore whether the correlated discovery of PPIs is a unique property of the low-content PPINs, we constructed mammalian and yeast PPINs by increasing the threshold for the maximum number of PPIs per publication from 10 to 50, to 100, to 1000 and with no threshold/filter at all. Observing the distribution of the discovery intervals for PPIs, we see that after including the high content studies, the distribution of intervals is similar to the distribution for randomly permuted data (Fig. 2g-h and Additional file 2: Figure S2). Interestingly, the entropy measure still shows difference between randomly shuffled discoveries and networks discovered by low- and high-content methods combined. We believe that this may be an artifact of the sparse data from high content PPIs, or a new type of bias within PPI data collected by high content methods. For example, PPIs from mass-spectrometry proteomics are known to be biased in detecting large, abundant or sticky proteins.

In principle, all parts of a PPIN are discoverable and a uniform exploration is theoretically possible. However, in practice, the discovery process appears to be correlated. In order to illuminate the dynamics of PPINs discovery we introduce a simple model. With reference to Kaufman’s “expanding the adjacent possible” [17] we explicitly incorporate the space of possibilities in the form of an underlying true network. We begin with a random uniform exploration process, and then by modulating the probability of discovering links based on the already discovered network, we study the effect research focus biases can have on the dynamics of the network discovery process. A schematic representation of this model is shown in Additional file 3: Figure S3. Although, the true PPIN is unknown, we can examine the effect of global network properties within this model.

Furthermore, we notice that accelerating nodes only occur in the models where the underlying networks have a power-law degree property (Additional file 8: Figure S8). This illustrates the relevance of the underlying network structure. It seems that the topology of the space of possibilities has an impact on the discovery process. We note that the difference between the biased and unbiased models is not as marked as the real PPI discovery (Additional file 8: Figure S8). However, it is clear that network discovery of the real networks must contain biases.

## Discussion

By time-resolving the mammalian and yeast literature-based PPINs we identified a clear pattern in the PPI discovery process. This pattern is consistent with a biased discovery process which exhibits properties of reinforcement, whereby commonly studied proteins are more likely to be further studied in the near future, and with triggering, whereby discoveries spur related discoveries in the PPI network neighborhood. We introduced a model of PPI network discovery which supports the idea that research focus bias is relevant in the discovery process of mammalian and yeast PPIs. The model demonstrates that network discovery can explain the existence of many more proteins whose degree is accelerating compared with the number of such proteins in more random discovery processes. Such trends should be considered when reusing PPI data for interpretation of new results for drawing conclusions about the underlying biology, and for making decisions about the next set of experiments. A recent publication by Schnoes et al. [31] suggested that there exist significant biases in the discovery of gene functional annotations, and this has a significant effect on their interpretation and application to biological investigations, here we extended this observation to the discovery of PPIs.

Our model of PPI network discovery also revealed that an underlying network with the scale-free property is also necessary for the appearance of proteins with super-linear degree growth, which supports the hypothesis that the topology of the real PPINs is scale free [25, 32, 33]. Interestingly, the local clustering of the underlying network does not seem to play a role in the emergence of biases during the discovery process. Notably, the observed bias is stronger in mammalian than yeast PPINs in terms of the ratio of proteins with super-linear degree growth. One explanation for this is that the discovered mammalian PPIN is further from saturation compared to yeast, which is supported by the estimated size of human and yeast PPINs [24]. To explore whether the effects of research focus bias introduced in low-content studies can be reduced, we included PPIs from high-throughput studies. We observed the overall reinforcement and triggering effects on the discovery process are mitigated. However, those effects can still be revealed on the discovery of PPIs for many individual proteins (Additional file 2: Figure S2), suggesting the inclusion of high-content studies help to some extent to reduce the research focus bias in LC-PPINs.

## Conclusions

Recent studies demonstrate that experimental methods that identify many reliable PPIs in a single study show more uniform distribution of PPIs [3, 34]. However, current high cost, requirement for specific skills, and years of concentrated efforts, are still great obstacles toward making such profiling experiments more widely applied and accepted. In principle, the shift toward genome-wide system-level biology is expected to correct and better inform our current understanding of the real PPINs. In addition, the view of binary PPI is limited. It is now well established that most proteins within cells work as a part of macro-molecular complexes, and thus we expect that the in-silico reconstruction of such complexes will become more central, while less emphasis will be placed on the identification and reuse of binary PPIs. Nevertheless, methods that correct for research focus biases can potentially improve the use of such PPIN and pathway databases for their various computational applications.

## Notes

## Declarations

### Acknowledgments

This research was supported by NIH grants: R01GM098316, U54CA189201 and U54HL127624 to AM.

## Authors’ Affiliations

## References

- Cordeddu V, Di Schiavi E, Pennacchio LA, Ma’ayan A, Sarkozy A, Fodale V, et al. Mutation of SHOC2 promotes aberrant protein N-myristoylation and causes Noonan-like syndrome with loose anagen hair. Nat Genet. 2009;41(9):1022–6.PubMed CentralPubMedView ArticleGoogle Scholar
- Lim J, Hao T, Shaw C, Patel AJ, Szabó G, Rual J-F, et al. A protein–protein interaction network for human inherited ataxias and disorders of Purkinje cell degeneration. Cell. 2006;125(4):801–14.PubMedView ArticleGoogle Scholar
- Vidal M, Cusick Michael E, Barabási A-L. Interactome networks and human disease. Cell. 2011;144(6):986–98.PubMed CentralPubMedView ArticleGoogle Scholar
- Barabasi A-L, Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease. Nat Rev Genet. 2011;12(1):56–68.PubMed CentralPubMedView ArticleGoogle Scholar
- Oti M, Snel B, Huynen MA, Brunner HG. Predicting disease genes using protein–protein interactions. J Med Genet. 2006;43(8):691–8.PubMed CentralPubMedView ArticleGoogle Scholar
- Vazquez A, Flammini A, Maritan A, Vespignani A. Global protein function prediction from protein-protein interaction networks. Nat Biotechnol. 2003;21(6):697–700.PubMedView ArticleGoogle Scholar
- Sharan R, Ulitsky I, Shamir R. Network‐based prediction of protein function. Mol Syst Biol. 2007;3(1):88.PubMed CentralPubMedGoogle Scholar
- Yu H, Paccanaro A, Trifonov V, Gerstein M. Predicting interactions in protein networks by completing defective cliques. Bioinformatics. 2006;22(7):823–9.PubMedView ArticleGoogle Scholar
- Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37(1):1–13.PubMed CentralView ArticleGoogle Scholar
- Berger SI, Posner JM, Ma’ayan A. Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases. BMC Bioinformatics. 2007;8(1):372.PubMed CentralPubMedView ArticleGoogle Scholar
- Antonov AV, Dietmann S, Rodchenkov I, Mewes HW. PPI spider: a tool for the interpretation of proteomics data in the context of protein–protein interaction networks. Proteomics. 2009;9(10):2740–9.PubMedView ArticleGoogle Scholar
- Neale BM, Kou Y, Liu L, Ma’Ayan A, Samocha KE, Sabo A, et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature. 2012;485(7397):242–5.PubMed CentralPubMedView ArticleGoogle Scholar
- Jia P, Zheng S, Long J, Zheng W, Zhao Z. dmGWAS: dense module searching for genome-wide association studies in protein–protein interaction networks. Bioinformatics. 2011;27(1):95–102.PubMed CentralPubMedView ArticleGoogle Scholar
- Califano A, Butte AJ, Friend S, Ideker T, Schadt E. Leveraging models of cell regulation and GWAS data in integrative network-based association studies. Nat Genet. 2012;44(8):841–7.PubMed CentralPubMedView ArticleGoogle Scholar
- Edwards AM, Isserlin R, Bader GD, Frye SV, Willson TM, Yu FH. Too many roads not taken. Nature. 2011;470(7333):163–5.PubMedView ArticleGoogle Scholar
- Tria F, Loreto V, Servedio VDP, Strogatz SH. The dynamics of correlated novelties. arXiv preprint arXiv:13101953. 2013.Google Scholar
- Kauffman SA. Investigations: the nature of autonomous agents and the worlds they mutually create. In: Santa Fe Institute. 1996.Google Scholar
- Johnson S. Where good ideas come from: the natural history of innovation. UK: Penguin; 2010.Google Scholar
- Wagner A, Rosen W. Spaces of the possible: universal Darwinism and the wall between technological and biological innovation. J R Soc Interface. 2014;11(97):20131190.PubMed CentralPubMedView ArticleGoogle Scholar
- Johnson NL, Kotz S. Urn models and their application: an approach to modern discrete probability theory. New York: Wiley; 1977.Google Scholar
- Mahmoud H. Pólya urn models: CRC press. 2008.View ArticleGoogle Scholar
- Pólya G. Sur quelques points de la théorie des probabilités. In: Annales de l'institut Henri Poincaré: 1930. Presses universitaires de France: 117–161.Google Scholar
- Turner B, Razick S, Turinsky AL, Vlasblom J, Crowdy EK, Cho E, et al. iRefWeb: interactive analysis of consolidated protein interaction data and their supporting evidence. Database. 2010;2010:baq023.PubMed CentralPubMedView ArticleGoogle Scholar
- Stumpf MP, Thorne T, de Silva E, Stewart R, An HJ, Lappe M, et al. Estimating the size of the human interactome. Proc Natl Acad Sci U S A. 2008;105(19):6959–64.PubMed CentralPubMedView ArticleGoogle Scholar
- Barabási A-L, Albert R. Emergence of scaling in random networks. Science. 1999;286(5439):509–12.PubMedView ArticleGoogle Scholar
- Holme P, Kim BJ. Growing scale-free networks with tunable clustering. Phys Rev E. 2002;65(2):026107.View ArticleGoogle Scholar
- Ispolatov I, Krapivsky PL, Yuryev A. Duplication-divergence model of protein interaction network. Phys Rev E. 2005;71(6):061911.View ArticleGoogle Scholar
- Batagelj V, Brandes U. Efficient generation of large random networks. Phys Rev E. 2005;71(3):036113.View ArticleGoogle Scholar
- Vincent DB, Jean-Loup G, Renaud L, Etienne L. Fast unfolding of communities in large networks. J Stat Mech: Theory Exp. 2008;2008(10):10008.View ArticleGoogle Scholar
- Newman MEJ. Analysis of weighted networks. Phys Rev E. 2004;70(5):056131.View ArticleGoogle Scholar
- Schnoes AM, Ream DC, Thorman AW, Babbitt PC, Friedberg I. Biases in the experimental annotations of protein function and their effect on our understanding of protein function space. PLoS Comput Biol. 2013;9(5):e1003063.PubMed CentralPubMedView ArticleGoogle Scholar
- Barabasi A-L, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004;5(2):101–13.PubMedView ArticleGoogle Scholar
- Han J-DJ, Dupuy D, Bertin N, Cusick ME, Vidal M. Effect of sampling on topology predictions of protein-protein interaction networks. Nat Biotech. 2005;23(7):839–44.View ArticleGoogle Scholar
- Yu H, Tardivo L, Tam S, Weiner E, Gebreab F, Fan C, et al. Next-generation sequencing to generate interactome datasets. Nat Meth. 2011;8(6):478–80.View ArticleGoogle Scholar

## Copyright

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.