### Exponential random graphs

Exponential Random Graphs Models (EGRMs) are a means of generating an ensemble of networks with prescribed statistical properties with the aim of modeling real-world networks. ERGMs were introduced by Holland and Leinhardt [

9] and have a long history in complex systems research and are commonly applied in statistics and social science. ERGMs have been used for modeling complex brain networks [

10] but their use in computational systems biology is scant. Park and Newman [

11] showed that ERGMs are the natural extension of the fundamentals of statistical mechanics to network modeling. Here we adopt this formalism. The set of graphs

*G* represents the sample space of graphs in the model, and, {

*x*
_{
i
}},

*i =* 1

*… r*, represent a set of

*r* empirical observations. Then a probability distribution

*P*(

*G*), can be defined over the elements

*g* in

*G*, such that the expectation value of x

_{i} takes the empirical values. The best choice of probability distribution is the one which satisfies the empirical constraints,

while admitting no further information about the model graphs, which is achieved by maximizing the Gibbs entropy,

This leads to a probability distribution which is the network equivalent of the Boltzman distribution,

Where

*H*(

*g*) is the graph Hamiltonian function with Lagrange multipliers

,

and

*Z* is the partition function over the set of graphs

*G*,

This probability distribution defines the exponential random graph model networks which obey the mean constrains of Equation 1, but which are otherwise maximally disordered.

### Inference approach

The inference problem addressed here is now phrased in terms of ERGMs. Given an unknown underlying network, *G*
_{
u
}, with vertices, *V*, we consider a set, *C*, composed of many subsets of the network’s vertices,
, such that *C* is a field over the set of vertices. The sets *c*
_{
i
} consist of vertices which are empirically observed to be related in the network, this could be for example, proteins identified in a mass-spectrometry proteomics pull-down, members of a cell signaling pathway, or co-authors of a publication. The central assumption is that each of the sets *c*
_{
i
} identifies a locally connected subgraph of the underlying network. In these terms, the network inference problem we pose is thus: given the set of *N*
_{
c
} subsets, *C*, and no other information, to what extent can we infer the underlying network *G*
_{
u
}?

To give a specific example, we may consider the results from HT-IP/MS, after appropriate filtering, as identifying a locally connected region of the underlying human protein-protein interactome network [7]. We hypothesize that we can resolve the underlying PPI network given enough observations of sets of proteins identified in different pull-downs. In these terms, the proteins would be represented by the vertices *v*
_{
i
}, and the list of proteins identified in each pull-down experiment would correspond to one element of *C*, while the underlying PPI network would be represented by *G*
_{
u
}. We are aware that we are searching for a static configuration of the network, whereas the underlying connectivity in complex systems, including PPIs and gene-regulatory networks, is dynamic as it may change over time and under different conditions.

We define our observable graph functions,

*x*
_{
i
}(

*G*) in the following way,

If we interpret each set as providing course local information on the connectivity of the underlying

*G*
_{
u
}, such that we have a confidence, α, that the elements in each line are locally connected, then the constraints on the ensemble are the following,

in which case the maximum entropy probability distribution function *P*(*G*) identifies an ensemble of networks which have a connectivity consistent with the known input set *C* but which are otherwise maximally disordered, we shall refer to this as *G*
_{
ens
}. In our studies of the properties of our inference approach on synthetic networks we shall generate data which identifies locally connected regions of the underlying network and so we shall take the value of *α* = 1. When we apply our inference approach to data from systems biology the value of *α* will typically be less than unity as there will generally be only a finite degree of confidence in the connectivity of each local region identified in the data.

The above constraints leading to the probability distribution *P*(*G*) can also be seen in terms of a dependence graph, *D*, for the corresponding ERGM. Each set *c*
_{
i
} contains vertices between which a set of edges are assumed to exist in the underlying network in order to form a locally connected subgraph. Over the ERGM ensemble, the presence of each of these edges is conditionally dependent upon the others. These edges form a complete subgraph of *D*, and therefore, according to the Hammersley-Clifford theorem, define the probability distribution *P*(*G*).

In the approach presented here we make no attempt to infer directionality, hence, we take the sample set, *G*, to consist of simple undirected graphs. The data contained in, *C*, constitutes an accumulation of course local information on the connectivity of the underlying network. Even with large amounts of this class of data the network often remains underdetermined, hence there exists a whole ensemble of networks which are consistent with the data. In order to infer finer details of the underlying network we adopt some of the philosophy of network modeling to take a sample of this ensemble, and then use the properties of this ensemble for inference. Here we take the sample numerically by using an algorithm which generates a random sample of the ensemble. The algorithm ensures that the sampled network has the local connectivity consistent with the data while minimizing the number of edges. The edge minimization ensures the strongest signal from the data in the sense that edges only appear in the ensemble as required by the data.

The algorithm works by generating a random sample, of size *N*
_{
g
}, of the ensemble of networks *G*
_{
ens
} consistent with the data. According to the assumptions of the inference, the GMT file contains a number *N*
_{
C
} of lines, *c*
_{
i
}
*,* each of which consists of vertices which are locally connected in the underlying network. The algorithm first randomly permutes the order of the lines in the GMT file, then, starting with the empty graph which consists of the set of nodes (all distinct entities included in the data) and the empty set of edges *E*
_{
i
} = {}, it builds a network by taking each line and introducing a minimal number of random links that connect the vertices in that line. The pseudo-code for the algorithm is then:

For i = 1 to *N*
_{
g
}

Randomly permute the order of the lines in the *GMT* file

*E*
_{
i
} = {} (start with a graph with no links)

For j = 1 to *N*
_{
C
}

Randomly introduce a minimal number of edges between the vertices *c*
_{
i
} such that they are connected, and continually append to the set *E*
_{
i
}

End For i

End For j

Calculate the mean adjacency matrix of the ensemble *G*
_{
ens
}

A sample of random networks generated in this fashion constitutes a random sample of *G*
_{
ens
}. The properties of this ensemble are then used to infer the underlying network *G*
_{
u
}. Specifically, we calculate the mean adjacency matrix over this ensemble, each element of which corresponds to the probability of the edge being present in a uniformly random draw from the ensemble; this is interpreted as being indicative of the accumulation of information on the presence of the edge in the underlying network.