Our method is to detect the key drivers which drive the diseases related networks in the microbial community. The key drivers can be microbial species or phylogenetic gene markers. For simplicity, we present our method with nodes as genes in the subsequent descriptions.

The detection of key drivers consists of following steps (see Fig. 1). First, we construct a MEN to represent the relationship between genes based on microbial abundance profiles and infer the weight of each edge. Second, we cluster the genes and partition the MEN into multiple subnetworks. Third, we analyze the phenotype variables and extract the delegated phenotype. By computing the associations between subnetworks and delegated phenotype, we obtain subnetworks that are most related to the disease. Last, based on PageRank, we identify actors with top influence over others in each subnetwork as key drivers.

### Inference method

In KDiamend, we provide two ways to compute distances between genes. The first one is PCC, which is the most popular way to capture similarities between genes. In addition, inspired from the inference of gene regulatory network in RNA analysis and mixture models in the microbiome analysis, we adopted the MMI and normalization processes in Context Likelihood of Relatedness(CLR) [29]. MI, which uses the mutual dependency and common uncertainty as for the measurement of connection between genes, does not assume linear, or continuous dependence like correlation [19, 30], so it can detect interactions which might be missed by PCC. MMI, under the assumption of the Gaussian mixture model, is for dealing with the high proportion of zero counts issue. The adopted CLR, which considers the context of the whole network and eliminates noises from the background, makes MMI more tolerant for noises when measuring the interactions.

At the beginning, we have an abundance matrix *E*, which contains abundance value of n genes in m samples. For each gene *i*, we have a vector of *X*_{
i
}=(*E*_{i1},*E*_{i2},…,*E*_{
im
}). The strength of the relationship between gene *i* and gene *j* can be measured by PCC:

$$ PCC(i,j)=\frac{cov(X_{i},X_{j})}{\sigma_{i},\sigma_{j} }, $$

(1)

where *c**o**v*(*X*_{
i
},*X*_{
j
})is the covariance of *X*_{
i
} and *X*_{
j
}, and *σ*_{
i
} is the standard deviation of *X*_{
i
}. The adjacency matrix A of network can be generated from *A*_{
ij
}=*P**C**C*(*i*,*j*). Then the distance between gene *i* and gene *j* can be interpreted as *D*_{
ij
}=1−|*P**C**C*(*i*,*j*)|.

Apart from PCC, we also implemented MMI [23]. First, we decomposed *X*_{
i
} into *c*_{
i
} bins as *X*_{i,1}, …, \(X_{i, c_{i}}\). In this case, *X*_{
i
} was distributed according to the following function:

$$ f_{X_{i}}(x)=\sum_{k=1}^{c_{i}}\pi_{i, k}g_{X_{i, k}}(x), $$

(2)

where \(g_{X_{i, k}}(x)\) (1≤*k*≤*c*_{
i
}) denotes the density function for *C*_{i,k}, and *π*_{i,k} is the proportion for each sample in *C*_{i,k}. As proved in former work [23], assuming that *X*_{i,k} fits in a Gaussian distribution, we can estimate the mutual information between *X*_{
i
} and *X*_{
j
} as:

$$ MMI(X_{i},X_{j})=MMI^{O}(X_{i},X_{j})+MMI^{I}(X_{i},X_{j}). $$

(3)

The “outer” MI, *M**M**I*^{O}(*X*_{
i
},*X*_{
j
}), captures discretized dependency, while the “inner” MI, *M**M**I*^{I}(*X*_{
i
},*X*_{
j
}), refers to the weighted aggregation of MI for each bin.

After computing MMI between all the genes and get a matrix *M*, we normalized the distance between gene *i* and gene *j* by: \(CLR(Z_{i},Z_{j}) = \sqrt {\left (Z_{i}^{2},Z_{j}^{2}\right)}\) where *Z*_{
i
} and *Z*_{
j
} are z-scores of *M*_{
ij
} taking *M*_{
i
} and *M*_{
j
} as background respectively [29]. Then we applied hierarchical clustering and partitioned the MEN into multiple subnetworks according to the distance between genes.

### Delegated phenotype

To best capture phenotype change in disease networks from healthy networks, we generated delegated phenotype by rotating Coordinates in the PC space of phenotype variable matrix *S*. Each row in *S* represents a sample while each column refers to a phenotype variable, such as gender, age, disease state, etc. If the properties are nonnumerical, we converted data into numbers before further analysis. Suppose one column *v* in *S* indicates whether each sample is collected from a person with or without this disease. That is *v*=(*v*_{1},*v*_{2},…,*v*_{
k
},…,*v*_{
m
}),*v*_{
k
}∈(*Y*,*N*),1<*k*<*m* where *m* is the number of samples, Y indicates that this sample is collected from a person with the disease, and N means not.

We applied principal component analysis (PCA) to conclude *S*. We consider the first two principal components (PCs) to be enough for explaining disease variability, since the number of phenotype variables is relatively small in our test data. When phenotype data are more complicated, we may need extra analysis to decide the number of PCs we use to conclude delegated phenotype. We investigated the first two PCs and plotted samples in the coordinate of PC1 and PC2, regarding every sample as a point. Consequently, we got m points and each point refers to one sample. The coordinates for point k, is expressed as (*x*_{
k
},*y*_{
k
}). We rotated *PC* to *P**C*^{′} and make sure it has the largest correlation with *v* upon rotation. *P**C*^{′} can best explain variability related to the disease. The angle of rotation is the *θ* that maximize *f*(*θ*).

$$\begin{array}{*{20}l} {}f(\theta)\! =\!\! \sum_{k, v_{k} \in N}(x_{k} cos \theta_{k} + y_{i} sin \theta_{k}) \,-\, \sum_{k, v_{k} \in Y}(x_{k} cos \theta_{k} + y_{k} sin \theta_{k}) \end{array} $$

(4)

That is to say, the angle between *P**C*1 and *P**C*1^{′} is:

$$\begin{array}{*{20}l} \theta=arctan \frac{\sum_{k, v_{k} \in N}x_{k}-\sum_{k, v_{k} \in Y} x_{k}}{\sum_{k, v_{k} \in N}y_{k}-\sum_{k, v_{k} \in Y} y_{k}} \end{array} $$

(5)

For example (see delegated phenotype in Fig. 1), blue points represent *v*_{
k
}=*Y* and red points represent *v*_{
k
}=*N*. By calculating *θ*, we got the line which implies the direction most correlated to the disease state.

We acquired delegated phenotype *P**C*^{′}, which has the largest correlation with the disease state and outperforms other single variables on explaining the variability of phenotype information at the meantime. For every subnetwork, we concluded the abundance pattern of genes as eigengenes [31]. Then we bridged the subnetworks to phenotype information by calculating the correlation between eigengenes and *P**C*^{′}. Subnetworks which have strong relationships to *P**C*^{′} are extracted as disease-relevant subnetworks.

### Identifying Key driver

Last, in every extracted disease relevant subnetwork, we applied network topology analysis and assigned every gene a PageRank score. PageRank (PR) ranks the nodes in a graph according to the structures of links with others and is used by Google’s search engine to compute rankings of websites. In this algorithm, the score for one node can be affected by its neighbors [32] and if one’s neighbors have high scores, its score increases iteratively [33].

As stated in [34], letting *F*_{
u
} be the nodes linking to the node, *B*_{
u
} be the nodes linked from it, and *N*_{
u
}=|*F*_{
u
}| be the magnitude of *F*_{
u
}. Besides, considering there might be other factors towards the ranking, let *E*(*u*) be the vector concerned with some of the rank. Then, the PageRank of the node is defined as.

$$ R(u)=c\sum_{\upsilon \in B_{u}} \frac{R(\upsilon)}{N_{\upsilon} }+cE(u) $$

(6)