With the development of high-throughput technologies, large amounts of metagenomic data have been produced, especially with the sequencing of the 16S ribosomal RNA gene, used as proxy for taxa abundances in a microbial community. This has demonstrated how the gut intestinal (GI) microbes respond and adapt to different situations [1], how alterations of the microbial community impact on the development and functioning of the immune and metabolic systems [2], and, globally, how divergences from homeostasis (eubiosis) in this district are predictive of diseases (dysbiosis). Typical approaches to analyze these data consist of the evaluation of the α-diversity of Operational Taxonomic Units (OTUs, computational proxies for species) within each sample to understand the microbial population structure using Shannon [3] and Simpson [4] indexes. This is based on the observation that more variability offers a larger spectrum of microbial molecular functions and hence of responses to environmental variations [5], and, reversely, this criterion relies on the observed limited α-diversity in inflammatory bowel disease [6] and obesity [7].
Along the same line, evaluation of the imbalance in the physiologic abundances of Bacteroides and Firmicutes is observed to be a measure of the inflammatory state of the system and a proxy for dysbiosis due to the relative increase of facultative anaerobic microbes able to exploit byproducts of the host inflammatory processes [8].
From a different perspective, differential analyses compute microbial variations, and highlights OTUs whose abundance are significantly changed between two conditions, followed by annotation of OTUs to taxa and manual search of known organisms whose functions within the host environment help to shed light, for example, on the mechanisms that trigger or sustain a disease.
Worldwide, large efforts are ongoing to complete the taxonomy of mammalians’ microbes, with a particular focus on their effects on health and disease (Human Microbiome Project, HMP) in synergy with metatranscriptomics and metaproteomics analyses to elucidate functional information [9]. Nevertheless, little is still known to date. As a result, despite the possibility to screen GI microbiomes at relatively low costs and with minimal invasiveness, it remains difficult to gain global understanding on the beneficial or deleterious effect of a condition, limited by the known bacteria (functions), thus leaving unaddressed, for example, the impact a novel therapy on the GI tract and, in the long run, on the immune and metabolic systems.
While awaiting for a (more) complete characterization of bacteria in the human GI microbiome, we propose to add a layer of interpretation by quantification of the varied composition of pathogens, with respect to a baseline, in statistical terms. This represents an informed base to further screen specific strains.
In fact, microbiology has cumulated, on harmful bacteria, a remarkable amount of information. From the well and long known Mycobacterium tuberculosis [10], more recent findings have shown how previously unsuspected noncommunicable diseases are also affected by bacterial alterations leading to the characterization of Porphyromonas gingivalis [11] in the mouth microbiome and Prevotella copri [12] in the GI microbiome as drivers of RA and to Lactobacilli-rich food conversely reported to improve RA symptoms [13].
As a result, it is possible to define bacteria as harmful when explicitly associated to a disease, or harmless (rather than beneficial, in a conservative perspective) otherwise. The collection of such information is not yet centralized, and we here offer a first curated database of this type of classification (part of the eudysbiome package, also added as Additional file 1: Table S1 for convenience).
This approach overcomes two current lacks: on one side, efficient and automated usability of the pathogenic potential information; and on the other side, a genera annotation strategy capable to fill the paucity of information available at the OTU level. Namely, we overcome these issues by: (i) centralizing available pathogenic annotation resources; (ii) devising a pathogenic genera definition, both implemented in a statistical pipeline available as Bioconductor package, offering tabular and graphical output.
Two words of cautions must be put forward for the usage of this approach. First, to offer the most detailed annotation we rely on OTUs/species (see Methods), that however imply a number of unknown/unannotated elements discarded from further analyses to avoid bias in the results. Second, the abundance of pathogens must be put into context, for example, healthy and long-lived hunter-gatherer populations are characterized by GI microbiomes with higher α-diversities than urban populations [14], including in this diversity numerous pathogens; however, when comparing the effects of treatments on a clinically uniform set of patients, the increased abundance of pathogens represents an added risk of comorbidity in individuals with already debilitated general health conditions. It is recommended, as in any omic analysis, to further manually investigate such global harmless/harmful trends by manual investigation of the emerging strains (as it is done for example in transcriptomics with the manual inspection of the genes identified in a statistically significant Gene Ontology biological function).
Globally, this approach should be considered as integrative and complementary to the existing ones to shed additional light on the effects of maladies, treatments and other external input on the host-microbiome supra-organism. To present the usability and informativeness of this approach, we apply it to the analysis of the GI microbiome of patients affected by rheumatoid arthritis (RA), a model for chronic, inflammatory and autoimmune diseases, spreading at very fast pace, and whose microbial composition is being continuously unveiled. For its incidence (1 % worldwide) and its exemplar characteristics (model disease) our results represents not only an important example of application but also meaningful results per se.