Construction of a high-quality 16S rRNA gene sequence reference database including uncultured taxa
A high-quality 16S rRNA gene sequence reference database that includes uncultured taxa was constructed using 16S rRNA gene sequences obtained from the Ribosomal Database Project (RDP) [12]. The 1,345,732 16S rRNA gene sequences of Bacteria and Archaea in RDP (release 11, update 2) were retrieved using the following parameters: Strain = Both, Source = Both, Size ≥1200, and Quality = Good. We obtained these sequences with three different file formats: multiply aligned fasta files, unaligned fasta files, and GenBank files. To eliminate contaminating tRNA gene sequences, internal transcribed spacer sequences (ITSs), and 23S rRNA gene sequences from this reference database, we performed a search of the 1,345,732 sequences with BLAST+ [19] (version 2.2.27; −max_target_seqs 100 and -parc_identity 97) against the previous VITCOMIC 16S rRNA gene sequence reference database [20]. This previous database contains only 16S rRNA gene sequences for which the 5′-end was adjusted to six bases before the 8F primer motif sequence (TTGATCCT) [21] and the 3′-end, adjusted to the end of the anti-Shine–Dalgarno sequence (CACCTCCTTN) [22]; therefore, the 5′- and 3′-end trimming of each sequence for optimization of VITCOMIC2 was easily performed using the sequence identity search results. After trimming the 1,345,732 16S rRNA gene sequences, we eliminated (i) 476 sequences for which the length was abnormally long (> 1660 bp) or short (< 1120 bp) according to the distribution of sequence length (Additional file 1), and (ii) 1159 sequences that contained ≥10 bases of homopolymer.
Because the RDP database contains many nearly identical sequences, we performed genus-level sequence clustering using USEARCH [23] version 6.0.307 (−-cluster_smallmem --id 0.94, −-query_cov 0.9, and --target_cov 0.9). Since USEARCH/UCLUST version 6 uses a greedy algorithm, the order of input sequences is very important [23]. Therefore, before conducting sequence clustering, we have sorted sequences whose origins at the genus-level taxonomy were not designated as “unclassified” by their sequence lengths. Then, “unclassified” sequences were sorted by their sequence lengths, and added on the end of the sequence file. By sorting sequences as described above, in most cases, we can avoid the “unclassified” sequences in the RDP database become a representative sequence of a cluster. The resulting 63,956 representative sequences of clusters were examined to determine whether any were PCR chimeric sequences using UCHIME [24] (version 6.0.307 de novo mode and reference mode). In the UCHIME reference mode, gold.fa sequence data [25] were used as the reference. We discarded the 706 representative sequences that were flagged by both the de novo and reference modes of UCHIME as chimeric sequences. In addition, we discarded the 34,273 representative sequences whose origins at the genus-level taxonomy were designated as “unclassified”. Consequently, we obtained a database that consisted of 28,977 high-quality 16S rRNA gene sequences (Additional file 2 and Additional file 3).
Allocation of references in VITCOMIC2
In the previous version of VITCOMIC, VITCOMIC visualized overall phylogenetic composition of a sample in one circular diagram using 16S rRNA gene sequences of genome sequenced prokaryote species as reference [7]. However, the number of genome sequenced prokaryote species are rapidly grown. In addition, the uncultured taxa whose genomes are not sequenced yet are also important to describe microbial communities. To include many reference sequences in VITCOMIC2, we need to separate a VITCOMIC circular diagram to two types of circular diagrams representing overall phylogenetic composition of a microbial community based on phylum composition and detailed phylogenetic composition in each phylum of the community based on genus composition. Since the separation of phylum and genus composition in different circular diagrams often cause difficulty to understand overall phylogenetic composition of a microbial community, we completely replaced a Perl and PostScript based static circular diagram in VITCOMIC to JavaScript Raphael library based interactive circular diagrams in VITCOMIC2.
VITCOMIC2 constructs a circular diagram representing overall phylogenetic composition of a microbial community based on phylum composition with the following procedure. (i) In each phylum, 30 sequences were randomly chosen from the high-quality 16S rRNA gene sequence database. As the phylum Proteobacteria is extraordinarily diverse [26], we randomly chose 30 sequences from every class of Proteobacteria. For any phylum having < 30 sequences, all available sequences were chosen. (ii) We thus obtained 1184 sequences from 44 phyla/classes. To calculate phylogenetic distances among them, the 1184 sequences were aligned using MAFFT [27] version 6.864b with default parameters. After constructing a multiple sequence alignment, genetic distances between sequences were calculated using the dnadist program in PHYLIP [28] version 3.69 with Kimura’s two-parameter model of base substitution [29]. The phylogenetic tree was constructed using the neighbor-joining method in the neighbor program of PHYLIP version 3.69. (iii) To construct the circular diagram from the phylogenetic tree, we randomly chose one sequence from Gammaproteobacteria to use as the initial sequence for subsequent tree topology scanning. The tree topology scanning is our concise method to represent tree topology (distance and cladogram) information in one line as follows. The sequence that was most distantly related to the initial sequence was identified by comparing the phylogenetic distances between the initial sequence and every other sequence (Fig. 1). The second most distantly related sequence was identified in the peripheral clade that contained the most unrelated sequence, and this process was reiterated to generate an ordered list of related sequences. (iv) The position on the circular diagram of each sequence in the list was determined based on its phylogenetic distance to its nearest relative. (v) Each phylum/class position in the circular diagram was determined by calculating an average position among sequences that belong to that phylum/class.
To understand the detailed taxonomic composition in each phylum, VITCOMIC2 also constructs a circular diagram that represents the genus composition of the sample as described by the following example. (i) The phylogenetic tree within each of the 44 phyla/classes was constructed by using all sequences belonging to each phylum/class in the high-quality 16S rRNA gene sequence database, i.e., the same method that was used for the phylum example. (ii) The same method was again used to construct the circular diagram within each of the 44 phyla/classes with a random choice of one sequence to serve as the initial sequence. If all the sequences derived from the same genus belonged to a single cluster, then the genus position in the circular diagram was the cluster position. Conversely, when the sequences derived from the same genus belonged to more than two clusters, the genus position in the circular diagram was determined by calculating an average position among the clusters.
A rapid sequence identity search using graphics processing units (GPUs)
The overall workflow of VITCOMIC2 for processing data from user-uploaded 16S rRNA gene amplicon sequencing and metagenomic shotgun sequencing is described in Fig. 2. The inference of the nearest relative for each query sequence in VITCOMIC2 was performed with a modified version of CLAST [30], a GPU-based high-speed nucleotide sequence identity search tool that we developed. We made the following modifications to CLAST to enable us to specifically search 16S rRNA gene sequences. (i) We completely separated the processes of database indexing and alignment calculation. Because VITCOMIC2 does not update the reference sequence database frequently, pre-indexing the reference database is more efficient for daily calculations. (ii) We changed the method used to estimate the amount of GPU memory consumed. In the original CLAST program, total nucleotide lengths of sequences were used to estimate the amount of the GPU memory consumed. In the modified version of CLAST for VITCOMIC2, referred to as CLASTV, the total number of sequences was used because there was little variation in the lengths of the query or database sequences. We compared the calculation speed between VITCOMIC2 and VITCOMIC by using 1,119,519 16S rRNA gene sequences sequenced by the 454 GS FLX sequencing platform obtained from Turnbaugh et al. [31] as query sequences. For VITCOMIC we used one process of BLAST+ with the parameters (−evalue 1e-08 and -num_alignments 100) and for VITCOMIC2 we used one process of CLASTV with the thresholds (identity ≥80% and alignment length ≥ 50 bp). The speed test was carried out on a machine with a Xeon X5670 2.93 GHz CPU and a Tesla C2050 GPU.
Normalizing differences in 16S rRNA gene copy number among taxa
Although some microbial genomes contain only one copy of the 16S rRNA gene, most microbes contain two or more copies [13]. Therefore, the taxonomic abundance of a microbial community directly inferred from 16S rRNA gene sequence abundance is generally biased because of differences in 16S rRNA gene copy number among taxa [32]. To avoid this bias in VITCOMIC2, we implemented the following method to normalize 16S rRNA gene copy number differences among taxa. Previous work suggested that phylogenetically cross-related taxa tend to have a similar number of 16S rRNA gene copies in the genome [33]. To obtain 16S rRNA gene copy number information for the phylogenetically related taxa of the user sequence, VITCOMIC2 conducts a sequence identity search using CLASTV against a 16S rRNA gene copy number reference database (16S rRNA-CN DB) of the genome-sequenced strains, which was constructed as follows. (i) The 16S rRNA gene sequences for each genome-sequenced strain were identified using RNAmmer [34] against the genomic sequence obtained from the NCBI Genome Database [35] in April 2014. (ii) One 16S rRNA gene sequence was randomly sampled per species because there are only small sequence differences among 16S rRNA genes within a single genome and across genomes within a single species [13]. We obtained 1505 16S rRNA gene sequences, which thus comprised the 16S rRNA-CN DB (Additional file 4). The sequence most related to the user sequence was then identified based on the results of a sequence identity search against the 16S rRNA-CN DB. VITCOMIC2 assumes that the 16S rRNA gene copy number is identical between the genomes of the user sequence and the nearest sequence in the 16S rRNA-CN DB.
We used a leave-one-out approach to validate the accuracy of the VITCOMIC2 16S rRNA gene copy number inference method. To identify the most related sequence, we randomly chose one sequence from the 16S rRNA-CN DB and conducted a sequence identity search with BLAST+ (−max_target_seqs 100) against this database (lacking the query sequence). Comparing the 16S rRNA gene copy number between the genomes of the query sequence and the nearest sequence, we calculated the error ratio for inferring the 16S rRNA gene copy number for all 1505 sequences from the 16S rRNA-CN DB.
Taxonomic composition analysis using the 16S rRNA gene sequences extracted from metagenomic shotgun sequencing data
To evaluate the specificity of the VITCOMIC2 extraction method for 16S rRNA gene sequences from metagenomic shotgun sequencing data, we compared the VITCOMIC2 method with a common extraction method using BLAST+ against the original RDP sequence database. Because we needed to eliminate the effect of different sensitivities between the two databases (i.e., the effect of different taxonomic coverage between the two databases on this comparison), we designated the 1,345,256 sequences in the VITCOMIC2 16S rRNA gene sequence reference database before conducting homopolymer filtering and 94% identity clustering as the reference sequence database. For comparison, we used soil metagenomic shotgun sequencing data (DRR001464) [18], as sequenced by the Illumina Genome Analyzer IIx sequencer. For the alternative sequence extraction method, the 47,195,934 sequences from DRR001464 were searched against the original RDP sequence database using BLAST+ with E-value <1e-8. The sequences extracted from the VITCOMIC2-based method and the original RDP sequence database-based method were searched against the NCBI-nt database [36] obtained as of July 2014 using BLAST+ with E-value <1e-4; each hit was manually verified with regard to gene function to evaluate whether the specific extracted sequence was indeed a 16S rRNA gene.
Validation of accuracy of the genus assignments and the 16S rRNA gene copy number normalization by using mock community metagenomic data
We used the Illumina HiSeq metagenomic sequencing data of Singer et al. [37] (read length = 150 bp) of a mock community of 26 genome-sequenced strains (23 genera) for the validation of accuracy of the genus assignments and the 16S rRNA gene copy number normalization.
The molarity of the DNA of each species in the mock community was previously measured [37]. In addition, the 16S rRNA gene copy number in the genome of each species is described in the RefSeq genome sequence database [36]. Therefore, the theoretical abundance of 16S rRNA gene sequences from each species in the mock community metagenomic sequencing data was calculated by multiplying the molarity of the DNA of each species and the 16S rRNA gene copy number in the genome. We compared the genus compositions of communities resulting from (i) the theoretical composition based on the molarity of the DNA of each species, (ii) the theoretical composition based on multiplying the molarity of the DNA of each species and the 16S rRNA gene copy number of the genome, (iii) 16S rRNA gene sequence-based composition calculated by VITCOMIC2, and (iv) 16S rRNA gene sequence-based composition calculated by VITCOMIC2 with 16S rRNA gene copy number normalization.
In addition, we compared genus composition inference accuracies of the Singer et al. mock community sequencing data among VITCOMIC2 and three other software (MAPSeq version 1.2 [38], SortMeRNA version 2.1b [39], and RiboTagger version 0.8.1 [40]). To separate the reference sequence databases differences and the genus composition inference accuracies of tools, we used the VITCOMIC2 high-quality 16S rRNA gene reference sequence database as a reference sequence database for MAPSeq and SortMeRNA. In the VITCOMIC2, MAPSeq, and SortMeRNA results, we only used hits with the threshold (identity ≥94%, alignment length ≥ 100 bp, and top hit) for genus assignments. In the RiboTagger analysis, we specified a parameter “-r v4”.
Species inference based on the nearly full-length 16S rRNA gene sequence reconstructed from metagenomic shotgun sequencing data
The ecological niches of taxa are sometimes different among species in the same genera [41]. However, the partial 16S rRNA gene sequences obtained from the metagenomic or amplicon sequencing analyses by the short read sequencers usually do not contain enough information for species identification [42]. Therefore, we developed a function to reconstruct nearly full-length 16S rRNA gene sequences from metagenomic shotgun sequencing data in VITCOMIC2. The reconstruction workflow is as follows. (i) Based on CLASTV results in VITCOMIC2, 16S rRNA gene reads from metagenomic shotgun sequencing data that are assigned to the same reference sequence are extracted and assembled by MEGAHIT version 1.1.1 with default parameters. (ii) After contig length (≥800 bp) and read coverage (≥10) filtering, nearly full-length 16S rRNA gene sequences are obtained. (iii) BLAST-based sequence identity searches are conducted using the nearly full-length 16S rRNA gene sequences against the type strain 16S rRNA gene sequences in the VITCOMIC2 16S rRNA gene sequence reference database. (iv) The names of species are assigned to the nearly full-length 16S rRNA gene sequences when ≥97% sequence identities are observed in the BLAST-based sequence identity searches from (iii). The accuracy of the above species inference method was evaluated by using Illumina HiSeq metagenomic data of the mock community [37].
VITCOMIC2 web system
The VITCOMIC2 web server [43] can accept either a fasta or a fastq file of one 16S amplicon or metagenomic shotgun sequencing sample with read length ≥ 100 bp as the query. Users can choose whether to include 16S rRNA gene copy number normalization and nearly full-length 16S rRNA gene sequence reconstruction. After uploading an input file, VITCOMIC2 conducts analyses as described in Fig. 2. (i) VITCOMIC2 conducts a CLASTV-based sequence identity search against the VITCOMIC2 high-quality 16S rRNA gene reference sequence database with the threshold (identity ≥80% and alignment length ≥ 100 bp, and top hit). (ii) To exclude rRNA gene sequences of eukaryotes and mitochondria, VITCOMIC2 also conducts CLASTV-based sequence identity searches against the 18S rRNA gene and rRNA gene sequences of eukaryotes and mitochondria obtained from Comparative RNA Web Site version 2 [44] with the same threshold. (iii) If users didn’t choose to conduct 16S rRNA gene copy number normalizations and the nearly full-length 16S rRNA gene sequence reconstructions, a genus composition tab-separated text file by setting identity threshold ≥94% and VITCOMIC2 phylum and genus composition figures are created by using information of CLASTV hits which were in (i)'s analysis but were not in (ii)'s analysis. (iv) In case of conducting 16S rRNA gene copy number normalizations, VITCOMIC2 conducts a CLASTV-based sequence identity search against the VITCOMIC2 16S rRNA-CN DB after (i) and (ii) analyses. Using the VITCOMIC2 16S rRNA-CN DB information, each hit in (i)'s analysis is normalized as described above section. After that, a normalized genus composition tab-separated text file and VITCOMIC2 phylum and genus composition figures are created. (v) In case of conducting nearly full-length 16S rRNA gene sequence reconstructions from metagenomic shotgun sequencing data, MEGAHIT assembling and BLAST-based sequence identity search against the type strain 16S rRNA gene sequences in VITCOMIC2 reference sequence database are performed as described above section. When the VITCOMIC2 analysis is finished, VITCOMIC2 sends an Email to the user with a download link to the results which include a VITCOMIC2 visualization result, a genus composition text file, reconstructed nearly full-length 16S rRNA gene sequences fasta files and species assignment result text files, and a text file for the VITCOMIC2 Comparison. Users can conduct multiple-sample comparisons by uploading VITCOMIC2 analysis result files for selected samples to the VITCOMIC2 Comparison web server [45].