Parallel-META: efficient metagenomic data analysis based on high-performance computation
© Su et al.; licensee BioMed Central Ltd. 2012
Published: 16 July 2012
Skip to main content
© Su et al.; licensee BioMed Central Ltd. 2012
Published: 16 July 2012
Metagenomics method directly sequences and analyses genome information from microbial communities. There are usually more than hundreds of genomes from different microbial species in the same community, and the main computational tasks for metagenomic data analyses include taxonomical and functional component examination of all genomes in the microbial community. Metagenomic data analysis is both data- and computation- intensive, which requires extensive computational power. Most of the current metagenomic data analysis softwares were designed to be used on a single computer or single computer clusters, which could not match with the fast increasing number of large metagenomic projects' computational requirements. Therefore, advanced computational methods and pipelines have to be developed to cope with such need for efficient analyses.
In this paper, we proposed Parallel-META, a GPU- and multi-core-CPU-based open-source pipeline for metagenomic data analysis, which enabled the efficient and parallel analysis of multiple metagenomic datasets and the visualization of the results for multiple samples. In Parallel-META, the similarity-based database search was parallelized based on GPU computing and multi-core CPU computing optimization. Experiments have shown that Parallel-META has at least 15 times speed-up compared to traditional metagenomic data analysis method, with the same accuracy of the results http://www.computationalbioenergy.org/parallel-meta.html.
The parallel processing of current metagenomic data would be very promising: with current speed up of 15 times and above, binning would not be a very time-consuming process any more. Therefore, some deeper analysis of the metagenomic data, such as the comparison of different samples, would be feasible in the pipeline, and some of these functionalities have been included into the Parallel-META pipeline.
The total number of microbial cells on earth is huge: approximate estimation of them is 1030 , and the genomes of these vastly unknown communities of microbes might contain a large number of novel genes with useful functions. However, more than 99% of microbe species were unknown and un-cultivable , making traditional isolation and cultivation process non-applicable. Analysis of their metagenomic data is the direct and efficient way to analyse all microbes in the community . The metagenomic approach has made it possible better understanding of microbial diversity as well as their functions. And the broad applications of metagenomic research, including environmental sciences, bioenergy research and human health, have made it an increasingly popular research area.
Metagenomics researches were based on sequencing data from 16S rRNA amplicon, or large-scale shot-gun whole-genome metagenomic sequencing. Early 16S rRNA-based metagenomic survey of microbial communities focused on 16S ribosomal RNA sequences which are relatively short, often conserved within a species, and different between species. The16S rRNA-based metagenomic survey has already produced data for analysis of microbial communities of Sargasso Sea , acid mine drainage biofilm  and human gut microbiome . Facilitated with Next-Generation-Sequencing (NGS) techniques , current metagenomic research has been advanced rapidly. NGS techniques could produce millions of reads at very high speed with relatively low price, thus it enables sequencing at much greater depth. Based on NGS techniques and high performance computational analysis methods, many large-scale metagenomic research projects have been conducted , thus made the large-scale metagenomic research the mainstream. In this paper, we were focusing on data analysis for shot-gun whole-genome metagenomic sequencing, in which computational methods play very important roles, especially the similarity-based database search.
The primary goal of metagenomic research is the assessment of taxonomic and functional diversity of microbial communities. Based on NGS data, metagenomic data analysis is both data- and computing-intensive. Therefore, high-performance computing is needed for metagenomic data analysis, especially for projects involving many metagenomic samples.
Traditional high performance computing platform only use CPU cluster. For high computing speed, CPU computing platform always has large amount of high performance CPUs, which also accompanied with high cost and high power consumption. However, with the increase of data size, it becomes more and more difficult for current CPU cluster to satisfy the requirement of the fast-developing metagenomic research. The computing speed of metagenomic data analysis would be accelerated significantly by the combination of GPU computing and parallel CPU computing. For GPU computing, the GPGPU(General Purpose Graphic Process Unit) hardware and CUDA(Compute Unified Device Architecture) software would be the method of choice. CUDA is a massive parallel computing architecture model. Based on nVIDIA (Santa Clara, CA) GPGPUs and SIMT(Single Instruction Multiple Threads), it enables dramatic increases in computing performance by parallel computing with huge number of stream processors. For parallel CPU computing, multi-core CPU could be utilized by implementation of multi-threaded parallel programming.
Large databases of reference sequences, such as Greengenes , SILVA  and RDP  already exist for metagenomic sequence analysis. As most of the microbial communities are still unknown, these databases are also updating frequently. For computational analysis of metagenomic data, the most important tasks include taxonomic and functional analyses. A crucial step in the taxonomic analysis of large-scale metagenomic data is "binning", in which the metagenomic sequences were assigned to phylogenetic groups according to their taxonomic origins at different resolutions: from "kingdom" to "genus" level. There are two categories of binning methods: similarity-based methods that align reads to reference databases, and composition-based methods that use composition patterns (GC content, k-mer frequency, etc.) to cluster reads. The similarity-based methods classify sequences based on sequence homology, which is determined by reference database searches using general purpose alignment tools such as BLAST . The most frequently used similarity-based metagenomic data binning methods include MEGAN , CARMA  and Sort-ITEM . Most of these software could be used on PC workstation. However, similarity-based methods rely on reference databases that contain sequences of known genomes, so these methods cannot classify the majority of sequences that were from unknown genomes without close references. In contrast, composition-based methods analyse intrinsic sequence features such as GC content, codon usage and k-mer frequency, and compare these features with reference genome sequence of known taxonomic origins. The most frequently used composition-based metagenomic data binning methods include TETRA  and PhyloPhythia . Recently, some all-in-one metagenomic data analysis pipelines were introduced, such as Phyloshop  and QIIME . The web-based metagenomic annotation platforms, such as MG-RAST  and CAMERA  were also designed to analyse metagenomic data. However, the increasing number of metagenome data analysis projects needs more and more computational power, which become an increasingly large huddle for the efficient process of metagenome datasets by current pipelines.
CUDA is a massive parallel computing model based on GPGPU (GPU for short) to solve the rapid increasing data computing problem. It is presented by nVIDIA in 2006 with the G80 series GPU. Different form the traditional GPUs, which are consisted of rendering pipeline of Vertex Engine and Pixel Engine, a CUDA enabled GPU is composed of several SMs (Stream Multiprocessors). The amount of SM depends on the model of GPU. For example, nVIDIA Tesla has 30 SMs, and nVIDIA Quadro FX 880M has 6 SMs. In a single SM, there are also several stream processors and a shared memory which can be accessed by these processors in the same SM. For G80/GT100/GT200 series GPU, one SM is composed by 8 stream processors. The latency of the shared memory is quite low, so it is always used as cache. There is also an on-board memory (Global Memory) which can be shared by all the stream processors in a GPGPU. As GPU cannot directly access the RAM of a computer system, data should be transferred from RAM to Global Memory before GPU computation.
In a computer system equipped with GPU, the CPU system is called host, and the GPU system is called device. CUDA provides a series of APIs which can be invocated by host programs. As GPU cannot directly access the system memory of CPU and Hard disk, data should be transferred from the system memory (RAM) to the Global Memory of GPU by CUDA APIs. Then the stream processors of GPU can exchange data with the Global Memory and Shared Memory.
CPU core is the key part made of Monocrystalline silicon on which instructions can be executed. Before 2005, for normal CPUs, there was only 1 core on one single CPU chipset, which limited the development of the computing capability and efficiency. Engineers used the method that integrating several CPU cores on a chipset to solve these problems. Instructions can be executed on those cores parallelly at the same time. This method not only enhanced the computing capability of CPU, but also reduced the TDP (Thermal Design Power) for cutting down the working voltage and clock rate of CPU cores and the application of power management technology.
The latest architecture of Intel multi-core CPU is Nehalem. Nehalem Xeon 5000 series have such architecture features: (a) four cores integrated into one CPU, (b) hyper-threading technology supports 8 threads at most, (c) each core has a 64 KB L1 cache and 256 KB L2 cache, with an 8 MB L3 cache is shared by all cores and (d) Turbo Boost technology, dynamically adjusts the work frequent of cores.
In this work, the hardware used was one single node of the GPU computing platform of QIBEBT, CAS (Qingdao Institute of Bioenergy and Bioprocess Technology, Chinese Academy Sciences) computing platform that had the following configuration: CPU: Dual Intel Xeon X5645 2.66 GHz with 12 cores, GPU: nVIDIA Tesla C2070 with 448 processors and 6G DDR5 ECC on board memory, RAM: 72 GB RDIMM DDR3. For this system, the total float computing capability of CPU is 89.6Gflops, and the total float computing capability of GPU is up to 1Tflops.
Parallel-META is an integrated metagenomic data analysis pipeline developed by QIBEBT, CAS which enables parallel analysis of large metagenomic data and comparison among multiple samples. This pipeline includes four steps: (1) 16S rRNA extraction part: to predict 16S rRNA fragments from metagenomic sequences by HMM (Hidden Markov Model) search of HMMER  and then extract out those fragments based on the results of prediction, (2) 16S rRNA mapping part: to map the 16S rRNA fragments extracted by last step onto the database of Greengenes  core set using megaBLAST  as the alignment tool for their identification, (3) Classification part: to classify the 16S rRNA fragments based on their assignment of the taxonomical terms and mapping to the phylogenetic tree of each sample, (4) Multi-sample comparison part: to compare the taxonomical structure of all samples on different biological levels. After these steps, Parallel-META reports the classification, length distribution, summary of the taxonomic assignments of 16S rRNA fragment sequences and structure difference at different phylogenetic levels among all samples.
In the 16S rRNA extraction part, we used GPU-HMMER  component to implement parallel 16S rRNA prediction on both original sequences and complement sequences instead of traditional HMMER which is based on CPU. The core of HMM search is the Viterbi Algorithm, which is used to compute the most probable path through a given state HMM. Different from CPU computing that performs HMM search by serially executing the loops of Viterbi Algorithm. In GPU-HMMER, the loops are parallelized and expanded into some sub processes. Then each process was mapped to a thread on a stream processor of GPU. As GPU enables the activities of huge number of threads at the same time, Viterbi Algorithm can be done in a much shorter time on GPU than on CPU. Similarly, when computing the complement sequences from the original sequences, each single sequence was also mapped to one stream processor and large number of sequences can be transformed at the same time.
The next step -16S rRNA mapping has been divided into three parts: Problem Decomposition, Parallel Computing, and Result Combination. In the first part, the output data gained from the 16S rRNA extraction were decomposed into sub data files with similar size. Then in the second part, each thread could directly find its input data from the original file and perform the megaBLAST search parallelly by multi-threads programming on CPU. After that, sub results were merged together to get the final result.
After the two steps above, Parallel-META will parse the result of mapping, classify the 16S rRNA fragments of each sample, and then compare the construction of each sample on different biological level. For classification, all samples will be mapped onto one phylogenetic tree with the percentage of each sample on each level. Finally Parallel-META will visualize the common phylogenetic tree of all samples and the comparison of all components with the normalized proportions.
Statistics of metagenomic datasets (Dataset 1)
16S rRNA number
In addition, we used 1,968 sets of 16S rRNA targeted sequences  (Dataset 2) with 68,667,837 sequences and total size of 11,878.4 MB to test the performance of 16S rRNA targeted sequences analysis. The 16S rRNA sequences were generated from the largest human microbiota of two individuals at four body sites over 396 time-points by Illumina GAIIx.
In the experiments, Firstly, we measured the speed-up of Parallel-META based on GPGPU and multi-core CPU against serial version on single core of single CPU on Dataset 1. Secondly, we tested the Parallel-META pipeline on Dataset 2. Finally, we analysed the comparison result and complete common phylogenetic tree by using Dataset 1. All experiments were performed on one single node of the GPU computing cluster with dual Intel Xeon X5645 CPU (12 cores and 24 threads in total), 72 GB DDR3 RDIMM RAM, and nVIDIA Tesla C2070 GPU(448 stream processors and 6 GB on board memory).
We ran the Parallel-META with each metagenomic sequence file as input by parallel mode using GPGPU and multi-core CPU and serial mode using one single core of single CPU to compare the speed of two different methods. To reduce the effect of system-wise randomness and noises on the results, each input data were executed three times to get the average results, and the average results were compared. For metagenomic data, we test the performance in 2 steps:
Here N i and S i were the sequence number and speed-up of input i, and M represented the sample number which is 4 in this experiment, respectively. By this we can get the average speed-up of GPGPU based 16S rRNA extraction of 14.71.
Here R i and S i were the 16S rRNA number and speed-up of input i, and M represented the sample number which is 4 in this experiment, respectively. Therefore, the average speed-up of the16S rRNA mapping was 19.00.
We made the 1,968 16S rRNA sequence files as input files for Parallel-META to evaluate the performance of 16S rRNA targeted sequences analysis. Rather than predicting and extracting 16S rRNA from metagenomic data first, for targeted sequences, Parallel-Meta will skip the 16S rRNA extraction step in the whole pipeline and then continue the last steps.
As the input sequence count quite huge (more than 60 million), this experiment can indicate the computing capability of Parallel-META on massive input data. We also executed each input data three times to get the average results to reduce the effect of system-wise randomness and noises on the results. For the 1,968 input files with 68,667,837 16S rRNA sequences, the average total analysis time of Parallel-META was 6,073 minutes and 51.918 seconds.
Given multiple samples, Parallel-META could parse the result of all samples together and map the classification information of one common phylogenetic tree. In the consensus phylogenetic tree, every level of the nodes represents one biological level, therefore from root level to leaf level there are at most six levels which represent phylum, class, order, family, genus and species. For visualization, after the name of each node of the phylogenetic tree there is a bar-chart indicating percentage distrubution of every input samples on this node.
Traditional metagenomic data analyses were conducted on single PC or CPU cluster, based on which handling multiple large metagenomic datasets is becoming more and more difficult. In this work, we have tried to utilize GPU computing and multi-core CPU computing to boost the speed of metagenomic data analysis, and proposed a novel pipeline that enabled the parallel processing of large metagenomic datasets. The Parallel-META pipeline has been applied on several metagenomic data analysis projects for human-associated bacterial communities, such as oral disease-causing microbial community analysis . Several folds of speed-up have been observed, while the sensitivity and discrepancy power were not compromised. With current 10 to more than 15 times of speed-up, some deeper analysis of the metagenomic data, such as the comparison of different samples, would be feasible.
Current Parallel-META pipeline could be improved in different ways. Firstly, the megaBLAST search part could also be implemented on GPU architecture, so that the efficiency of this time-consuming part could be further improved. Secondly, as metagenomic datasets are of different types and sources, the parameters for analysis would be different for each metagenomic dataset. These parameters could be trained based on running Parallel-META on a large amount of different metagenomic datasets, which in turn could improve the accuracy of Parallel-META. Thirdly, the Parallel-META framework could be extended to work with multiple search engines and databases so as to be applicable to different types of metagenomic datasets. Finally, as a general-purpose metagenomic data analysis pipeline, Parallel-META could also incorporate component-based binning methods, which might also significantly improve the speed for clustering metagenomic short reads .
Compliment to the high-performance computational pipeline is the high-performance database management system. The high-performance database management system would not only store large amount of results by high-performance computational pipeline, but also facilitate deeper data mining of metagenomic data. Such high-performance database management system would also be incorporated into the next-generation high-performance computational platform for metagenomic data analysis.
We thank Huimin Li from USTC, Xingzhi Chang and YinheQiao of QIBEBT, CAS for their support in arrangement of computing facilities. We also thank nVIDIA to provide us with the nVIDIA Tesla C2070 GPU card and their helpful discussion of the problem. This research is supported in part by Chinese Academy of Sciences' e-Science grant INFO-115-D01-Z006, Ministry of Science and Technology's high-tech (863) grant 2009AA02Z310 and NSFC grant 61103167.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 1, 2012: Selected articles from The 5th IEEE International Conference on Systems Biology (ISB 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S1.