Literature DB >> 21505035

FR-HIT, a very fast program to recruit metagenomic reads to homologous reference genomes.

Beifang Niu¹, Zhengwei Zhu, Limin Fu, Sitao Wu, Weizhong Li.

Abstract

SUMMARY: Fragment recruitment, a process of aligning sequencing reads to reference genomes, is a crucial step in metagenomic data analysis. The available sequence alignment programs are either slow or insufficient for recruiting metagenomic reads. We implemented an efficient algorithm, FR-HIT, for fragment recruitment. We applied FR-HIT and several other tools including BLASTN, MegaBLAST, BLAT, LAST, SSAHA2, SOAP2, BWA and BWA-SW to recruit four metagenomic datasets from different type of sequencers. On average, FR-HIT and BLASTN recruited significantly more reads than other programs, while FR-HIT is about two orders of magnitude faster than BLASTN. FR-HIT is slower than the fastest SOAP2, BWA and BWA-SW, but it recruited 1-5 times more reads. AVAILABILITY: http://weizhongli-lab.org/frhit.

Entities: Chemical Disease Species

Mesh：

Year: 2011 PMID： 21505035 PMCID： PMC3106194 DOI： 10.1093/bioinformatics/btr252

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Metagenomic data provide a more comprehensive picture for our understanding of the microbial world. An important step of such understanding is to compare the raw sequencing reads against the available microbial genomes to analyze the phylogenetic composition, genes and functions of the samples. Such a procedure, referred to as fragment recruitment, was introduced in the Global Ocean Sampling (GOS) metagenomics study (Rusch ). Sequences from metagenomic samples exhibit great differences from the available genomes. Although there are thousands of available complete microbial genomes, they hardly cover the broad and diverse species in many metagenomic samples. A typical metagenomic dataset may have hundreds or thousands of species, and many of them are novel. Therefore, it is critical for fragment recruitment methods to align reads to homologous genomes. In the GOS study, BLAST (Altschul ) was used for fragment recruitment. However, it is too slow to handle large datasets. The explosion of next-generation sequencing data stimulated the development of new mapping programs, such as SOAP (Li ), Bowtie (Langmead ), BWA (Li and Durbin, 2009) and many others. These programs are several orders of magnitude times faster than BLAST, but they can only identify very stringent similarities that tolerate only a few mismatches and gaps. So these mapping programs are insufficient for fragment recruitment. The slightly slower programs like BLAT (Kent, 2002), SSAHA2 (Ning ) and LAST (Kielbasa ) can recruit more reads than the mapping programs, but their fragment recruiting capacities are still limited. In this article, we present a new fragment recruitment method, FR-HIT. Given reference genomes, metagenomic reads and sequence identity and alignment length cutoffs, the goal of FR-HIT is to align the most reads to references with minimal computational time.

2 METHODS AND IMPLEMENTATION

FR-HIT first constructs a k-mer hash table for the reference genome sequences. Then for each query, it performs seeding, filtering and banded alignment to identify the alignments to reference sequences that meet user-defined cutoffs.

2.1 Constructing k-mer hash table

The reference genome sequences are converted into a k-mer hash table. The default value of k is 11 and can be adjusted from 8 to 12. We include overlapping k-mers at an equidistant step from reference sequences. A reference sequence of length m contains (m − k)/(k − p)+1 k-mers with an overlap of p bases. Here, p is also a user-adjustable parameter. The hash table stores the indexes of reference sequences and the offset positions of k-mers on reference sequences.

2.2 Seeding

Seeding identifies candidate blocks, which are fragments of reference sequences that can be potentially aligned with the query. For each query, we count all its overlapping k-mers and scan the k-mer hash table to collect the k-mers shared by reference sequences. We identify pieces of reference sequences that the query can be aligned to. These pieces are anchored by the shared k-mers. For a reference, any cluster of ≥ 2 pieces within b bases will derive a candidate block. This block covers all the pieces in that cluster and has extra b bases at each end. Here, b is the bandwidth to be introduced in Section 2.4. If two candidate blocks overlap, they are joined together into one candidate block. We repeat this until no overlapping blocks are observed.

2.3 Filtering

Filtering removes the candidate blocks that do not enclose qualified alignments. K-mer filtering was originally used in QUASAR (Burkhardt ). Two sequences of length n with Hamming distance ϵ share at least n + 1 − (ϵ + 1)k common k-mers (Jokinen and Ukkonen, 1991; Owolabi and Mcgregor, 1988). Here, ϵ is the number of mismatches in an alignment. Based on user-defined length and sequence identity cutoffs, we calculate the number of mismatches and reject the candidate blocks that do not have enough common k-mers. In this step, the length of a k-mer is 4.

2.4 Banded alignment

FR-HIT performs banded alignments (Pearson and Lipman, 1988) between the query and the candidate blocks that passed the filter. The bandwidth is also a user-defined value. For each candidate block, the band that contains the most shared k-mers is used. If a reference sequence has multiple candidate blocks, these blocks are sorted by the number of shared k-mers in decreasing order. Banded alignments are performed in this order, and if t banded alignments do not recruit this query, no more banded alignment is tried for this reference. Here, t is a parameter with default value of 10.

2.5 Implementation

FR-HIT is written in C++ and distributed at http://weizhongli-lab.org/frhit with documentation and testing data. FR-HIT takes reference sequences in FASTA format and queries in FASTA or FASTQ format and produce recruitment results. If a query hits multiple references or multiple locations of a reference, FR-HIT reports all these alignments. Currently, FR-HIT does not support reads in color space.

3 RESULTS

We applied FR-HIT on four metagenomic datasets and compared it with BLASTN, MegaBLAST, SOAP2, BWA, BWA-SW, SSAHA2, BLAT and LAST. The first dataset has 1 million 75 bp Illumina reads from MetaHIT sample MH0006 (Qin ). The other three datasets are from 454 GS20, GSFLX and Titanium platforms, with 688 590, 288 735 and 502 399 reads, respectively. Their average lengths are 99, 233 and 345 bp, respectively. The GS20 and GSFLX datasets were downloaded from CAMERA (Sun ) under IDs SCUMS_SMPL_Arctic and BATS_SMPL_174-2. The Titanium data were from NCBI under accession SRR029691. For the Illumina dataset, we used the 194 human gut genomes from MetaHIT study as reference. For the 454 datasets, we used the 1985 completed bacterial genome sequences downloaded from NCBI in April 2010 as references. The two reference databases are 1.139 and 3.823 GB in size. A read is considered recruited if it is aligned to a reference with ≥ 30 bp and ≥ 80% identity. Such cutoffs represent a basic need for fragment recruitment, to recruit more reads and to prevent obviously spurious hits. More discussions and examples of parameters are available in Supplementary Material. Parameters of all the programs are listed in Supplementary Table S2. The CPU time and the number of recruited reads are shown in Figure 1 and Supplementary Table S3. FR-HIT's results with different parameters are provided in Supplementary Table S4. On average, FR-HIT and BLASTN recruited significantly more reads than other programs, FR-HIT is ~ 2 orders of magnitude faster than BLASTN. FR-HIT is slower than the fastest mapping programs SOAP2, BWA and BWA-SW, but it recruited 1–5 times more reads. In these tests, FR-HIT shows better recruitment rate and speed than SSAHA2. FR-HIT is slightly slower than MegaBLAST, BLAT and LAST, but it recruited much more reads than them. Using the Illumina data as an example, BLASTN recruited 475 584 reads in 7168 min. SOAP2 used 1.5 min, but only recruited 141 417 reads.

Fig. 1.

Recruitment rate and speed of FR-HIT and other programs. The x-axis (logarithmic scale) is CPU minute on AMD Opteron 8380 Shanghai 2.5 GHz processors; y-axis is the number of recruited reads. SOAP2 and BWA, short read mapping tools, were only used in Illumina data. FR-HIT recruited 523 868 reads in 45 min. Metagenomic data contain many novel species, so 49–64% of reads cannot be recruited by FR-HIT. Due to the use of overlapping k-mers, FR-HIT needs more memory than other programs. It used ~ 4 and 8 GB for the two reference databases in these tests. Funding: Supported by Awards (R01RR025030 and R01HG005978) from National Center for Research Resources and National Human Genome Research Institute. Conflict of Interest: none declared.

11 in total

1. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

2. SSAHA: a fast search method for large DNA databases.

Authors: Z Ning; A J Cox; J C Mullikin
Journal: Genome Res Date: 2001-10 Impact factor: 9.043

3. Adaptive seeds tame genomic sequence comparison.

Authors: Szymon M Kiełbasa; Raymond Wan; Kengo Sato; Paul Horton; Martin C Frith
Journal: Genome Res Date: 2011-01-05 Impact factor: 9.043

4. A human gut microbial gene catalogue established by metagenomic sequencing.

Authors: Junjie Qin; Ruiqiang Li; Jeroen Raes; Manimozhiyan Arumugam; Kristoffer Solvsten Burgdorf; Chaysavanh Manichanh; Trine Nielsen; Nicolas Pons; Florence Levenez; Takuji Yamada; Daniel R Mende; Junhua Li; Junming Xu; Shaochuan Li; Dongfang Li; Jianjun Cao; Bo Wang; Huiqing Liang; Huisong Zheng; Yinlong Xie; Julien Tap; Patricia Lepage; Marcelo Bertalan; Jean-Michel Batto; Torben Hansen; Denis Le Paslier; Allan Linneberg; H Bjørn Nielsen; Eric Pelletier; Pierre Renault; Thomas Sicheritz-Ponten; Keith Turner; Hongmei Zhu; Chang Yu; Shengting Li; Min Jian; Yan Zhou; Yingrui Li; Xiuqing Zhang; Songgang Li; Nan Qin; Huanming Yang; Jian Wang; Søren Brunak; Joel Doré; Francisco Guarner; Karsten Kristiansen; Oluf Pedersen; Julian Parkhill; Jean Weissenbach; Peer Bork; S Dusko Ehrlich; Jun Wang
Journal: Nature Date: 2010-03-04 Impact factor: 49.962

Review 5. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

6. Improved tools for biological sequence comparison.

Authors: W R Pearson; D J Lipman
Journal: Proc Natl Acad Sci U S A Date: 1988-04 Impact factor: 11.205

7. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors: Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal: Genome Biol Date: 2009-03-04 Impact factor: 13.583

8. Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource.

Authors: Shulei Sun; Jing Chen; Weizhong Li; Ilkay Altintas; Abel Lin; Steve Peltier; Karen Stocks; Eric E Allen; Mark Ellisman; Jeffrey Grethe; John Wooley
Journal: Nucleic Acids Res Date: 2010-11-02 Impact factor: 16.971

9. The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific.

Authors: Douglas B Rusch; Aaron L Halpern; Granger Sutton; Karla B Heidelberg; Shannon Williamson; Shibu Yooseph; Dongying Wu; Jonathan A Eisen; Jeff M Hoffman; Karin Remington; Karen Beeson; Bao Tran; Hamilton Smith; Holly Baden-Tillson; Clare Stewart; Joyce Thorpe; Jason Freeman; Cynthia Andrews-Pfannkoch; Joseph E Venter; Kelvin Li; Saul Kravitz; John F Heidelberg; Terry Utterback; Yu-Hui Rogers; Luisa I Falcón; Valeria Souza; Germán Bonilla-Rosso; Luis E Eguiarte; David M Karl; Shubha Sathyendranath; Trevor Platt; Eldredge Bermingham; Victor Gallardo; Giselle Tamayo-Castillo; Michael R Ferrari; Robert L Strausberg; Kenneth Nealson; Robert Friedman; Marvin Frazier; J Craig Venter
Journal: PLoS Biol Date: 2007-03 Impact factor: 8.029

10. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

38 in total

1. Single cells within the Puerto Rico trench suggest hadal adaptation of microbial lineages.

Authors: Rosa León-Zayas; Mark Novotny; Sheila Podell; Charles M Shepard; Eric Berkenpas; Sergey Nikolenko; Pavel Pevzner; Roger S Lasken; Douglas H Bartlett
Journal: Appl Environ Microbiol Date: 2015-09-18 Impact factor: 4.792

2. Unravelling core microbial metabolisms in the hypersaline microbial mats of Shark Bay using high-throughput metagenomics.

Authors: Rendy Ruvindy; Richard Allen White; Brett Anthony Neilan; Brendan Paul Burns
Journal: ISME J Date: 2015-05-29 Impact factor: 10.302

Review 3. Survey of (Meta)genomic Approaches for Understanding Microbial Community Dynamics.

Authors: Anukriti Sharma; Rup Lal
Journal: Indian J Microbiol Date: 2016-11-11 Impact factor: 2.461

4. Intestinal virome changes precede autoimmunity in type I diabetes-susceptible children.

Authors: Guoyan Zhao; Tommi Vatanen; Lindsay Droit; Arnold Park; Aleksandar D Kostic; Tiffany W Poon; Hera Vlamakis; Heli Siljander; Taina Härkönen; Anu-Maaria Hämäläinen; Aleksandr Peet; Vallo Tillmann; Jorma Ilonen; David Wang; Mikael Knip; Ramnik J Xavier; Herbert W Virgin
Journal: Proc Natl Acad Sci U S A Date: 2017-07-10 Impact factor: 11.205

5. Complete arsenic-based respiratory cycle in the marine microbial communities of pelagic oxygen-deficient zones.

Authors: Jaclyn K Saunders; Clara A Fuchsman; Cedar McKay; Gabrielle Rocap
Journal: Proc Natl Acad Sci U S A Date: 2019-04-29 Impact factor: 11.205

6. Virome biogeography in the lower gastrointestinal tract of rhesus macaques with chronic diarrhea.

Authors: Guoyan Zhao; Lindsay Droit; Margaret H Gilbert; Faith R Schiro; Peter J Didier; Xuemei Si; Anne Paredes; Scott A Handley; Herbert W Virgin; Rudolf P Bohm; David Wang
Journal: Virology Date: 2018-11-20 Impact factor: 3.616

7. The unique composition of Indian gut microbiome, gene catalogue, and associated fecal metabolome deciphered using multi-omics approaches.

Authors: D B Dhakan; A Maji; A K Sharma; R Saxena; J Pulikkan; T Grace; A Gomez; J Scaria; K R Amato; V K Sharma
Journal: Gigascience Date: 2019-03-01 Impact factor: 6.524

8. Gut DNA viromes of Malawian twins discordant for severe acute malnutrition.

Authors: Alejandro Reyes; Laura V Blanton; Song Cao; Guoyan Zhao; Mark Manary; Indi Trehan; Michelle I Smith; David Wang; Herbert W Virgin; Forest Rohwer; Jeffrey I Gordon
Journal: Proc Natl Acad Sci U S A Date: 2015-09-08 Impact factor: 11.205

9. Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files.

Authors: Xiaobo Sun; Jingjing Gao; Peng Jin; Celeste Eng; Esteban G Burchard; Terri H Beaty; Ingo Ruczinski; Rasika A Mathias; Kathleen Barnes; Fusheng Wang; Zhaohui S Qin
Journal: Gigascience Date: 2018-06-01 Impact factor: 6.524

10. Bacteria Isolated From the Antarctic Sponge Iophon sp. Reveals Mechanisms of Symbiosis in Sporosarcina, Cellulophaga, and Nesterenkonia.

Authors: Mario Moreno-Pino; Juan A Ugalde; Jorge H Valdés; Susana Rodríguez-Marconi; Génesis Parada-Pozo; Nicole Trefault
Journal: Front Microbiol Date: 2021-06-10 Impact factor: 5.640