| Literature DB >> 25347677 |
Ying Yang1, Xiao-Tao Jiang1, Tong Zhang1.
Abstract
The fast development of next generation sequencing (NGS) has dramatically increased the application of metagenomics in various aspects. Functional annotation is a major step in the metagenomics studies. Fast annotation of functional genes has been a challenge because of the deluge of NGS data and expanding databases. A hybrid annotation pipeline proposed previously for taxonomic assignments was evaluated in this study for metagenomic sequences annotation of specific functional genes, such as antibiotic resistance genes, arsenic resistance genes and key genes in nitrogen metabolism. The hybrid approach using UBLAST and BLASTX is 44-177 times faster than direct BLASTX in the annotation using the small protein database for the specific functional genes, with the cost of missing a small portion (<1.8%) of target sequences compared with direct BLASTX hits. Different from direct BLASTX, the time required for specific functional genes annotation using the hybrid annotation pipeline depends on the abundance for the target genes. Thus this hybrid annotation pipeline is more suitable in specific functional genes annotation than in comprehensive functional genes annotation.Entities:
Mesh:
Year: 2014 PMID: 25347677 PMCID: PMC4210140 DOI: 10.1371/journal.pone.0110947
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Comparison of UBLAST and RAPSearch2 in ARGs annotation using the optimized ARDB.
| Test datasets | Average sequence length/bp | Time consumed/minb | Number of ARGs-like sequencesc | Number of sequences overlap with BLASTX | Number of annotation accession number overlap with BLASTXd | |||||
| UBLAST | RAPSearch2 | BLASTX | UBLAST | RAPSearch2 | UBLAST | RAPSearch2 | UBLAST | RAPSearch2 | ||
| R_INF | 100 | 8 | 92 | 6,646 | 7,011 | 6,467 | 6,566 | 6,301 | 2,466 | 1,973 |
| R_AS | 100 | 6 | 93 | 322 | 341 | 291 | 308 | 277 | 141 | 95 |
| R_ADS | 100 | 7 | 93 | 361 | 388 | 347 | 351 | 327 | 127 | 99 |
| T_INF | 162 | 14 | 164 | 6,972 | 6,970 | 6,556 | 6,856 | 6,465 | 2,931 | 2,436 |
| T_AS | 167 | 11 | 170 | 254 | 248 | 236 | 241 | 228 | 102 | 98 |
| T_ADS | 178 | 12 | 177 | 336 | 330 | 304 | 326 | 300 | 134 | 120 |
: Each dataset contained 10 million sequences. b: Both UBLAST and RAPSearch2 were performed using 1 thread in the search against the optimized ARDB. Parameter of sensitivity was set as –accel 0.5 in UBLAST c: ARGs-like sequences were search with E-value of 1e-5, then further identified with sequence identity ≥90% and hit length ≥25 aa. d: The number of sequences which have the same annotation (accession number) with BLASTX within the overlapped sequences between UBLAST/RAPSearch2 and BLASTX.
Figure 1Process of the hybrid annotation pipeline using UBLAST and BLASTX.
The potential matched sequences are firstly identified through ultra-fast UBLAST using the cutoff of E-value, and then the potential matched sequences are extracted. Further identification and annotation of these potential matched sequences are performed by BLASTX using cutoff of E-value, sequence identity and hit length.
Time consumed of single BLASTX and the hybrid annotation pipeline.
| Data | Database | Single BLASTX | Hybrid annotation pipeline | Fold increase in speed | ||
| BLASTX time/min | UBLAST time/min | BLASTX time/min | Total time consumed/min | |||
| R_INF_1 | The optimized ARDB (2,998 protein sequences) | 1,522 | 8 | 9 | 17 | 90 |
| R_AS_1 | 1,597 | 6 | 3 | 9 | 177 | |
| R_ADS_1 | 1,524 | 7 | 3 | 10 | 152 | |
| T_INF | 2,102 | 14 | 34 | 48 | 44 | |
| T_AS | 2,276 | 11 | 23 | 34 | 67 | |
| T_ADS | 2,320 | 12 | 25 | 37 | 63 | |
| R_INF_1 | Arsenic (103,954 protein sequences) | 18,969 | 101 | 84 | 185 | 103 |
| R_AS_1 | 18,422 | 82 | 75 | 157 | 117 | |
| R_ADS_1 | 19,086 | 85 | 55 | 140 | 136 | |
| T_INF | 28,084 | 141 | 349 | 490 | 57 | |
| T_AS | 27,191 | 133 | 329 | 462 | 59 | |
| T_ADS | 29,612 | 147 | 367 | 514 | 58 | |
| R_INF_1 | Nitrogen_KEGG (63,791 protein sequences) | 21,228 | 89 | 250 | 339 | 63 |
| R_AS_1 | 22,115 | 86 | 244 | 330 | 67 | |
| R_ADS_1 | 21,841 | 85 | 188 | 273 | 80 | |
| T_INF | 30,528 | 156 | 434 | 590 | 52 | |
| T_AS | 30,629 | 159 | 437 | 596 | 51 | |
| T_ADS | 33,567 | 169 | 445 | 614 | 55 | |
*Both BLAST and UBLAST were performed using 1 thread. Sensitivity parameter in UBLAST was set as –accel 0.5.
Difference of annotation results from the hybrid annotation pipeline and direct BLASTX.
| Target genes | Dataset | BLASTX | Hybrid annotation pipelineb | Number of overlapped sequences | Overlap percentage in direct BLASTX/% | Unique in direct BLASTX | Same annotation | Different annotation |
| ARGs | R_INF | 6,646 | 6,624 | 6,624 | 99.7 | 22 | 6,624 | 0 |
| R_AS | 322 | 320 | 320 | 99.4 | 2 | 320 | 0 | |
| R_ADS | 361 | 361 | 361 | 100 | 0 | 361 | 0 | |
| T_INF | 6,972 | 6,940 | 6,940 | 99.5 | 32 | 6,940 | 0 | |
| T_AS | 254 | 253 | 253 | 99.6 | 1 | 253 | 0 | |
| T_ADS | 336 | 334 | 334 | 99.4 | 2 | 334 | 0 | |
| Arsenic resistance genes | R_INF | 15,240 | 15,190 | 15,190 | 99.7 | 50 | 15,189 | 1 |
| R_AS | 3,132 | 3,128 | 3,128 | 99.9 | 4 | 3,128 | 0 | |
| R_ADS | 5,191 | 5,163 | 5,163 | 99.5 | 28 | 5,163 | 0 | |
| T_INF | 16,178 | 16,016 | 16,016 | 99.0 | 162 | 16,015 | 1 | |
| T_AS | 2,082 | 2,069 | 2,069 | 99.4 | 13 | 2,069 | 0 | |
| T_ADS | 5,088 | 4,998 | 4,998 | 98.2 | 90 | 4,998 | 0 | |
| Nitrogen metabolism | R_INF | 49,408 | 49,327 | 49,327 | 99.8 | 81 | 49,327 | 0 |
| R_AS | 33,674 | 33,617 | 33,617 | 99.8 | 57 | 33,617 | 0 | |
| R_ADS | 25,428 | 25,368 | 25,368 | 99.8 | 60 | 25,368 | 0 | |
| T_INF | 51,437 | 51,331 | 51,331 | 99.8 | 106 | 51,331 | 0 | |
| T_AS | 32,429 | 32,374 | 32,374 | 99.8 | 55 | 32,374 | 0 | |
| T_ADS | 23,256 | 23,135 | 23,135 | 99.5 | 121 | 23,135 | 0 |
: ARGs-like sequences were search by direct BLASTX with E-value of 1e-5, then further identified with sequence identity ≥90% and hit length ≥25 aa. b: Cutoff of the first step UBLAST was E-value 1e-5. Cutoff of third step BLASTX was E-value of 1e-5, then further identified with sequence identity ≥90% and hit length ≥25 aa.
Figure 2Number of potential ARGs-like sequences and number of overlapped sequences in UBLAST and direct BLASTX.
Potential ARGs-like sequences were selected by UBLAST using different E-value cutoff (1e-1, 1e-2, 1e-3, 1e-4 and 1e-5) and different UBLAST sensitivity (-accel 0.5, -accel 0.8 and –accel 1) in dataset R_33.