| Literature DB >> 28122610 |
Daniel H Huson1,2, Rewati Tappu3, Adam L Bazinet4,5, Chao Xie6, Michael P Cummings4, Kay Nieselt3, Rohan Williams7.
Abstract
BACKGROUND: Microbiome sequencing projects typically collect tens of millions of short reads per sample. Depending on the goals of the project, the short reads can either be subjected to direct sequence analysis or be assembled into longer contigs. The assembly of whole genomes from metagenomic sequencing reads is a very difficult problem. However, for some questions, only specific genes of interest need to be assembled. This is then a gene-centric assembly where the goal is to assemble reads into contigs for a family of orthologous genes.Entities:
Keywords: Functional analysis; Sequence assembly; Software; String graph
Mesh:
Year: 2017 PMID: 28122610 PMCID: PMC5267372 DOI: 10.1186/s40168-017-0233-2
Source DB: PubMed Journal: Microbiome ISSN: 2049-2618 Impact factor: 14.650
Fig. 1Induced DNA overlap edges. If two reads r and s both have a protein alignment to the same reference protein p, then this defines an overlap edge between the corresponding nodes if the induced DNA alignment has 100% identity. This induced DNA alignment is of length 12, as we ignore any induced gaps
Fig. 2Alignment viewer. a Alignment of 13,623 reads against one of the reference sequences representing bacteria rpoB, as displayed in MEGAN’s alignment viewer. b More detailed view in which nucleotides that do not match the consensus sequence are highlighted in color. c Reads are ordered by contig membership and decreasing length of contigs
For each gene family studied, we report the KEGG orthology group, number of reads assigned to that group by DIAMOND, number of reference gene sequences that exist in the synthetic community, and number of reference genes “detected” by each method: MEGAN, IDBA-UD, Ray, SOAPdenovo, and Xander
| Gene family | KEGG | Reads | References | MEGAN | IDBA-UD | Ray | SOAP | Xander |
|---|---|---|---|---|---|---|---|---|
| Acetyl-CoA C-acetyltransferase | K00626 | 58,135 | 64 |
| 16 | 12 | 12 | 23 |
| Archael rpoB1 | K03044 | 17,875 | 16 | 7 | 7 | 6 | 3 |
|
| Archael rpoB2 | K03045 | 12,025 | 16 |
|
| 6 | 5 | 5 |
| Cell division protein | K03531 | 45,881 | 48 | 37 |
| 12 | 7 | 12 |
| Bacterial rpoB | K03043 | 105,212 | 64 | 43 | 16 | 12 | 13 |
|
| Phenylalanyl-tRNA synthetase alpha subunit | K01889 | 44,779 | 64 |
| 56 | 47 | 51 | 48 |
| Phenylalanyl-tRNA synthetase beta subunit | K01890 | 73,072 | 64 |
| 50 | 42 | 38 | 35 |
| Phosphoribosylformylglycinamidine cyclo ligase | K01933 | 31,919 | 64 | 58 |
| 46 | 45 | 54 |
| Ribonuclease HII | K03470 | 18,707 | 64 | 54 |
| 53 | 45 | 48 |
| Ribosomal protein L1 | K02863 | 24,190 | 64 |
| 49 | 45 | 48 | 53 |
| Ribosomal protein L10 | K02864 | 23,970 | 64 |
| 48 | 55 | 57 | 55 |
| Ribosomal protein L11 | K02867 | 17,113 | 64 |
| 50 | 51 |
| 59 |
| Ribosomal protein L13 | K02871 | 17,642 | 64 |
| 54 | 53 | 57 | 45 |
| Ribosomal protein L14 | K02874 | 13,435 | 64 | 56 | 42 | 49 | 58 |
|
| Ribosomal protein L15 | K02876 | 13,087 | 64 |
| 56 | 50 | 55 | 55 |
| Ribosomal protein L16 | K02878 | 10,058 | 64 |
| 34 | 36 | 44 | 44 |
| Ribosomal protein L18 | K02881 | 14,856 | 64 |
| 48 | 56 |
| 55 |
| Ribosomal protein L2 | K02886 | 29,849 | 64 |
| 54 | 46 | 55 | 57 |
| Ribosomal protein L22 | K02890 | 15,875 | 64 |
| 54 | 55 | 57 | 51 |
| Ribosomal protein L24 | K02895 | 11,786 | 64 |
| 46 | 56 | 58 | 44 |
| Ribosomal protein L25 | K02897 | 12,941 | 64 | 41 | 41 | 39 |
| 41 |
| Ribosomal protein L29 | K02904 | 4913 | 64 | 29 | 8 | 33 |
| 8 |
| Ribosomal protein L3 | K02906 | 30,192 | 64 |
| 47 | 51 | 57 | 51 |
| Ribosomal protein L4 | K02926 | 14,539 | 64 |
| 41 | 39 | 43 |
|
| Ribosomal protein L5 | K02931 | 20,533 | 64 |
| 58 | 55 | 59 | 58 |
| Ribosomal protein L6 | K02933 | 20,645 | 64 | 58 | 41 | 56 | 59 |
|
| Ribosomal protein S10 | K02946 | 11,327 | 64 |
| 42 | 48 |
| 54 |
| Ribosomal protein S11 | K02948 | 10,793 | 64 | 47 | 43 | 51 | 52 |
|
| Ribosomal protein S12 | K02950 | 14,199 | 64 |
| 41 | 48 | 60 | 58 |
| Ribosomal protein S13 | K02952 | 13,975 | 64 | 59 | 46 | 56 |
| 58 |
| Ribosomal protein S15 | K02956 | 10,795 | 64 | 54 | 16 | 43 |
| 50 |
| Ribosomal protein S17 | K02961 | 10,235 | 64 | 58 | 36 | 49 | 44 |
|
| Ribosomal protein S19 | K02965 | 12,479 | 64 |
| 39 | 51 |
| 58 |
| Ribosomal protein S2 | K02967 | 25,926 | 64 |
| 46 | 41 | 53 | 48 |
| Ribosomal protein S3 | K02982 | 25,722 | 64 |
| 46 | 48 | 57 | 57 |
| Ribosomal protein S5 | K02988 | 21,761 | 64 |
| 55 | 53 | 53 | 56 |
| Ribosomal protein S7 | K02992 | 20,520 | 64 | 60 | 42 | 54 | 60 |
|
| Ribosomal protein S8 | K02994 | 14,543 | 64 |
| 57 | 58 | 60 | 57 |
| Ribosomal protein S9 | K02996 | 12,927 | 64 | 59 | 52 | 52 | 58 |
|
| Signal recognition particle protein | K03110 | 27,386 | 64 | 35 |
| 36 | 19 | 46 |
| Two-component system | K03407 | 47,904 | 64 |
| 17 | 15 | 15 | 27 |
| Mean absolute deviation |
| 19.73 | 18.24 | 15.41 | 14.17 | |||
Best results are shown in bold. Mean absolute deviation between the number of references genes and the number detected by each method is reported as a summary statistic
Fig. 3Summary of number of contigs produced. For each gene family along the x-axis, we plot the number of contigs of length ≥200 bp produced by each assembler
Fig. 4Reference gene coverage heat map. For each assembler, each gene family (rows), and each reference gene sequence associated with a species in the synthetic community (columns), we indicate the percentage of the reference gene covered by the longest contig. We also plot the average percent coverage per gene family for all assemblers
Fig. 5Reference gene coverage summary. For each gene family along the x-axis, we plot the number of reference gene sequences detected by each method