| Literature DB >> 25058116 |
Bas E Dutilh1, Noriko Cassman2, Katelyn McNair3, Savannah E Sanchez4, Genivaldo G Z Silva5, Lance Boling4, Jeremy J Barr4, Daan R Speth6, Victor Seguritan4, Ramy K Aziz7, Ben Felts8, Elizabeth A Dinsdale9, John L Mokili4, Robert A Edwards10.
Abstract
Metagenomics, or sequencing of the genetic material from a complete microbial community, is a promising tool to discover novel microbes and viruses. Viral metagenomes typically contain many unknown sequences. Here we describe the discovery of a previously unidentified bacteriophage present in the majority of published human faecal metagenomes, which we refer to as crAssphage. Its ~97 kbp genome is six times more abundant in publicly available metagenomes than all other known phages together; it comprises up to 90% and 22% of all reads in virus-like particle (VLP)-derived metagenomes and total community metagenomes, respectively; and it totals 1.68% of all human faecal metagenomic sequencing reads in the public databases. The majority of crAssphage-encoded proteins match no known sequences in the database, which is why it was not detected before. Using a new co-occurrence profiling approach, we predict a Bacteroides host for this phage, consistent with Bacteroides-related protein homologues and a unique carbohydrate-binding domain encoded in the phage genome.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25058116 PMCID: PMC4111155 DOI: 10.1038/ncomms5498
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Figure 1Schematic representation of the circular crAssphage genome.
The genome contains 80 ORFs that were predicted with Glimmer56 trained on Caudovirales. The total coverage of each nucleotide in the F2T1 metagenome, and in all public metagenomes in MG-RAST49 is indicated (466 human faecal and 2,440 other metagenomes, as determined by blastn mapping: ≥75 bp aligned with ≥95% identity, see Methods). Green bars indicate the 36 regions that were validated by long-range PCR (see Table 2 and Supplementary Table 1). Selected regions of several PCR amplicons (indicated as light green regions in the green bars) were sequenced by Sanger dideoxynucleotide sequencing to validate that the amplicons were indeed derived from the crAssphage genome (Supplementary Table 1). See Supplementary Fig. 6 for the fully annotated figure.
CrAssphage ORFs with homology to known proteins or domains.
|
|
|
|
|
|---|---|---|---|
| orf00014 | Hypothetical protein | Phages | Ambiguous |
| orf00017 | Uracil-DNA glycosylase |
|
|
| orf00016 | DNA helicase |
|
|
| orf00018 | DNA polymerase |
|
|
| orf00025 | DNA primase/helicase |
| |
| orf00029 | DNA ligase |
| |
| orf00031 | Deoxynucleoside monophosphate kinase |
| |
| orf00032 | Baseplate hub |
| |
| orf00033 | Thymidylate synthase complementing protein ThyX |
| |
| orf00035 | Hypothetical protein |
| |
| orf00037 | Phage/plasmid-related protein |
|
|
| orf00038 | Deoxyuridine 5'-triphosphate nucleotidohydrolase |
| |
| orf00039 | Endonuclease |
| |
| orf00040 | Deoxyuridine 5'-triphosphate nucleotidohydrolase |
| |
| orf00042 | Glutaredoxin/thioredoxin | Phages | Ambiguous |
| orf00047 | Hypothetical protein |
|
|
| orf00050 | Plasmid replication protein domain |
|
|
| orf00052 | Phage-structural protein |
| |
| orf00053 | Phage-structural protein |
| |
| orf00056 | Hypothetical protein |
| |
| orf00065 | Hypothetical protein |
|
|
| orf00066 | Phage-related protein |
| |
| orf00070 | Predicted protein |
| |
| orf00071 | Predicted protein |
| |
| orf00072 | Hypothetical protein |
|
|
| orf00073 | Phage-related protein |
|
|
| orf00074 | Phage-structural protein, contains BACON domains |
| |
| orf00075 | Phage-structural protein |
| |
| orf00077 | Recombination endonuclease sunbunit |
|
|
| orf00076 | Phage-related protein |
|
|
| orf00086 | Phage-structural protein |
| |
| orf00088 | Phage-structural protein |
| |
| orf00091 | Phage-structural protein |
| |
| orf00092 | Hypothetical protein |
| |
| orf00093 | DNA helicase |
| |
| orf00094 | Endolysin |
|
|
| orf00095 | Endolysin | Phage |
|
| orf00096 | Phage-related protein |
| |
| orf00102 | Plasmid replication protein domain |
|
|
ORF, open reading frame.
Function and taxonomy information of the hits are displayed. For details see Supplementary Data 2.
Figure 2CRISPR spacers similar to regions of the crAssphage genome.
CRISPR spacers were identified in 2,773 complete bacterial genomes from Genbank, and in 404 genomes of intestinal isolates from HMP and MetaHIT. The CRISPR spacers that were most similar to the crAssphage genome were found in Prevotella intermedia 17 (Genbank genomes) and in Bacteroides sp. 20_3 (HMP and MetaHIT genomes). Conserved A, C, G, and T nucleotides are displayed in red, green, yellow and blue, respectively.
Figure 3Phage–host prediction based on co-occurrence across metagenomes.
Unrooted co-occurrence cladogram of correlated depth profiles across 151 HMP faecal metagenomes16 of the crAssphage, two known Bacteroides fragilis-infecting phages2226, and 404 potential hosts. Colours indicate bacterial phyla. The phages are indicated with blue dashed lines. See Supplementary Fig. 7 for the fully annotated figure.
Figure 4Abundance ubiquity plot of phage genomes in public metagenomes.
Reads from 2,944 publicly available shotgun metagenomes49 were aligned to a database of 1,193 phage genomes (see Methods). The average depth of aligned reads per nucleotide of the phage genome (abundance) is plotted against the number of metagenomes it is found in (ubiquity). See Supplementary Data 5 for details.
Figure 5Normalized coverage plot of the crAssphage genome in 940 public metagenomes.
Rows are metagenomes, with the sequence volume in nucleotides indicated to the right (see Supplementary Data 4 for the order and detailed annotations of the metagenomes). The x axis of the heat map displays the 97,065 bp length of the crAssphage genome sequence. The colour bar indicates the percentage of nucleotides in each metagenome that aligns to each position. Black arrowheads at the top of the figure indicate metaviromic islands50. Details are available in Supplementary Data 4. Note that some of the metagenomes at the bottom of the plot that are annotated as ‘Plant-associated’ are also faecal metagenomes69.
PCR primer pairs designed to validate the crAssphage genome sequence.
|
|
|
|
|
|
|---|---|---|---|---|
| 1 | 5′-GTGACGAGAGGTATTGAATGTGGA-3′ | 1,785 | 5′-GCTATAAGTCCAGCAGCAAAAGG-3′ | 6,793 |
| 2 | 5′-TGACTAGCTTGCTTCCATCCT-3′ | 6,004 | 5′-GCACTACGTCCATCTTGAGTACCA | 11,923 |
| 3 | 5′-CAGGTGAACGTAAACCTGTTCCT-3′ | 9,512 | 5′-ACTCATACCAGCAAATGAAGGCA-3′ | 14,930 |
| 4 | 5′-ATGGTGCTCGTGAAATTGCT-3′ | 13,526 | 5′-GCTTTACGCTGAGCAATCGT | 17,858 |
| 5 | 5′-GCACCGGTATTGCAAAGGCT-3′ | 17,801 | 5′-CTCCAAATCCTTTGTTTCCACGT-3′ | 25,822 |
| 6 | 5′-TGCTATTTGGCAAACTGCTGG-3′ | 23,445 | 5′-ATCATGCTGACCGTCTTGCT | 29,713 |
| 7 | 5′-GTAGCGAAGCGGAGCGTTCTA-3′ | 28,192 | 5′-TATGGAACGAGCTGCTGGTG | 30,897 |
| 8 | 5′-ATTCACCAGCAGCTCGTTCC-3′ | 30,875 | 5′-TGAATGGCGTTCAGCAGGCT-3′ | 36,058 |
| 9 | 5′-AGCTATTCCGCCCTCACTCAA-3′ | 35,019 | 5′-TGCTAAGATTGGTCGTGTAGCT | 40,185 |
| 10 | 5′-TGAGGAACTTCTTGCTGACGA-3′ | 38,017 | 5′-ACTTAAAGGTGATGCTCGACGT-3′ | 43,360 |
| 11 | 5′-TCAGGTATTGTTCCATCCTCC-3′ | 42,662 | 5′-CAAGATACTAGTTGGAGAGCTGCT | 47,956 |
| 12 | 5′-CTGCAAAACCAATAGCTGTACCA-3′ | 47,220 | 5′-GGTGGTATTGCTCAACCTATTGG | 52,314 |
| 13 | 5′-AGAGTAGGTTGACCTGGGCCT | 50,678 | 5′-AGGTTATGGTGGGCTACAAGAT-3′ | 54,585 |
| 14 | 5′-TGGTCTTGTGCAGCTTGAGC-3′ | 53,477 | 5′-TATGCCCGATGATTGTTGTCCT | 58,673 |
| 15 | 5′-GACCAGAACGACCTCCACTA-3′ | 57,353 | 5′-TCTTGATGGTCGAGTTGATGCT-3′ | 62,669 |
| 16 | 5′-CACGAATACGTTGTTGCAAACCT-3′ | 60,536 | 5′-ATCGGTACTGCACTTGGTGC-3′ | 66,345 |
| 17 | 5′-ACCAGCCGTAAACATCTTTTCCA-3′ | 65,848 | 5′-AGTATTGGAGCAACAGGTGGA-3′ | 71,818 |
| 18 | 5′-AGCAGGAACAGCTTTACGAGTA-3′ | 69,175 | 5′-TTGCTAGTCTTGATGGAGATGGT-3′ | 74,902 |
| 19 | 5′-GTGGCACTTATTCAGTACCACCA-3′ | 74,161 | 5′-CAGAATTAGGCTTCCCATTGAACG-3′ | 79,628 |
| 20 | 5′-CGAAGTTTAGCAATAGGCTGCCA-3′ | 78,705 | 5′-AGGCTCTATTGGTTTGCAGGT-3′ | 83,203 |
| 21 | 5′-TAGCAAGACGCTCAGCTTCTC | 82,364 | 5′-GTTTGCTGAACGTCGTATGTTGAC-3′ | 87,950 |
| 22 | 5′-TCCATACGTTCTTCAGCTTGATTC-3′ | 86,454 | 5′-AGATGATGCTGGTGGAGAAACTT-3′ | 91,978 |
| 23 | 5′-GTCCAACCTTGCCAAGTAGGA-3′ | 91,910 | 5′-TGACCATCAGTACAGATGCGTCTA-3′ | 638 |
| 24 | 5′-AGCGTCAAGTGCTTCACTTG-3′ | 95,395 | 5′-CGAAGTCCACCATCAGCAGT-3′ | 3,167 |
Numbers in the first column correspond to the bands in Supplementary Table 1. Numbers in the Start and End columns refer to the position on the crAssphage genome.