| Literature DB >> 28472946 |
Cheng-Han Chung1, Michael H Walter2, Luobin Yang3, Shu-Chuan Grace Chen4, Vern Winston3, Michael A Thomas3.
Abstract
BACKGROUND: Most tailed bacteriophages (phages) feature linear dsDNA genomes. Characterizing novel phages requires an understanding of complete genome sequences, including the definition of genome physical ends. RESULT: We sequenced 48 Bacillus cereus phage isolates and analyzed Next-generation sequencing (NGS) data to resolve the genome configuration of these novel phages. Most assembled contigs featured reads that mapped to both contig ends and formed circularized contigs. Independent assemblies of 31 nearly identical I48-like Bacillus phage isolates allowed us to observe that the assembly programs tended to produce random cleavage on circularized contigs. However, currently available assemblers were not capable of reporting the underlying phage genome configuration from sequence data. To identify the genome configuration of sequenced phage in silico, a terminus prediction method was developed by means of 'neighboring coverage ratios' and 'read edge frequencies' from read alignment files. Termini were confirmed by primer walking and supported by phylogenetic inference of large DNA terminase protein sequences.Entities:
Keywords: Bacteriophage; Direct terminal repeat; Genome packaging mechanisms; Neighboring coverage ratio; Phage genome configuration; Read edge frequency; Terminus prediction
Mesh:
Substances:
Year: 2017 PMID: 28472946 PMCID: PMC5418689 DOI: 10.1186/s12864-017-3744-0
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Illustration of two major characteristics of phage genome sequencing used for terminus prediction: Neighboring coverage ratio (NCR) and read edge frequency. I12 phage was used as an example of the selection process of the NCRs that are considered as potential boundaries of terminal repeats. Each dot represents the logarithmic transformed NCR on given nucleotide position with 100-nucleotide window size. Two horizontal dashed lines show the threshold of 1.8 NCR and reciprocal of 1.8. NCRs that are greater than 1.8 or less than reciprocal of 1.8 are collected in a subset of hits (green dots). Within the subset, hits with at least one window coverage of given NCR is 1.8 times greater than genome coverage are considered as significant hits (blue dots). Finally, the local highest and local lowest of significant hits are considered as potential boundaries of terminal repeats (red dots). a The whole-contig NCR of I12 isolate. b The NCR between nucleotide position 68,500 and 72,000. c Every mapped read has one corresponding coordinate at its 5′ end (5′ read edge position) and one at 3′ end (3′ read edge position). The counts of every read edge position were used as one of the indicators of terminus prediction
Genome similarity among representative isolates
| Q8 | SBP8a | Q11 | ||
|---|---|---|---|---|
| I48 | Query Coverage; Identities | 1700/158180; 72% | 151518/158819; 94% | 70/26005; 81% |
| Overall Identity | 0.774% | 89.679% | 0.002% | |
| Q8 | Query Coverage; Identities | 27614/158819; 75% | 57/26005; 89% | |
| Overall Identity | 13.040% | 0.196% | ||
| SBP8a | Query Coverage; Identities | 367/26005; 94% | ||
| Overall Identity | 1.327% |
Fig. 2The map of coverage distribution, neighboring coverage ratio (NCR) and read edge frequencies of phage isolate I13. a An illustration of hypothetical genome configuration of I13 with terminal repeats. Filled squares indicate the direct terminal repeat of phage genome. b Coverage distribution over I13 sequence contig. The lower dashed line represents the average coverage of I13 sequencing reads. The upper dashed line represents the level of 1.8 times of average coverage. c Neighboring coverage ratio (NCR) over I13 sequence contig with window size = 100 bp. The dashed lines indicate the cut-off of 1.8 and reciprocal of 1.8 of NCR after base-2 logarithmic transformation [−0.848, 0.848]. d 5′ or 3′ read edge frequencies from I13 sequencing reads. Filled black squares indicate the frequencies of 5′ read edge positions. Blank triangles indicate the frequencies of 3′ read edge position
Summary of terminus prediction on selected isolates in this study and nine published phages
| Phage | Sequencer | # reads | Coverage | Contig size | Contig form | a5′ hit | b5′ NCR | c5′ REP | d5′ REF | e3′ hit | f3′ NCR | g3′ REP | h3′ REF | 5′ terminus flanking sequence | 3′ terminus flanking sequence |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Novel Phages | |||||||||||||||
| SBP8a |
| 38455 | 87.27 | 158794 | Circular |
| 1.874 |
| 117 |
| 0.159 |
| 238 | 5′-TCAGGTAGAA | AGAAAAACCT-3′ |
| SBP8a |
| 4662176 | 4670.87 | 158822 | Circular | 29634 | 1.889 | 30078 | 2434 | 32372 | 0.526 | 29983 | 2455 | N/A | N/A |
| iI13 |
| 186893 | 345.92 | 157905 | Circular |
| 2.036 |
| 110 |
| 0.42 |
| 274 | 5′-AAACCGTATG | AGAAAAACCT-3′ |
| I48 |
| 2975178 | 4042.41 | 157912 | Circular | 72862 | 1.819 | 73296 | 1332 | 74149 | 0.524 | 73201 | 1261 | N/A | N/A |
| Q8 |
| 72799 | 115.53 | 158180 | Circular |
| 68.497 |
| 965 | no hit | N/A | 156836 | 114 | 5′-AGGTTTTGTG | N/A |
| Q11 |
| 2411528 | 15285.66 | 26005 | Linear | no hit | N/A | 15118 | 6717 | no hit | N/A | 15126 | 6090 | N/A | N/A |
| Published phage with suggested packaging mechanism | |||||||||||||||
| Direct Terminal Repeat | |||||||||||||||
| Adelynn |
| 250419 | 215.43 | 162356 | Circular |
| 3.346 |
| 261 |
| 0.499 |
| 171 | 5′-GGGTTTTTAT | CCGCCTACCC-3′ |
| Nigalana |
| 35451 | 98.06 | 160174 | Circular |
| 2.296 |
| 121 |
| 0.542 |
| 95 | 5′-AGGTTTTTCT | CGTTCTACCT-3′ |
| Troll |
| 32949 | 29.17 | 163019 | Circular | no hit | N/A | 62795 | 7 | no hit | N/A | 43962 | 9 | N/A | N/A |
| Circular permutation | |||||||||||||||
| Breeniome |
| 138039 | 60.10 | 154434 | Circular | no hit | N/A | 26061 | 17 | no hit | N/A | 48207 | 19 | N/A | N/A |
| Teardrop |
| 9807 | 27.05 | 155389 | Circular | no hit | N/A | 10666 | 26 |
| 0.497 |
| 28 | N/A | CCGCTCCGTT-3′ |
| Zeenon |
| 203150 | 179.03 | 155292 | Circular | no hit | N/A | 139104 | 23 | no hit | N/A | 10008 | 22 | N/A | N/A |
| 3′ overhangs | |||||||||||||||
| Equemioh13 |
| 150088 | 394.34 | 53042 | Circular |
| 4.803 |
| 276 | no hit | N/A | 41030 | 191 | 5′-TGCGGCCGCC | N/A |
| Zetzy |
| 9157 | 80.92 | 48463 | Linear |
| 3.321 |
| 116 | no hit | N/A | 34586 | 70 | 5′-CCTGTGCGCC | N/A |
| Lilith |
| 215716 | 668.86 | 50827 | Circular | no hit | N/A | 5180 | 89 | no hit | N/A | 3846 | 113 | N/A | N/A |
The numbers in bold font indicate the significant hits of potential termini
a5′ hit: nucleotide position of significant NCR hit at 5′ boundary of high coverage region
b5′ NCR: the ratio of 5′ significant NCR hit
c5′ REP: 5′ read edge position with highest frequency
d5′ REF: read edge frequency at 5′ REP
e3′ hit: nucleotide position of significant NCR hit at 3′ boundary of high coverage region
f3′ NCR: the ratio of 3′ significant NCR hit
g3′ REP: 3′ read edge position with highest frequency
h3′ REF: read edge frequency at 3′ REP
iI13 is one of I48-like isolate
Comparison of terminal position between the prediction from NGS data and identification from primer walking method
| Terminus predicted by NGS data | Terminus identified by primer walking sequencing | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Name | Sequencer | Contig size | a5′ ter. | b3′ ter. | 5′ ter. | cdistance | 5′ terminus flanking sequence | 3′ ter. | distance | 3′ terminus flanking sequence | length of DTR |
| SBP8a |
| 158794 | 111794 | 114616 | 111796 | +2 | 5′-AGGTAGAACG | 114616 | 0 | AGAAAAACCT-3′ | 2821 |
| dI13 |
| 157905 | 111610 | 114359 | 111537 | −73 | 5′- AGGTAGAACG | 114359 | 0 | AGAAAAACCT-3′ | 2823 |
| dI22 |
| 157889 | 106421 | 109166 | 106348 | −73 | 5′- AGGTAGAACG | 109169 | +3 | AGAAAAACCT-3′ | 2822 |
| Q8 |
| 158180 | 156549 | g5025 | 156549 | 0 | 5′- AGGTTTTGTG | 5099 | +74 | GGGTCTACCC-3′ | 6731 |
| eQ10 |
| 158174 | 86402 | g93056 | 86402 | 0 | 5′- AGGTTTTGTG | 93130 | +74 | GGGTCTACCC-3′ | 6729 |
| Q11 |
| 26005 | N/A | N/A | h-9 | N/A | 5′ AAAATGTAAA | h + 43 | N/A | ATATACATTT-3′ | N/A |
| fI46 |
| 24896 | N/A | N/A | h-740 | N/A | 5′ AAAATGTAAA | h + 421 | N/A | ATATACATTT-3′ | N/A |
a5′ ter.: nucleotide coordinate of 5′ terminus
b3′ ter.: nucleotide coordinate of 3′ terminus
cdistance: position difference to prediction
dI13 and I22 are I48-like isolates
eQ10 is one of Q8-like isolate
fI46 is one of Q11-like isolate
gLocation of Q8-like stain physical ends was assisted by designing primers from several hundred bases upstream of the common predicted hits
hThe position outside of contig sequence was calculated by the relative position of coordinates of contig sequence from NGS data. Minus coordinate represents 5′ upstream of first bp of contig; Plus coordinate represents 3′ downstream of last bp of contig
Fig. 3Maximum Likelihood phylogeny of large terminase amino acid sequences. The alignment of protein sequences was generated by ClustalW2 [65]. The phylogeny was reconstructed using Maximum Likelihood method based on the Poisson correction model. Numbers next to internal nodes indicate the bootstrap value divided by trials size of 1000. Names of phages were illustrated at the tip of the phylogeny. The root of the phylogeny was arbitrarily chosen for visualization purpose. Arrows: three novel Bacillus phages including SBP8a, I48 and Q8. *, +, &: nine phages with suggested types of genome terminus