| Literature DB >> 32546194 |
Lizhi Zhou1, Hai Yu1, Kaihang Wang2, Tingting Chen2, Yue Ma2, Yang Huang2, Jiajia Li2, Liqin Liu2, Yuqian Li2, Zhibo Kong2, Qingbing Zheng1, Yingbin Wang1, Ying Gu1,2, Ningshao Xia1,2, Shaowei Li3,4.
Abstract
BACKGROUND: The Escherichia coli ER2566 strain (NC_CP014268.2) was developed as a BL21 (DE3) derivative strain and had been widely used in recombinant protein expression. However, like many other current RefSeq annotations, the annotation of the ER2566 strain was incomplete, with missing gene names and miscellaneous RNAs, as well as uncorrected annotations of some pseudogenes. Here, we performed a systematic reannotation of the ER2566 genome by combining multiple annotation tools with manual revision to provide a comprehensive understanding of the E. coli ER2566 strain, and used high-throughput sequencing to explore how the strain adapted under external pressure.Entities:
Keywords: Engineer bacteria; Escherichia coli ER2566; Genome reannotation; Transcriptome sequencing
Year: 2020 PMID: 32546194 PMCID: PMC7296898 DOI: 10.1186/s12864-020-06818-1
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Flowchart depicting the pipeline and methods used for bacterial genome reannotation of the E. coli strain ER2566
Fig. 2Examples of the differences between the original RefSeq annotation and our reannotation. a In the reannotation, one pseudogene (RS16270) was identified as two genes, insA and insB, which show strong homology to the insertion element protein, IS1. b In the reannotation, two pseudogenes were re-identified as two genes (lacZ1 and lacZ2), whereas the hypothetical protein was reannotated and shown to be highly homologous with the DNA-directed RNA polymerase gene ECBD_2906 from E. coli strain BL21-DE3
Fig. 3Comparison between BL21(DE3) genome and ER2566 genome. Viewing from outside to inside rings, the outermost two rings, respectively representing plus-strand and minus-strand, show features extracted from the BL21(DE3) genome GenBank file (GenBank: CP001509.3); the next ring shows the positions of BLAST hits between the BL21(DE3) genome and the ER2566 genome detected by Blastn. The height of each line in the third ring showing BLAST results is proportional to the percent identity of the hit, and overlapping hits renders as darker lines. The next two rings show GC content and GC skew
Overview of the differences between the original annotation, the reannotation and BL21(DE3) annotation
| Original annotation (NZ_CP014268.2) | Reannotation | BL21(DE3) | |
|---|---|---|---|
| Genome length | 4,478,958 bp | 4,558,953 bp | |
| Plasmids | None | ||
| G + C% | 50.81% | 50.83% | |
| Genes (total) | 4364 | 4627 | 4440 |
| Protein_coding genes | 4054 | 4202 | 4197 |
| Pseudo Genes | 194 | 71 | 70 |
| tRNAs | 85 | 87 | 85 |
| rRNAs | 22 | 22 | 22 |
| Miscellaneous RNAsa | 9 | 245 | 66 |
| Backbone genes | 4170 (4054 protein-coding genes,85 tRNA genes,22rRNAs and 9 misc. RNAs) | 4556 (4202 protein-coding genes,87 tRNA genes,22rRNAs and 245 misc. RNAs) | 4370 (4197 protein-coding genes, 85 tRNA genes,22rRNAs and 66 misc. RNAs) |
a: The concept of miscellaneous RNA includes ncRNA, tmRNA and all other ncRNAs
Fig. 4Flow-chart of variant calling, combining reads mapping and de novo assembly
Fig. 5RNA-seq for variant calling under pressure from overexpression. a) The experimental design. Each group (B37, without plasmid; Y37, with pTO-T7 plasmid overexpressed) had three biological replicates. b) Visualization of BAM files of the B37 (left panel) and Y37 (right panel) in the Integrative Genomics Viewer. Based on the reannotation, one mutant was identified at position 1,094,824 C > T, located within the 3′ non-coding region of the transcription factor gene lacI. c) Mutation detected by Sanger sequencing of the B37 and Y37 genomic samples
Statistical analysis of RNA-seq data
| Sample | Run | Raw sequences reads | Unidentified readsa | HPV16L1 readsb | |
|---|---|---|---|---|---|
| B37 | 1 | 22,781,394 (100%) | 250,596 (1%) | 0 (0%) | 22,334,273 (99%) |
| 2 | 22,763,848 (100%) | 318,694 (1%) | 0 (0%) | 22,312,577 (99%) | |
| 3 | 29,884,996 (100%) | 597,700 (2%) | 0 (0%) | 29,276,302 (98%) | |
| Y37 | 1 | 28,292,578 (100%) | 4,526,812 (16%) | 12,731,660 (45%) | 11,034,106 (39%) |
| 2 | 25,214,878 (100%) | 4,286,528 (17%) | 11,346,696 (45%) | 9,581,654 (38%) | |
| 3 | 22,521,416 (100%) | 3,378,212 (15%) | 10,134,638 (45%) | 9,008,566 (40%) |
a: Three biological replicates of each samples were analyzed. Unidentified RNA-seq reads could include unidentified nucleotides (Ns), short reads, low quality reads, unaligned recombinant protein reads and a larger number of mRNA from plasmids. b: HPV16L1 reads and E. coli reads respectively represent Human papillomavirus and Escherichia coli organism reads