| Literature DB >> 23599422 |
Erika Sallet1, Brice Roux, Laurent Sauviac, Marie-Francoise Jardinaud, Sébastien Carrère, Thomas Faraut, Fernanda de Carvalho-Niebel, Jérôme Gouzy, Pascal Gamas, Delphine Capela, Claude Bruand, Thomas Schiex.
Abstract
The availability of next-generation sequences of transcripts from prokaryotic organisms offers the opportunity to design a new generation of automated genome annotation tools not yet available for prokaryotes. In this work, we designed EuGene-P, the first integrative prokaryotic gene finder tool which combines a variety of high-throughput data, including oriented RNA-Seq data, directly into the prediction process. This enables the automated prediction of coding sequences (CDSs), untranslated regions, transcription start sites (TSSs) and non-coding RNA (ncRNA, sense and antisense) genes. EuGene-P was used to comprehensively and accurately annotate the genome of the nitrogen-fixing bacterium Sinorhizobium meliloti strain 2011, leading to the prediction of 6308 CDSs as well as 1876 ncRNAs. Among them, 1280 appeared as antisense to a CDS, which supports recent findings that antisense transcription activity is widespread in bacteria. Moreover, 4077 TSSs upstream of protein-coding or non-coding genes were precisely mapped providing valuable data for the study of promoter regions. By looking for RpoE2-binding sites upstream of annotated TSSs, we were able to extend the S. meliloti RpoE2 regulon by ∼3-fold. Altogether, these observations demonstrate the power of EuGene-P to produce a reliable and high-resolution automatic annotation of prokaryotic genomes.Entities:
Keywords: RNA-Seq; genome annotation; prokaryotes; rhizobium
Mesh:
Substances:
Year: 2013 PMID: 23599422 PMCID: PMC3738161 DOI: 10.1093/dnares/dst014
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
RNA-Seq libraries used for annotation
| GEO sample code | RNA samples | RNA fraction | Biological replicate number | Sequencing process | Number of unambiguously mapped reads or paired-reads |
|---|---|---|---|---|---|
| GSM1078108 | Nodule | Long | 1 | pe 2 × 54 nt | 79 339 |
| GSM1078109 | Nodule | Long | 2 | pe 2 × 54 nt | 103 025 |
| GSM1078110 | Nodule | Long | 3 | pe 2 × 54 nt | 55 825 |
| GSM1078111 | Nodule | Short | 1 | pe 2 × 54 nt | 785 009 |
| GSM1078112 | Nodule | Short | 2 | pe 2 × 54 nt | 1 503 684 |
| GSM1078113 | Nodule | Short | 3 | pe 2 × 54 nt | 1 465 610 |
| GSM1078114 | Bacteria mid-exponential phase | Long | 1 | se 1 × 50 nt | 4 158 264 |
| GSM1078115 | Bacteria mid-exponential phase | Long | 2 | se 1 × 50 nt | 4 154 232 |
| GSM1078116 | Bacteria mid-exponential phase | Long | 3 | se 1 × 50 nt | 2 873 524 |
| GSM1078117 | Bacteria mid-exponential phase | Short | 1 | pe 2 × 50 nt | 4 792 283 |
| GSM1078118 | Bacteria mid-exponential phase | Short | 2 | pe 2 × 50 nt | 5 390 729 |
| GSM1078119 | Bacteria mid-exponential phase | Short | 3 | pe 2 × 50 nt | 9 061 874 |
| GSM1078120 | Bacteria stationary phase | Long | 1 | se 1 × 50 nt | 2 102 607 |
| GSM1078121 | Bacteria stationary phase | Long | 2 | se 1 × 50 nt | 3 171 844 |
| GSM1078122 | Bacteria stationary phase | Long | 3 | se 1 × 50 nt | 2 953 260 |
| GSM1078123 | Bacteria stationary phase | Short | 1 | pe 2 × 50 nt | 11 368 031 |
| GSM1078124 | Bacteria stationary phase | Short | 2 | pe 2 × 50 nt | 5 960 882 |
| GSM1078125 | Bacteria stationary phase | Short | 3 | pe 2 × 50 nt | 5 559 756 |
All RNA samples were depleted in ribosomal RNA using the RiboMinus™ protocol and separated in short (<200 nt) and long (>200 nt) fractions. Note that nodule libraries contain a mixture of S. meliloti and M. truncatula transcriptomes. Figures indicated here correspond to S. meliloti sequence reads only.
pe, paired ends; se, single end.
Figure 1.A prokaryotic genomic sequence and the corresponding annotation defined as a sequence of typed regions. Each region has a specific label (or state) that defines its type. Beyond coding regions (e.g. CDS1) and intergenic regions (IG), an annotation may identify untranslated transcribed regions at the extremities of coding transcripts (5′ and 3′ UTRs), untranslated internal regions (or UIR, between two CDSs in a transcript) and ncRNA genes. Specific region types are also used to label overlapping CDSs. In this figure, the region labelled CDS1:3 corresponds to the overlap of a CDS in frame 1 with another CDS in frame 3.
Figure 2.The different states and possible transitions between these states used inside EuGene-P.
Figure 3.Graphical representation of a genomic region in Apollo. Apollo represents the annotation on both strands (upper and lower part of the figure) as well as the expression level of the mapped RNA-Seq data from short and long RNA libraries in exponential and stationary growth phase conditions. Reads mapped on the plus strand are shown in colour, and reads mapped on the minus strand are in grey. Y-axis represents the number of reads summed from triplicates. The upper limit was set at 300 reads. This region contains several annotated non-coding (in green) and protein-coding (in blue) genes, full blue squares correspond to CDSs and open blue squares correspond to 5′ and 3′ UTRs.
Modifications performed on the automatic annotation during the manual curation process
| Type | 5′ ends | 3′ ends | CDS starts | CDS stops | Creations | Removals | Total number of modified genes |
|---|---|---|---|---|---|---|---|
| Coding genes | 350 | 275 | 135 | 2 | 19 | 194 | 835 |
| Non-coding genes | 31 | 151 | 180 | 252 | 604 |
Structural annotation of the S. meliloti 2011 genome
| CDSs (total number) | 6308 |
| New (when compared with Sm1021) | 125 |
| tRNAs | 55 |
| tRNA primary transcripts | 28 |
| rRNAs | 9 |
| ncRNAs | 1876 |
| Antisense to a protein-coding gene | 1281 |
| TSSs (total number) | 4840 |
| Predicted with high confidence | 4077 |
| Predicted with low confidence | 763 |
| Insertion sequences | 94 |
| Repeated elements | 618 |
| RIME | 209 |
| MOTIF | 256 |
| Sm-1 repeat | 21 |
| Sm-2 repeat | 8 |
| Sm-3 repeat | 4 |
| Sm-4 repeat | 73 |
| Sm-5 repeat | 47 |
Congruence between TSS annotation and the published literature
| Fraction of annotated TSSa matching: | ||
|---|---|---|
| Experimentally mapped TSSb | ||
| RpoD | 22/27 | 63/89 |
| RpoH1 and/or RpoH2 | 45/67 | 49/69 |
| RpoE1 and/or RpoE4 | 3/4 | – |
| RpoE2 | 1/1 | 29/35 |
| RpoN | 3/4 | 5/6c |
All mapped or predicted promoter sequences are available in Supplementary Table S8.
aGenes for which no TSS was annotated in the current study were not retained for this table.
bData extracted from[67,85,86] (RpoD)[65] (RpoH1/H2),[39,40,70] (RpoE2),[66] (RpoE1/E4)[87–94] (RpoN).
cAs the coordinates of RpoN TSS predicted by Dombrecht et al.[94] were not described in their paper, we kept the promoters carrying the most obvious −24/−12 RpoN-binding sequences.