| Literature DB >> 11879526 |
Stéphanie Bocs1, Antoine Danchin, Claudine Médigue.
Abstract
BACKGROUND: Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene) is imbedded in the approach.Entities:
Mesh:
Year: 2002 PMID: 11879526 PMCID: PMC77393 DOI: 10.1186/1471-2105-3-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overall strategy of the CDSs (re-)annotation of the bacterial genomes. The procedure involves four main steps (see text), the latter being performed on potential New Genes having a coding probability above 0.4 (list Ist-NG>=Sure-Pc), and on annotated Genes Not Found having a coding probability below the 0.2 (list lst-GNF
Examples of potential New Genes integrated as new SWISS-PROT entries
| Entry | Accession | Description | Seq Length (aa) |
|---|---|---|---|
| RL21_AERPE | P58077 | 107 | |
| RL29_AERPE | P58085 | 66 | |
| RL34_AERPE | P58026 | 95 | |
| GYRB_ARCFU | 029720 | 632 | |
| EX7S_CHLTR | P58001 | PROBABLE EXODEOXYRIBONUCLEASE VII SMALL SUBUNIT (EC 3.1.11.6) (EXONUCLEASE VII SMALL SUBUNIT) | 72 |
| YD5A_METJA | P58018 | HYPOTHETICAL PROTEIN MJ135.1 | 364 |
| SECG_MYCGE | P58061 | PROBABLE | 77 |
| RL31_PYRHO | P58189 | 95 | |
| RS27_PYRHO | P58078 | 65 | |
| SUI1_PYRHO | P58193 | 99 | |
| Y56A_THEMA | P58008 | HYPOTHETICAL PROTEIN TM0562.1 | 192 |
| YB5A_THEMA | P58009 | HYPOTHETICAL PROTEIN TM1158.1 | 240 |
| YV6A_VIBCH | P58093 | HYPOTHETICAL PROTEIN VCA0360.1 | 80 |
Figure 2Assignation of a status to some additional CDSs. A. The annotated Genes Not Found by the AMIGA method (CDSd). B. The potential AMIGA New Genes (CDSa). The procedure takes into account the length of the CDS, its coding probability, results of similarity search in the non-redundant protein databank and overlaps between adjacent CDSs, these CDSs being an AMIGA CDS (CDSa) and a databank CDS (CDSd) (see text). Although all situations are investigated in the procedure, there are obviously preferred ways (thick arrows): for example a CDSa of the lst-NG>=Sure-Pc list is often found with no overlap with a CDSd. In this case, the CDSa often has a length below 300 bp and, either no similarity (AMBIGUOUS status) or similarity (NEW status) with proteins in the databank. If a CDSa does overlap a CDSd, the last one often has a weak coding probability and no similarity with proteins in the databank (in this case, the CDSa has the NEW status). Therefore it is extremely rare to found a CDSa of the lst-NG>=Sure-Pc in overlap with a CDSd having a strong coding probability, this overlap between the two CDSs being also important (broken arrows). In case of A. pernix and P. horikoshii the threshold for the CDSd length has been fixed to 600 bp instead of 300 bp. This choice is motivated by the nature of the annotation procedure of the authors of the genome sequences (see text). (L) length; (Pc) coding probability; (lst-NG>=Sure-Pc) list of CDSa having a coding probability above 0.4; (lst-GNF