| Literature DB >> 26581719 |
Allison Piovesan1, Maria Caracausi1, Marco Ricci2, Pierluigi Strippoli1, Lorenza Vitale3, Maria Chiara Pelleri1.
Abstract
We have developed GeneBase, a full parser of the National Center for Biotechnology Information (NCBI) Gene database, which generates a fully structured local database with an intuitive user-friendly graphic interface for personal computers. Features of all the annotated eukaryotic genes are accessible through three main software tables, including for each entry details such as the gene summary, the gene exon/intron structure and the specific Gene Ontology attributions. The structuring of the data, the creation of additional calculation fields and the integration with nucleotide sequences allow users to make many types of comparisons and calculations that are useful for data retrieval and analysis. We provide an original example analysis of the existing introns across all the available species, through which the classic biological problem of the 'minimal intron' may find a solution using available data. Based on all currently available data, we can define the shortest known eukaryotic GT-AG intron length, setting the physical limit at the 30 base pair intron belonging to the human MST1L gene. This 'model intron' will shed light on the minimal requirement elements of recognition used for conventional splicing functioning. Remarkably, this size is indeed consistent with the sum of the splicing consensus sequence lengths.Entities:
Keywords: NCBI Gene; computational biology; gene data parsing; minimal intron; personal computer software
Mesh:
Substances:
Year: 2015 PMID: 26581719 PMCID: PMC4675715 DOI: 10.1093/dnares/dsv028
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1.Flow diagram illustrating the data parsing involved in the GeneBase development. ‘Gene_Summary’, ‘Gene_Table’ and ‘Gene_Ontology’ are the three main related software tables.
Figure 2.Screen shot of GeneBase ‘Gene_Table’ interface. Fields are described in detail in the software documentation. Exon and coding exon sequences are in distinct fields to allow an independent management of data (see Materials and methods). This figure is available in black and white in print and in colour at DNA Research online.
Statistical analysis of exon and intron lengths
| Number of recordsa | Mean length (bp) | Standard deviation (bp) | Minimum length (bp)b | Maximum length (bp) | |
|---|---|---|---|---|---|
| Exons | 1,396,026 | 308 | 613 | 2 | 91,671 |
| Coding exons | 1,252,462 | 206 | 325 | 2 | 27,708 |
| Introns | 1,219,806 | 3,820 | 15,693 | 30 | 1,160,411 |
The analysis was carried out considering only ‘Gene_Table’ records belonging to gene entries with a ‘REVIEWED’ or ‘VALIDATED’ RefSeq status and with an ‘NM_’ or ‘NR_’ type of corresponding RefSeq RNA accession number, omitting entries marked as ‘not in current annotation release’ in the ‘Genome_Annotation_Status’ field (see Materials and methods). Mean and standard deviation values were obtained from the ‘Reports’ database table calculation fields. Lengths are given in base pairs (bp).
aCommon exons and introns belonging to multiple transcript variants are counted multiple times. The existence of intronless genes and the fact that terminal exons are not followed by an intron account for a reduced number of introns in comparison with exons.
bMinimum exon and intron length determination is subject to the annotation artifacts described in the text, so only the manually verified data are shown here.
Statistical analysis of intron length of some representative organisms
| Organism (Taxonomy ID) | Number of introns | Mean length (bp) | Standard deviation (bp) | Minimum length (bp)a | Maximum length (bp) |
|---|---|---|---|---|---|
| 124,533 | 169 | 194 | — | 11,602 | |
| 107,605 | 324 | 803 | 39 | 100,913 | |
| 434 | 1,883 | 5,592 | 35 | 53,400 | |
| 58,480 | 1,657 | 5,841 | 40 | 189,627 | |
| 4,174 | 2,849 | 7,646 | — | 160,644 | |
| 10,306 | 2,281 | 5,578 | — | 170,685 | |
| 14,706 | 2,856 | 9,719 | — | 351,090 | |
| 85,507 | 5,622 | 20,369 | — | 1,041,985 | |
| 155,222 | 7,386 | 24,002 | 30 | 1,160,411 |
The analysis was carried out considering a set of non-redundant ‘Gene_Table’ records belonging to gene entries with a ‘REVIEWED’ or ‘VALIDATED’ RefSeq status and with an ‘NM_’ or ‘NR_’ type of corresponding RefSeq RNA accession number, omitting entries marked as ‘not in current annotation release’ in the ‘Genome_Annotation_Status’ field (see Materials and methods). Mean and standard deviation values were obtained from the ‘Reports’ database table calculation fields. Lengths are given in base pairs (bp).
aMinimum length determination is subject to the annotation artifacts described in the text, so only the manually verified data for C. elegans, T. castaneum, D. melanogaster and H. sapiens are shown here.
Number of retrieved GeneBase records and corresponding intron lengths
| Intron length queried (bp) | Number of retrieved records | |||||
|---|---|---|---|---|---|---|
| REVIEWED | VALIDATED | PREDICTED | PROVISIONAL | INFERRED | EMPTY | |
| 1–10 | 21 | 5 | 2 | 2,822 | 1 | 954 |
| 11–20 | 22 | 1 | 4 | 8,250 | 1 | 7,291 |
| 21–30 | 63 | 13 | 37 | 32,487 | 2 | 4,269 |
| 31–40 | 442 | 20 | 38 | 58,242 | 8 | 3,947 |
| Total | 548 | 39 | 81 | 101,801 | 12 | 16,461 |
Lengths are given in base pairs (bp). The total number of ‘Gene_Table’ records with a currently annotated intron length between 1 and 40 bp is 118,942. Common introns belonging to multiple transcript variants are counted multiple times.
Minimal introns validated through bioinformatic analysis (one representative intron for each available organism; see Supplementary Tables S1, S2 and S3 for more details)
| Gene symbol (Gene ID) | Organism (Taxonomy ID) | RefSeq RNA accession number | Intron number | Positiona | Intron length (bp) | Previous exon 3′ | Intron sequence 5′-3′b | Following exon 5′ | GenBank accession numbersc |
|---|---|---|---|---|---|---|---|---|---|
| NM_001271733.1 | 9 | CDS | 30 | gca | gtgagtccctggtgctcccggccccgccag ****** ***** | g | AY192149.1 | ||
| NM_001162528.1 | 8 | CDS | 35 | aag | gtaaaaatctaatcacatacccacccccattcaag **** ****:** ** | c | EU937812.1 | ||
| NM_070722.3 | 2 | CDS | 39 | caa | gtacgttttgagaaatatattttattcaatgaatcatag *** ** *** :** * ** | a | CK586343.1 CK586324.1 CK586322.1 CK579517.1 | ||
| NM_176410.3 | 2 | CDS | 40 | cag | gtgagctcaaagccaacaaagtcagccatcgtcttatcag ***** ** * :** ***** | a | X14459.1 BT021332.1 |
aCDS: coding sequence.
bBold: donor (5′) and acceptor (3′) splice sites. Underline: possible pyrimidine-rich region. Colon: possible branch point. Asterisks: eukaryotic conserved bases (eukaryotic consensus sequence taken as reference: MAGGTRAGT…YNYYRAYY…YYYYYYYYYYYNYCAGG, where M = A/C, R = A/G, Y = C/T, N = any base).[16,24]
cSome example independent RNA sequences encompassing the intron and thus validating its existence.