| Literature DB >> 28025344 |
Allison Piovesan1, Maria Caracausi1, Francesca Antonaros1, Maria Chiara Pelleri2, Lorenza Vitale1.
Abstract
We release GeneBase 1.1, a local tool with a graphical interface useful for parsing, structuring and indexing data from the National Center for Biotechnology Information (NCBI) Gene data bank. Compared to its predecessor GeneBase (1.0), GeneBase 1.1 now allows dynamic calculation and summarization in terms of median, mean, standard deviation and total for many quantitative parameters associated with genes, gene transcripts and gene features (exons, introns, coding sequences, untranslated regions). GeneBase 1.1 thus offers the opportunity to perform analyses of the main gene structure parameters also following the search for any set of genes with the desired characteristics, allowing unique functionalities not provided by the NCBI Gene itself. In order to show the potential of our tool for local parsing, structuring and dynamic summarizing of publicly available databases for data retrieval, analysis and testing of biological hypotheses, we provide as a sample application a revised set of statistics for human nuclear genes, gene transcripts and gene features. In contrast with previous estimations strongly underestimating the length of human genes, a 'mean' human protein-coding gene is 67 kbp long, has eleven 309 bp long exons and ten 6355 bp long introns. Median, mean and extreme values are provided for many other features offering an updated reference source for human genome studies, data useful to set parameters for bioinformatic tools and interesting clues to the biomedical meaning of the gene features themselves.Database URL: http://apollo11.isto.unibo.it/software/.Entities:
Mesh:
Year: 2016 PMID: 28025344 PMCID: PMC5199132 DOI: 10.1093/database/baw153
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.(A) Gene type composition of GeneBase 1.1 Human entries for a total of 59 801 genes and (B) for 22 451 ‘REVIEWED’ or ‘VALIDATED’ genes with at least one ‘REVIEWED’ or ‘VALIDATED’ transcript (genes not in current annotation release are excluded). Gene type labels are derived from ‘Gene_Type’ field of GeneBase 1.1 Human ‘Gene_Summary’ table as annotated in NCBI Gene as follows: protein-coding, pseudo (pseudogenes), ncRNA (non-coding RNA), snoRNA (small nucleolar RNA), snRNA (small nuclear RNA), rRNA (ribosomal RNA), tRNA (transfer RNA), ‘other’ and ‘unknown’.
Figure 2.Number of ‘REVIEWED’ or ‘VALIDATED’ genes with at least one ‘REVIEWED’ or ‘VALIDATED’ transcript in GeneBase 1.1 Human (genes not in current annotation release are excluded) divided in protein-coding genes, pseudogenes and non-coding genes (which include genes for ribosomal RNAs, small nucleolar RNAs, small nuclear RNAs and non-coding RNAs) for each human chromosome. See Table 1 and Supplementary Table S2 for more details.
Human chromosome lengths and number of genes
| Chromosome | Length (Mb) | Number of genes per chromosome in NCBI Genome | Number of genes per chromosome in GeneBase 1.1 without selection | Number of ‘REVIEWED’ and ‘VALIDATED’ genes per chromosome in GeneBase 1.1 Human |
|---|---|---|---|---|
| 1 | ||||
| 2 | 242.19 | 7746 | 4215 | 1455 |
| 3 | 198.30 | 5855 | 3287 | 1255 |
| 4 | 190.21 | 4905 | 2649 | 865 |
| 5 | 181.54 | 5006 | 2795 | 1063 |
| 6 | 170.81 | 5843 | 3364 | 1205 |
| 7 | 159.35 | 5426 | 3010 | 1059 |
| 8 | 145.14 | 4292 | 2383 | 807 |
| 9 | 138.39 | 4670 | 2509 | 879 |
| 10 | 133.80 | 4385 | 2379 | 897 |
| 11 | 135.09 | 5571 | 3235 | 1304 |
| 12 | 133.28 | 4942 | 2731 | 1140 |
| 13 | 114.36 | 2880 | 1537 | |
| 14 | 107.04 | 3965 | 2220 | 675 |
| 15 | 101.99 | 3791 | 2081 | 716 |
| 16 | 90.34 | 3763 | 2206 | 902 |
| 17 | 83.26 | 4686 | 2694 | 1293 |
| 18 | 80.37 | |||
| 19 | 58.62 | 4480 | 2694 | 1416 |
| 20 | 64.44 | 2600 | 1435 | 631 |
| 21 | ||||
| 22 | 50.82 | 504 | ||
| X | 156.04 | 3690 | 2416 | 938 |
| Y | 57.23 | |||
| Total | 3088.27 | 50 704 | 59 245 | 22 448 |
Mb: megabase.
These columns shows numbers reported at http://www.ncbi.nlm.nih.gov/genome accessed on 19 January 2016 (when NCBI Gene entries were also downloaded for parsing and import in GeneBase 1.1).
This column shows the number of gene entries with ‘REVIEWED’ or ‘VALIDATED’ RefSeq status, with at least one ‘REVIEWED’ or ‘VALIDATED’ transcript, excluding ‘not in current annotation release’ records in the ‘Gene_Summary’ table of GeneBase (Methods; Supplementary Methods File).
The three remaining genes (to reach the total of 59 801 genes) are 65 genes with unknown and 491 empty ‘Chromosome’ field.
The three remaining genes (to reach the total of 22 451 genes) are three pseudogenes with unknown location (Gene IDs: 100233156, 283788, 389834).
Bold: minimum and maximum values for each column. Underlined: the three smallest numbers of genes per chromosome, beyond chrY.
See Supplementary Table S2 for more details on gene types.
Known human nuclear gene numbers and lengths
| Protein-coding genes | Non-coding genes | |
|---|---|---|
| Total number of entries | 18 255 | 4196 |
| Median number per chr | 737 per chr | 175 per chr |
| Mean number per chr | 761 per chr | 175 per chr |
| SD | 400 per chr | 75 per chr |
| Minimum number | 44 (chrY) | 40 (chrY) |
| Maximum number | 1876 (chr 1) | 383 (chr1) |
| Median length | 26 288 bp | 11 155 bp |
| Mean length | 66 577 bp | 33 803 bp |
| SD | 130 398 bp | 68 302 bp |
| Shortest | 189 bp ( | 60 bp ( |
| Longest | 2 473 559 bp ( | 1 033 350 bp ( |
| Total length | 1 215 363 666 bp | 141 838 888 bp |
SD: standard deviation; chr: chromosome; bp: base pair.
We considered only protein-coding or non-coding genes with ‘REVIEWED’ or ‘VALIDATED’ RefSeq status, with at least one ‘REVIEWED’ or ‘VALIDATED’ transcript, excluding ‘not in current annotation release’ records in the ‘Gene_Summary’ and ‘Genes’ tables of GeneBase 1.1 Human as explained in the ‘Methods’ section and in the Supplementary Methods File.
The mean, minimum and maximum gene numbers are also available in Supplementary Table S2.
Figure 3.Exon (A) and intron (B) length distributions considering GeneBase 1.1 Human ‘Gene_Table’ records with a ‘VALIDATED’ or ‘REVIEWED’ RefSeq status, with an ‘NM_’ (protein-coding RNAs, continuous lines) or ‘NR_’ (non-coding RNAs, dotted lines) type of corresponding RefSeq RNA accession number, belonging to ‘REVIEWED’ or ‘VALIDATED’ genes excluding those not in current annotation release.
Human protein-coding transcript (mRNA), exon and intron numbers and lengths
| mRNAs | Exons | Coding Exons | Introns | ||
|---|---|---|---|---|---|
| Number | Total of entries | 37 608 | 412 641 | 384 289 | 375 033 |
| non-redundant: 147 484 | non-redundant: 138 736 | non-redundant: 134 497 | |||
| Median | 4.0 | 8.0 | 8.0 | 7.0 | |
| per gene | per transcript | per transcript | per transcript | ||
| Mean | 5.4 | 11.0 | 10.2 | 10.0 | |
| per gene | per transcript | per transcript | per transcript | ||
| SD | 5.4 | 9.9 | 9.9 | 8.9 | |
| per gene | per transcript | per transcript | per transcript | ||
| Min | 1 | 1 | 1 | 1 | |
| (3984 genes) | (948 transcripts, 942 genes) | (2560 transcripts, 1856 genes) | (1724 transcripts, 1441 genes) | ||
| Max | 28 | 363 | 362 | 362 | |
| ( | ( | ( | ( | ||
| Length | Median | 2787 bp | 133 bp non-redundant: 141 bp | 122 bp | 1,632 bp |
| 126 bp | non-redundant: 122 bp | non-redundant: 1710 bp | |||
| Mean | 3392 bp | 309 bp non-redundant: 365 bp | 163 bp | 6,355 bp | |
| 161 bp | non-redundant: 171 bp | non-redundant: 6990 bp | |||
| SD | 2600 bp | 725 bp non-redundant: 810 bp | 256 bp | 20 649 bp | |
| 214 bp | non-redundant: 290 bp | non-redundant: 23 493 bp | |||
| Shortest | 186 bp | 2 bp | 1 bp | 30 bp | |
| ( | ( | ( | ( | ||
| Longest | 109 224 bp | 24 927 bp | 21 693 bp | 1 160 411 bp | |
| ( | ( | ( | ( | ||
| Total | 127 583 379 bp | 127 583 379 bp | 62 554 408 bp | 2 383 497 318 bp | |
| non-redundant: 53 827 863 bp | non-redundant: 23 698 355 bp | non-redundant: 940 173 183 bp |
SD: standard deviation; min: minimum; max: maximum; chr: chromosome; bp: base pair.
We considered only protein-coding genes with ‘REVIEWED’ or ‘VALIDATED’ RefSeq status, with at least one ‘REVIEWED’ or ‘VALIDATED’ transcript, excluding ‘not in current annotation release’ records in GeneBase 1.1 Human software. Non-coding RNA produced by a protein-coding gene are excluded, selecting only transcripts with an ‘NM_’ RNA accession number type (Methods and Supplementary Methods File). In particular, mRNA data were derived from the ‘Genes’ and ‘Transcripts’ tables, while exon and intron data from ‘Gene_Table’ and ‘Reports’ tables. Exon and intron non-redundant sets were found counting only one exon or intron for each group of exons or introns present in multiple transcript isoforms. A comprehensive analysis including both non-coding and protein-coding genes is available in the Supplementary Table S3.
In this column, numbers and lengths are shown considering only the protein-coding portion of exons, including stop codons.
These values were calculated excluding 37 608 records corresponding to the last exon, which is usually the longest one (Supplementary Figure S2). (Values considering last exons only: 1264 bp median, 1782 bp mean, 1709 SD. Values considering non-redundant last exons only: 1177 bp median, 1716 bp mean, 1683 bp SD).
Exon and intron minimum lengths are adjusted following manual curation as described (6) because the starting database contains some artefactual data.
Human non-coding RNA, exon and intron numbers and lengths
| Non-coding RNAs | Exons | Introns | ||
|---|---|---|---|---|
| Number | Total of entries | 7933 | 47 521 | 39588 |
| non-redundant: 24 783 | non-redundant: 21 297 | |||
| Median | 1 per gene | 4.0 | 3.0 | |
| per transcript | per transcript | |||
| Mean | 1.3 per gene | 6.0 | 5.0 | |
| per transcript | per transcript | |||
| SD | 0.9 per gene | 5.8 per transcript | 4.8 per transcript | |
| Min | 1 (4728 genes) | 1 | 1 | |
| (554 transcripts and genes) | (908 transcripts, 792 genes) | |||
| Max | 52 ( | 51 | 50 | |
| ( | ( | |||
| Length | Median | 1787 bp | 141 bp non-redundant: 153 bp | 1855 bp |
| 125 bp | non-redundant: 2089 bp | |||
| Mean | 2168 bp | 362 bp non-redundant: 405 bp | 7897 bp | |
| 177 bp | non-redundant: 9873 bp | |||
| SD | 2037 bp | 863 bp non-redundant: 979 bp | 22 390 bp | |
| 237 bp | non-redundant: 26 807 bp | |||
| Shortest | 60 bp | 4 bp | 31 bp | |
| ( | ( | ( | ||
| Longest | 91 671 bp | 91 671 bp | 499 303 bp | |
| ( | ( | ( | ||
| Total | 17 200 281 bp | 17 200 281 bp | 312 610 461 bp | |
| non-redundant: 10 040 755 bp | non-redundant: 210 269 817 bp |
SD: standard deviation; min: minimum; max: maximum; chr: chromosome; bp: base pair.
We considered only genes with ‘REVIEWED’ or ‘VALIDATED’ RefSeq status, with at least one ‘REVIEWED’ or ‘VALIDATED’ transcript, excluding ‘not in current annotation release’ records in GeneBase 1.1 Human software. Here only transcripts with an ‘NR_’ RNA accession number type (Methods and Supplementary Methods File) are selected, corresponding to 5890 genes (since also non-coding RNAs can be transcribed from a protein-coding gene). In particular, transcript data were derived from the ‘Genes’ and ‘Transcripts’ tables, while exon and intron data were derived from ‘Gene_Table’ and ‘Reports’ tables. Exon and intron non-redundant sets were found counting only one exon or intron for each group of exons or introns present in multiple transcript isoforms. A comprehensive analysis including both non-coding and protein-coding genes is available in the Supplementary Table S3.
These values were calculated excluding 7933 records corresponding to the last exon, which is usually the longest one (Supplementary Figure S2). (Values considering non-redundant last exons only: 821 bp median, 1287 bp mean, 1775 SD. Values considering non-redundant last exons only: 676 bp median, 1161 bp mean, 1851 bp SD).
Human mRNA region numbers and lengths
| 5′ UTR | CDS | 3′ UTR | |
|---|---|---|---|
| Median length | 203 bp | 1278 bp | 938 bp |
| Mean length | 259 bp | 1663 bp | 1470 bp |
| SD | 228 bp | 1901 bp | 1620 bp |
| Shortest | 0 bp | 75 bp ( | 0 bp |
| Longest | 14705 bp ( | 107 976 bp ( | 24 505 bp ( |
| Total length | 9 740 061 bp | 62 554 408 bp | 55 288 737 bp |
SD: standard deviation; UTR: untranslated region; CDS: coding DNA sequence; bp: base pair; chr: chromosome.
We considered only genes with ‘REVIEWED’ or ‘VALIDATED’ RefSeq status, with at least one ‘REVIEWED’ or ‘VALIDATED’ transcript, excluding ‘not in current annotation release’ records in GeneBase 1.1 Human software. In particular, data were derived from the ‘Transcripts’ table. Here only transcripts with an ‘NM_’ RNA accession number type (Methods and Supplementary Methods File) are selected. 5′ and 3′ UTRs minimum lengths are subjected to the quality of the RefSeq annotation.