| Literature DB >> 31164174 |
Allison Piovesan1, Francesca Antonaros1, Lorenza Vitale1, Pierluigi Strippoli1, Maria Chiara Pelleri2, Maria Caracausi1.
Abstract
OBJECTIVE: A well-known limit of genome browsers is that the large amount of genome and gene data is not organized in the form of a searchable database, hampering full management of numerical data and free calculations. Due to the continuous increase of data deposited in genomic repositories, their content revision and analysis is recommended. Using GeneBase, a software with a graphical interface able to import and elaborate National Center for Biotechnology Information (NCBI) Gene database entries, we provide tabulated spreadsheets updated to 2019 about human nuclear protein-coding gene data set ready to be used for any type of analysis about genes, transcripts and gene organization.Entities:
Keywords: Gene statistics; Human genes; Protein-coding genes
Mesh:
Substances:
Year: 2019 PMID: 31164174 PMCID: PMC6549324 DOI: 10.1186/s13104-019-4343-8
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Number and length of known human nuclear protein-coding genes and protein-coding transcripts (mRNAs)
| Protein-coding genesa | mRNAsb | |
|---|---|---|
| Number | ||
| Total entries | 19,116 | 49,632 |
| Median | N/A | N/A |
| Mean | Per chr: 797 | N/A |
| SD | N/A | N/A |
| Min | chrY: 47 chr21: 228 | N/A |
| Max | chr1: 1952 | N/A |
| Length | ||
| Median | 26,018 bp | 2938 bp |
| Mean | 66,646 bp | 3522 bp |
| SD | 131,781 bp | 2557 bp |
| Shortest | 189 bp ( | 186 bp ( |
| Longest | 2,473,592 bp ( | 109,224 bp ( |
| Total | 1,274,002,474 bp | 174,797,813 bp |
SD standard deviation, chr chromosome, min minimum, max maximum, bp base pair
aValues of protein-coding genes have been calculated exploiting Excel functions in Genes.xlsx file containing data exported from GeneBase “Genes” and “Gene_Summary” tables (records retrieved searching for nuclear protein-coding gene type and REVIEWED or VALIDATED gene RefSeq status and REVIEWED or VALIDATED transcript RefSeq status, excluding records annotated as “not in current annotation release”). Min and max number of genes per chr were derived using filter function in the Excel Genes.xlsx file. Mean number per chr has been calculated dividing the total number of genes by 24 (22 autosomes, chrX and chrY)
bValues were calculated exploiting Excel functions in Transcripts.xlsx file containing data exported from GeneBase “Transcripts” table (retrieved records with a VALIDATED or REVIEWED RefSeq status with an “NM_” type of corresponding RefSeq RNA accession number belonging to genes with a VALIDATED or REVIEWED RefSeq status, excluding “not in current annotation release” records). The gene locations have been retrieved manually from GeneBase “Gene_Summary” table. N/A: not applicable
Number and length of human exons and introns in protein-coding transcripts
| Exons (E) | Coding exonsa | Introns (I) | |
|---|---|---|---|
| Number | |||
| Total entries | 562,164 | 512,303 | 512,530 |
| Total non-redundant entries | 159,652 | 151,285 | 148,092 |
| Median per transcript | 9.0 | 8.0 | 8.0 |
| Mean per transcript | 11.3 | 10.3 | 10.3 |
| SD per transcript | 9.6 | 9.6 | 8.6 |
| Min per transcript | 1 (1074 transcripts; 1068 genes) | 1 (3157 transcripts; 2117 genes) | 1 (1960 transcripts; 1572 genes) |
| Max per transcript | 363 ( | 362 ( | 362 ( |
| Length | |||
| Median | 131 bp Not lastb: 124 bp | 120 bp | 1747 bp |
| Median non-redundant | 142 bp Not lastb: 130 bp | 121 bp | 1742 bp |
| Mean | 311 bp Not lastb: 159 bp | 160 bp | 6938 bp |
| Mean non-redundant | 371 bp Not lastb: 177 bp | 171 bp | 7397 bp |
| SD | 744 bp Not lastb: 205 bp | 254 bp | 22,163 bp |
| SD non-redundant | 828 bp Not lastb: 242 bp | 293 bp | 24,263 bp |
| Shortest | 2 bp ( | 1 bp (e.g., | 26 bp ( 30 bp ( |
| Longest | 27,303 bp ( | 21,693 bp ( | 1,160,411 bp ( |
| Total | 174,797,813 bp | 82,144,360 bp | 3,555,747,074 bp |
| Total non-redundant | 59,281,518 bp | 25,840,698 bp | 1,095,434,245 bp |
Median, mean, SD, min and max number of exons or coding exons per transcript were calculated exploiting Excel functions in Transcripts.xlsx file (containing data exported from GeneBase “Transcripts” table, i.e. retrieved records with a VALIDATED or REVIEWED RefSeq status with an “NM_” type of corresponding RefSeq RNA accession number belonging to genes with a VALIDATED or REVIEWED RefSeq status, excluding “not in current annotation release” records). Number of introns per transcript was estimated assuming: (number of exons—1). Minimum number of introns per transcript was found excluding mono-exonic genes. Number of genes with one exon can be retrieved filtering Excel rows for Exons_per_RNA equal to 1, copying the retrieved gene symbols in a new sheet and applying the Excel “Advanced Filter” called “Unique records only”. Number of genes with one intron can be found with the same procedure, filtering Excel rows for Exons_per_RNA greater than 1. Length values were calculated exploiting Excel functions in Gene_Table.xlsx file containing data exported from GeneBase “Gene_Table” table (retrieved as above). When calculations were performed on filtered data, “AGGREGATE” Excel function was used. Exon and intron non-redundant sets were found counting only one exon or intron for each group of exons or introns present in multiple transcript isoforms, i.e. filtering for Excel rows containing “Yes” in the relative Non_Redundant column. Values were calculated for the total number of entries when “non-redundant” is not specified. Total number of entries was calculated in Gene_Table.xlsx file using Excel “Count number” function for each column containing length_bp values, filtering to select non-redundant entries when indicated. Total length for each feature was calculated in Gene_Table.xlsx file using Excel “Sum” function for each column, filtering to select non-redundant entries when indicated
SD standard deviation, min minimum, max maximum, chr chromosome, bp base pair
aIn this column numbers and lengths are shown considering only the protein-coding portion of exons, including stop codons
bThese values were calculated excluding records corresponding to the last exon, which is usually the longest one, filtering for Excel rows not containing “Yes” in Last_Exon column