| Literature DB >> 30486838 |
Mihaela Pertea1,2, Alaina Shumate1,3, Geo Pertea1, Ales Varabyou1,2, Florian P Breitwieser1, Yu-Chi Chang2, Anil K Madugundu4,5,6,7, Akhilesh Pandey4,8,7, Steven L Salzberg9,10,11,12.
Abstract
We assembled the sequences from deep RNA sequencing experiments by the Genotype-Tissue Expression (GTEx) project, to create a new catalog of human genes and transcripts, called CHESS. The new database contains 42,611 genes, of which 20,352 are potentially protein-coding and 22,259 are noncoding, and a total of 323,258 transcripts. These include 224 novel protein-coding genes and 116,156 novel transcripts. We detected over 30 million additional transcripts at more than 650,000 genomic loci, nearly all of which are likely nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells. The CHESS database is available at http://ccb.jhu.edu/chess .Entities:
Keywords: GTEx; Human gene count; RNA sequencing; Transcriptome; Transcriptome assembly
Mesh:
Year: 2018 PMID: 30486838 PMCID: PMC6260756 DOI: 10.1186/s13059-018-1590-2
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1One of 224 new protein-coding genes (CHS.7402) discovered in this study. This 4-exon gene occurs on the forward strand of chromosome 10 at the coordinates shown. The exon lengths are 134, 30, 136, and 663 bp (left to right), with the narrower rectangles indicating the 5′ and 3′ UTR regions. The intron lengths (not shown to scale) are 18,098, 1086, and 1956 bp. The sequence alignment at the bottom shows, top to bottom, the protein sequences from CHS.7402, long-tailed macaque, rhesus macaque, marmoset, white-faced capuchin, ass, Przewalski’s horse, white rhinoceros, and wild boar. The full-length human protein sequence is shown
The number of human genes and transcripts in the new CHESS (Comprehensive Human Expressed SequenceS) database built from 9795 RNA-seq experiments, with comparisons to the RefSeq database. ncRNA noncoding RNA, lncRNA long noncoding RNA gene, miscRNA miscellaneous RNA
| Type of gene | Number in RefSeq | Number in CHESS |
|---|---|---|
| Protein-coding genes | 20,054 | 20,352 |
| ncRNA genes | ||
| - lncRNA | 14,788 | 18,887 |
| - Antisense | 23 | 2144 |
| - miscRNA | 1217 | 1228 |
| Total gene counts | 36,082 | 42,611 |
| Transcripts in protein-coding genes | 127,718 | 266,331 |
| Transcripts in ncRNA genes | ||
| - lncRNA | 28,015 | 49,892 |
| - Antisense | 28 | 2688 |
| - miscRNA | 2005 | 4347 |
| Total transcripts | 157,766 | 323,258 |
Protein-coding genes from RefSeq that were not expressed in any of the 9795 RNA-seq samples from GTeX
| NCBI gene ID | Gene name | Location | Product |
|---|---|---|---|
| 101927562 | LOC101927562 | chr11 1554607–1556457 | Uncharacterizeda |
| 101929097 | LOC101929097 | chr19 2511219–2513571 | Uncharacterizeda |
| 107987231 | LOC107987231 | Chr16 29973622–29974648 | Uncharacterizeda |
| 101928589 | LOC101928589 | chrX 110175773–110177788 | Uncharacterized |
| 728072 | CT47A5 | chrX 120963026–120966446 | Cancer/testis antigen family 47 member A5 |
| 728049 | CT47A8 | chrX 120948422–120951842 | Cancer/testis antigen family 47 member A8 |
| 728042 | CT47A9 | chrX 120943561–120946981 | Cancer/testis antigen family 47 member A9 |
| 245927 | DEFB113 | chr6 49968677–49969625 | Defensin beta 113 |
| 51206 | GP6 | Chr19 55013705–55038264 | Glycoprotein VI platelet |
| 102723822 | LOC102723822 (GTPBP4/NGB) | Unplaced KI270752.1 8198–27137 | Nucleolar GTP-binding protein 1-like |
aThese genes were removed from RefSeq by NCBI after publication of a preliminary version of these findings
Genes and transcripts in CHESS (v2.1) that are also found in either RefSeq (rel 108) or GENCODE (v27) (columns 2 and 5) and that are unique to CHESS (columns 3 and 6)
| Gene biotype | Genes | Transcripts | ||||
|---|---|---|---|---|---|---|
| Shared by RefSeq or GENCODE | Novel in CHESS | Novel + FANTOM | Shared by RefSeq or GENCODE | Novel in CHESS | Novel + FANTOM | |
| Protein coding | 20,128 | 224 | 26 | 169,959 | 96,372 | 23,102 |
| LncRNA | 16,216 | 2671 | 1407 | 34,222 | 15,670 | 5840 |
| Antisense | 598 | 1546 | 494 | 637 | 2051 | 606 |
| MiscRNA | 1227 | 1 | 1 | 2284 | 2063 | 476 |
The columns labeled “Novel + FANTOM” show the subset of CHESS genes and transcripts that are not found in RefSeq or GENCODE but that are present in the FANTOM gene catalog
Fig. 2The number of a introns and b transcripts shared by and unique to all combinations of the CHESS (v2.1), RefSeq (rel 108), and GENCODE databases (v28). For this comparison, only transcripts and introns assembled directly by the CHESS pipeline were included. The CHESS database also includes additional transcripts that were added directly from RefSeq and GENCODE (see main text)
Fig. 3a The number of novel protein-coding and lncRNA genes that were differentially expressed between males and females, for each of the GTEx tissues that had both male and female samples. All tissues except kidney had at least 10 samples for each sex; kidney had 9 female and 29 male. b The number of novel protein-coding and lncRNA genes in CHESS that were upregulated in each of the 31 GTEx tissues as compared to the remaining tissues
Fig. 4Multiple sequence alignments of novel CHESS protein-coding genes CHS.57705 (a) and CHS.24083 (b), each compared to five other primates, with annotated MS/MS spectra validating the identified peptides IDISFHR (a) and QLLTGAR (b) as shown on the right
Fig. 5Summary of the computational pipeline used to align and assemble all 9795 RNA-seq samples