| Literature DB >> 27589963 |
Matsuyuki Shirota1, Kengo Kinoshita2.
Abstract
The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to polymorphisms in the human population, but the overall landscape of the discordant sequences has not been clarified. In this study, we comprehensively listed the discordant bases and regions between the GRCh38, RefSeq and UniProt reference sequences, based on the genomic coordinates of GRCh38. We observed that the RefSeq sequences are more likely to represent the major alleles than GRCh38 and UniProt, by assigning the alternative allele frequencies of the discordant bases. Since some reference sequences have minor alleles, functional and structural annotations may be performed based on rare alleles in the human population, thereby biasing these analyses. Some of the differences between the RefSeq and GRCh38 account for biological differences due to known RNA-editing sites. The definitions of the coding regions are frequently complicated by possible micro-exons within introns and by SNVs with large alternative allele frequencies near exon-intron boundaries. The mRNA or protein regions missing from GRCh38 were mainly due to small deletions, and these sequences need to be identified. Taken together, our results clarify overall consistency and remaining inconsistency between the reference sequences.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27589963 PMCID: PMC5009343 DOI: 10.1093/database/baw124
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 3.The alternative allele frequencies of the 1K genomes SNVs, classified by the concordance and discordance statuses between GRCh38, RefSeq mRNA, UniProt protein and GRCh37. The reference allele is defined by the GRCh38 sequence. (A) Both RefSeq and UniProt represent the reference allele, (B) RefSeq represents the reference allele but UniProt represents the alternative allele, (C) RefSeq represents the alternative allele but UniProt represents the reference allele, (D) both RefSeq and UniProt represent the alternative allele. The two histograms in panel A, B and D indicate that GRCh37 represents the reference (left) and alternative (right) alleles. On the other hand, panel C has only one histogram, because no SNVs were found for which RefSeq and GRCh37, but not UniProt, represent the alternative allele.
Figure 1.Alternative allele frequencies (AAFs) of the genomic positions in which RefSeq mRNA represents the alternative allele. (A and B) Density distributions of alternative allele frequencies for each 5% frequency interval based on (A) 1K genomes and (B) ESP6500 data. Green and orange bars represent the distributions for all of the SNVs in the data set and those of the SNVs corresponding to the difference between RefSeq mRNA and GRCh38, respectively. Red and blue borders show non-synonymous and synonymous substitutions, respectively. Insets represent the densities for each 1% frequency interval within 0–5% and 95–100%. (C) AAFs for the discrepancy loci between 1K genomes and ESP6500 data. Synonymous and non-synonymous substitutions are shown in green and orange, respectively.
Figure 2.Counts of variants with large alternative allele frequencies. (A and B) The number of variants for each 1% of alternative allele frequency is plotted for (A) 1K genomes and (B) ESP6500. Red and blue lines indicate non-synonymous and synonymous variants. Thin and thick lines indicate all and discordant variants. (C) Fraction of discordant variants for each 1% alternative allele frequency bin. Red and magenta lines indicate non-synonymous and synonymous variants for 1K genomes, respectively, and blue and cyan lines are those for ESP6500, respectively.
The numbers of concordant and discordant non-synonymous SNVs of 1K genomes data
| GRCh37 | RefSeq | UniProt | Occurrence |
|---|---|---|---|
| Ref | Ref | Ref | 584 059 |
| Alt | Ref | Ref | 76 |
| Ref | Alt | Ref | 404 |
| Alt | Alt | Ref | 0 |
| Ref | Ref | Alt | 93 |
| Alt | Ref | Alt | 186 |
| Ref | Alt | Alt | 33 |
| Alt | Alt | Alt | 34 |
Each row represents a set of allele types of GRCh37 (previous reference genome), RefSeq and UniProt reference sequences and the occurrence of SNVs of the allele types. ‘Ref’ and ‘Alt’ indicate that the reference sequence of the column is the same as and different from the sequence of GRCh38, respectively.
Figure 4.Protein 3D structures including rare variants. (A) Superimposition of two quinolic acid phosphoribosyl transferase (QPTR) hexamers. (B) Dimer–dimer interface of QPTR surrounding the variant residues A195T. Chain As of PDB entries 2JBM and 2KWW are green and purple, respectively and their chain Cs are magenta and orange, respectively. (C) A rare variant in QPTR. (D) Structure of HOMER protein homologue 3 (HOMER3) coiled-coil domains and (E) close-up view with the S342 and E345 residues. (F) A rare variant in HOMER3. (G, H) Sequence comparisons of (G) QPTR and (H) HOMER3.
Figure 5.Base frequency in pooled RNA-seq data of the GEUVADIS project for regions surrounding discordant variants for (A) NM_031968.2 (NARF), (B) NM_000828.4 (GRIA3) and NM_000829.3 (GRIA4). A, T, G and C bases are coloured green, red, orange and blue, respectively, in the nucleotide sequences and bar plots. The bar plots indicate the fraction of each base occurring in the pooled RNA-seq data of 462 individuals from the GEUVADIS project. The amino acid sequences for these regions and the amino acid changes that occur due to possible A-to-G RNA-editing events are shown below the bar plot, with synonymous changes shown in grey. Grey inverted triangles on the DNA sequences represent two SNVs within these regions, which are both rare (0.02% in 1,000 genomes phase 3) and are G-to-A changes. The vertical dashed lines indicate exon–intron boundaries, and the grey letters indicate the intronic bases in the DNA sequences and the RNA bases from neighbouring exons of the transcript sequences. Asterisks indicate the discordant variants, and colons and dots indicate possible RNA-editing with G-base fractions of 0.05 and 0.01, respectively.
Missing gene segments in GRCh38
| GENE | Description | #1 | Chr | Strand | Positions | #2 | mRNA |
|---|---|---|---|---|---|---|---|
| ABO | Abo blood group (transferase a, alpha 1-3-n-acetylgalactosaminyltransferase; transferase b, alpha 1-3-galactosyltransferase) (abo) | 1 | chr9 | − | 133257521:133257521 | G | NM_020469.2 |
| THSD7B | Thrombospondin, type i, domain containing 7b (thsd7b) | 1 | chr2 | + | 137451025:137563217 | T | NM_001080427.1 |
| TMEM247 | Transmembrane protein 247 (tmem247) | 1 | chr2 | + | 46484425:46484425 | A | NM_001145051.2 |
| IFNL4 | Interferon, lambda 4 (gene/pseudogene) (ifnl4) | 1 | chr19 | − | 39248513:39248515 | C | NM_001276254.2 |
| PKD1L2 | Polycystic kidney disease 1-like 2 (pkd1l2) | 1 | chr16 | − | 81127878:81127878 | G | NM_001278425.1, NM_052892.3 |
| NR1H2 | Nuclear receptor subfamily 1, group h, member 2 (nr1h2) | 3 | chr19 | + | 50378567:50378567 | ACA | NM_007121.5, NM_001256647.1 |
| ALMS1 | Alstrom syndrome 1 (alms1) | 3 | chr2 | + | 73385903:73385903 | GGA | NM_015120.4 |
| BCL6B | B-cell cll/lymphoma 6, member b (bcl6b) | 3 | chr17 | + | 7024730:7024730 | CAG | NM_181844.3 |
| HTT | Huntingtin (htt) | 6 | chr4 | + | 3074935:3074935 | GCAGCA | NM_002111.7 |
| LACTBL1 | Lactamase, beta-like 1 (lactbl1) | 7 | chr1 | − | 22965430:- | ATGAAGA | NM_001289974.1 |
| NPIPB3 | Nuclear pore complex interacting protein family, member b3 (npipb3) | 11 | chr16 | − | 21404379:21404384, 21404506:21404509 | AGCTC, GCTCAC | NM_130464.2 |
| IGFBP2 | Insulin-like growth factor binding protein 2, 36kda (igfbp2) | 9 | chr2 | + | 216633587:216633587 | CGCTGCTGC | NM_000597.2 |
| MOB2 | Mob kinase activator 2 (mob2) | 16 | chr11 | − | 1480886:- | ATGGACTGGCTC | NM_053005.5 |
| ATGG | |||||||
| MROH8 | Maestro heat-like repeat family member 8 (mroh8) | 29 | chr20 | − | 37179390:37179390 | AAGAGTGCCGGC | NM_152503.5, NM_213631.2, NM_213632.2 |
| CGCGGGGCCCTGTCTAT | |||||||
| RYBP | Ring1 and yy1 binding protein (rybp) | 30 | chr3 | − | 72446623:- | ATGACCATGGGC | NM_012234.6 |
| GACAAGAAGAGCCCGACC | |||||||
| NRG1 | Neuregulin 1 (nrg1)hrg-beta1 | 35 | chr8 | + | −:32595825 | – | NM_001159995.1, NM_001159999.1 |
| SHANK3 | Sh3 and multiple ankyrin repeat domains 3 (shank3) | 42 | chr22 | + | 50695048:50697563 | – | NM_033517.1 |
| CASP8AP2 | Caspase 8 associated protein 2 (casp8ap2) | 48 | chr6 | + | 89871390:89873802 | – | NM_012115.3, NM_001137667.1, NM_001137668.1 |
| FAM101B | Family with sequence similarity 101, member b (fam101b) | 209 | chr17 | − | 445940:- | – | NM_182705.2 |
| NBPF1 | Neuroblastoma breakpoint family, member 1 (nbpf1) | 220 | chr1 | − | – | – | NM_017940.4 |
| SAMD1 | Sterile alpha motif domain containing 1 (samd1) | 318 | chr19 | − | 14090084:14090084 | – | NM_138352.1 |
| MUC19 | Mucin 19, oligomeric (muc19) | 487 | chr12 | + | – | – | NM_173600.2 |
| MUC2 | Mucin 2, oligomeric mucus/gel-forming (muc2) | 3153 | chr11 | + | – | – | NM_002457.3 |
| FCGBP | Fc fragment of igg binding protein (fcgbp) | 3603 | chr19 | − | – | – | NM_003890.2 |
#1: Number of missing bases.
#2: Missing mRNA base.