| Literature DB >> 23349826 |
Edward Wijaya1, Martin C Frith, Paul Horton, Kiyoshi Asai.
Abstract
Human gene catalogs are fundamental to the study of human biology and medicine. But they are all based on open reading frames (ORFs) in a reference genome sequence (with allowance for introns). Individual genomes, however, are polymorphic: their sequences are not identical. There has been much research on how polymorphism affects previously-identified genes, but no research has been done on how it affects gene identification itself. We computationally predict protein-coding genes in a straightforward manner, by finding long ORFs in mRNA sequences aligned to the reference genome. We systematically test the effect of known polymorphisms with this procedure. Polymorphisms can not only disrupt ORFs, they can also create long ORFs that do not exist in the reference sequence. We found 5,737 putative protein-coding genes that do not exist in the reference, whose protein-coding status is supported by homology to known proteins. On average 10% of these genes are located in the genomic regions devoid of annotated genes in 12 other catalogs. Our statistical analysis showed that these ORFs are unlikely to occur by chance.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23349826 PMCID: PMC3551959 DOI: 10.1371/journal.pone.0054210
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The standard model of genetic information transfer in molecular biology.
Panel (a) shows the transfer begins with the DNA being transcribed into mRNA, and continues with protein being synthesized using information in mRNA as a template (translation). We investigate the effect of polymorphic modification of the mRNA. Panel (b) depicts how the new longer ORF was formed. The starting position of the new ORF in the mRNA is before that of original ORF. The new ORF may or may not overlap the original ORF, and if it does overlap, it is in different reading frame, so that the proteins are completely different.
Figure 2Workflows for finding protein-coding genes.
Panel (a) describes the workflow of gene-finding without applying human polymorphism and (b) with human polymorphism. The values inside the brackets refer to the number mRNAs, ORFs and genes respectively. The final number of genes in workflow (b) refers to the genes where the ORFs change after modification, but in workflow (a) such change does not apply. For the second workflow (b) two main sources of data are used: human mRNA sequences and polymorphism data (dbSNP 131). Based on the polymorphism information we redefine the mRNA sequences. Out of the modified mRNA sequences we derived the longest ORFs. These ORFs are further refined by filtering them based on significant homology to Swiss-Prot and proximity to 5′ UTR. Finally we construct the genes from the refined ORFs.
Genomic positions of mRNA for the examples shown in Figure 3, 4, 5.
| AK124706 | AK127273 | AY129028 | |
| Chr. name | chrUn_gl000222 | chr7 | chr14 |
| Strand | − | + | - |
| mRNA pos. in chr. | 25,008–28,821 | 128,295,697–128,299,178 | 93,403,259–93,406,150 |
| dbSNP ref. | rs66651466 | rs71162510 | rs8011546 |
| ORF pos. after modification | 28,249–28,585 | 128,296,043–128,296,637 | 93,404,946–93,405,303 |
| ORF pos. before modification | 27,864–28,166 | 128,298,398–128,298,944 | 93,404,502–93,404,811 |
| 5′ UTR | 236 | 347 | 847 |
| E-value of new ORF alignment to Swiss-Prot | 3.21E-10 | 4.74E-50 | 1.66E-08 |
Figure 3Modification by polymorphism of mRNA AK124706 and its ORFs.
In the reference genome the modification is caused by an insertion (rs66651466) with ‘AT’ as the allele. The initial longest ORF before modification has length 302 bp. The new longest ORF has length 336 bp, and it aligns to Swiss-Prot Integrin beta-5 protein (Acc:P18084). Annotation of start/stop codon in the translation process and alleles that cause the change can be found in Figure 2.
Figure 4Modification by polymorphism of mRNA AK127273 and its ORFs.
The initial longest ORF before modification has length 546 bp. The longest ORF after modification has length 594 bp. The polymorphism responsible for the modification is an in-del (rs71162510) which replaces the reference genome allele ‘C’ with ‘TGCCCC’.
Figure 5Modification by polymorphism of mRNA AY129028 and its ORFs.
The initial longest ORF before modification has length 309 bp, and after has length 357 bp. The polymorphism that effects the modification is a SNP (rs8011546) which replaces the reference allele ‘G’ with ‘A’.
Figure 6Cumulative allele frequency from 11 populations.
In panel (a) we plot the allele percentage of new ORFs and (b) allele percentage of all HapMap data in UCSC Genome Browser. The percentage (-axis) in panel (b) is based on Allele1, chosen arbitrarily.
ORFs and genes that changed after modification.
| ORFs | Genes | |
| Unchanged | 202,232 | 26,491 |
| Changed | 18,726 | 5,737 |
An ORF is said to be new or undergo changes if it shares no same-frame codons with the initial ORF before modification in the mRNA. The genes are constructed by merging the ORFs (from different mRNAs) that overlap in the same strand of a chromosome.
Effect of randomization in ORFs and genes prediction.
| Polymorphisms type | ORFs | Genes |
| Real | 18,726 | 5,737 |
| Random | 3,004 | 1,330 |
Similar to the previous table, the figures refer to the number of ORFs that changed after modification with real and random polymorphisms. These figures are reported after validating the ORFs through Swiss-Prot homology.
Number of new genes after modification by polymorphism with and without overlap with each of 12 other gene sets.
| Gene Finder | No Overlap | With Overlap |
| acembly | 365 | 5,372 |
| ccdsGene | 713 | 5,024 |
| ensGene | 473 | 5,264 |
| geneid | 541 | 5,196 |
| genscan | 449 | 5,288 |
| hinv70Coding | 284 | 5,453 |
| knownGene | 452 | 5,285 |
| nscanGene | 521 | 5,216 |
| refGene | 521 | 5,216 |
| sgpGene | 498 | 5,239 |
| vegaGene | 1,762 | 3,975 |
| xenoRefGene | 622 | 5,115 |
Number of new genes after human polymorphism modification that overlap with other species' genes and not found by Ensembl gene finder (ensGene).
| With Overlap | No Overlap | |
| chicken (galGal3) | 65 | 408 |
| medaka (oryLat2) | 68 | 405 |
| zebrafish (danRer7) | 71 | 402 |
| zebrafinch (taeGut1) | 72 | 401 |
| panda (ailMel1) | 76 | 397 |
| tetraodon (tetNig2) | 77 | 396 |
| frog (xenTro2) | 78 | 395 |
| lizard (anoCar1) | 78 | 395 |
| fugu (fr2) | 80 | 393 |
| stickleback (gasAcu1) | 81 | 392 |
| cat (felCat3) | 81 | 392 |
| pig (susScr2) | 83 | 390 |
| orangutan (ponAbe2) | 86 | 387 |
| chimpanzee (panTro2) | 87 | 386 |
| marmoset (calJac3) | 89 | 384 |
| rhesus (rheMac2) | 89 | 384 |
| rabbit (oryCun2) | 90 | 383 |
| guinea pig (cavPor3) | 92 | 381 |
| horse (equCab2) | 93 | 380 |
| elephant (loxAfr3) | 94 | 379 |
| mouse (mm9) | 94 | 379 |
| cow (bosTau4) | 96 | 377 |
| rat (rn4) | 97 | 376 |
| dog (canFam2) | 101 | 372 |
| opossum (monDom5) | 104 | 369 |
Molecular function gene ontology of new ORFs after modification.
| Description | P-Value | FDR q-value | Enrichment (N,B,n,b) |
| ATP binding | 1.49E-33 | 5.71E-30 | 1.46 (16846,1442,5262,659) |
| adenyl ribonucleotide binding | 5.77E-33 | 1.11E-29 | 1.45 (16846,1466,5262,666) |
| binding | 9.74E-33 | 1.25E-29 | 1.09 (16846,11419,5262,3897) |
| adenyl nucleotide binding | 2.80E-32 | 2.69E-29 | 1.45 (16846,1475,5262,667) |
| catalytic activity | 1.12E-27 | 8.63E-25 | 1.19 (16846,5141,5262,1909) |
| protein binding | 1.40E-25 | 8.98E-23 | 1.14 (16846,7073,5262,2519) |
| purine ribonucleoside triphosphate binding | 1.90E-25 | 1.04E-22 | 1.35 (16846,1776,5262,751) |
| purine ribonucleotide binding | 4.93E-25 | 2.36E-22 | 1.35 (16846,1806,5262,760) |
| ribonucleotide binding | 4.93E-25 | 2.10E-22 | 1.35 (16846,1806,5262,760) |
| purine nucleotide binding | 1.67E-24 | 6.40E-22 | 1.34 (16846,1818,5262,762) |
| nucleotide binding | 5.94E-21 | 2.07E-18 | 1.27 (16846,2318,5262,921) |
| nucleoside phosphate binding | 6.92E-21 | 2.21E-18 | 1.27 (16846,2319,5262,921) |
| small molecule binding | 8.24E-21 | 2.43E-18 | 1.26 (16846,2485,5262,978) |
| ion binding | 1.70E-19 | 4.65E-17 | 1.19 (16846,3833,5262,1426) |
| cation binding | 2.72E-19 | 6.95E-17 | 1.19 (16846,3825,5262,1422) |
| metal ion binding | 4.41E-19 | 1.06E-16 | 1.19 (16846,3754,5262,1397) |
| kinase activity | 4.11E-18 | 9.27E-16 | 1.47 (16846,756,5262,347) |
| phosphotransferase activity, alcohol group as acceptor | 1.79E-17 | 3.82E-15 | 1.48 (16846,704,5262,325) |
| protein kinase activity | 7.02E-17 | 1.42E-14 | 1.51 (16846,592,5262,280) |
| transferase act. transferring phosphorus containing grp. | 7.95E-17 | 1.52E-14 | 1.42 (16846,875,5262,387) |
Ranked top 20 terms according to the P-value of overrepresentation against the background set. Last column with ‘Enrichment (N, B, n, b)’ is defined as follows: is the total number of genes, is the total number of genes associated with the corresponding GO term (description), is the number of genes in the target set, is the number of genes in the intersection. Enrichment Note that the total number of target genes () in the last column could be less or equal to the number of input genes. This is because GOrilla normalised the input gene names with its gene database.
Biological process gene ontology of new ORFs after modification.
| Description | P-Value | FDR q-value | Enrichment (N,B,n,b) |
| macromolecule modification | 1.23E-13 | 1.51E-10 | 1.23 (16846,1977,5262,762) |
| phosphorylation | 5.86E-13 | 6.48E-10 | 1.43 (16846,628,5262,280) |
| Phosphate containing compound metabolic process | 9.70E-13 | 9.75E-10 | 1.37 (16846,798,5262,342) |
| phosphorus metabolic process | 9.70E-13 | 8.94E-10 | 1.37 (16846,798,5262,342) |
| protein phosphorylation | 1.14E-12 | 9.73E-10 | 1.44 (16846,575,5262,259) |
| cellular protein modification process | 3.48E-12 | 2.75E-09 | 1.23 (16846,1884,5262,721) |
| protein modification process | 3.48E-12 | 2.57E-09 | 1.23 (16846,1884,5262,721) |
| regulation of biological process | 1.63E-11 | 1.12E-08 | 1.08 (16846,7733,5262,2615) |
| protein metabolic process | 6.10E-11 | 3.97E-08 | 1.16 (16846,2921,5262,1061) |
| regulation of biological quality | 3.45E-10 | 2.12E-07 | 1.19 (16846,2036,5262,759) |
| transmem.eceptor protein tyrosine kinase sig. pathway | 4.16E-10 | 2.42E-07 | 1.42 (16846,489,5262,217) |
| response to stimulus | 6.81E-10 | 3.77E-07 | 1.10 (16846,5534,5262,1901) |
| regulation of cellular process | 8.46E-10 | 4.46E-07 | 1.08 (16846,7330,5262,2470) |
| enzyme linked receptor protein signaling pathway | 2.26E-09 | 1.14E-06 | 1.34 (16846,658,5262,276) |
| cellular protein metabolic process | 3.41E-09 | 1.64E-06 | 1.17 (16846,2348,5262,856) |
| cellular response to stimulus | 4.72E-09 | 2.18E-06 | 1.12 (16846,4170,5262,1453) |
| organelle organization | 5.27E-09 | 2.33E-06 | 1.20 (16846,1654,5262,621) |
| regulation of response to stimulus | 6.76E-09 | 2.87E-06 | 1.17 (16846,2110,5262,774) |
| cellular component organization | 7.02E-09 | 2.88E-06 | 1.14 (16846,3139,5262,1115) |
| Peptidyl tyrosine phosphorylation | 8.21E-09 | 3.24E-06 | 2.17 (16846,59,5262,40) |
Ranked top 20 terms according to the P-value of overrepresentation against the background set.
Cellular component gene ontology of new ORFs after modification.
| Description | P-Value | FDR q-value | Enrichment (N,B,n,b) |
| cell part | 2.57E-32 | 3.30E-29 | 1.08 (16846,12147,5262,4108) |
| intracellular part | 6.23E-28 | 4.00E-25 | 1.08 (16846,11593,5262,3922) |
| organelle | 2.22E-18 | 9.52E-16 | 1.11 (16846,7819,5262,2703) |
| intracellular organelle | 3.61E-18 | 1.16E-15 | 1.11 (16846,7802,5262,2696) |
| cytoplasmic part | 2.28E-17 | 5.85E-15 | 1.12 (16846,6470,5262,2268) |
| cytosol | 1.01E-16 | 2.16E-14 | 1.24 (16846,2261,5262,878) |
| cytoplasm | 1.03E-14 | 1.88E-12 | 1.16 (16846,3850,5262,1398) |
| Membrane bounded organelle | 3.36E-14 | 5.39E-12 | 1.10 (16846,6897,5262,2377) |
| intracellular membrane bounded organelle | 3.74E-14 | 5.33E-12 | 1.10 (16846,6892,5262,2375) |
| organelle part | 8.10E-10 | 1.04E-07 | 1.09 (16846,6066,5262,2070) |
| intracellular organelle part | 9.22E-10 | 1.08E-07 | 1.09 (16846,5980,5262,2042) |
| nucleus | 1.36E-09 | 1.46E-07 | 1.11 (16846,4598,5262,1597) |
| non-membrane-bounded organelle | 1.55E-09 | 1.53E-07 | 1.19 (16846,1920,5262,715) |
| Intracellular non-membrane bounded organelle | 1.55E-09 | 1.42E-07 | 1.19 (16846,1920,5262,715) |
| nuclear part | 6.90E-08 | 5.91E-06 | 1.15 (16846,2412,5262,866) |
| cytoplasmic vesicle part | 7.30E-07 | 5.86E-05 | 1.37 (16846,390,5262,167) |
| cell junction | 1.60E-06 | 1.21E-04 | 1.26 (16846,694,5262,274) |
| Golgi apparatus | 5.61E-06 | 4.00E-04 | 1.26 (16846,629,5262,248) |
| cell projection | 8.08E-06 | 5.46E-04 | 1.22 (16846,819,5262,313) |
| nucleoplasm | 1.10E-05 | 7.05E-04 | 1.21 (16846,912,5262,344) |
Ranked top 20 terms according to the P-value of overrepresentation against the background set.