Literature DB >> 26681921

To Know How a Gene Works, We Need to Redefine It First but then, More Importantly, to Let the Cell Itself Decide How to Transcribe and Process Its RNAs.

Yuping Jia¹, Lichan Chen², Yukui Ma¹, Jian Zhang³, Ningzhi Xu⁴, Dezhong Joshua Liao².

Abstract

Recent genomic and ribonomic research reveals that our genome produces a stupendous amount of non-coding RNAs (ncRNAs), including antisense RNAs, and that many genes contain other gene(s) in their introns. Since ncRNAs either regulate the transcription, translation or stability of mRNAs or directly exert cellular functions, they should be regarded as the fourth category of RNAs, after ribosomal, messenger and transfer RNAs. These and other research advances challenge the current concept of gene and raise a question as to how we should redefine gene. We can either consider each tiny part of the classically-defined gene, such as each mRNA variant, as a "gene", or, alternatively and oppositely, regard a whole genomic locus as a "gene" that may contain intron-embedded genes and produce different types of RNAs and proteins. Each of the two ways to redefine gene not only has its strengths and weaknesses but also has its particular concern on the methodology for the determination of the gene's function: Ectopic expression of complementary DNA (cDNA) in cells has in the past decades provided us with great deal of detail about the functions of individual mRNA variants, and will make the data less conflicting with each other if just a small part of a classically-defined gene is considered as a "gene". On the other hand, genomic DNA (gDNA) will better help us in understanding the collective function of a genomic locus. In our opinion, we need to be more cautious in the use of cDNA and in the explanation of data resulting from cDNA, and, instead, should make delivery of gDNA into cells routine in determination of genes' functions, although this demands some technology renovation.

Entities: Chemical Disease Gene Species

Keywords: Complementary DNA; Gene definition; Gene function; Genomic locus; non-coding RNA

Mesh：

Substances：
RNA

Year: 2015 PMID： 26681921 PMCID： PMC4671999 DOI： 10.7150/ijbs.13436

Source DB: PubMed Journal: Int J Biol Sci ISSN： 1449-2288 Impact factor: 6.580

Introduction

In the history of biology, it was long thought that each individual gene served a single phenotypic trait, which could be rephrased more mechanistically as “one gene for one function”. However, we now know that most, if not all, mammalian genes have multiple functions. One major mechanism for a gene to differentiate one function from another is for itself to be present in a different form. As described previously 1-3, this “different form” could be a polymorphism or mutation, a different RNA transcript initiated from or terminated at a different genomic site, a different cis- or trans-spliced product of a transcript, a different protein isoform due to usage of a different start codon or stop codon, or a different product of post-translational process such as proteolysis, phosphorylation or glycosylation. In this regard, the traditional concept of “one gene for one function” still, generally speaking, holds true if “one gene” is considered as one form of the gene. In other words, it may be the definition of gene, but not this concept, that needs to be amended instead. If a small part of a currently-defined gene, which is actually a genomic locus, is redefined as a “gene”, many advances in genomic, ribonomic and proteomic research that cannot be accommodated by the current gene definition become easily understandable, and a lot of data on the functions of most mammalian genes that are conflicting with each other will no longer be controversial. For example, it is understandable that two splice-derived mRNA variants would have different functions because they are two different “genes”. However, if we consider part of a classically-defined gene, such as an mRNA variant, as a “gene” and continue to mainly use complementary DNA (cDNA) to characterize its function, as we have done for decades, we will not be able to learn the collective function of the gene as a genomic locus and in turn the function of the whole mammalian genome. In this essay, we discuss these fundamental issues raised by the recent genomic, ribonomic and proteomic findings and by the omnipresent controversial data on genes' functions.

Advance in genomic research requires a new definition of gene

Recent research on the human genome has two revelations. One is that only about 1.5% of the human genomic sequence belongs to the exome, which is the sum of exons for encoding proteins, while the remaining 98.5% of the genome belongs to introns or intergenic sequence 4-7. However, without a strong rationale, currently only those genomic loci that encode a peptide of 100 or more amino acids (AAs) are considered as genes 8-12, although even much shorter peptides, as short as 11 AAs 13-16, are known to be functional as well 17-22. If we consider that all genomic loci are genes as long as they encode a peptide, no matter how short it is, the number of genes will be greatly increased 14;23, in turn increasing the protein-coding regions and reciprocally decreasing the intergenic regions. Actually, we anticipate that many currently unannotated genes as shown in figure 1 may later be confirmed to be authentic. Another new finding is that in the human genome there are many more genes than we had previously realized that contain one or more other genes in their introns, either on the Watson strand or the Crick strand of the DNA double helix, with some examples illustrated in figure 1. On many occasions these intron-embedded or nested genes are pseudogenes. Since virtually the whole genome is transcribed 4-7, virtually all pseudogenes are expressed. Actually, increasing evidence suggests that most pseudogenes may be functional, either via their protein products 24;25 or via their non-coding RNA (ncRNA) products as detailed below 26-28. There is another situation wherein the nested gene is not located in an intron but is on the opposite strand of the DNA double helix, with exemplar cases indicated by the arrows pointing to the opposite direction in figure 1. A well-characterized example is that the DNA strand opposite to the one coding for N-myc gene, i.e. its Crick strand, also encodes a gene that is thus termed N-Cym according the nomenclature of antisense 29. The existence of gene(s) within another gene challenges the current definition of gene, as in this case the gene is actually a genomic locus that harbors two or more genes. We herein refer those genes that contain other gene(s) to as “parental genes”. A parental gene and the other gene(s) it contains, especially when they are encoded on the same strand of the DNA double helix and thus have the same orientation, are likely to be expressed concomitantly. Exploration of the biological consequence of simultaneous expression of the genes in the same genomic locus is still largely lacking, but would be intriguing. Moreover, whether the hypothetical 3'-to-5' polymerization of DNA or RNA 30;31 widely exists in some organisms awaits exploration, as it, if confirmed to widely exist, also is incongruous with the current concept of gene.

Fig 1

Examples copied from the NCBI database in which a gene (indicated by a red arrow) contains other gene(s) (indicated by grey arrow). In the cases that two arrows point to the opposite directions, the genes are encoded by the opposite strands of the DNA double helix. A: The CTNNA1 (β-catenin) gene has the LRRTM2 gene and an unannotated gene (LOC105379193) embedded in its introns. B: The RB1 gene has the LPAR6 gene and several pseudogenes embedded in its introns. C: The REEP5 gene has the SRP19 and ERSP1 genes and the XBP1P1 pseudogene embedded in its introns. D: The genomic region for the oncogenic PVT1 ncRNA harbors the TMEM75 gene and several microRNAs as well.

Non-coding RNAs can be considered the fourth category of RNA

RNAs are traditionally categorized to 1) ribosomal RNAs (rRNAs), 2) messenger RNAs (mRNAs), and 3) transfer RNAs (tRNAs) that are synthesized by RNA polymerases I, II and III, respectively. Exceptions exist but are traditionally considered rare. For instance, the 5S rRNA is synthesized by RNA polymerase III instead, and the human mitochondrial 12S rRNA encodes a 16-amino-acid peptide called MOTS-c 32 whereas the 16S rRNA encodes another short peptide called humanin 33;34. However, recent advance in ribonomic research reveals that besides these three classical categories, a huge number of RNA polymerase II products do not encode proteins and thus are not mRNAs. Based on the length of their sequences, these non-coding RNAs as RNA polymerase II products are divided into the long and short groups. Short ncRNAs are mainly regulatory, i.e. regulating transcription, translation, or degradation of mRNAs, such as microRNAs (miRNAs) and small-interference RNAs (siRNAs). Long ncRNAs can be further dichotomized into one group that are regulatory, similar to the short ncRNAs, and another group that directly exert cellular functions, such as the PVT-1 RNA that is oncogenic 35 and the Xist-and-Tsix pair of RNAs that regulate the inactivation of the X chromosome 36-38. Most short and long ncRNAs are processed from intergenic transcripts or are byproducts of cis-splicing that processes pre-mRNAs to mRNAs 39, although one may argue that circular RNAs (circRNAs), which are also long ncRNAs and are recently gaining momentum in ribonomics, are products, but not byproducts, of cis-splicing 40-42. Moreover, as we have discussed before 43;44, it is now a well-accepted notion that mammalian genomes are pervasively transcribed from both strands of the DNA double helix and thus produce a huge amount of antisense RNAs, as best exemplified by the aforementioned Xist and Tsix, each of which is the antisense of the other. Many of these antisenses are long and non-coding but have regulatory functions 45. One caveat that needs to be given is that there is actually no clear demarcation between regulatory and functional ncRNAs, not only because “regulation” itself is a function as well but also because some long ncRNAs, such as some antisense RNAs and circRNAs, are not only regulatory but also functional. Another caveat is that there are other types of ncRNAs that are not described above, such as PIWI-interacting RNAs (piRNAs) 46;47, extracellular RNAs (exRNAs) 48;49, and small nuclear RNAs (snRNAs) that include small nucleolar RNAs (snoRNAs) 50;51. These and the abovementioned ncRNAs are overlapping in the mechanisms for their production and their functions, have been extensively reviewed in the literature, and thus will not be described herein in detail in order to avoid digression. We propose that all ncRNAs should be considered as the fourth group of RNAs, although some of them, such as some snRNAs, may initially be transcribed by RNA polymerase III, but not II.

Other ribonomic advances also suggest a need of a new gene definition

A gene should be studied at the transcriptional, post-transcriptional (e.g. splicing), translational, post-translational, and functional steps. Each of these steps has steps of its own as well. The study of transcription analyzes the regulatory element of the gene, which is mainly localized at the upstream, i.e. 5', side of the transcription initiation site, although for many genes the 5'-untranslational region (UTR), introns, 3'-UTR and even the 3'-intergenic region are also involved 52;53. It is common that transcription of a gene can be initiated from or terminated at an alternative site, as seen in the RSK4, CDK4 and Smarca2 genes 54-56. Such alternative initiation or termination of transcription, when combined with alternative splicing, can generate not only different mRNA variants of the same gene but also different genes' mRNAs, as best exemplified by the p15, p16 and p19 tumor suppressors that are generated in this way from the human INK4A locus 57;58. Sometimes, the alternative initiation site resides in an upstream gene while the alternative termination site resides in a downstream gene, as described and illustrated before 44;59. These situations are irreconcilable with the current concept of gene and make it debatable whether the resulting RNA is a chimeric product of the two neighboring genes or a product of an unannotated gene 44. Over 95% of the human genes contain exons and introns 60, with a gene having about 9 exons and 8 introns on average 61. During cis-splicing in human cells, on average about 91% of the transcript is severed to be introns, with the remaining 9% of the transcript as exons encoding proteins 62, which collectively corresponds to about 1.5% of the human genomic sequence as aforementioned 4-7. Although many more genes will likely be identified or annotated in the future, as discussed above, the non-coding sequence will still remain the overwhelming majority. This does not mean that almost the whole genome is junk, but, instead, it indicates that its vast majority is assigned to be regulatory, i.e. of control. Reiterated, one of the reasons for only such a short portion of the genome being used to encode proteins is because the cells need to create a deluge of regulatory RNAs. In some way, the RNA world resembles a departmental community in the biomedical academy in the aspect of the relationship between the ncRNAs and the coding mRNAs: ncRNAs as the regulators or controllers act as professors wearing neat suits and ties to give instructions, while their lab members (graduate students, postdocs and technicians), who are the academic counterpart of the blue-collar working class wearing gloves and lab coats (although lab coats are usually white), act like mRNA-derived proteins to produce data as told, as depicted in figure 2.

Fig 2

Illustration of the relationship between ncRNAs and mRNAs with an academic biomedical department as an analogy. Our genome assigns only 1.5% of its sequence to mRNAs that encode proteins, which conduct cellular functions and thus resemble the cellular counterpart of the blue-collar working class. The remaining 98.5% of the genome is non-coding but is also transcribed to RNAs as regulators of cellular functions, mainly via control of mRNAs, thus resembling the white-collar class. In a biomedical department, most scientific achievements, with their various rewards, are credited to professors, with the tiny leftover credited with acknowledgements (such as diplomas) to those graduate students, postdocs or technicians who are the academic counterpart of the blue-collar working class employed and told by the professors to produce the actual data in the labs or animal rooms. Therefore, those who provide the direction are considered more important than those who provide the labor. Today the main focus of the biomedical fraternity is still on proteins as before, but it is probably time to shift more attention to the governing 98.5%.

RNA transcripts from nearly 95% of the human genes undergo alternative cis-splicing to produce multiple mRNA variants 60. In disease situations that have mutations, such as in cancers, many more genes' transcripts undergo alternative cis-splicing, because at least 14% of the mutations affect or occur at splice sites 63;64, and some cancer-specific cis-splices are recurrent 65. The human gene that produces the largest number of cis-spliced products may be titin, as alternative cis-splicing of its 363 exons in the skeleton muscle results in over one million RNA variants by estimation 66. The Drosophila gene Dscam also has 95 exons that undergo alternative cis-splicing to produce over 38,000 mRNA variants 67. We therefore surmise that in animal cells most pre-mRNAs can be cis-spliced to many more mature mRNAs than we can imagine or can find in the literature or in the database of the US National Center for Biotechnology Information (NCBI). This is very possible considering that an exon could be as short as only three nucleotides, such as an exon in some mRNA variants of the mouse Ncam gene 68 shown in the NCBI database. A lot of genes have a huge number of expression sequence tags (ESTs) showing many unannotated mRNA variants (http://www.ncbi.nlm.nih.gov/ieb/research/acembly/), which could provide unofficial support for this conjecture. For some genes, trans-splicing may be involved, which can produce mRNAs with duplicated exons 69;70, as seen in some estrogen receptor alpha variants 71-73. It has been estimated that about 10% of the genes in the human, fly and worm contain tandem duplication of exons 74, although we opine that the actual figure may be smaller. Trans-splicing can also engender chimeric RNA 44, which may be bicistronic and may contain a transcript from the opposite strand of the DNA double helix or even a transcript from another chromosome. For instance, in mouse testis the Msh4 gene is expressed to seven mRNA variants that involve transcripts from four different chromosomes, and one of the mRNAs is bicistronic while another contains antisense sequence 75. Obviously, the current concept of gene cannot accommodate any of the above-described situations.

Advances in proteomic research suggest a need of new gene definition as well

The human genome encompasses only slightly over 20,000 genes by the current estimation 76;77. This number seems to be too small to explain the many complex biological functions and the very diverse social activities of the human being. For this discrepancy, besides the aforementioned reason that there are many genes awaiting identification or annotation, a major reason may be that many genomic loci are not considered as genes because their transcripts do not contain a long-enough open reading frame (ORF). However, this may be a problem of the translation algorithm that is formulated based on the current concept of gene. Indeed, today's algorithm cannot translate some human mitochondrial RNAs 78 and cannot explain why some RNAs with multiple stop codons, such as the Mig-7 mRNA 79-81, can still be expressed to a protein. The current algorithm cannot translate the mRNAs containing the CAG and GGGGCC repeats either 82-84. About half of the human mRNAs contain upstream ORFs (uORFs) in the region regarded by the current algorithm as the 5'-URT (Fig 3), and in most cases one mRNA contains several uORFs 85. Many uORFs may be translated to peptides 85-87, such as the one in the yeast SPO24 mRNA 88;89 and the one in the dendritically localized Shank1 mRNA 90, but the peptides may be too short to be noticed 91. Besides regulation of the mRNA stability 85, these uORFs determine which start codon should be used 86, including non-AUG ones 92, such as CUG and even the questionable AUA 93-95. Some uORFs may lead to translation of an N-terminally extended or truncated protein isoform as well 10;91.

Fig 3

Illustration of multiple ORFs in a given mRNA. Top panel: In the wt human CDK4 mRNA (copied from the NCBI database as a DNA sequence), as an example, all ATGs and CTGs as the most possible start codons are highlighted in red color while the three canonical stop codons (TAA, TAG and TGA) are shaded with yellow color. The ATG and TGA of the annotated CDK4 ORF are italicized and boldfaced with green color, while all in-frame downstream ATGs that may initiate N-terminally truncated CDK4 protein isoforms are highlighted in green color. Some (but not all, to avoid overwhelming the picture) ORFs that are initiated from out-of-frame ATGs and thus encode non-CDK4 peptides or proteins are underlined, with red underlining indicating the ORFs outside the CDK4 coding region, green underlining indicating an ORF overlapping with the CDK4 C-terminus, and black underlining indicating the ORFs within the CDK4 coding region. Some of these non-CDK4 AltORFs also contain some shorter out-of-frame AltORFs, which are displayed in yellow letters. Bottom panel: Although the current translation algorithm assigns only one ORF (long red bar, referred herein to as “annotated” ORF) to one mRNA (long black arrow), the mRNA also has two uORFs (short green bar) at the 5'UTR and an out-of-frame AltORF at the 3'UTR (long green bar). Moreover, there are many other short AltORFs (blue bar) that are not in frame with one another or with the annotated one. Some of these AltORFs may overlap with the nearby ones and contain some even shorter AltORFs (short yellow bar).

Some mRNAs have been known to contain alternative ORFs (AltORFs) that are irrelevant to the annotated or wild type (wt) ORF, as illustrated in figure 3, although the current algorithm allows only one authentic ORF for one mRNA and considers these AltORFs untranslatable. For example, the XL-exon of the XLαs/Gαs gene in the human and rat encodes a protein completely different from the XLαs/Gαs protein 96. In the wt human Ataxin-1 mRNA, an out-of-frame ORF overlaps with the wt one 97. There are a few other known cases of AltORFs in the literature, including the PRNP and the T cell epitopes 98;99. We surmise that AltORFs may pervasively exist in human mRNAs to dramatically enlarge the protein repertoire (Fig 3), and this conjecture is supported by some bioinformatic data 100. Unfortunately, like those peptides translated from short uORFs 91, many proteins or peptides translated from AltORFs may be too short to be catchable as well 10. Even within the annotated ORF of a given mRNA, on most occasions there are many in-frame start codons that are downstream of the canonical ATG and may initiate translation via different mechanisms, such as the use of an IRES (internal ribosome entry site), to produce N-terminally truncated protein isoforms, as we discussed before 54-56;101 and as shown in figure 3 for the human CDK4. Some short protein isoforms of c-Myc, P53 and RB1 are good examples 102-104. In addition, translation initiation may occur via the so-called +1 or -1 frame-shift mechanisms 105-107, which convert some ncRNAs to mRNAs and may even lead to production of a completely different protein. Most of these alternative initiations of translation occur more often in stressed situations such as in cancer and other diseases 85;108-110. Termination of translation is as sophisticated as the above-described initiation and elongation, in part because in some situations the three canonical stop codons (UAA, UAG and UGA) may be read through or may instead encode glutamine, tyrosine, pyrrolysine, leucine, cysteine, tryptophan or selenocysteine 111;112. The NCBI has already listed tryptophan, selenocysteine and pyrrolysine as three new AAs in proteins 112, but few translation algorithms have included them. AGG and AGA encode not only arginine but also glycine and serine and can, at least in human mitochondria, serve as stop codons as well 113;114. Moreover, AAG and AAA encode not only lysine but also asparagine, whereas CUG encodes not only leucine but also serine, besides as a start codon 112. Translation termination can be different in different subcellular locations as well. For instance, in human cells the humanin mRNA is translated to a 24-AA peptide in the cytoplasm but to a 21-AA peptide instead in the mitochondria 115;116. It is worth noting that a drug called G418 is often used to assist selection of cell clones in cell culture but many peers do not realize that G418 is a potent stop codon suppressor that helps the cells in selecting another stop codon 117-120. Just like the initiation and elongation of translation, stop codon read-through and alternative codon usage occur more often in stressed situations as well 121. In most proteomic studies, including ours 1, there is always a large portion, often varying between 10-50% 122, of the peptides that cannot be matched to the references and thus are unknown. Although the reasons behind this phenomenon are multiple, including technical ones and gene mutations or polymorphisms, the above-described unannotated genes, imperfect translation algorithms, and alternative codon usages may also be countable for many of these unmatchable peptides. A new definition of gene is required as the basis for constructing a better algorithm to solve these problems of protein translation. An imperative task is to determine whether the omnipresent uORFs and AltORFs are really translated to functional proteins or peptides. For example, our conjecture awaits determination as to whether the wt CDK4 mRNA is translated to not only the wt CDK4 protein but also, in different situations, different N-terminally truncated CDK4 isoforms and many other proteins encoded by those uORFs and AltORFs underlined in figure 3.

Ectopic expression of cDNA skips RNA process, thus often misleading

Ever since the start of the widespread use of the techniques of reverse transcription and polymerase chain reactions three decades ago, genes have been identified from mRNA, and rarely from genomic DNA (gDNA) as was done previously. Nowadays, after a gene is identified, its function is determined using two opposite strategies, i.e. to increase and to decrease its expression in vitro or in vivo with the original level as the reference. Decreased levels are achieved by knocking out the gene (often just interrupting its ORF) or knocking down its mRNA levels by ectopically expressing miRNAs or siRNAs, whereas increased levels are reached by ectopically expressing a cDNA (Fig 4). Almost without exception, one cDNA of the gene is inserted into a vector and then delivered into cells in culture or in an animal by such as transfection, infection or a transgenic-technique. Inside the cells, the cDNA is transcribed to RNA again. In most cases, the cDNA contains only one ORF and thus the RNA is translated to only one protein isoform, although sometimes the cDNA construct is bicistronic with the other cistron encoding a tracer peptide. If the gene in question is known to express multiple mRNA variants, each of the corresponding cDNAs is separately constructed into a vector and then delivered to the target cells for study of its function.

Fig 4

Illustration of how a gene functions by producing different RNAs. Left panel: Flowchart of the routine in studying a gene's function, with emphasis on the ectopic expression approach. Sequencing RT products leads to identification of a gene's mRNA in a cDNA form. Aligning its sequence with gDNA will localize it to a chromosome, which allows us to knock-in or knockout the gene. Continuing to sequence more cDNAs will identify other mRNA variants, which allows us not only to knock down the expression of one, some or all of the variants using such as siRNA but also to ectopically express the mRNAs using cDNAs. For ectopic expression, each cDNA will be cloned into a vector and introduced to cells in culture or in an animal, and the resulting data are used to evaluate the function of this cDNA. Right panel: A gene, which may be expressed in two different cell types (A and B), has two alternative initiation sites and two alternative termination sites for transcription, permitting it to produce four different transcripts. One, some or all four transcripts may have a long 5'-UTR that may harbor multiple uORFs and/or an even-longer 3'-UTR that may contain AltORFs. In one cell type, e.g. normal cells, splicing of one transcript retains all five exons, thus annotated as the wt mRNA, or alternative splicing produces three mRNA variants. In another cell type, e.g. in cancer or another organ or at another developmental stage, the transcripts are spliced to a partly different spectrum of mRNA variants. Some of the mRNAs encode AltORFs as well, resulting in a total of six AltORFs in the two cell types. Moreover, the intron 2 encodes another gene, and its transcripts may be spliced to a wt mRNA with 3 exons (I1, I2 and I3) or, alternatively, to two other mRNA variants in the two cell types. The intron sequences may be processed to different ncRNAs, although only miRNAs and siRNAs are shown for simplicity. More complexly, part of the Crick strand of the DNA may be transcribed to some antisense RNAs as well. Therefore, the global picture about the function of this gene or genomic locus is a collective (but not simply additive) effect of the six mRNA variants and six AltORFs of the parental gene, the three mRNA variants of the nested gene, and all the ncRNAs (miRNAs, siRNAs, piRNAs, snRNAs, exRNAs, circRNAs, and antisense RNAs) in these two cell types. If the parental or the nested gene encodes a transcription factor or a membrane receptor, different heterodimers may be formed among the protein isoforms of the same gene to exert functions as well.

The above described procedure for study of genes' functions actually deprives the cell of its right to regulate the RNA production and various RNA processes, especially splicing. In other words, the cell may have its preferred mRNA variant(s) and ratio(s) among different variants to be expressed to better exert the function of the gene in the particular situation, but we do not allow the cell to decide. Instead, we force the cell to express one variant at a time and later piece together all the shreds of result from single cDNAs as the “global” picture of the gene's function. This current routine has so many flaws that makes it close to malpractice on, probably, most occasions: First, ectopic expression of a cDNA elides the alternative initiation or termination of transcription, thus depriving the cell of its right to produce different transcripts to cope with the particular situation. Second, in the case where there exists intron-embedded or nested gene(s), the function of the to-be-studied gene, or more correctly the genomic locus that contains two or more genes, should be partly contributed by the nested gene(s). Ectopic expression of an intron-less cDNA deletes this contribution. Third, intron sequences, once nipped during splicing, may be processed to regulatory RNAs, especially miRNAs and siRNAs, to elicit functions, but these functions are slipped because cDNAs lack introns. It seems to be possible for us to add back the regulatory RNAs such siRNAs to correct this weakness. However, we have shown that during cis-splicing of the mouse p53 pre-mRNA, different introns are spliced in different orders to produce different variants with retention of different introns 123. Therefore, we have no way of knowing in any situation which regulatory RNAs from which introns should be produced, and thus should be added back to the system when we ectopically express a cDNA. Fourth, each individual mRNA (cDNA) may have different functions when it is present alone and when it is accompanied by its sibling mRNAs and/or intron-derived regulatory RNAs. This possibility is greatly increased when the gene encodes a transcription factor or a membrane receptor that often functions by forming heterodimers among different isoforms. A cDNA-derived single protein isoform cannot form such heterodimers and may heterodimerize with the protein isoforms of the endogenous origin to muddle things up. Fifth, in general we still know much less about translation, relative to transcription 9. The existence of multiple uORFs in the 5'-UTR and the existence of the IRESs within the ORF can greatly affect the selection of the start codon, resulting in an N-terminally extended or truncated protein isoform, whereas a +1/-1 frame-shift mechanism may make ncRNA coding and may even produce a completely different protein. These three mechanisms are under the sway of the length of 5'-UTR. Moreover, the 3'-UTR also greatly influences the decision on whether the translation elongation needs to 1) read through the canonical stop codon, 2) use the stop codon to encode an AA, or 3) utilize a downstream stop codon that can be a canonical one or an alternative one, as described above. All these options are often trimmed away when the cDNA is cloned into the vector, not only because usually very short 5'- and 3'-URTs are retained in the construct but also because the artificial promoter in the vector, usually from a cytomegalovirus (CMV), ignores different translation algorithms. In addition, trimming away the 5'- and 3'-UTRs trims away many AltORFs as well. Therefore, when a cDNA is expressed from a vector, it is very likely to function differently from its corresponding mRNA that has a full length of the 5'- and 3'-UTRs. Sixth, it is likely that a gene's function is synergistically or antagonistically different from the simple addition of all single mRNAs. If trans-splicing is also involved, chimeric RNAs and bicistronic mRNAs may be engendered. Probably on many occasions, all the proteins translated from different annotated ORFs, different uORFs and different AltORFs interact in a complex manner to determine the gene's function (Fig 4). The six scenarios described above, plus many others unmentioned, raise the complexity to a much higher order and make it impossible for us to produce a genuine picture of the gene's functions by piecing together the results from cDNAs that are expressed piecemeal. Addition of the data from the opposite approach, i.e. down-regulating the expression (Fig 4), will greatly help but will still not be able to correct the above described flaws of cDNA use. Besides, knockdown techniques have their own defects and weaknesses as well, such as the off-target problems of the routine RNA-silencing methods 124-127. Moreover, in many gene-knockout animals, only the ORF for the target (usually the wt) protein is interrupted whereas the RNAs or part of the RNAs may still be expressed. These defects are great concerns in today's ribonomic research but will not be detailed herein to avoid distraction. To use an analogy, the achievements of each research group are made via collaborations among all teammates (technicians, graduate students, postdocs and the principal investigator) in a highly synergistic manner. Therefore, it is largely wrong to divide and distribute the achievements to different teammates at the proportions that we think are reasonable, and it is even more wrong if we attribute all achievements only to the principal investigator who somewhat is equivalent to the wt mRNA.

Then, how should we redefine gene and determine its function?

Today's routine practice of the expression of cDNAs by piecemeal in target cells is actually a good and efficient strategy for determining the function of individual mRNA variants and corresponding protein isoforms, i.e. individual forms, of a “gene”. It is undeniable that we have learned a great many details about the functions of our genome by using cDNA, and much of the knowledge so obtained has been verified using a variety of means such as clinical observations and manipulations in patients or in experimental animals. If we consider part of a classically-defined gene, such as an RNA variant (coding or non-coding) or a phosphorylated status of a protein, as an individual “gene”, our data on the function of “gene” would be much less conflicting with each other. For instance, it will become understandable that a truncated protein isoform translated from a cis-spliced mRNA variant differs in function from the wt protein as they are products of two different “genes”. However, considering a small section of the classically-defined gene as a “gene” may make it more difficult to understand the collective function of a genomic locus as a whole, because, as discussed above, each genomic locus functions via complex synergies and antagonisms among different types of RNAs and among different proteins or protein isoforms produced from the locus. These complex collaborative and antagonistic interactions among various gene elements are also the main reasons for the omnipresent controversy of the data on the functions of most genes documented in the literature. Actually, because of these omnipresent pros and cons, the whole biomedical fraternity has become used to, and enjoys, saying “on one hand…, but on the other hand…” Even worse, although we know that the information controversy may be largely due to the use of cDNA and that data resulting from cDNA use somewhat contort the picture of the functions of the currently-defined gene in question, we have no way of knowing the extent of the distortion. We opine that ectopic expression of single cDNAs by piecemeal will not lead us to an undistorted picture of the functions of genes defined as individual genomic loci, and probably not even close, in many cases. We favor the gene definition given by the current Wikipedia: “a broad, modern working definition of a gene is any discrete region of heritable, genomic sequence which affects an organism's traits by being expressed as a functional product or by regulating expression.” For simplicity, a gene is herein referred to as a genomic locus, with its activities depicted in an oversimplified manner in figure 4, which shows that it functions in a much higher scale of complexity than what cDNAs can tell us.

Concluding remarks

It is now a post-genomic era, which requires a new gene definition to accommodate the recent advances in genomic, ribonomic and proteomic research. In the past decades, we have learned a great detail from cDNA about the functions of individual mRNA variants and protein isoforms. However, in most cases, our knowledge about the function of each genomic locus as a whole gene is a distorted and unfaithful picture with plentiful controversial information. The image distortion and the data controversy may mainly be because alternatives occur at all levels, including alternative initiation and termination of transcription and translation, alternative codon usage during translation elongation, alternative ORFs within a given mRNA, etc. In most cases, piecemeal ectopic expression of cDNA cannot mimic these alternatives, and piecing together the resulting data cannot lead us to the collective function of a gene as a genomic locus. To obtain an undistorted picture of a gene's function, we should take much greater caution when using a cDNA and when interpreting the data resulting from use of a cDNA. Instead, we should ectopically express gDNA, so that the cells can decide how to transcribe and then to process (mainly splice) the transcript(s) in order to better cope with the particular situation. The gDNA may be constructed under the control of a physiological or a viral (such as CMV) promoter, so as to address different aspects of transcription initiation. Delivery of a gDNA into cells is still difficult because gDNAs usually are of giant size, but it is doable with the available technology 128;129. For instance, clones of bacterial 130-132, mouse and human artificial chromosomes 133-136 are available for this purpose. Actually, delivery of gDNA into cells in culture and even in an animal has already been used to correct genetic disorders in the lab 137. It is probably time for us to put more efforts into renovating gDNA delivery technology and to make such delivery into cells routine in our exploration of genes' functions in both physiological and pathological situations.

137 in total

Review 1. Alternative translational products and cryptic T cell epitopes: expecting the unexpected.

Authors: On Ho; William R Green
Journal: J Immunol Date: 2006-12-15 Impact factor: 5.422

Review 2. Divorcing ARF and p53: an unsettled case.

Authors: Charles J Sherr
Journal: Nat Rev Cancer Date: 2006-08-17 Impact factor: 60.716

Review 3. A gripping tale of ribosomal frameshifting: extragenic suppressors of frameshift mutations spotlight P-site realignment.

Authors: John F Atkins; Glenn R Björk
Journal: Microbiol Mol Biol Rev Date: 2009-03 Impact factor: 11.056

4. Isoforms of wild type proteins often appear as low molecular weight bands on SDS-PAGE.

Authors: Ju Zhang; Xiaomin Lou; Haihong Shen; Lucas Zellmer; Yuan Sun; Siqi Liu; Ningzhi Xu; D Joshua Liao
Journal: Biotechnol J Date: 2014-07-04 Impact factor: 4.677

5. Repairing faulty genes by aminoglycosides: development of new derivatives of geneticin (G418) with enhanced suppression of diseases-causing nonsense mutations.

Authors: Igor Nudelman; Dana Glikin; Boris Smolkin; Mariana Hainrichson; Valery Belakhov; Timor Baasov
Journal: Bioorg Med Chem Date: 2010-03-27 Impact factor: 3.641

6. Detecting and characterizing circular RNAs.

Authors: William R Jeck; Norman E Sharpless
Journal: Nat Biotechnol Date: 2014-05 Impact factor: 54.908

7. Humanin peptide suppresses apoptosis by interfering with Bax activation.

Authors: Bin Guo; Dayong Zhai; Edelmira Cabezas; Kate Welsh; Shahrzad Nouraini; Arnold C Satterthwait; John C Reed
Journal: Nature Date: 2003-05-04 Impact factor: 49.962

Review 8. Complex alternative splicing.

Authors: Jung Woo Park; Brenton R Graveley
Journal: Adv Exp Med Biol Date: 2007 Impact factor: 2.622

9. Targeting migration inducting gene-7 inhibits carcinoma cell invasion, early primary tumor growth, and stimulates monocyte oncolytic activity.

Authors: Aaron P Petty; Stephen E Wright; Kathleen A Rewers-Felkins; Michelle A Yenderrozos; Beth A Vorderstrasse; J Suzanne Lindsey
Journal: Mol Cancer Ther Date: 2009-08-11 Impact factor: 6.261

Review 10. Humanin and age-related diseases: a new link?

Authors: Zhenwei Gong; Emir Tas; Radhika Muzumdar
Journal: Front Endocrinol (Lausanne) Date: 2014-12-04 Impact factor: 5.555

11 in total

1. About three-fourths of mouse proteins unexpectedly appear at a low position of SDS-PAGE, often as additional isoforms, questioning whether all protein isoforms have been eliminated in gene-knockout cells or organisms.

Authors: Jiayuan Qu; Ju Zhang; Lucas Zellmer; Yan He; Siqi Liu; Chenguang Wang; Chengfu Yuan; Ningzhi Xu; Hai Huang; Dezhong J Liao
Journal: Protein Sci Date: 2020-01-23 Impact factor: 6.725

Review 2. Learning about the Importance of Mutation Prevention from Curable Cancers and Benign Tumors.

Authors: Gangshi Wang; Lichan Chen; Baofa Yu; Lucas Zellmer; Ningzhi Xu; D Joshua Liao
Journal: J Cancer Date: 2016-01-28 Impact factor: 4.207

Review 3. It Is Imperative to Establish a Pellucid Definition of Chimeric RNA and to Clear Up a Lot of Confusion in the Relevant Research.

Authors: Chengfu Yuan; Yaping Han; Lucas Zellmer; Wenxiu Yang; Zhizhong Guan; Wenfeng Yu; Hai Huang; D Joshua Liao
Journal: Int J Mol Sci Date: 2017-03-28 Impact factor: 5.923

Review 4. While it is not deliberate, much of today's biomedical research contains logical and technical flaws, showing a need for corrective action.

Authors: Yan He; Chengfu Yuan; Lichan Chen; Yanjie Liu; Haiyan Zhou; Ningzhi Xu; Dezhong Joshua Liao
Journal: Int J Med Sci Date: 2018-01-19 Impact factor: 3.738

5. Probably less than one-tenth of the genes produce only the wild type protein without at least one additional protein isoform in some human cancer cell lines.

Authors: Rui Yan; Ju Zhang; Lucas Zellmer; Lichan Chen; Di Wu; Siqi Liu; Ningzhi Xu; Joshua D Liao
Journal: Oncotarget Date: 2017-08-07

Review 6. Evidence for immortality and autonomy in animal cancer models is often not provided, which causes confusion on key issues of cancer biology.

Authors: Xixi Dou; Pingzhen Tong; Hai Huang; Lucas Zellmer; Yan He; Qingwen Jia; Daizhou Zhang; Jiang Peng; Chenguang Wang; Ningzhi Xu; Dezhong Joshua Liao
Journal: J Cancer Date: 2020-03-04 Impact factor: 4.207

7. ACTB and GAPDH appear at multiple SDS-PAGE positions, thus not suitable as reference genes for determining protein loading in techniques like Western blotting.

Authors: Keyin Zhang; Ju Zhang; Nan Ding; Lucas Zellmer; Yan Zhao; Siqi Liu; Dezhong Joshua Liao
Journal: Open Life Sci Date: 2021-12-13 Impact factor: 0.938

8. At elevated temperatures, heat shock protein genes show altered ratios of different RNAs and expression of new RNAs, including several novel HSPB1 mRNAs encoding HSP27 protein isoforms.

Authors: Xia Gao; Keyin Zhang; Haiyan Zhou; Lucas Zellmer; Chengfu Yuan; Hai Huang; Dezhong Joshua Liao
Journal: Exp Ther Med Date: 2021-06-24 Impact factor: 2.447

Review 9. The well-accepted notion that gene amplification contributes to increased expression still remains, after all these years, a reasonable but unproven assumption.

Authors: Yuping Jia; Lichan Chen; Qingwen Jia; Xixi Dou; Ningzhi Xu; Dezhong Joshua Liao
Journal: J Carcinog Date: 2016-05-20

Review 10. Transcriptional-Readthrough RNAs Reflect the Phenomenon of "A Gene Contains Gene(s)" or "Gene(s) within a Gene" in the Human Genome, and Thus Are Not Chimeric RNAs.

Authors: Yan He; Chengfu Yuan; Lichan Chen; Mingjuan Lei; Lucas Zellmer; Hai Huang; Dezhong Joshua Liao
Journal: Genes (Basel) Date: 2018-01-16 Impact factor: 4.096