Literature DB >> 33510865

Interpopulation differences of retroduplication variations (RDVs) in rice retrogenes and their phenotypic correlations.

Haiyue Zeng1,2,3, Xingyu Chen3, Hongbo Li4, Jun Zhang5, Zhaoyuan Wei1,2, Yi Wang1,2.   

Abstract

Retroduplication variation (RDV), a type of retrocopy polymorphism, is considered to have essential biological significance, but its effect on gene function and species phenotype is still poorly understood. To this end, we analyzed the retrocopies and RDVs in 3,010 rice genomes. We calculated the RDV frequencies in the genome of each rice population; detected the mutated, ancestral and expressed retrogenes in rice genomes; and analyzed their RDV influence on rice phenotypic traits. Collectively, 73 RDVs were identified, and 14 RDVs in ancestral retrogenes can significantly affect rice phenotypes. Our research reveals that RDV plays an important role in rice migration, domestication and evolution. We think that RDV is a good molecular breeding marker candidate. To our knowledge, this is the first study on the relationship between retrogene function, expression, RDV and species phenotype.
© 2021 The Authors.

Entities:  

Keywords:  BP, Biological process; CC, Cellular component; FPKM, Fragments per kilobase per million mapped reads; GO, Gene ontology; Indel, Insertion-deletion; MF, Molecular function; MST, monosaccharide transporter; ORF, Open reading frame; RDV; RDV, Retroduplication variation; Retrocopy; Retrogene; Retroposition; Rice

Year:  2021        PMID: 33510865      PMCID: PMC7811064          DOI: 10.1016/j.csbj.2020.12.046

Source DB:  PubMed          Journal:  Comput Struct Biotechnol J        ISSN: 2001-0370            Impact factor:   7.271


Introduction

Increasing attention has been paid to retrocopy and retroposition, which have been recently reported to be vital to the formation of new genes. Retrocopies are defined as gene copies originating by reverse transcription of mRNA and are subsequently inserted into the genome (Fig. 1A). This above process is called retroposition, which generally occurs by enzymes derived from retrotransposable elements, thereby giving rise to intronless copies of existing genes [1]. The source gene of the duplicated mRNA is generally named the parental gene. For a long time, retrocopies were considered as “junk DNA” due to a lack of regulatory elements as well as the presence of functional disruptions (including in-frame terminations, frameshift mutations and 5’ truncations) [2]. However, at present, retrocopies are called “seeds of evolution” due to their significant contributions to genomic evolution [3]. Numerous studies have shown that retrocopies may gain functionality and become so-called retrogenes. Retrogenes might have intact open reading frames (ORFs) with protein-coding functions [4] or lead to new protein domain formation through fusion with other genes [5], [6]. Functional retrogenes can encode proteins that may be similar or different from those encoded by their parental genes [7], and they can also be used to replace exons or produce chimeric transcripts [8]. It was found that retrogenes significantly contribute to the ongoing generation of new genes over evolutionary processes [7], [9], [10] and affect the expression of parental genes by retrogene-originated siRNA [11], [12], [13] and miRNA [14]. Meanwhile, parental genes are likely to become pseudogenes, and their function is subsequently completely lost and are further replaced by retrocopies, which can turn retrogenes into “orphans” (with the absence of parental genes in the genome) [15]. Therefore, retrocopies can also be utilized as useful phylogenetic markers, shedding novel light on evolutionary processes such as sex chromosome origination [16] or alterations in germline gene expression and retrotransposable element activities [7].
Fig. 1

(A) Formation of a retrocopy: the mRNA transcribed from the gene is reverse transcribed into cDNA and subsequently inserted into the genome. (B) Formation of an RDV: an insertion or deletion in a retrocopy.

(A) Formation of a retrocopy: the mRNA transcribed from the gene is reverse transcribed into cDNA and subsequently inserted into the genome. (B) Formation of an RDV: an insertion or deletion in a retrocopy. To study the function of retrogenes, researchers usually analyze their expression through RNA-Seq data. Baertsch et al. [17] studied human retrogenes and found that 726 retrogenes can be expressed. They believe that these genes may play an important role after expression. Navarro et al. [18] found approximately 3,600 expressible retrogenes in seven primates. These expressible retrogenes are often close to a normal gene, and their expression is tissue-specific. They also believe that the expression of the retrogene and the parental gene is not collinear. Interestingly, Zhong et al. [19] found 306 retrogenes in zebrafish. Among them, the expressible retrogenes and their parental genes have strong collinearity in expression. Du et al. [20] obtained 219 expressible retrogenes in the coelacanth fish genome, of which the newly formed retrogenes had a lower expression value than the old retrogenes. In the plant genome, Sakai et al. [21] analyzed the expression of 150 retrogenes in rice in 7 tissues, and the results showed that the expression level of retrogene is often lower than that of the parental gene, but there is a certain collinearity. Abdelsamad et al. [22] believe that retrogenes in Arabidopsis are abundantly expressed in pollen and play an important role in pollen development. Retroduplication variations (RDVs) are a type of polymorphism with the insertion or deletion of a retrocopy from individual genomes (Fig. 1B). Three studies [6], [23], [24] have uncovered 208 RDVs in the human genome, two of which [23], [24] facilitate the re-establishment of the phylogenetic tree of human populations, confirming the significance of RDV polymorphisms as genomic markers for a population’s history. Moreover, Kabza et al. [25] developed a new approach to detect novel retrocopies not annotated in the reference genome and focused on identifying RDV caused by retrocopy deletion. By studying human genomes from 17 populations, they found 193 RDVs and analyzed their distribution among populations, which has important implications for studying human evolution and migration. With the exception of these studies, there is no research on the RDVs of other species. Moreover, RDV may affect the function and expression of retrogenes, which is directly reflected in the phenotypes of the species. However, no research has been conducted on the relationship between retrogene function, expression, RDV and species phenotype. Due to the above various unresolved questions, we aimed to identify RDVs caused by the different indels in functional retrogenes of rice. RNA-Seq data from the Rice Expression Database [26] were utilized to evaluate the expression of retrogenes and their parental genes. The 3,000 Rice Genomes Project [27], [28], [29] was adopted for assessment of the RDVs among rice populations. The mutated, ancestral and expressed retrogenes were detected in rice genomes, and their RDV influence on rice phenotypes was analyzed with statistical methods.

Results

Summary of data for retrocopies in rice and nonsynonymous (Ka)/synonymous (Ks) analysis

To analyze RDVs in rice, first, we need to scan retrocopies in the genome. We obtained all 74 retrocopies of rice (Oryza sativa japonica) from RetrogeneDB [30]. Although previous studies have suggested that retrocopies in rice might be more numerous [21], [31], retrocopies in RetrogeneDB are high quality because they were annotated based on a very rigorous analysis, followed by a manual inspection to ensure maximum data quality. Compared with the parental gene, all retrocopies met the following conditions: length of the alignment at least 150 bp, minimum of 50% coverage, minimum of 50% identity, and loss of at least two introns. Then, we obtained 37 retrocopies with a high coverage of over 90% with their parental genes (Fig. 2A), and 6 retrocopies had a high identity of over 90% (Fig. 2B). Forty-nine retrocopies were annotated as functional retrogenes whose status was classified as “KNOWN_PROTEIN_CODING” in RetrogeneDB. Twenty-five retrocopies were neither annotated as protein coding genes nor overlapped with annotated pseudogenes; they were determined to be “NOVEL” retropseudogenes (Fig. 2C). Fifty-nine retrocopies had conserved ORFs (Fig. 2D), which means that these retrocopies contained no frameshifts or stop codons.
Fig. 2

Summary of 74 rice retrocopies (Table S1). (A) Parental gene protein coverage within the retrocopy-parental alignment. (B) Retrocopy-parental gene alignment percent identity. (C) Ratio of functional retrogenes and retropseudogenes. (D) Ratio of retrocopy with and without conserved ORF. (E) Ka/Ks distribution of functional retrogenes and retropseudogenes.

Summary of 74 rice retrocopies (Table S1). (A) Parental gene protein coverage within the retrocopy-parental alignment. (B) Retrocopy-parental gene alignment percent identity. (C) Ratio of functional retrogenes and retropseudogenes. (D) Ratio of retrocopy with and without conserved ORF. (E) Ka/Ks distribution of functional retrogenes and retropseudogenes. To detect the selection pressure of protein levels for retrocopies, we calculated the Ka, Ks and substitution rate (Ka/Ks) between retrocopies and their parental genes; if Ka/Ks ≤ 0.5, then a retrocopy is considered to be negatively selected [32], [33]. Table 1 shows the different Ka/Ks distributions of functional retrogenes and retropseudogenes. Their proportions show a clear distribution difference. For the functional retrogenes, 44.90% had Ka/Ks < 0.5, while 76.00% of retropseudogenes had Ka/Ks in a range of 0.5–1.2, indicating that most functional retrogenes were affected by negative selection and most retropseudogenes were neutrally selected. We also found that the Ka/Ks values of retropseudogenes were concentrated at 0.5–0.6, while those of functional retrogenes were concentrated at 0.1–0.3. The geometric mean of the Ka/Ks value for the rice retropseudogene was 0.5168, which was lower than the expected value (1.0) [34]. The reason why certain retropseudogenes are under significant selection pressure may be that they remove deleterious mutations by selective cutting and maintaining their functionality, or they were originally functional but later turned into pseudogenes due to harmful mutations. In addition, some retrocopies had extremely large Ka/Ks values, indicating that they are strongly positively selected.
Table 1

. Selective pressure of rice retrocopies.

GeneKa/Ks < 0.5Ka/Ks = 0.5 ~ 1.2Ka/Ks > 1.2Geometric mean
Functional retrogene22/49 (44.90%)16/49 (32.65%)11/49 (22.45%)0.3747
Retropseudogene3/25 (12.00%)19/25 (76.00%)3/25 (12.00%)0.5168
. Selective pressure of rice retrocopies.

Estimates of the years of retrocopy origin and conservation analysis

To assess the formation history and identify ancestral retrocopies, we used the molecular clock to calculate the origin time. We assume the Ks of rice genes to be 6.5 × 10−9 substitutions per silent site per year [35]; then, the results show that a large number of new retrocopies in the rice genome were produced within the last 20 million years, later than the differentiation time of indica rice and japonica rice (20–40 million years ago) [36], [37], [38], [39] (Fig. 3A). This shows that after the differentiation of indica and japonica rice, the number of retrocopies has exploded. This is consistent with previous studies in mammals [33], fish [40], [41] and other plants [31], proving that retrocopy plays an important role in species formation and differentiation. Interestingly, retropseudogenes are found at both ends of the timeline, indicating that some retropseudogenes newly formed in the past 20 million years and that some represent ancient ancestral genes.
Fig. 3

(A) Origin time of retrocopy based on a molecular clock. (B) Retroposition in Tracheophyta. Left boxes indicate the number of retrocopies originating in a certain ancestral genome; right boxes reveal the number of retrocopies in the Oryza sativa japonica genome. All detected ancestral retrocopies are marked by red triangles. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

(A) Origin time of retrocopy based on a molecular clock. (B) Retroposition in Tracheophyta. Left boxes indicate the number of retrocopies originating in a certain ancestral genome; right boxes reveal the number of retrocopies in the Oryza sativa japonica genome. All detected ancestral retrocopies are marked by red triangles. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Moreover, we performed comparative analysis across 25 Tracheophyta species to identify retrocopy orthologs that originated prior to rice subspecies. OrthoVenn2 [42] was used to identify retrocopies originating in certain lineage ancestors, followed by checking orthologous genes in major plant lineages. We subsequently identified 72 retroposition events in the Oryza sativa ancestral genome, followed by 50 retropositions in Oryzinae, 196 in Poaceae, and 18 in commelinids. In comparison, we detected 117 retrocopies common to Pentapetalae and 65 shared by tomato and potato (Fig. 3B). We also determined the presence of 48 retrocopies in more than one other genome, that is, 48 retrocopies of Oryza sativa japonica were ancestral. Of these, 4 originated in the common ancestor Oryza sativa, 7 originated in Oryzinae, and 16 were extremely ancient with a presence in more than one genome of analyzed Mesangiospermae (Fig. 3B). This is consistent with the results of the molecular clock analysis, proving that rice retrocopies mostly originated from very old ancestors or newly formed Oryza sativa japonica.

Identification of RDVs from known retrogenes in the rice genome

To analyze the retrocopy variation caused by retrogene indels in rice populations, all 74 rice retrocopies were used, and the genomic variation data were obtained from the 3,000 Rice Genomes Project [27], [28], [29]. To avoid the complexity of variation, we focused on only the indels of biallelic variation. Each RDV is an indel of at least 8 bp of retrogene sequence in certain analyzed genomes. As a result, 73 indels influencing 35 retrogenes were identified by mapping rice retrocopies to genomic variants (Table S2), and only one RDV (RDVD_osat_62) was detected in the retropseudogene (Fig. 4A). Then, the allele frequencies of detected RDVs were calculated in different populations, including geographic regions and subpopulations, with RDVs specific information, were displayed in Table S3. It was observed that the distribution differences of some RDVs had obvious characteristics among populations. The differences in regions were manifested as the spread of frequency values from the center to the surroundings, and among the subpopulations, there were obvious differences in the magnitude of frequency values among the five populations (i.e. Xian/Ind, Geng/Jap, cA(Aus), cB(Bas) and admix).
Fig. 4

(A) Seventy-three RDVs in 35 retrogenes. (B) Frequencies of RDVD_osat_9_1 in different subpopulations. (C) Frequencies of RDVD_osat_9_1 in different geographic regions.

(A) Seventy-three RDVs in 35 retrogenes. (B) Frequencies of RDVD_osat_9_1 in different subpopulations. (C) Frequencies of RDVD_osat_9_1 in different geographic regions. For example, RDVD_osat_9_1 (one deletion in retro_osat_9), which was detected in the largest number of rice samples, can be observed that the Xian/Ind and cA(Aus) populations had the highest absence rate (approximately over 75%), while the Geng/Jap and cB(Bas) populations had the lowest absence rate (approximately <10%), and the admix population had a moderate absence rate (approximately 50%) (Fig. 4B). This shows that RDVD_osat_9_1 plays an important role in the differentiation of the rice populations. Moreover, the proportion of detected RDVD_osat_9_1 in the South Asian population was the highest at 71.60%, with approximately 55%-60% of alleles in the East Asian, Southeast Asian and African populations; approximately 25%-30% of alleles in the American, West Asian and Oceanian populations; and approximately 5% in the European population (Fig. 4C). The most likely explanation for this phenomenon is the emergence of a new retroposition occurring in Asia, which spread to other continents, or it might be a deletion originating in Europe. We prefer to accept the first hypothesis, which is consistent with the origin, evolution, and migration routes of rice based on previous research that cultivated rice originated and was domesticated in Asia [43].

Gene ontology (GO) enrichment analysis of mutated, ancestral and expressed retrogenes

To study the relationship between RDV and functional retrogenes in depth, we set three criteria, mutated, ancestral and expressed, to screen for important retrogenes in the rice genome. These retrogenes are derived from ancestors and have undergone domestication and natural selection periods; they are mutated but still remain in the rice genome, so their produced RDVs may have important biological significance. We considered a retrogene to be expressed if it had a normalized expression value of over 1 FPKM (fragments per kilobase per million mapped reads) in at least one normal tissue. These criteria were met by 29 retrogenes, and most of the expressed retrogenes met these criteria (Fig. 5A).
Fig. 5

(A) The Venn diagram shows the number of retrogenes that are expressed, ancestral and mutated. (B) GO enrichment analysis. BP (biological process), CC (cellular component) and MF (molecular function). Enriched GO terms were selected by Fisher’s exact test and FDR < 0.05.

(A) The Venn diagram shows the number of retrogenes that are expressed, ancestral and mutated. (B) GO enrichment analysis. BP (biological process), CC (cellular component) and MF (molecular function). Enriched GO terms were selected by Fisher’s exact test and FDR < 0.05. To study the role of these ancestral, expressed retrogenes that had undergone insertion or deletion in the adaptive evolution of rice, the GO method was used to classify these genes functionally into three sets including biological process (BP), molecular function (MF), and cellular component (CC) based on GO analysis. These retrogenes were significantly enriched in 20 GO clusters (p < 0.05, FDR < 0.05, Fig. 5B). It is worth noting that the most significant GO clusters were mainly concentrated in the MF and CC categories. For the MF category, hydrogen symporter activity, sugar symporter activity and transmembrane transporter activity were the most important terms. For the CC category, cell, cell component and membrane were significantly enriched. The BP category was significantly enriched in only one term of carbohydrate transport. These results prove that RDV mainly occurs in expressed ancestral retrogenes that are primarily involved in transporter activity, cell membrane composition and transmembrane transporter activity.

Associations of the RDVs and phenotypes

The genotypes and phenotypes of rice RDVs of 29 expressed ancestral retrogenes were analyzed for statistical correlation, and RDVs with less than three samples corresponding to a single genotype were eliminated. A total of 14 RDVs were analyzed for their influence on the phenotypes. The results show that each RDV can significantly affect at least one trait of rice (p < 0.05), which strongly proves the effect of RDV on gene function or phenotype. All statistical results can be found in Table S4. For example, the results of RDVs in three different retrogenes (retro_osat_9, OsVPE4, annotated as vacuolar-processing enzyme precursor; retro_osat_16, OsMST5, annotated as transporter family protein; retro_osat_27, annotated as cupin domain-containing protein) are displayed in Table 2. Among the nine traits, RDVD_osat_9_1 can significantly affect seven traits, indicating that retro_osat_9 may have abundant gene functions and may participate in numerous physiological activities and that this deletion affects the various agronomic characteristics of rice, while RDVI_osat_16 and RDVI_osat_27_1 can significantly affect four traits. The insertion variant of RDVI_osat_16 increased the culm number and ligule length and reduced the grain width and thousand grain weight compared with the DD and II samples. The insertion variation in RDVI_osat_27_1 increased the culm number, ligule length and time to flowering and reduced the grain width.
Table 2

Effects of three RDVs in three retrogenes (retro_osat_9, retro_osat_16, retro_osat_27) on the agronomic traits and grain size of rice from the 3,000 Rice Genomes Project based on statistical analysis. The genotype that was homozygous for deletion was DD, the genotype that was homozygous for insertion was II, and the genotype that was heterozygous was DI. These genotypes apply below.

RDVD_osat_9_1
PhenotypeGenotypic (LSM ± SE)
IIDIDDP valuenII, nDI, nDD
Culm length (cm)109.50 ± 1.11B110.13 ± 4.53AB114.83 ± 0.91A0.0048718,32,1007
Culm number (count)15.15 ± 0.19B15.56 ± 0.84AB17.49 ± 0.16A3.4615e-19718,32,1008
Grain length (mm)8.61 ± 0.048.53 ± 0.268.62 ± 0.030.7766827,18,1163
Grain width (mm)3.16 ± 0.02A2.87 ± 0.08B2.92 ± 0.01B1.585e-34827,18,1163
Ligule length (mm)16.91 ± 0.22B18.25 ± 0.95AB19.31 ± 0.16A4.3148e-12712,32,992
Panicle length (cm)25.12 ± 0.1725.28 ± 0.7124.77 ± 0.100.0579716,32,1007
Seedling height (cm)36.59 ± 0.42B38.84 ± 2.17AB40.06 ± 0.37A4.8483e-8710,32,994
Thousand grain weight (g)25.70 ± 0.17a24.06 ± 0.53b24.72 ± 0.13b0.0000061842782,48,1007
Time to flowering (day)96.87 ± 0.72B96.49 ± 3.51AB103.00 ± 0.71A0.0000044165845,37,1213
RDVI_osat_16
PhenotypeGenotypic (LSM ± SE)
DDDIIIP valuenDD, nDI, nII
Culm length (cm)112.05 ± 0.80120.17 ± 15.26114.29 ± 1.450.3221368,6,383
Culm number (count)16.21 ± 0.14B19.17 ± 2.51AB17.50 ± 0.26A01368,6,384
Grain length (mm)8.61 ± 0.038.30 ± 0.658.65 ± 0.050.67471535,5,468
Grain width (mm)3.05 ± 0.01A2.90 ± 0.20AB2.92 ± 0.02B4.0563e-71535,5,468
Ligule length (mm)18.02 ± 0.15B19.00 ± 1.71AB19.35 ± 0.27A0.00011352,6,378
Panicle length (cm)24.89 ± 0.1125.13 ± 0.8725.02 ± 0.170.99741366,6,383
Seedling height (cm)38.34 ± 0.3239.83 ± 1.6639.60 ± 0.600.19551353,6,377
Thousand grain weight (g)25.29 ± 0.12A24.30 ± 1.41AB24.56 ± 0.19B0.00281408,12,417
Time to flowering (day)99.71 ± 0.5798.71 ± 9.11102.90 ± 1.140.07321624,7,464
RDVI_osat_27_1
PhenotypeGenotypic (LSM ± SE)
DDDIIIP valuenDD, nDI, nII
Culm length (cm)112.32 ± 0.83115.06 ± 6.62113.08 ± 1.340.85111236,16,505
Culm number (count)16.06 ± 0.14B16.81 ± 1.13AB17.56 ± 0.25A0.00000451541236,16,506
Grain length (mm)8.62 ± 0.038.51 ± 0.258.63 ± 0.040.74921351,12,645
Grain width (mm)3.09 ± 0.01a2.81 ± 0.12b2.87 ± 0.01b6.9863e-291351,12,645
Ligule length (mm)17.95 ± 0.16B19.25 ± 1.50AB19.16 ± 0.23A0.00011224,16,496
Panicle length (cm)24.91 ± 0.1225.11 ± 0.8424.94 ± 0.150.9521232,16,507
Seedling height (cm)38.62 ± 0.3440.19 ± 2.7938.58 ± 0.510.89191224,16,496
Thousand grain weight (g)25.27 ± 0.1323.84 ± 0.9324.81 ± 0.170.08641268,19,550
Time to flowering (day)97.36 ± 0.58B109.67 ± 5.67AB107.14 ± 1.00A9.1898e-161445,18,632
Effects of three RDVs in three retrogenes (retro_osat_9, retro_osat_16, retro_osat_27) on the agronomic traits and grain size of rice from the 3,000 Rice Genomes Project based on statistical analysis. The genotype that was homozygous for deletion was DD, the genotype that was homozygous for insertion was II, and the genotype that was heterozygous was DI. These genotypes apply below. An interesting phenomenon was the three detected RDVs located on retro_osat_34 (EXPB8, annotated as expansin precursor); although their deletion sites were different and their genotype distributions were also different in rice samples, they had affected seven identical rice phenotypes (Table 3). The culm length, culm number, ligule length and seedling height of the mutant rice samples were higher than those of the reference genome genotype, while the grain length, grain width and thousand grain weight were lower than those of the reference genome genotype. This result shows that these three RDVs are closely linked with these seven phenotypes and have an in-depth development potential for molecular breeding.
Table 3

Effects of three RDVs in the distinct loci of one retrogene (retro_osat_34) on the agronomic traits and grain size of rice from the 3,000 Rice Genomes Project based on statistical analysis.

RDVD_osat_34_2
PhenotypeGenotypic (LSM ± SE)
IIDIDDP valuenII, nDI, nDD
Culm length (cm)111.03 ± 0.90b113.89 ± 4.74ab115.85 ± 1.08a0.02271186,27,544
Culm number (count)15.87 ± 0.15B17.19 ± 0.80AB17.84 ± 0.22A1.1001e-131186,27,545
Grain length (mm)8.69 ± 0.03A8.55 ± 0.17AB8.47 ± 0.04B0.00000220521360,22,626
Grain width (mm)3.07 ± 0.01a2.87 ± 0.06b2.90 ± 0.01b1.324e-151360,22,626
Ligule length (mm)17.62 ± 0.16B20.67 ± 0.84A19.70 ± 0.23A2.8357e-121173,27,536
Panicle length (cm)25.00 ± 0.1224.62 ± 0.5824.78 ± 0.130.21941185,27,543
Seedling height (cm)37.55 ± 0.33B39.70 ± 2.25AB40.92 ± 0.53A0.0000011111173,27,536
Thousand grain weight (g)25.53 ± 0.13a23.4 ± 0.69b24.13 ± 0.16b4.0517e-101314,32,491
Time to flowering (day)99.93 ± 0.5998.41 ± 4.40101.57 ± 1.000.62541414,34,647
RDVD_osat_34_3
PhenotypeGenotypic (LSM ± SE)
IIDIDDP valuenII, nDI, nDD
Culm length (cm)111.14 ± 0.90b114.23 ± 4.35ab115.62 ± 1.09a0.04731189,31,537
Culm number (count)15.88 ± 0.15B17.10 ± 0.86AB17.83 ± 0.23A8.5279e-131189,31,528
Grain length (mm)8.69 ± 0.03A8.54 ± 0.20AB8.47 ± 0.04B0.00000377491364,23,621
Grain width (mm)3.07 ± 0.01A2.79 ± 0.06B2.91 ± 0.01B3.3993e-161364,23,621
Ligule length (mm)17.64 ± 0.16B19.68 ± 0.90AB19.72 ± 0.23A6.466e-121177,31,528
Panicle length (cm)24.97 ± 0.1225.06 ± 0.6324.80 ± 0.130.47361188,31,536
Seedling height (cm)37.47 ± 0.33B40.35 ± 2.28AB41.08 ± 0.52A1.0705e-71177,31,528
Thousand grain weight (g)25.57 ± 0.13a23.59 ± 0.61b23.99 ± 0.16b1.8174e-121321,36,480
Time to flowering (day)99.83 ± 0.59100.24 ± 4.22101.73 ± 1.010.72571421,38,636
RDVD_osat_34_4
PhenotypeGenotypic (LSM ± SE)
IIDIDDP valuenII, nDI, nDD
Culm length (cm)111.18 ± 0.78B119.50 ± 8.98AB120.96 ± 1.31A01507,6,244
Culm number (count)16.23 ± 0.13B17.67 ± 2.82AB18.12 ± 0.34A0.00000329521508,6,244
Grain length (mm)8.66 ± 0.03A9.30 ± 0.35AB8.37 ± 0.05B01727,4,277
Grain width (mm)3.04 ± 0.01A2.55 ± 0.17AB2.93 ± 0.02B0.00031727,4,277
Ligule length (mm)18.04 ± 0.15b24.00 ± 1.48a19.83 ± 0.32a1.5847e-71490,5,241
Panicle length (cm)24.93 ± 0.1024.33 ± 1.2324.91 ± 0.190.90531506,6,243
Seedling height (cm)38.11 ± 0.30B37.80 ± 5.67AB41.77 ± 0.75A01489,5,242
Thousand grain weight (g)25.29 ± 0.11A24.93 ± 3.49AB23.69 ± 0.25B0.00000378151637,4,196
Time to flowering (day)99.76 ± 0.53110.71 ± 10.16104.26 ± 1.590.04241801,7,287
Effects of three RDVs in the distinct loci of one retrogene (retro_osat_34) on the agronomic traits and grain size of rice from the 3,000 Rice Genomes Project based on statistical analysis.

Confirmation and evaluation of the expression of retrogenes

To validate and assess the expression level of retrogenes and their parental genes, Rice Expression Database-derived RNA-Seq data of 49 functional retrogenes were utilized, and the FPKM of all functional retrogenes and their parental genes can be found in Table S5. By analyzing the expression pattern of the above retrogenes (Fig. 6A), surprisingly, we found that although the RDVs of the above four typical retrogenes have been detected to significantly affect the grain size of rice and aleurone is usually considered to be related to grain size development, their expression level in the aleurone was very low or even not expressed. Simultaneously, it can be observed that they were highly expressed in reproductive organs such as anther and pistil, that retro_osat_24 and retro_osat_16 were highly expressed in anther, and that retro_osat_34 was highly expressed in panicle and pistil, indicating that the generation of RDVs might change their degree of expression in the aleurone or that these retrogenes affected the grain size traits by playing a certain role in the reproductive organs and panicles, but this explanation requires further experiments and research to verify the functions. Moreover, in anther, panicle and pistil, the expression level of retro_osat_16 in other tissues was very low, while the other three retrogenes had moderate expression in some vegetative organs, such as the root, leaf and shoot.
Fig. 6

Expression pattern of retrogenes with their parental genes (A) Heat map of the expression pattern of retrogenes. (B) Comparison of the expression patterns between the above four typical retrogenes and their parental genes.

Expression pattern of retrogenes with their parental genes (A) Heat map of the expression pattern of retrogenes. (B) Comparison of the expression patterns between the above four typical retrogenes and their parental genes. To compare the expression pattern between the retrogene and its parental gene, we calculated the Pearson product-moment correlation coefficient (R) between the retrogenes and parental genes using the FPKM values observed from nine tissues (Table S5). For above four retrogenes, except for retro_osat_16 (R < 0), the expression of the other three retrogenes and their parental genes in various tissues was positively correlated (R > 0) (Fig. 6B). The R value of retro_osat_34 > 0.90 showed a strong positive correlation. As a whole, the expression level of most retrogenes in a single tissue was lower than that of their parental genes, and some were no longer expressed, but the relative degree of expression in each tissue was similar. However, there were two exceptions. The parental gene of retro_osat_16 was almost not expressed in anther, but it had a high degree of expression on its own; retro_osat_27 and its parental genes were expressed in anther, but the expression level of retrogenes was obviously higher than that of the parental genes. This anomaly seems to imply a certain connection between retrogene and male reproductive organs. For retropseudogenes, their expression information cannot be obtained from the Rice Expression Database because the database only contains gene expression data, so we downloaded the original RNA-Seq data from NCBI SRA and calculated the normalized expression value of these retropseudogenes (Table S5). It shows that 92% retropseudogenes can be expressed, and 70% of them were originated from one parental gene, Os04g0473025, which encodes plastoquinone oxidoreductase and plays energy sensing and response to abiotic stress in rice photosynthesis [44]. They are mainly expressed in anther, pistil and root at a high level and may play a role in reproductive and nutritional processes, and participate in abiotic stress response in root like their parental gene. Moreover, we compared the locus of rice retropseudogenes with rice regulatory RNAs (including siRNA, miRNA, lncRNA, etc.) data from published research [45], database [46] and China Rice Data Center (http://www.ricedata.cn/gene/), no rice retropseudogene generated by retrocopy had been identified as regulatory RNAs previously.

Discussion

RDVs are defined as a type of genomic sequence polymorphism related to the insertion or deletion of retrocopies in individual genomes. Because retrocopy is a single exon structure in most cases, we speculated that indels in functional retrogenes may have a greater probability of affecting gene structure and species phenotype. However, so far, there are only a limited number of studies concerning the significance and prevalence of the above phenomenon. In this research, we used bioinformatics methods to comprehensively analyze rice retrogenes and RDVs. By calculating the Ka/Ks value of rice retrogenes and their parental genes, it was found that the geometric means of functional retrogenes and retropseudogenes were both <0.6, which tended to be negatively selected. The molecular clock and orthologous analysis showed that rice retrogenes were formed explosively nearly 20 million years ago, and retropseudogenes were either new genes or ancient genes. Seventy-three RDVs affecting 35 retrogenes were detected in the rice genomes in total, mainly in retrogenes that have the functions of transmembrane transport. Most RDVs were detected at a very low frequency in the populations. The distribution differences of some RDVs had obvious characteristics among geographic regions and subpopulations. Through the genotype-phenotype association analysis of 14 RDVs in ancestral retrogenes, it was found that they could all affect at least one of the nine traits of rice. The role of retrocopy polymorphisms as markers for human population history has been clearly established, but our findings suggest that they can also provide great insight into ongoing evolutionary processes of species. RDVs seem to affect the ancestral retrogenes and the new retrogenes indiscriminately but mainly affect the expressed functional retrogenes, with only one RDV was in the retropseudogene, meaning that the functions they expressed are subject to the selection of the external environment to produce RDV. For the RDVs in the new retrogenes, their formation time was relatively short, which might be a mutation produced by the combined action of natural selection and artificial domestication. The high frequency of a small portion of rice RDVs may indicate that large-scale mutations have occurred in the rice population and have been stably inherited. On the other hand, most low-frequency RDVs may indicate that these indels were deleterious deletions and were therefore subject to negative selection pressure or newer mutations. Retrogenes with protein coding functions participate in the physiological and biochemical activities of organisms, and the insertion or deletion of RDV can change the functions of these retrogenes, which can be visually expressed in phenotypes. For example, retro_osat_34 (EXPB8), which belongs to the expansin family, involves in the process of cell wall loosening [47], [48] and has been proven to be widely involved in various developmental processes of plant, including cellular turgor, pollen tube entry into stigma, fruit ripening and softening, organ shedding, leaf formation and response to stress stimuli [49], [50]. It was considered the mutations in expansin genes among rice cultivars might generate protein variants that differ in terms of efficiency [51]. Our study also finds that three RDV in EXPB8 could affect seven identical rice phenotypes, including the culm length, culm number, ligule length, seedling height, grain length, grain width and thousand grain weight. This shows that RDV in a multifunctional retrogene may causes multiple phenotypes. We performed expression analysis based on RNA-Seq data on nine tissues and found that the expression level of retrogenes in rice is generally lower than that of the parent gene but shows a certain degree of collinearity, which is consistent with the results of a previous study [21]. We detected two highly expressed retrogenes (retro_osat_16 and retro_osat_27) in the anther, and their expression level in the anther was also exceptionally higher than that of their parental gene. Among them, retro_osat_16 (OsMST5) belongs to the monosaccharide transporter (MST) gene family and has been proven that it was associated with pollen development in rice [52], [53]. Although the function of retro_osat_27 is unreported, its parental gene Os05g0116000 encodes globulin and has been shown to play a role in seed development [54], [55], but its role in anther is unclear. And its retrogene, retro_osat_27 might has gained new function in anther for its higher expression level. This supports results from previous studies, showing that many retrogenes have broad expression patterns [56]. Numerous studies have revealed a tendency for retrogenes to be expressed in the human testis [5], [16], [25], which implies a certain connection between the male reproductive organs and retrogenes, and some mechanisms have been confirmed [33], [57], [58]. However, the set of analyzed retrogenes is relatively small and may not be fully representative. Compared with the research results of Kabza et al. [25] on RDVs in humans, which was the only species that has been specifically studied for RDV in the past, some of our research results on RDV in rice were similar to those on human RDV. First, the frequency of most RDVs detected between populations was very low. Second, some RDVs were distributed regularly among populations, especially geographically. This shows that these RDVs spread with the migration of species. Third, the mutated, ancestral, and expressed retrogenes of both humans and rice have detected retrogenes that were highly expressed in the male reproductive organs. Human retrogenes were reflected in the testis, while rice retrogenes were reflected in anthers. However, there are still some differences between rice and human RDVs. First, the research of human RDV focuses on detecting the deletions of retrogenes relative to the reference genome, but since we cannot distinguish whether the reference genome or the genome being compared has insertion or deletion mutations [59], in our work, we scanned insertions and deletions of retrogenes, which should be more comprehensive. Second, in the study of Kabza et al. [25], human RDV is defined as a fragment deletion of at least 100 bp, but this threshold was too high for the rice genome. Previous study has shown that the length of indel varies greatly, and there are also different length distributions among different species, the reason for this phenomenon may be that the directed evolution and molecular characteristics are different between human and rice genomes [60], which implies that their retrotransposon activities are different, and it may also be related to their different selection and evolution histories [61], [62]. Therefore, we chose to retain insertions or deletions of 8 bp and above as RDVs in this research. Third, our study performed an association analysis between rice RDVs and phenotypes, which has not yet been involved in human RDV research and can more intuitively reflect the impacts of RDV on gene function and species phenotype. In recent years, molecular breeding technology has gradually become the mainstream of crop breeding worldwide. Because indel molecular markers have the characteristics of high density, high accuracy, good reproducibility, and easy operation, they have been widely used in genetic analysis and molecular-assisted breeding of rice [63], [64], [65], oilseed rate [66], wheat [67] and other crops. This is an ideal solution to improve the genetic yield potential of crops. RDV is a special kind of indel because the retrogene is almost always with a single exon structure, so RDV may have a greater probability of affecting crop traits and is more suitable for screening molecular breeding markers than normal indel markers. However, further research is needed to explore the biological function mechanism of RDV. Furthermore, the relationship between retrogenes and male organs is also worthy of in-depth study, which will help to explore the mechanism of retrogene formation.

Materials and methods

Ka/Ks calculation

Clustalw v2.1 [68] was used to align sequences between retrocopies and their parental genes. All sequence data were downloaded from RetrogeneDB [30]. Next, the files were converted into axt format. KaKs_Calculator v2.0 [69] was used to calculate the Ka and Ks substitution rates and Ka/Ks ratios using the MA method and 1-standard genetic code.

Retrocopy conservation analysis

We set a phylogenetic tree of all analyzed species based on NCBI Taxonomy [70], [71], which is represented in Fig. 3B. To analyze conserved retrocopies, we used the retrocopy data of 25 species from RetrogeneDB [30]. A retrocopy originating in a given lineage ancestor was identified using OrthoVenn2 [42]. We considered a retrocopy to be ancestral for a given lineage if it was observed in any two species from this lineage. All the ancestral clusters (E-value < 1 × 10−5) were then downloaded. Orthologs of Oryza sativa japonica retrocopies were identified using Diamond [72]. To be more specific, we used an Oryza sativa japonica retrocopy to conduct comparisons with all genes of ancestral clusters. The following parameters were used: identity > 30%; e-value < 1 × 10−10; score > 200; overlap > 60%. The main branch, which contains Oryza sativa japonica, was first detected from the top node to the end node. The accessory branch was then detected.

Detecting RDVs of known retrocopies

The presently constructed set of rice retrocopies was employed to identify RDVs, and we downloaded and searched the sequence variants detected in 3,010 accessions in the 3,000 Rice Genomes Project [27], [28], [29], aiming to detect indels that overlapped with retrogene loci using BEDtools [73]. We incorporated only indels causing the insertion or deletion of at least 8 bp of retrogene sequence for further analysis. The frequency of every indel was calculated in each population. All RDVs were marked on retrogenes by GSDS 2.0 [74] (Fig. 4A).

GO enrichment analysis

PlantGSEA [75] was used for GO enrichment analysis. Using the default parameters, the clustering results with p < 0.05 and FDR < 0.05 were selected as the research objects. Since retropseudogenes are unannotated and only one RDV were detected in them, they were not analyzed.

Obtaining rice trait data

All phenotypic data of rice samples were downloaded from the 3,000 Rice Genome Project. Among them, the culm length, culm number, grain length, grain width, ligule length, panicle length, seedling height and thousand grain weight traits were downloaded from RFGB v2.0 [27]. Time to flowering (from sowing) trait data were downloaded from the Rice SNP-seek database [76]. The rice phenotype data in the 3,000 Rice Genomes Project have not been fully published, so the total number of statistical samples for each trait was <3,010.

Statistical analysis of phenotype-genotype correlation

We assumed that a certain RDV locus was compared with a reference genome, the genotype that was homozygous for deletion was DD, the genotype that was homozygous for insertion was II, and the genotype that was heterozygous was DI. The genotypes of each RDV were counted in each rice sample of the 3,000 Rice Genome Project (Table S3), and RDVs with a single genotype of less than three samples were eliminated because they were not sufficient for analysis of variance. Then, we performed a complete random analysis of variance (three-group comparison) between the genotypes of all detected RDV and the phenotypic data of each rice sample to analyze the impact of RDV on the rice phenotype. All analyses were carried out in MATLAB (version R2019a) and R (version 3.5.3).

Expression analysis

We used RNA-Seq data derived from the Rice Expression Database [26] to analyze the expression patterns of retrogenes. Expression data of 49 functional retrogenes derived entirely from NGS RNA-Seq data of Nipponbare (Oryza sativa japonica) provide information on the gene expression profiles of normal tissues. For retropseudogenes, we downloaded the original RNA-Seq data from NCBI SRA (https://www.ncbi.nlm.nih.gov/sra) and the Nipponbare reference genome Os-Nipponbare-Reference-IRGSP-1.0 [77], then performed the same procession and calculation of the FPKM normalized value of retropseudogenes by the method of Rice Expression Database [26]. We considered a retrogene to be expressed only with a normalized expression value of at least 1 FPKM. The experimental ID and developmental stages of each rice tissue are displayed in Table S5.

Author contributions

Conceived and designed the experiments: Y.W., H.-Y.Z., W.-Z.Y. Performed the experiments: H.-Y.Z., X.-Y.C. Analyzed the data: H.-Y.Z., H.B.-L., J.Z. Wrote the paper: H.-Y.Z., Y.W., X.-Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (31871330), the Scientific Research Starting Foundation of Southwest University (SWU118103), Chongqing Municipal Training Program of Innovation and Entrepreneurship for Undergraduates (S201910635003) and the National Undergraduate Training Program for Innovation and Entrepreneurship (202010635111).

CRediT authorship contribution statement

Haiyue Zeng: Methodology, Visualization, Writing - original draft, Writing - review & editing. Xingyu Chen: Data curation, Writing - original draft, Validation. Hongbo Li: Validation, Data curation. Jun Zhang: Data curation, Investigation. Zhaoyuan Wei: Methodology, Investigation. Yi Wang: Conceptualization, Methodology, Supervision, Writing - review & editing, Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  75 in total

1.  Retroposed new genes out of the X in Drosophila.

Authors:  Esther Betrán; Kevin Thornton; Manyuan Long
Journal:  Genome Res       Date:  2002-12       Impact factor: 9.043

Review 2.  Expansins and cell growth.

Authors:  Yi Li; Louise Jones; Simon McQueen-Mason
Journal:  Curr Opin Plant Biol       Date:  2003-12       Impact factor: 7.834

3.  Evolutionary fate of retroposed gene copies in the human genome.

Authors:  Nicolas Vinckenbosch; Isabelle Dupanloup; Henrik Kaessmann
Journal:  Proc Natl Acad Sci U S A       Date:  2006-02-21       Impact factor: 11.205

4.  An initial map of insertion and deletion (INDEL) variation in the human genome.

Authors:  Ryan E Mills; Christopher T Luttig; Christine E Larkins; Adam Beauchamp; Circe Tsui; W Stephen Pittard; Scott E Devine
Journal:  Genome Res       Date:  2006-08-10       Impact factor: 9.043

5.  Extensive gene traffic on the mammalian X chromosome.

Authors:  J J Emerson; Henrik Kaessmann; Esther Betrán; Manyuan Long
Journal:  Science       Date:  2004-01-23       Impact factor: 47.728

6.  Lack of Globulin Synthesis during Seed Development Alters Accumulation of Seed Storage Proteins in Rice.

Authors:  Hye-Jung Lee; Yeong-Min Jo; Jong-Yeol Lee; Sun-Hyung Lim; Young-Mi Kim
Journal:  Int J Mol Sci       Date:  2015-06-30       Impact factor: 5.923

7.  Inter-population Differences in Retrogene Loss and Expression in Humans.

Authors:  Michał Kabza; Magdalena Regina Kubiak; Agnieszka Danek; Wojciech Rosikiewicz; Sebastian Deorowicz; Andrzej Polański; Izabela Makałowska
Journal:  PLoS Genet       Date:  2015-10-16       Impact factor: 5.917

8.  KaKs_Calculator 2.0: a toolkit incorporating gamma-series methods and sliding window strategies.

Authors:  Dapeng Wang; Yubin Zhang; Zhang Zhang; Jiang Zhu; Jun Yu
Journal:  Genomics Proteomics Bioinformatics       Date:  2010-03       Impact factor: 7.691

9.  InDel Marker Based Estimation of Multi-Gene Allele Contribution and Genetic Variations for Grain Size and Weight in Rice (Oryza sativa L.).

Authors:  Sadia Gull; Zulqarnain Haider; Houwen Gu; Rana Ahsan Raza Khan; Jun Miao; Tan Wenchen; Saleem Uddin; Irshad Ahmad; Guohua Liang
Journal:  Int J Mol Sci       Date:  2019-09-28       Impact factor: 5.923

10.  OrthoVenn2: a web server for whole-genome comparison and annotation of orthologous clusters across multiple species.

Authors:  Ling Xu; Zhaobin Dong; Lu Fang; Yongjiang Luo; Zhaoyuan Wei; Hailong Guo; Guoqing Zhang; Yong Q Gu; Devin Coleman-Derr; Qingyou Xia; Yi Wang
Journal:  Nucleic Acids Res       Date:  2019-07-02       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.