Literature DB >> 30894495

Genome of Crucihimalaya himalaica, a close relative of Arabidopsis, shows ecological adaptation to high altitude.

Ticao Zhang^1,2, Qin Qiao³, Polina Yu Novikova^4,5, Qia Wang¹, Jipei Yue¹, Yanlong Guan¹, Shengping Ming⁶, Tianmeng Liu⁶, Ji De⁶, Yixuan Liu⁶, Ihsan A Al-Shehbaz⁷, Hang Sun², Marc Van Montagu^8,5, Jinling Huang^9,10,11, Yves Van de Peer^8,5,12.

Abstract

Crucihimalaya himalaica, a close relative of Arabidopsis and Capsella, grows on the Qinghai-Tibet Plateau (QTP) about 4,000 m above sea level and represents an attractive model system for studying speciation and ecological adaptation in extreme environments. We assembled a draft genome sequence of 234.72 Mb encoding 27,019 genes and investigated its origin and adaptive evolutionary mechanisms. Phylogenomic analyses based on 4,586 single-copy genes revealed that C. himalaica is most closely related to Capsella (estimated divergence 8.8 to 12.2 Mya), whereas both species form a sister clade to Arabidopsis thaliana and Arabidopsis lyrata, from which they diverged between 12.7 and 17.2 Mya. LTR retrotransposons in C. himalaica proliferated shortly after the dramatic uplift and climatic change of the Himalayas from the Late Pliocene to Pleistocene. Compared with closely related species, C. himalaica showed significant contraction and pseudogenization in gene families associated with disease resistance and also significant expansion in gene families associated with ubiquitin-mediated proteolysis and DNA repair. We identified hundreds of genes involved in DNA repair, ubiquitin-mediated proteolysis, and reproductive processes with signs of positive selection. Gene families showing dramatic changes in size and genes showing signs of positive selection are likely candidates for C. himalaica's adaptation to intense radiation, low temperature, and pathogen-depauperate environments in the QTP. Loss of function at the S-locus, the reason for the transition to self-fertilization of C. himalaica, might have enabled its QTP occupation. Overall, the genome sequence of C. himalaica provides insights into the mechanisms of plant adaptation to extreme environments.

Entities: Chemical Disease Gene Species

Keywords: Qinghai–Tibet Plateau; S-locus; adaptive evolution; extreme environment; natural selection

Mesh：

Substances：

Year: 2019 PMID： 30894495 PMCID： PMC6452661 DOI： 10.1073/pnas.1817580116

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

The Qinghai–Tibet Plateau (QTP), also known as Himalayan Plateau, is the highest (average elevation above 4,000 m) and largest (ca. 2.5 million km2) plateau in the world. The QTP mountains reached their current elevations mainly between the late Miocene and late Pliocene, according to recent syntheses of different studies (1). The uplift of the QTP resulted in profound climatic and environmental changes in both the plateau region and Asia at large (2). Conditions on the QTP are now characterized by low temperature, low oxygen, reduced pathogen incidence, and strong UV radiation, which together provide a unique environment to study adaptive evolution (3, 4). Previous genome-wide studies on adaptive evolution on the QTP have focused mainly on humans and vertebrates (reviewed in ref. 5). These studies have revealed that genes involved in hypoxia responses, energy metabolism, and skeletal development are under positive selection and rapid evolution. In contrast, hitherto, no study of adaptive evolution based on whole-genome analysis has been conducted on wild plants (excluding the domesticated Tibetan highland barley) (6) in this region. Arabidopsis thaliana (Cruciferae/Brassicaceae) is the best-known model organism in plant biology. Relatives of A. thaliana are important for studies on evolutionary ecology and comparative genomics of plants (7–9), especially those found in specific and challenging environments. Crucihimalaya himalaica (Edgew.) Al-Shehbaz, O’Kane & R.A.Price, a self-fertilizing, diploid (2n = 16) relative of A. thaliana, was previously recognized as A. himalaica (Edgew.) O.E.Schulz, and the names Arabis rupestris Edgew. and Arabis brevicaulis Jafri are also accepted as synonyms (10). C. himalaica mainly grows on rocky hillsides, sandy slopes, alpine meadows, and screes in the Himalayas and Hengduan Mountains. Previous estimates suggested that Crucihimalaya diverged from the closely related Pachycladon ∼10.94 Mya (11), while the species C. himalaica originated about 3.56 Mya (12). Given the postulated timing of the most recent rapid uplift of the QTP during the period of late Miocene to late Pliocene (1, 13), many of the genetic changes in C. himalaica might reflect its adaptation to the extreme environment of QTP (4). In a previous study (4), we identified 487 positively selected genes (PSGs) in the transcriptome of C. himalaica. Predicted functions of these PSGs indicate that they potentially contribute to miscellaneous traits of adaptive importance, such as response to UV radiation, DNA repair, and membrane stabilization, which presumably are important for the adaptation of C. himalaica to the specialized environment on the QTP. Therefore, we believe that this species represents a model system for the study of speciation and ecological adaptation in extreme environments. In the present study, we generated de novo whole-genome sequences of C. himalaica and applied comparative and evolutionary genomics approaches to clarify the origin of this species and to investigate signals of adaptation. Our goal is to gain a deeper understanding into the mechanisms by which C. himalaica has adapted to the complex extreme conditions on the QTP at the whole-genome level.

Results and Discussion

Genome Assembly and Annotation.

The genome size of C. himalaica was estimated to be 265.23 Mb with a heterozygosity of 0.70% on the basis of k-mer statistics (14) (). In total, we generated 50.68 Gb of paired-end (PE) and mate-pair (MP) reads with different insert sizes using an Illumina HiSeq platform combined with PacBio sequencing technology (). A total assembly of 234.72Mb, consisting of 583 scaffolds (scaffold N50 length, 2.09 Mb; longest scaffold, 8.34 Mb), was achieved by combination of the ALLPATHS–LG (15) and PBJelly2 (16) assembly strategies (Table 1). To assess assembly accuracy, we remapped raw sequencing reads of a small fragment library to the assembled genome. With a 96.69% mapping rate and 224.57× average sequence depth, the reads covered 96.43% of the genome, which implies that the current assembly covered almost all unique genomic regions. About 94.05% of the assembly was covered by at least 20× reads, which guaranteed a highly accurate assembly at the single-nucleotide level ().

Table 1.

Genome assembly of C. himalaica

Genome features	Contigs	Scaffolds
Total length, bp	230,905,116	234,722,603
Total no.	3,983	583
Longest length, bp	1,756,581	8,343,586
No. of length ≥2,000	3,586	429
Length of N50, bp	136,392	2,088,603
No. of N50	406	34
Length of N90, bp	32,421	470,087
No. of N90	1,711	129
GC content, %	36.38	—
No. of genes	—	27,019

Genome assembly of C. himalaica A total of 27,019 protein-coding genes were predicted (Table 1 and ). Among these, 26,806 genes (99.21%) were functionally annotated (). In addition to protein-coding genes, various noncoding RNA sequences were identified and annotated (), including 448 microRNAs, 577 transfer RNAs, 153 ribosomal RNAs, and 974 small nuclear RNAs. Gene region completeness was evaluated by RNA sequencing data (). Of the 29,420 transcripts assembled by Trinity, 99.74% were mapped to the genome assembly, and 92.37% were complete. The completeness of gene regions was further assessed using BUSCO (Benchmarking Universal Single Copy Orthologs) (17), which showed that 96% of the plant single-copy orthologs were complete (). Compared with other close relatives, the C. himalaica genome contains a similar number of transcription factors (1,711) and transcription regulators (413) (). The C. himalaica genome showed strong homology with the genome of A. thaliana, Arabidopsis lyrata, and Capsella rubella, except in centromeres and certain high-repeat regions (Fig. 1 and ). These results implied that the assembly was of high quality, thus ensuring the reliability of subsequent comparative genomic analyses in this study.

Fig. 1.

Comparative analyses of genomic features of C. himalaica vs. A. thaliana (A) and C. himalaica vs. C. rubella (B). Tracks from inside to outside are collinearity between both genomes, number of chromosomes/scaffolds, gene density, and TE density.

Repeat Elements.

Compared with the genome size of closely related species, such as C. rubella (which has an assembled genome size of 134.8 Mb, whereas the genome size estimated from flow cytometry was 219 Mb) (18) and A. thaliana [125 Mb (19) and the estimated genome size of the reference accession Col-0 from flow cytometry is 166 Mb (20)], the C. himalaica genome is considerably larger. It has been well documented that polyploidization (whole-genome duplication, WGD) events and transposable element (TE) amplification are major causes of genome expansion (21, 22). Analyses of age distributions built from transversion substitutions at fourfold degenerate sites (4DTv) indicated that, except for the α (4DTv distance = ∼0.3) and β (4DTv distance = ∼0.6) polyploidy events shared among members of the Brassicaceae (23), C. himalaica has not undergone an additional species-specific WGD event (Fig. 2). Next, we investigated the content and evolutionary history of TEs in C. himalaica. Using de novo prediction of TEs (), we identified and marked 46.91% of the assembly as repeat regions, among which TEs occupied 45.78% of the genome assembly length (), which is higher than that reported for A. lyrata and A. thaliana (29.7% and 23.7%, respectively) (24). This was also apparent in a CIRCOS genomic comparison plot, which showed that the density of TEs in C. himalaica was higher than that in A. thaliana or C. rubella, whereas the density of genes in C. himalaica was lower than that in A. thaliana or C. rubella (Fig. 1).

Fig. 2.

Evolutionary analyses of the C. himalaica genome. (A) Age distribution of 4DTv distance values between orthologs of C. himalaica and A. thaliana, C. himalaica and A. lyrata, and C. himalaica and C. rubella. (B) Insertion time distribution of LTR retrotransposons. (C) Estimation of divergence times of nine species in the Brassicaceae. (D) Venn diagram showing unique and shared gene families between genomes of C. himalaica and four close relatives. In particular, a high proportion of TEs in C. himalaica were LTR retrotransposons (30.37%), whereas other retrotransposons (short and long interspersed nuclear elements) only constituted 3.31% collectively (). The retrotransposon proliferation might be responsible for the genome-size expansion in C. himalaica. To investigate the evolutionary dynamics of LTR retrotransposons, we estimated their insertion dates in four closely related species (). Compared with A. thaliana, which shows a large number of microdeletions in noncoding DNA and transposons (24), A. lyrata has a comparatively high proportion of recent insertions (18) (Fig. 2), possibly contributing to its larger genome size (207 Mb). The proliferation of LTR retrotransposons in C. himalaica (Fig. 2 and ) peaks at ∼2.0 Mya, shortly after the dramatic elevational and climatic changes of the QTP between the late Miocene and late Pliocene (1, 13). The activity of TEs (including LTR retrotransposons) can lead to diverse genetic changes (e.g., chromosomal rearrangements and gene duplication, creation, and disruption) that may drive lineage-specific diversification and adaptation (25, 26). We therefore hypothesize that the proliferation of LTR retrotransposons likely contributed to the diversification, speciation, and adaptation of C. himalaica, similar to the proliferation of BARE-1 retrotransposons in wild barley (27) and DIRS1 retrotransposons in Antarctic teleost (28), which might have facilitated the adaptation of plants to drier slopes and subzero temperatures, respectively. Another example from apples involves a major burst of retrotransposons ∼21.0 Mya, which coincided with the uplift of the Tianshan mountains, the postulated center of origin of apples (29).

Phylogenetic Tree Construction and Estimation of Divergence Times.

Previous studies suggest that Crucihimalaya forms an endemic genus in the Himalayas and is most closely related to Arabidopsis (30, 31). However, our earlier phylogenetic analyses of transcriptome sequences found that C. himalaica formed a clade sister to C. rubella (4). Applying OrthoMCL (32) to nine published whole-genome sequences from Brassicaceae (), we identified a total of 22,670 orthogroups. Among these orthogroups, 4,586 contained putative single-copy gene families. To verify the phylogenetic position of C. himalaica, we generated a maximum likelihood phylogenetic tree with a trimmed and concatenated protein sequence alignment from 4,586 single-copy genes in nine species. The resulting phylogeny indicated that C. himalaica was most closely related to C. rubella, and that these two species in turn formed a clade with A. lyrata and A. thaliana (Fig. 2), confirming our previous results (4). The above-mentioned four genera (which are classified in the tribe Camelineae), together with the allied Cardamine hirsuta, were often recognized as Lineage I or Clade A in previous phylogenetic studies (33, 34). C. himalaica and C. rubella were estimated to have diverged ∼10.6 (8.8 to 12.2) Mya in our analyses using MCMCtree (35) with three calibration points ( and Fig. 2); the two species diverged from Arabidopsis ∼14.6 (12.7 to 17.2) Mya and from Brassica rapa ∼23.6 (20.9 to 27.8) Mya. These results are in agreement with previous estimates (11, 36). Given the absence of genomic information for the remaining species of Crucihimalaya, we are unable to date the exact origin of C. himalaica, which was estimated previously at ∼3.56 Mya (11, 12). However, C. himalaica must have evolved after the split with Capsella and almost certainly less than 10.6 Mya. Such estimation is in accordance with the timing of the most recent rapid uplift of the QTP from late Miocene to late Pliocene (1, 13). The genus Crucihimalaya is assigned to the tribe Crucihimalayeae, which includes two additional genera, Ladakiella and Transberingia (37, 38). The tribe Crucihimalayeae is mainly distributed in Central Asia, with one species (Transberingia bursifolia) extending into North America. Therefore, we speculate that the ancestor of C. himalaica dispersed from Central Asia into the Himalayas, and speciation of C. himalaica was likely triggered by the rapid uplift of the Himalayas, which resulted in the evolution of specialized phenotypic and physiological characters as adaptations to the extreme environment (discussed further below).

Gene Family Expansion and Contraction.

Significant expansion or contraction in the size of particular gene families is often associated with the adaptive divergence of closely related species (39, 40). Comparisons of the genomes of C. himalaica and four close relatives (Fig. 2) identified a total of 151 gene families that are significantly (P < 0.01) expanded in C. himalaica and 89 gene families that are significantly contracted (). Based on Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) annotations, expanded gene families were highly enriched in ubiquitin-conjugating enzyme (E2) activity (P = 7.17E-40) and DNA repair pathways (P = 1.45E-12) (Table 2). It is notable that the most significantly contracted gene families in the C. himalaica genome were found to be functionally related to disease and immune responses, such as the Toll-like receptor and NF-κB signaling pathway (Table 2). The majority of bacteria interact with Toll-like receptors on the surface of the host cell membrane, stimulating the NF-κB signaling pathway in the immune response (41). The Toll and interleukin-1 receptor (TIR) is an N-terminal component of the nucleotide-binding site (NBS) disease resistance protein family, which includes the TIR-NBS-LRR (TNL) and CC-NBS-LRR (CNL) subfamilies. In the C. himalaica genome, both TNL and CNL subfamilies underwent severe contractions compared with their close relatives (Fig. 3). In particular, disease resistance RPP8-like proteins (belonging to CNL) have undergone the most significant contraction in the C. himalaica genome (6 vs. 16∼29 members, P = 1.26E-10). These proteins are major players in plant defense against pathogens by triggering hypersensitive responses (42). The rapid evolutionary expansion or contraction of the NBS gene family may be a fundamental strategy for plants to adapt to the rapidly changing species-specific pathogen spectrum (43, 44). As fewer microorganisms exist on the QTP owing to the harsh environments characterized by cold temperatures, aridity, and high UV radiation (45), it is reasonable to presume that the reduction in number of NBS genes in the C. himalaica genome is due to a lighter load of pathogens present in the environment and therefore the scarcity of pathogen infection on the QTP. We performed additional searches for NBS pseudogenes, which were assumed to have frameshift mutations and/or premature stop codons. Only one pseudogene could be found in A. thaliana and A. lyrata, but 20, 15, and 7 NBS pseudogenes could be identified in the C. himalaica, C. rubella, and C. hirsuta genomes, respectively (Fig. 3). Therefore, pseudogenization of NBS disease resistance genes in C. himalaica is at least partly responsible for the contraction of this gene family. Similar observations were made for the genome of the ground tit (Parus humilis), native to the QTP, where MHC genes involved in the cellular immune defense against pathogens also show significant contractions (46). Although the mechanism of MHC gene evolution in ground tit remains unclear, it is worth noting that both ground tit and C. himalaica show a convergent evolution with respect to the contraction of gene families involved in defense against pathogens.

Table 2.

Functional annotation of the most significantly expansive and contract gene families in C. himalaica

Gene families	KEGG terms	Input no.	Background no.	P value
Expanded gene	Ubiquitin-mediated proteolysis	19	137	5.59E-29
families	Mismatch repair	10	80	9.44E-12
	Homologous recombination	10	105	9.12E-11
	DNA replication	10	103	9.12E-11
	Nucleotide excision repair	10	137	9.31E-10
	RNA polymerase	4	32	1.59E-06
	Pyrimidine metabolism	4	105	9.53E-05
	Purine metabolism	4	176	5.66E-04
Contracted gene	NF-κB signaling pathway	64	93	1.96E-106
families	Toll-like receptor signaling pathway	64	106	5.32E-104
	Retinol metabolism	21	65	3.60E-29
	Metabolism of xenobiotics by P450	18	73	2.89E-23
	Tryptophan metabolism	13	40	2.89E-18
	Steroid hormone biosynthesis	13	58	1.73E-16
	Metabolic pathways	31	1,243	1.03E-12
	Arachidonic acid metabolism	10	62	1.35E-11
	Linoleic acid metabolism	8	29	4.96E-11
	Lysosome	7	123	1.27E-05
	Apoptosis	4	33	9.75E-05

Fig. 3.

Size of the NBS gene family (A) and number of pseudo-NBS genes (B) in the C. himalaica genome compared with those of related species.

Functional annotation of the most significantly expansive and contract gene families in C. himalaica Size of the NBS gene family (A) and number of pseudo-NBS genes (B) in the C. himalaica genome compared with those of related species.

Positive Selection on Single-Copy Genes.

Orthologs that show signs of positive selection usually underwent adaptive divergence (47). Previously, we observed that the ratio of nonsynonymous to synonymous substitutions (dN/dS or ω) in C. himalaica is higher than those of closely related species, suggestive of accelerated evolution in C. himalaica after divergence from its ancestral lineage (4). In the present study, we conducted a positive selection analysis using the genomic sequences of C. himalaica and four close relatives. Among the 21,383 orthogroups, 11,085 contained single-copy orthologous genes. We used the branch-site model of the PAML 4 package (35) to identify genes with signs of positive selection. As a result, 844 genes possibly under positive selection were identified in the C. himalaica genome (ω > 1, P < 0.05). Of these genes, 610 showed highly significant (P < 0.01) positive selection (). A KEGG functional classification of the 610 significant PSGs in the C. himalaica genome () showed that several categories associated with DNA repair, the ubiquitin system, as well as plant hormone biosynthesis and signal transduction were enriched. Genes involved in DNA repair were also identified as being under positive selection pressure in a previous transcriptomic study of C. himalaica (4). It is notable that significantly expanded gene families and PSGs were both enriched in DNA repair and protein ubiquitination pathways. Signal transduction-related CheY-like genes were also found to have undergone both significant positive selection and expansion events in our previous study on the adaptive evolution of the cyanobacterium Trichormus sp. NMC-1 on the QTP (3), again providing evidence that expansion/duplication and (subsequent) positive selection of genes are an important mechanism for plant adaptive evolution. The extremely intense UV radiation on the QTP may influence plant growth and developmental processes such as photoperiodism and flowering, or cause DNA, RNA, and protein damage (48). Our results showed that 10 PSGs in the C. himalaica genome are involved in the DNA repair pathway (). Notably, several genes in the nucleotide excision repair pathway showed signatures of positive selection. One of them encodes the DNA excision repair protein ERCC-1 (or UV hypersensitive 7), which was also identified previously (4) and is reported to function as a component of a structure-specific endonuclease that cleaves on the 5ʹ side of UV photoproducts in DNA (49). In addition to ERCC-1, we identified PSGs that encode the AP endonuclease 2 protein (APEX2), DNA polymerase delta subunit 2 (POLD2), DNA mismatch repair protein MLH1, and DNA repair and recombination protein RAD54 (). These PSGs are involved in base excision repair, mismatch excision repair, and homologous recombination, suggesting that C. himalaica has evolved an integrated DNA-repair mechanism to adapt to the harsh habitats caused by the uplift of the QTP. Moreover, high UV-B radiation is a common stress that both animals and plants on the QTP must cope with. The DNA repair and radiation responses pathways similarly have played crucial roles in the highland adaptation of the Tibetan highland barley (6), Tibetan antelope (50), Tibetan chicken (51), and Tibetan hot-spring snake (52). Our analyses also indicated that ubiquitin system-related gene families underwent significant expansion and natural selection. Ubiquitin-mediated proteolysis impacts almost every aspect of plant growth and development, including plant hormone signal transduction, photomorphogenesis, reproduction, and abiotic stress responses (53). Such a multifunctional biological process is of even greater importance for plants in the complex harsh environment on the QTP. Among the seven PSGs in the ubiquitin system based on the KEGG annotation, for instance, one encodes an ortholog of the COP10 protein in A. thaliana, which can enhance the activity of ubiquitin-conjugating enzymes (E2s) (54) and participates in light signal transduction and photomorphogenesis (55). An additional PSG encodes the ortholog of the auxin transport protein BIG (E3 ubiquitin-protein ligase UBR4) in A. thaliana, which not only influences multiple auxin-mediated developmental processes (e.g., lateral root production and inflorescence architecture) but also plays a critical role in a multitude of light and phytohormone pathways (56). These PSGs enriched in the ubiquitin system might be important for Crucihimalaya to better survive the harsh environment in the QTP. Similarly, in three mangrove species, the ubiquitin-mediated proteolysis pathway includes 12 genes that have experienced convergent evolution at conservative sites, presumably important for adaptation of mangroves to the harsh coastal habitat (57). Moreover, many of the identified PSGs were associated with reproduction pathways (92 PSGs, P = 5.40E-06) based on GO annotation (). One of the PSGs encodes the phytochrome and flowering time regulatory protein 1 (PFT1), which represses the PhyB-mediated light signaling and regulates the expression of FLOWERING LOCUS T (FT) and CONSTANS (CO) (58, 59). In a low-red to far-red ratio environment, plants require CO and FT to fully accelerate flowering in long days (58, 59). PFT1 acts as a hub to integrate a variety of interdependent environmental stimuli, including light quality and jasmonic acid-dependent defense responses. An additional gene encodes TERMINAL FLOWER 1 (TFL1), which controls inflorescence meristem identity and regulates flowering time in the long-day flowering pathway. As the QTP experiences a long sunshine duration and a short vegetation growing season, flowering time is particularly critical and affects both the life cycles and reproductive success of many alpine plants (60). In the wild, C. himalaica initiates flowering relatively early (April). However, this species rarely blossoms in low-altitude areas outside the QTP. These observations suggest that C. himalaica has evolved specific reproductive strategies on photoperiodism and flowering-related pathways as an adaptation to the long sunshine duration and short growing season on the QTP.

S-Locus Structure and Self-Fertilization of C. himalaica.

Self-incompatibility in Brassicaceae is controlled by the S-locus recognition system, which prevents self-fertilization (61, 62). Whereas self-fertilization often leads to decreased fitness of homozygous offspring, it also ensures reproduction in absence of pollinators or suitable mates, and therefore can be advantageous for plants to expand their distribution edges and occupy new niches (63, 64). The QTP is characterized by harsh conditions including a short growing season and low pollinator activities. Therefore, it is not surprising that C. himalaica is a selfing plant (9). The Arabidopsis/Capsella-like S-locus contains the female specificity gene (S-receptor kinase, SRK) and the male specificity gene (S-locus cysteine-rich protein, SCR) (65). Self-compatibility is achieved when the SRK receptor fails to recognize the SCR ligand, usually through a loss of SCR function (66–69). We manually annotated the S-locus flanking genes (AKR3 and U-box) and the SCR gene on scaffold 29 of the C. himalaica genome assembly. Alignment of Crucihimalaya SCR coding sequences with homologs from miscellaneous Arabidopsis halleri, A. lyrata, A. thaliana, Capsella grandiflora, C. rubella, and C. hirsuta S-locus haplotypes showed that the closest haplotype to C. himalaica was the one from A. halleri S15 () (66, 68, 70, 71). Note that, due to strong balancing selection on the S-locus, highly diverged S-alleles are shared across species and SCR gene phylogeny does not follow species phylogeny () (72, 73). SCR protein sequence alignment showed that C. himalaica appears to have lost two out of eight conserved cysteines (Fig. 4) that are essential for structural and functional integrity of SCR ligand (66, 74). This alone may explain the transition to self-fertilization in C. himalaica. Interestingly, neither SRK from the S15 haplogroup nor any other SRK had a BLAST hit to the C. himalaica genome (Fig. 4). To investigate whether SRK is indeed lost from the genome, we mapped short reads (450-bp insert size library) of C. himalaica to the combination of the C. himalaica genome assembly and A. halleri S15 haplogroup of S-locus. The SRK gene region did not have any reads mapped, and therefore we conclude that SRK is not present in the C. himalaica genome and likely has been lost. Taken together, we reason that the transition to self-compatibility of C. himalaica was accomplished by both (i) loss of function in the male recognition gene SCR and (ii) loss of the female recognition gene SRK. It is often suggested that self-fertilization rates of alpine plants increase at higher altitudes (75), due to the reduction in pollinator abundance and seed production in alpine plants (76, 77). This suggestion is consistent with the fact that self-fertilizing populations also exist in Arabis alpina, a close relative of C. himalaica that grows at high altitudes (78, 79). Moreover, previous studies also found that the rate of self-fertilizing hermaphroditic plants is increasing on the QTP and autonomous self-fertilization provides substantial reproductive assurance in the pollinator scarcity condition of the QTP (80, 81). Therefore, the transition to a self-compatibility mating system in C. himalaica likely facilitated its occupation of the QTP.

Fig. 4.

Transitioning to self-fertilization in C. himalaica. (A) Protein sequence alignment of S-locus SCR genes from C. himalaica, A. halleri, and A. lyrata. Cysteine residues are highlighted in red, showing eight conserved sites important for structural and functional integrity of the protein (66, 74). The C. himalaica SCR protein appears to have lost two out of eight conserved cysteines and therefore is probably nonfunctional. (B) Structure of the S-locus part of scaffold 29 in C. himalaica genome assembly compared with A. halleri S15 haplogroup (70). Colors of the lines between S-loci correspond to BLAST scores from highest (blue) to lowest (gray).

Conclusion

Organisms that live on the QTP face a variety of abiotic stresses under the harsh environmental conditions and presumably have been subjected to a series of adaptive evolutionary changes. In the present study, we de novo-sequenced the genome of C. himalaica, a species exclusively found in the Himalayan region. Based on phylogenetic reconstruction and estimated divergence times, we propose that the speciation of C. himalaica was triggered by the uplift of the QTP, mediated by genomic evolution to adapt to the altered environment. The proliferation of LTR retrotransposons may at least partly be responsible for the increased genome size of C. himalaica. The significant contraction and pseudogenization of NBS gene families reflect the strong reduction in pathogen incidence on the QTP. Gene families that underwent significant expansion and genes that show signs of positive selection are enriched in DNA repair and protein ubiquitination pathways, which probably reflect to the adaptation of C. himalaica to high radiation, low temperature, and pathogen-depauperate environments on the QTP. Occupation of the QTP by C. himalaica was likely facilitated by self-compatibility, which, as we have shown here, involved both male and female components of the recognition system. Both similarities (e.g., DNA repair and disease-resistance pathways) and differences (e.g., reproduction and plant hormone-related pathways) in adaptive mechanisms have been identified among plants and animals that grow at high altitudes. Although further experimental verification is needed, our results provide insights into how plants adapt to harsh and extreme environments.

Materials and Methods

Plant Material, Genome Sequencing, and Assembly.

Seedlings of C. himalaica were sampled from Batang County (altitude 4,010 m, N 30.313°, E 99.358°) of QTP. Seedlings from the same individual were cultivated in the greenhouse at Kunming Institute of Botany. High-quality genomic DNA was extracted using the Qiagen DNeasy Plant Mini Kit. The C. himalaica genome assembly was performed using sequence data obtained from a combination of sequencing technologies: Illumina PE reads, Illumina MP reads, and PacBio RS II reads (). The whole step of library construction and sequencing was performed at Novogene Bioinformatics Technology Co., Ltd. First, six PE libraries were prepared to sequence the C. himalaica genome. These included two PE libraries with insert sizes of 250 and 450 bp and four MP libraries with insert sizes of 2, 5, 10 and 15 kb. Whole genomic sequence (44.49 Gb) data were generated solely using Illumina platforms (HiSEq. 2500) and assembled using ALLPATHS -LG (15). Next, the PacBio reads (6.19 Gb) were used to fill gaps using the PBJelly2 tool (16), followed by scaffold assembly with SSPACE (82) using long-insert-size PE reads. The accuracy and completeness of the assemblies were assessed by aligning the reads from short-insert-size libraries back to the scaffolds using BWA-mem (v. 0.7.17) (83).

Gene Prediction and Annotation.

Gene prediction was performed using a combination of homology, de novo, and transcriptome-based approaches. Gene models were integrated by EvidenceModeler (evidencemodeler.github.io/). Gene models were further updated by PASA (84) to generate UTRs and provide information on alternative splicing variants. The predicted genes were analyzed for functional domains and homologs using InterProScan and BLAST against the NCBI nonredundant protein sequence database, TrEMBL and SwissProt with an E-value cutoff 1E-15 and Blast2GO with default parameters. Completeness of the genome was also assessed by performing core gene annotation using the BUSCO (17) methods. Transcription factors were identified and classified into different families using the iTAK pipeline (bioinfo.bti.cornell.edu/cgi-bin/itak/index.cgi) (85).

Repetitive Elements Identification and Dynamic Analysis.

A combined strategy based on homologous sequence alignment and de novo searches was used to identify repeat elements in the C. himalaica genome. We de novo predicted TEs using RepeatModeler (www.repeatmasker.org/RepeatModeler), RepeatScout (www.repeatmasker.org), Piler (86) (www.drive5.com/piler/), and LTR-Finder (87) (tlife.fudan.edu.cn/tlife/ltr_finder) with default parameters. For alignment of homologous sequences to identify repeats in the assembled genome, we used RepeatProteinMask and RepeatMasker (www.repeatmasker.org) with the repbase library (88). TEs overlapping with the same type of repeats were integrated, while those with low scores were removed if they overlapped over 80% of their lengths and belonged to different types. Intact LTR retrotransposons were identified by searching the genomes of C. himalaica with LTRharvest (89) (-motif tgca -motifmis 1) and LTR_Finder (LTR length 100 to 5,000 nt; length between two LTRs: 1,000 to 2,0000 nt). We combined results from both analyses after integrating the overlap sites. Candidate sequences were filtered by a two-step procedure to reduce false positives. First, LTRdigest (90) was used to identify the primer binding site (PBS) motif, and only elements containing PBS were retained; then, protein domains (pol, gag, and env) in candidate LTR retrotransposons were identified by searching against HMM profiles considered as intact. Second, clustering of candidate LTR retrotransposons was performed using the SiLiX software package (lbbe.univ-lyon1.fr/SiLiX). Using a substitution rate (r) of 7*10−9 substitutions per site per year (91, 92), the insertion date (T) can be calculated for each LTR retrotransposons (T = K/2r, K: genetic distance). Tandemly repeated gene arrays were identified using BLASTP with a threshold E value < 10−8; fragmental alignments were conjoined for each gene pair. The tandem repeat gene clusters were identified using a cutoff of sequence identity >70% and distance <150 kb. Each cluster with more than two genes was retained.

Whole-Genome Alignment and Duplication Analysis.

We aligned the C. himalaica genome to those of A. thaliana, A. lyrata, and C. rubella using LASTZ (93) with the following parameter values: M = 254 K = 4500 L = 3000 Y = 15000 -seed = match 12 -step = 20 -identity = 85. To avoid the interference caused by repetitive sequences for sequence alignment, RepeatMasker and RepBase library were used to mask repetitive sequences of the above four genomes. The raw alignments were combined into larger blocks using the ChainNet algorithm. We identified orthologous genes among the C. himalaica, A. thaliana, A. lyrata, and C. rubella genomes and paralogous genes within C. himalaica and other relatives using BLASTP (E value < 1E-7). MCscanx (94) was used to identify syntenic blocks within the genome. For each gene pair in a syntenic block, the 4DTv distance was calculated; values of all gene pairs were plotted to identify putative whole-genome duplication events and divergence in two species.

Phylogenetic Tree Construction and Estimation of Species Divergence Times.

We selected genomes of C. himalaica and eight other species (A. thaliana, A. lyrata, C. rubella, C. hirsuta, Eutrema salsugineum, B. rapa, Schrenkiella parvula, and Aethionema arabicum) to identify orthologs. To remove redundancy caused by alternative splicing variations, we retained only gene models at each gene locus that encoded the longest protein sequence. To exclude putative fragmented genes, genes encoding protein sequences shorter than 50 aa were filtered out. All filtered protein sequences of these plants were compared with each other using BLASTP (E value < 1E-7) and clustered into orthologous groups using OrthoMCL (inflation parameter, 1.5) (33). Protein sequences from 4,586 single-copy gene families were used for phylogenetic tree construction. MUSCLE (95) was used to generate multiple sequence alignment for protein sequences in each single-copy family with default parameters. The alignments of each family were concatenated to a super alignment matrix, which was then used for phylogenetic tree reconstruction through the PROTCATJTT model in RAxML software. Divergence time between nine species was estimated using MCMCtree in PAML (35) with the options “independent rates” and “GTR” model. A Markov chain Monte Carlo analysis was run for 10,000 generations, using a burn-in of 1,000 iterations. Three calibration points were applied based on a recent study using 53 plastomes of Brassicales: A. arabicum and other crucifers divergence time (29.0 to 41.8 Mya), core Brassicaceae origination time (21.3 to 29.8 Mya), and core Arabidopsis origination time (4.8 to 9.7 Mya) (36). We selected four close relatives (A. thaliana, A. lyrata, C. rubella, and C. hirsuta) of C. himalaica to identify orthogroups (gene families). Expansions and contractions of orthologous gene families were determined using CAFÉ v. 4.1 (96). The program uses a birth and death process to model gene gain and loss over a phylogeny. For each significantly expanded and contracted gene family in C. himalaica, functional information was inferred based on its ortholog in A. thaliana. GO enrichment analyses of genes were conducted using web-based agriGO (systemsbiology.cau.edu.cn/agriGOv2) (97) with the singular enrichment analysis method and TAIR10 database. The KOBAS and BlastKOALA software (98) was also used to test the statistical enrichment of genes in KEGG pathways (99).

Identification and Classification of Putative NBS Resistance Genes.

A hidden Markov model search (hmmer.janelia.org) was used to identify NBS-encoding R genes in the C. himalaica genome and four close relatives (A. thaliana, A. lyrata, C. rubella, and C. hirsuta). The sequences were screened using HMMs to search for the Pfam NBS (NB-ARC) family PF00931 domain (pfam.xfam.org/). Pfam HMM searches were performed using a TIR model (PF01582) and several LRR models (PF00560, PF07723, PF07725, PF12799, PF13306, PF13516, PF13504, PF13855, and PF08263) to detect TIR domains and LRR motifs in the NBS-encoding amino acid sequences. CC motifs were detected using the COILS prediction program 2.2 (embnet.vital-it.ch/software/COILS_form.html) with a p-score cutoff of 0.9.

Gene Family-Based Positive Selection Tests.

For positive selection analyses we also selected the four most closely related species with assembled genomes (A. thaliana, A. lyrata, C. rubella, and C. hirsuta) and C. himalaica to identify orthologs based on our phylogenetic tree. The analysis procedure is similar to the phylogenetic tree analysis. First, we used OrthoMCL (32) to identify homologous gene clusters (orthogroups) among the five genomes. OrthoMCL was run with an E-value cutoff of 1E-15 and an inflation parameter of 1.5 due to the close genetic relationship between five relatives. Orthogroups with single-copy genes shared by all five genomes were retained for further analyses. Multiple sequence alignment was performed for each orthogroup using MUSCLE v. 3.8.31 (97) with default parameters. Poorly aligned regions were further trimmed using the trimAl v. 1.4 software (100) with the parameter “-gt 0.8 -st 0.001”. Maximum likelihood trees were generated using RAxML v7.0.4 (101) with the PROTCATJTT model. To calculate the nonsynonymous (Ka) and synonymous (Ks) substitution rates between pairs of orthogroups, we reverse-translated amino acid alignments to the corresponding codon-based nucleotide alignments using PAL2NAL (102). To increase the power of tests for positive selection, we applied the branch-site model (103) implemented in codeml program (35) to estimate the dN/dS substitution rates (ω value). We also deleted all gaps (clean data = 1) from the alignments to lower the effect of ambiguous bases on the inference of positive selection. A foreground branch was specified as the clade of C. himalaica. A likelihood ratio test was conducted to determine whether positive selection is operating in the foreground branch. In this study, PSGs were inferred if the P value was less than 0.01. The functional annotation of PSGs in C. himalaica was also conducted using the same approach and the same P-value cutoff with the gene family expansion and contraction analysis. Known S-haplogroups from Arabidopsis and Capsella were used as query to search the genome assembly of C. himalaica (70, 104) using BLASTP. We manually annotated both flanking genes of the S-locus on scaffold 29: the ARK3 gene and U-box gene. However, no homologs were detected in our search for the SRK gene. Using the standard BWA-mem (v. 0.7.17) (83) and Samtools (v. 1.6) (105) pipeline, we mapped short reads (450-bp insert size library) of C. himalaica to the combined reference of C. himalaica genome and A. halleri S15 haplogroup of S-locus (70) and then manually explored the coverage of the SRK region of the S15 haplogroup using IGV (v. 2.4.13) (106). The coding sequence of the predicted SCR gene (ID = AT4G22105.1_Ath-D2 on scaffold29:1034290-1034749), together with SCR sequences from A. thaliana, C. rubella, C. grandiflora, and C. hirsuta, was added to the curated alignment of the publicly available SCR genes kindly provided by Vincent Castric (Unité Evo-Eco-Paléo, CNRS/Université de Lille 1, Villeneuve d’Ascq, France) from A. halleri and A. lyrata using MUSCLE (v. 3.8.31) (97), and a maximum likelihood tree with 500 bootstrap replicates was constructed using MEGAX (107) and visualized using Jalview2 (108). A comparative structure plot of C. himalaica and A. halleri S15 S-loci (Fig. 4) was constructed using the R library genoPlotR (109).

91 in total

Review 1. Beyond the Arabidopsis genome: opportunities for comparative genomics.

Authors: Anne E Hall; Aretha Fiebig; Daphne Preuss
Journal: Plant Physiol Date: 2002-08 Impact factor: 8.340

Review 2. Analysis of alternative splicing in plants with bioinformatics tools.

Authors: B J Haas
Journal: Curr Top Microbiol Immunol Date: 2008 Impact factor: 4.291

3. Genome evolution of wild barley (Hordeum spontaneum) by BARE-1 retrotransposon dynamics in response to sharp microclimatic divergence.

Authors: R Kalendar; J Tanskanen; S Immonen; E Nevo; A H Schulman
Journal: Proc Natl Acad Sci U S A Date: 2000-06-06 Impact factor: 11.205

4. MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms.

Authors: Sudhir Kumar; Glen Stecher; Michael Li; Christina Knyaz; Koichiro Tamura
Journal: Mol Biol Evol Date: 2018-06-01 Impact factor: 16.240

5. Uncovering the dynamic evolution of nucleotide-binding site-leucine-rich repeat (NBS-LRR) genes in Brassicaceae.

Authors: Yan-Mei Zhang; Zhu-Qing Shao; Qiang Wang; Yue-Yu Hang; Jia-Yu Xue; Bin Wang; Jian-Qun Chen
Journal: J Integr Plant Biol Date: 2015-07-24 Impact factor: 7.061

6. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.

Authors:
Journal: Nature Date: 2000-12-14 Impact factor: 49.962

7. Arabidopsis COP10 forms a complex with DDB1 and DET1 in vivo and enhances the activity of ubiquitin conjugating enzymes.

Authors: Yuki Yanagawa; James A Sullivan; Setsuko Komatsu; Giuliana Gusmaroli; Genki Suzuki; Jianning Yin; Toyotaka Ishibashi; Yusuke Saijo; Vicente Rubio; Seisuke Kimura; Jian Wang; Xing Wang Deng
Journal: Genes Dev Date: 2004-09-01 Impact factor: 11.361

8. Sporophytic self-incompatibility genes and mating system variation in Arabis alpina.

Authors: A Tedder; S W Ansell; X Lao; J C Vogel; B K Mable
Journal: Ann Bot Date: 2011-08-05 Impact factor: 4.357

9. Draft genome sequence of the Tibetan antelope.

Authors: Ri-Li Ge; Qingle Cai; Yong-Yi Shen; A San; Lan Ma; Yong Zhang; Xin Yi; Yan Chen; Lingfeng Yang; Ying Huang; Rongjun He; Yuanyuan Hui; Meirong Hao; Yue Li; Bo Wang; Xiaohua Ou; Jiaohui Xu; Yongfen Zhang; Kui Wu; Chunyu Geng; Weiping Zhou; Taicheng Zhou; David M Irwin; Yingzhong Yang; Liu Ying; Haihua Bao; Jaebum Kim; Denis M Larkin; Jian Ma; Harris A Lewin; Jinchuan Xing; Roy N Platt; David A Ray; Loretta Auvil; Boris Capitanu; Xiufeng Zhang; Guojie Zhang; Robert W Murphy; Jun Wang; Ya-Ping Zhang; Jian Wang
Journal: Nat Commun Date: 2013 Impact factor: 14.919

10. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons.

Authors: Zhao Xu; Hao Wang
Journal: Nucleic Acids Res Date: 2007-05-07 Impact factor: 16.971

22 in total

1. Insights into phylogeny, age and evolution of Allium (Amaryllidaceae) based on the whole plastome sequences.

Authors: Deng-Feng Xie; Jin-Bo Tan; Yan Yu; Lin-Jian Gui; Dan-Mei Su; Song-Dong Zhou; Xing-Jin He
Journal: Ann Bot Date: 2020-06-01 Impact factor: 4.357

2. Genomic basis of parallel adaptation varies with divergence in Arabidopsis and its relatives.

Authors: Magdalena Bohutínská; Jakub Vlček; Sivan Yair; Benjamin Laenen; Veronika Konečná; Marco Fracassetti; Tanja Slotte; Filip Kolář
Journal: Proc Natl Acad Sci U S A Date: 2021-05-25 Impact factor: 11.205

3. Population transcriptomic sequencing reveals allopatric divergence and local adaptation in Pseudotaxus chienii (Taxaceae).

Authors: Li Liu; Zhen Wang; Yingjuan Su; Ting Wang
Journal: BMC Genomics Date: 2021-05-26 Impact factor: 3.969

4. Chromosome-level reference genome of the soursop (Annona muricata): A new resource for Magnoliid research and tropical pomology.

Authors: Joeri S Strijk; Damien D Hinsinger; Mareike M Roeder; Lars W Chatrou; Thomas L P Couvreur; Roy H J Erkens; Hervé Sauquet; Michael D Pirie; Daniel C Thomas; Kunfang Cao
Journal: Mol Ecol Resour Date: 2021-03-10 Impact factor: 7.090

5. Plastome Evolution in Dolomiaea (Asteraceae, Cardueae) Using Phylogenomic and Comparative Analyses.

Authors: Jun Shen; Xu Zhang; Jacob B Landis; Huajie Zhang; Tao Deng; Hang Sun; Hengchang Wang
Journal: Front Plant Sci Date: 2020-04-15 Impact factor: 5.753

6. Transcriptomic Evidence of Adaptive Evolution of the Epiphytic Fern Asplenium nidus.

Authors: Jiao Zhang; Li Liu; Jiang-Ping Shu; Dong-Mei Jin; Hui Shen; Hong-Feng Chen; Rui Zhang; Yue-Hong Yan
Journal: Int J Genomics Date: 2019-12-01 Impact factor: 2.326

7. A Combined Morphological and Molecular Evolutionary Analysis of Karst-Environment Adaptation for the Genus Urophysa (Ranunculaceae).

Authors: Deng-Feng Xie; Rui-Yu Cheng; Xiao Fu; Xiang-Yi Zhang; Megan Price; Yan-Ling Lan; Chang-Bao Wang; Xing-Jin He
Journal: Front Plant Sci Date: 2021-06-10 Impact factor: 5.753

8. De novo sequencing of the transcriptome reveals regulators of the floral transition in Fargesia macclureana (Poaceae).

Authors: Ying Li; Chunxia Zhang; Kebin Yang; Jingjing Shi; Yulong Ding; Zhimin Gao
Journal: BMC Genomics Date: 2019-12-30 Impact factor: 3.969

9. Uncovering the role of a positive selection site of wax ester synthase/diacylglycerol acyltransferase in two closely related Stipa species in wax ester synthesis under drought stress.

Authors: Yunqiang Yang; Zhili Zhou; Yan Li; Yanqiu Lv; Danni Yang; Shihai Yang; Jianshuang Wu; Xiong Li; Zhijia Gu; Xudong Sun; Yongping Yang
Journal: J Exp Bot Date: 2020-07-06 Impact factor: 6.992

10. Genomic analysis of field pennycress (Thlaspi arvense) provides insights into mechanisms of adaptation to high elevation.

Authors: Yupeng Geng; Yabin Guan; Shugang Lu; Miao An; M James C Crabbe; Ji Qi; Fangqing Zhao; Qin Qiao; Ticao Zhang
Journal: BMC Biol Date: 2021-07-22 Impact factor: 7.431