| Literature DB >> 34310022 |
Philipp E Bayer1, Armin Scheben1, Agnieszka A Golicz2, Yuxuan Yuan1, Sebastien Faure3, HueyTyng Lee4, Harmeet Singh Chawla4, Robyn Anderson1, Ian Bancroft5, Harsh Raman6, Yong Pyo Lim7, Steven Robbens8, Lixi Jiang9, Shengyi Liu10, Michael S Barker11, M Eric Schranz12, Xiaowu Wang13, Graham J King14, J Chris Pires15, Boulos Chalhoub9, Rod J Snowdon4, Jacqueline Batley1, David Edwards1.
Abstract
Plant genomes demonstrate significant presence/absence variation (PAV) within a species; however, the factors that lead to this variation have not been studied systematically in Brassica across diploids and polyploids. Here, we developed pangenomes of polyploid Brassica napus and its two diploid progenitor genomes B. rapa and B. oleracea to infer how PAV may differ between diploids and polyploids. Modelling of gene loss suggests that loss propensity is primarily associated with transposable elements in the diploids while in B. napus, gene loss propensity is associated with homoeologous recombination. We use these results to gain insights into the different causes of gene loss, both in diploids and following polyploidization, and pave the way for the application of machine learning methods to understanding the underlying biological and physical causes of gene presence/absence.Entities:
Keywords: zzm321990Brassicazzm321990; XGBoost; gene loss propensity; machine learning; pangenome; transposable elements
Mesh:
Year: 2021 PMID: 34310022 PMCID: PMC8633514 DOI: 10.1111/pbi.13674
Source DB: PubMed Journal: Plant Biotechnol J ISSN: 1467-7644 Impact factor: 9.803
Assembly statistics for the newly assembled B. napus cv. Darmor‐bzh v9 compared with v4.1 (Chalhoub et al., 2014)
| Assembly | Assembly size (Mb) | Anchored chromosome (Mb) | TEs (%) | Number of annotated genes | Completeness (BUSCO) |
|---|---|---|---|---|---|
| V4.1 (Chalhoub | 850.3 | 645.4 | 46.5 | 101 040 | 99.5% |
| v9 | 1043.4 | 933.3 | 64.5 | 108 580 | 99.5% |
Pangenome additional contigs assembly statistics
| Pangenome | Assembly size (Mbp) | Assembly N50 | Predicted genes |
|---|---|---|---|
|
| 121.8 | 3848 | 6715 |
|
| 180.5 | 2500 | 19 767 |
|
| 87.2 | 2295 | 5060 |
Figure 1Pangenome models based on the (Golicz et al., 2016) gene number modelling method for (a) B. oleracea, (b) B. rapa, (c) B. napus (including synthetic lines) and (d) B. napus (excluding synthetic lines). Upper curves show the total pangenome after different combinations of individuals, the lower curve shows the number of core genes between all combinations of individuals.
Shared genes between the three pangenomes based on exon‐level read alignments. For B. rapa, FPSc (Fast Plants, self‐compatible) and non‐FPSc lines are compared. For B. napus, non‐synthetic and synthetic lines are compared
|
|
|
| ||||
|---|---|---|---|---|---|---|
| Total genes | 58 315 | 59 864 | 108 580 | |||
| Dispensable genes within the same species | 12 354 (21%) | With FPScs | 19 912 (33%) | With synthetics | 41 614 (38%) | |
| Without FPScs | 19 735 (33%) | Without synthetics | 27 930 (26%) | |||
| Core genes within the same species | 45 961 (79%) | With FPScs | 39 952 (67%) | With synthetics | 66 966 (62%) | |
| Without FPScs | 40 129 (67%) | Without synthetics | 80 650 (74%) | |||
| Present in all three species in at least one individual each | 57 717 (99%) | 57 941 (97%) | 104 465 (96%) | |||
| Present only in… |
| 226 (0.4%) | 0 | 648 (0.6%) | ||
|
| 0 | 1198 (2%) | 2512 (2.3%) | |||
|
| 12 (0.02%) | 16 (0.02%) | 0 | |||
|
| 0 | 0 | 955 (0.9%) | |||
|
| 0 | 711 (1.1%) | 0 | |||
|
| 360 (0.6%) | 0 | 0 | |||
Figure 2Genes shared across B. oleracea, B. rapa and B. napus in the three assembled pangenomes. (a) B. oleracea pangenome (58 315 genes), (b) B. rapa pangenome (59 864 genes) and (c) B. napus pangenome (108 580 genes).,
Figure 3First two principal components based on PAV data of (a) A genome genes and (b) C genome genes. The PAV matrix of all B. napus genes was split into two subsets – (a) one containing only A‐genome genes and A‐genome species (B. rapa, fast‐cycling B. rapa FPSc, B. napus) and (b) one containing only C‐genome genes and C‐genome species (B. oleracea, B. napus). PCA was carried out using logistic singular value decomposition (SVD). In both cases 31% of variance was explained by the model.
Figure 5SHAP values as a measure of importance in predicting dispensable genes based on the genes’ position on the chromosomes in three XGBoost models trained for B. oleracea (a), B. rapa (b) and B. napus (c). The x‐axis represents the feature ‘Position on chromosome’ in Figure 4. Each line represents one chromosome. The y‐axis displays SHAP values, the higher the value, the more of an impact that gene’s position has towards the prediction of a dispensable gene. Negative SHAP values imply that this gene’s position has an impact towards the prediction of a core gene. Only on B. napus do SHAP values exceed 1, and then only at the telomeres of almost all chromosomes. In the diploids, genes located at the telomeres have negative SHAP values, i.e. their telomeres are not linked with the prediction of gene loss propensity.
Figure 4Impact of model output for the prediction of gene loss propensity measured via SHAP values for three XGBoost models trained for PAV data from B. oleracea (a), B. rapa (b) and B. napus (c). High feature values are displayed in red, low in blue. Twenty attributes with the strongest impact on the model are displayed. Binary variables are 1/0 encoded, so genes with a 1 for the dispensable C01 are located on the chromosome C01. In this case, high (red colour) with high SHAP values means that the presence of a gene on this chromosome is a stronger predictor of gene dispensability. The transposable element codes follow the nomenclature of (Wicker et al., 2007): DNA/DTT = CACTA, DNA/DTM = Mutator, DNA/DTH = PIF‐Harbinger.