| Literature DB >> 28261241 |
Bruno Contreras-Moreira1, Carlos P Cantalapiedra2, María J García-Pereira2, Sean P Gordon3, John P Vogel3, Ernesto Igartua2, Ana M Casas2, Pablo Vinuesa4.
Abstract
The pan-genome of a species is defined as the union of all the genes and non-coding sequences found in all its individuals. However, constructing a pan-genome for plants with large genomes is daunting both in sequencing cost and the scale of the required computational analysis. A more affordable alternative is to focus on the genic repertoire by using transcriptomic data. Here, the software GET_HOMOLOGUES-EST was benchmarked with genomic and RNA-seq data of 19 Arabidopsis thaliana ecotypes and then applied to the analysis of transcripts from 16 Hordeum vulgare genotypes. The goal was to sample their pan-genomes and classify sequences as core, if detected in all accessions, or accessory, when absent in some of them. The resulting sequence clusters were used to simulate pan-genome growth, and to compile Average Nucleotide Identity matrices that summarize intra-species variation. Although transcripts were found to under-estimate pan-genome size by at least 10%, we concluded that clusters of expressed sequences can recapitulate phylogeny and reproduce two properties observed in A. thaliana gene models: accessory loci show lower expression and higher non-synonymous substitution rates than core genes. Finally, accessory sequences were observed to preferentially encode transposon components in both species, plus disease resistance genes in cultivated barleys, and a variety of protein domains from other families that appear frequently associated with presence/absence variation in the literature. These results demonstrate that pan-genome analyses are useful to explore germplasm diversity.Entities:
Keywords: Arabidopsis thaliana; RNA-seq; accessory genome; barley; comparative genomics; core-genome; pan-genome
Year: 2017 PMID: 28261241 PMCID: PMC5306281 DOI: 10.3389/fpls.2017.00184
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 5.753
Arabidopsis thaliana CDS and transcripts analyzed in this work, with median length and N50 values.
| Ecotype | WGS CDS | Length | cDNA | Length | N50 | Raw reads | Clean reads | Assembly reads | |
|---|---|---|---|---|---|---|---|---|---|
| Can_0 | 39,739 | 984 | (Unavailable RNA-seq reads) | ||||||
| Col_0 | 40,553 | 1,008 | |||||||
| Bur_0 | 39,941 | 990 | 26,469 | 67,259 | 614 | 1,349∗ | 89.1M | 87.3M | 21.5M |
| Ct_1 | 39,975 | 993 | 26,121 | 66,425 | 581 | 1,260 | 85.1M | 83.4M | 18.4M |
| Edi_0 | 39,971 | 990 | 26,383 | 69,374 | 577 | 1,246 | 80.0M | 78.2M | 18.2M |
| Hi_0 | 40,056 | 990 | 25,986 | 71,934 | 547 | 1,165 | 80.9M | 79.3M | 19.5M |
| Kn_0 | 39,915 | 987 | 25,832 | 75,550 | 529 | 1,114 | 82.9M | 81.0M | 19.3M |
| Ler_0 | 40,027 | 987 | 26,405 | 72,858 | 555 | 1,252 | 88.0M | 85.5M | 19.3M |
| Mt_0 | 39,914 | 990 | 25,933 | 74,723 | 554 | 1,182 | 80.1M | 78.3M | 18.2M |
| No_0 | 39,847 | 987 | 26,127 | 71,987 | 564 | 1,188 | 90.9M | 89.4M | 20.3M |
| Oy_0 | 39,875 | 990 | 26,475 | 72,095 | 552 | 1,239 | 85.6M | 83.7M | 19.5M |
| Po_0 | 40,028 | 993 | 26,564 | 67,404 | 586 | 1,219 | 87.1M | 85.5M | 19.3M |
| Rsch_4 | 39,847 | 990 | 26,188 | 79,719 | 505 | 1,175 | 84.0M | 82.5M | 20.3M |
| Sf_2 | 39,797 | 987 | 26,138 | 71,544 | 550 | 1,159 | 77.9M | 76.7M | 17.9M |
| Tsu_0 | 39,902 | 987 | 26,062 | 71,100 | 563 | 1,185 | 79.1M | 77.9M | 17.7M |
| Wil_2 | 39,807 | 987 | 25,888 | 62,552 | 580 | 1,223 | 67.8M | 66.9M | 16.8M |
| Ws_0 | 39,784 | 987 | 26,270 | 66,243 | 610 | 1,349 | 83.1M | 82.0M | 19.1M |
| Wu_0 | 39,934 | 990 | 26,237 | 66,214 | 586 | 1,253 | 80.0M | 78.9M | 18.1M |
| Zu_0 | 40,003 | 984 | 26,259 | 65,652 | 603 | 1,300 | 77.6M | 76.4M | 18.8M |
Clusters of de novo assembled transcripts of 17 A. thaliana ecotypes are compared to pan-genome clusters of genome-based annotated CDS.
| Minimum occupancy | Length | % Genomic matches | WGS CDS matches | Clusters/CDS | Recall | Precision | Pan-size | |
|---|---|---|---|---|---|---|---|---|
| 1 | 115,278 | 406 | 87.6 | 24,695 | 2.85 | 0.72 | 0.50 | 54,498 |
| 2 | 50,252 | 600 | 96.3 | 23,571 | 1.42 | 0.69 | 0.78 | 38,920 |
| 3 | 41,691 | 670 | 97.3 | 23,088 | 1.23 | 0.68 | 0.85 | 34,974 |
| 4 | 37,087 | 721 | 97.8 | 22,678 | 1.13 | 0.66 | 0.89 | 32,543 |
| 5 | 34,133 | 759 | 98.1 | 22,370 | 1.06 | 0.65 | 0.92 | 30,793 |
| 6 | 31,863 | 793 | 98.4 | 22,048 | 1.00 | 0.64 | 0.94 | 29,338 |
| TAIR10 | 29,066 | 1,588 | 31,525 | 0.90 | 0.84 | 0.97 |
Barley transcriptomes analyzed on this work.
| Accession | Assembled transcripts | Median length | N50 | Tissue/Reference | Sequence reads (SRA/ENA) |
|---|---|---|---|---|---|
| Alexis | 54,493 | 522 | 1,552 | Fully expanded leaf ( | SAMN02483509 |
| AmagiNijo | 50,782 | 498 | 1,435 | SAMN02483508 | |
| Beiqing5 | 51,855 | 503 | 1,466 | SAMN02483504 | |
| Esterel | 51,731 | 514 | 1,520 | SAMN02483510 | |
| Franka | 52,913 | 507 | 1,503 | SAMN02483511 | |
| Himala2 | 45,935 | 477 | 1,355 | SAMN02483505 | |
| ECI-2-0 (Hs) | 57,440 | 536 | 1,608 | SAMN02483497 | |
| Turkey-19-24 (Hs) | 65,005 | 507 | 1,542 | SAMN02483500 | |
| XZ2 (Hs) | 56,813 | 533 | 1,529 | SAMN02483491 | |
| Padanggamu | 50,254 | 493 | 1,379 | SAMN02483503 | |
| TX9425 | 46,965 | 470 | 1,324 | SAMN02483507 | |
| Yiwuerleng | 48,508 | 472 | 1,247 | SAMN02483506 | |
| SBCC073 | 76,362 | 513 | 1,416 | Fully expanded leaf (this work) | PRJEB12540 |
| Scarlett | 84,826 | 574 | 1,569 | ||
| Haruna Nijo (transcripts) | 51,249∗ | 1,426 | 1,951 | Seedling, root, leaf, shoot, spike ( | |
| Morex (HC and LC cDNAs) | 131,692∗ | 1,101 | 1,941 | Embryo, leaf, root, flower, internode, caryopsis ( |
Accessory transcripts (barley) and genes (A. thaliana) not found in reference genomes (Morex ∪ Haruna Nijo and Col_0, respectively).
| Donor genotype | Total CDS clusters | Annotated in references | Novel clusters (sequences) | <length> (bp) | |
|---|---|---|---|---|---|
| SBCC073 | 20,932 | 16,584 | 4,348 (4,818) | 354 | |
| Scarlett | 21,956 | 17,558 | 4,398 (4,831) | 370 | |
| Wild ecotypes (ECI-2-0, XZ2, Turkey-19-24) | 14,344 | 13,324 | 1020 (3,595) | 498 | |
| Bur_0 | 30,800 | 29,765 | 1,035 (1,456) | 853 | |
| Can_0 | 30,698 | 29,433 | 1,265 (1,743) | 826 | |
| German ecotypes (No_0, Po_0, Wu_0, Zu_0) | 28,632 | 28,431 | 202 (526) | 1,236 |