| Literature DB >> 31427456 |
Dario I Ojeda1,2, Tiina M Mattila3, Tom Ruttink4, Sonja T Kujala3,5, Katri Kärkkäinen5, Jukka-Pekka Verta6, Tanja Pyhäjärvi3,7.
Abstract
Compared to angiosperms, gymnosperms lag behind in the availability of assembled and annotated genomes. Most genomic analyses in gymnosperms, especially conifer tree species, rely on the use of de novo assembled transcriptomes. However, the level of allelic redundancy and transcript fragmentation in these assembled transcriptomes, and their effect on downstream applications have not been fully investigated. Here, we assessed three assembly strategies for short-reads data, including the utility of haploid megagametophyte tissue during de novo assembly as single-allele guides, for six individuals and five different tissues in Pinus sylvestris We then contrasted haploid and diploid tissue genotype calls obtained from the assembled transcriptomes to evaluate the extent of paralog mapping. The use of the haploid tissue during assembly increased its completeness without reducing the number of assembled transcripts. Our results suggest that current strategies that rely on available genomic resources as guidance to minimize allelic redundancy are less effective than the application of strategies that cluster redundant assembled transcripts. The strategy yielding the lowest levels of allelic redundancy among the assembled transcriptomes assessed here was the generation of SuperTranscripts with Lace followed by CD-HIT clustering. However, we still observed some levels of heterozygosity (multiple gene fragments per transcript reflecting allelic redundancy) in this assembled transcriptome on the haploid tissue, indicating that further filtering is required before using these assemblies for downstream applications. We discuss the influence of allelic redundancy when these reference transcriptomes are used to select regions for probe design of exome capture baits and for estimation of population genetic diversity.Entities:
Keywords: Haploid tissue; Pinus sylvestris; RNA-Seq; allelic redundancy; megagametophyte; paralogy; short-reads
Mesh:
Year: 2019 PMID: 31427456 PMCID: PMC6778806 DOI: 10.1534/g3.119.400357
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1Strategies used to generate and evaluate the reference transcriptome for P. sylvestris. A) Individual assemblies, B) combined assembly of all reads per sample, and C) assembly of all megagametophyte (ME) per sample; retaining only > 500 bp transcripts (1) and then all different tissues per sample combined using the ME assembly as guidance sequences during the de novo assembly (2). Trinity and CLCbio Workbench assemblers were used on all three strategies. The secondary clustering consisted of the Orthology Guided Approach (OGA), construction of Lace SuperTranscripts followed by CD-HIT reduction of allelic redundancy. Assemblies marked with an * were assessed for levels of paralog mapping.
Proteomes used as reference for Orthology Guided Assembly (OGA) of P. sylvestris transcriptomes. Number of input proteins used per reference species and the number of orthologs identified in P. sylvestris with the OGA approach
| Species | Dataset | No. of input proteins | No. of ORFs identified | No. reference proteins with ORFS | Average length (bp | N50 | N50-length | No. of orthologs in | Reference name |
|---|---|---|---|---|---|---|---|---|---|
| ALL dataset ver. 1.01 | 84,525 | 19,123 | 7,008 | 676.90 | 3,367 | 1,158 | 27,241 | OGAPitaALL1.01 | |
| HQ dataset ver. 1.01 | 8,997 | 40,282 | 7,899 | 910.19 | 8,261 | 1,422 | 7,847 | OGAPitaHQ-1.01 | |
| ALL dataset ver 2.01 | 36,732 | 44,548 | 13,274 | 742.62 | 8,174 | 1,266 | 13,131 | OGAPitaALL2.01 | |
| ALL dataset | 85,053 | 9,739 | 3,805 | 807.02 | 1,833 | 1,317 | 22,807 | OGAPilaALL | |
| HQ dataset | 13,396 | 44,971 | 12,228 | 892.21 | 9,096 | 1,398 | 12,136 | OGAPilaHQ |
Summary of the assessment of PSC levels utilizing observed heterozygosity patterns on the seven reference transcriptomes selected. NA = not analyzed. *Only including genes where there are >100 callable sites. 1Rough estimate for total sites from Grivet , assuming that one-third of coding sites are non-synonymous and two-third are synonymous. 2Assuming Hardy-Weinberg equilibrium
| Assembly | Total assembled bases | No. of callable sites* | Expected heterozygosity per bp at callable sites (x 1000) | Observed heterozygosity per bp at callable sites (x 1000) | Ratio of HO | Ratio Ho/HE | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ME | EM | NE,PH,VB | ME | EM | NE,PH,VB | ME | EM | NE,PH,VB | ME/EM | ME/NE,PH,VB | ME | EM | NE,PH,VB | ||
| TRINITYguided | 667,499,116 | 14,501,929 | 19,472,366 | NA | 0.85 | 0.86 | NA | 0.29 | 0.86 | NA | 0.34 | NA | 0.34 | 1.00 | NA |
| CLCguided | 233,918,293 | 19,255,392 | 26,397,168 | 48,098,347 | 0.79 | 0.74 | 0.75 | 0.22 | 0.74 | 0.78 | 0.29 | 0.28 | 0.28 | 1.00 | 1.03 |
| CLCnotguided | 207,640,702 | 23,803,859 | 28,354,915 | 42,051,752 | 0.72 | 0.72 | 0.8 | 0.23 | 0.78 | 0.86 | 0.29 | 0.26 | 0.31 | 1.08 | 1.08 |
| TRINITYLace | 399,688,510 | 27,934,641 | 33,810,952 | 28,094,261 | 0.69 | 0.71 | 0.83 | 0.25 | 0.76 | 0.91 | 0.33 | 0.28 | 0.37 | 1.08 | 1.10 |
| TRINITYCD-HIT | 379,552,297 | 91,640,833 | 101,596,814 | 47,019,030 | 0.12 | 0.14 | 0.75 | 0.03 | 0.09 | 0.85 | 0.28 | 0.03 | 0.22 | 0.67 | 1.14 |
| OGAPilaALL | 16,643 820 | 6,748,147 | 7,852,153 | 11,128,428 | 0.54 | 0.52 | 0.59 | 0.26 | 0.64 | 0.74 | 0.40 | 0.35 | 0.47 | 1.24 | 1.26 |
| OGAPitaALL | 18,721 689 | 6,889,082 | 8,303,569 | 11,995,251 | 0.58 | 0.56 | 0.63 | 0.27 | 0.69 | 0.78 | 0.39 | 0.35 | 0.47 | 1.23 | 1.25 |
Figure 2Percentage of completeness on the core set of genes in BUSCO of the assemblies obtained in this study in comparison with published transcriptomes.
Figure 3Strategies to employ the availability of haploid tissue in conifers for de novo transcriptome assembly and their application to assess the amount of allelic redundancy and paralog sequence collapse (PSC). Green boxes indicate additional steps recommended when additional genomic resources are available from a related species.