| Literature DB >> 32709858 |
Marko Petek1, Maja Zagorščak2, Živa Ramšak3, Sheri Sanders4, Špela Tomaž3,5, Elizabeth Tseng6, Mohamed Zouine7, Anna Coll3, Kristina Gruden3.
Abstract
Although the reference genome of Solanum tuberosum Group Phureja double-monoploid (DM) clone is available, knowledge on the genetic diversity of the highly heterozygous tetraploid Group Tuberosum, representing most cultivated varieties, remains largely unexplored. This lack of knowledge hinders further progress in potato research. In conducted investigation, we first merged and manually curated the two existing partially-overlapping DM genome-based gene models, creating a union of genes in Phureja scaffold. Next, we compiled available and newly generated RNA-Seq datasets (cca. 1.5 billion reads) for three tetraploid potato genotypes (cultivar Désirée, cultivar Rywal, and breeding clone PW363) with diverse breeding pedigrees. Short-read transcriptomes were assembled using several de novo assemblers under different settings to test for optimal outcome. For cultivar Rywal, PacBio Iso-Seq full-length transcriptome sequencing was also performed. EvidentialGene redundancy-reducing pipeline complemented with in-house developed scripts was employed to produce accurate and complete cultivar-specific transcriptomes, as well as to attain the pan-transcriptome. The generated transcriptomes and pan-transcriptome represent a valuable resource for potato gene variability exploration, high-throughput omics analyses, and breeding programmes.Entities:
Mesh:
Year: 2020 PMID: 32709858 PMCID: PMC7382494 DOI: 10.1038/s41597-020-00581-4
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Merging DM Phureja PGSC and ITAG gene models. (a) Decision tree for the merging of both genome models, with 6 possible outcomes: singleton genes (‘keep gene as is’), manual curation (‘gene cluster to manual curation’ and programmatic solution (all remaining 4 options). Green solid lines represent a satisfied condition (Y: Yes), dashed red lines, an unsatisfied condition (N: No). (b) Example of manual curation in the merged DM genome GTF, region visualisation (chr12:11405699..11418575) in the Spud DB (solanaceae.plantbiology.msu.edu) Genome Browser[13]. ITAG defined Sotub12g014200.1.1 spans three PGSC defined coding sequences (PGSC0003DMT400005728, PGSC0003DMT400005745 and PGSC0003DMT400005726). Below the gene models, RNA sequence tracks are shown, showing how these genes are expressed in various plant organs. In the concrete case, Sotub12g014200.1.1 was preferred due to RNA-Seq evidence being in concordance, and no evidence for PGSC0003DMT400005745.
Fig. 2Bioinformatics pipeline for generation of potato transcriptomes. Software used in specific steps are given in bold. Input datasets (sequence reads) and output data (transcriptomes) are depicted as blue cylinders. Data upload steps to public repositories are shaded in orange. Abbreviations: SRA – NCBI Sequence Read Archive, PGSC – Potato Genome Sequencing Consortium, ITAG – international Tomato Annotation Group, CLC – CLC Genomics Workbench, PacBio – Pacific Biosciences Iso-Seq sequencing, Tr – transcriptome, StPanTr – potato pan-transcriptome, tr2aacds – “transcript to amino acid coding sequence” Perl script from EvidentialGene pipeline.
Parameters used for short read de novo assembly generation.
| Genotype | Assembly ID | Read type | Assembler | Assembler version | k-mer length (word size) | Bubble size |
|---|---|---|---|---|---|---|
| Désirée | CLCdnDe8 | SOLiD | CLC | 9.1 | 24 | 50 |
| Désirée | CLCdnDe1 | SOLiD | CLC | 10.0.1 | 24 | 50 |
| - transcript discovery as reference | ||||||
| Désirée | VdnDe8, …, | SOLiD | Velvet/Oases | 1.2.10 | 23, 33, 43 | Default |
| …, VdnDe10 | ||||||
| Désirée | CLCdnDe9, …, | Illumina | CLC | 9.1 | 21, 23, 33, 43, | 85 |
| …, CLCdnDe14 | 53, 63 | |||||
| Désirée | CLCdnDe2, …, | Illumina | CLC | 10.0.1 | 21, 23, 33, 43, | 85 |
| …, CLCdnDe7 | - transcript discovery as reference | 53, 63 | ||||
| Désirée | TDe | Illumina | Trinity | r2013-02-25 | 25 | NA |
| Désirée | VdnDe1, …, | Illumina | Velvet/Oases | 1.2.10 | 23, 33, 43, 53, | Default |
| …, VdnDe7 | 63, 73, 83 | |||||
| PW363 | CLCdnPW1 | SOLiD | CLC | 8.5.4 | 24 | 50 |
| PW363 | CLCdnPW2 | SOLiD | CLC | 9.1 | 24 | 50 |
| - transcript discovery as reference | ||||||
| PW363 | VdnPW8, …, | SOLiD | Velvet/Oases | 1.2.10 | 23, 33, 43 | Default |
| …, VdnPW10 | ||||||
| PW363 | CLCdnPW3, …, | Illumina | CLC | 8.5.4 | 21, 23, 24, 25, 30, | 50, 65, 85 |
| …, CLCdnPW44 | 33, 35, 40, 43, 45, | |||||
| 50, 53, 55, 63 | ||||||
| PW363 | CLCdnPW45, …, | Illumina | CLC | 10.0.1 | 21, 23, 33, 43, | 85 |
| …, CLCdnPW50 | - transcript discovery as reference | 53, 63 | ||||
| PW363 | SdnPW1 | Illumina | rnaSPAdes | 3.11.1 | 43 | Default |
| PW363 | TPW | Illumina | Trinity | r2013-02-25 | 25 | NA |
| PW363 | VdnPW1, …, | Illumina | Velvet/Oases | 1.2.10 | 23, 33, 43, 53, | Default |
| …, VdnPW7 | 63, 73, 83 | |||||
| Rywal | PBdnRY1 | PacBio Isoseq | Iso-Seq. 3, | 2017 | NAp | NAp |
| Cupcake ToFU | ||||||
| Rywal | CLCdnRY1, …, | Illumina | CLC | 9.1 | 21, 23, 33, 43, | 85 |
| …, CLCdnRY6 | 53, 63 | |||||
| Rywal | CLCdnRY7, …, | Illumina | CLC | 10.1.1 | 21, 23, 33, 43, | 85 |
| …, CLCdnRY12 | - transcript discovery as reference | 53, 63 | ||||
| Rywal | SdnRY1 | Illumina | rnaSPAdes | 3.11.1 | 43 | Default |
| Rywal | VdnRY1, …, | Illumina | Velvet/Oases | 1.2.10 | 23, 33, 43, 53, | Default |
| …, VdnRY7 | 63, 73, 83 |
NAp – not applicable.
NA – not available.
Fig. 3Structure of the potato pan-transcriptome. Stacked bar plot showing the overlap of paralogue groups in cultivar-specific transcriptomes and merged Phureja DM gene model. Only representative and alternative transcripts of the pan-transcriptome are counted (i.e. cultivar representative sequences) while disregarding additional cultivar alternative transcripts. For Phureja DM, the merged ITAG and PGSC DM gene models were counted. DM and at least one Group Tuberosum: sequences shared by Phureja DM and at least one tetraploid genotype, core: sequences shared among all genotypes in the pan-transcriptome.
Samples used to generate the de novo transcriptome assemblies.
| Genotype | Sample descriptiona | Sequencing platform | Library typeb | Number of readsc | SRA ID |
|---|---|---|---|---|---|
| Désirée | PVY inoculated leaves | Illumina | DSN-normalized | ~54 mio | SRR10070125 |
| PE90 unstranded | |||||
| Désirée | non-transformed and PVY-inoculated plants, non-infested and CPB infested leaves | Illumina | PE90 unstranded | ~195 mio | SRR1207287, …, SRR1207290 |
| Désirée | mock and PVY inoculated leaves and stem | SOLiD | SE50 unstranded | ~154 mio | SRR10065428, SRR10065429 |
| Désirée | leaves | Illumina | SE50 unstranded | ~172 mio | SRR3161991, SRR3161995, SRR3161999, SRR3162003, SRR3162007, SRR3162011, SRR3162015, SRR3162019, SRR3162023, SRR3162027, SRR3162031, SRR3162035 |
| Désirée | seedlings | Illumina | SE100 unstranded | ~80 mio | SRR4125238, …, SRR4125247 |
| Désirée | roots | Illumina | SE100 unstranded | ~31 mio | SRR4125248, …, SRR4125252 |
| Désirée | mock and | Illumina | PE90 unstranded | ~53 mio | ERR305632 |
| Rywal | mock and PVY inoculated leaves | PacBio | Iso-Seq, 0.7–2 Kb, 2–3.5 Kb, >3.5 Kb | ~1.4 mio | SRR8281993, …, SRR8282008 |
| CCS | |||||
| Rywal | mock and PVY inoculated leaves | Illumina | PE100 strand-specific | ~710 mio | SRX6801457, …, SRX6801468 |
| PW363 | PVY inoculated leaves | Illumina | DSN-normalized | ~104 mio | SRR10070123, SRR10070124 |
| PE90 unstranded | |||||
| PW363 | mock and PVY inoculated leaves | SOLiD | SE50 unstranded | ~180 mio | SRR10065430, …, SRR10065433 |
aPVY, Potato virus Y; CPB, Colorado potato beetle.
bPE, paired-end library (the number stands for read length in nt); SE, single-end library (the number stands for read length in nt); DSN-normalized, RNA-Seq library utilizing the crab duplex nuclease; CCS, circular consensus sequences.
cFor paired-end libraries, pairs are counted as two reads.
Transcriptome quality control by RNA-seq reads remapping.
| Mapping statistics/genotype | Désirée’ | PW363’ | Rywal’ |
|---|---|---|---|
| Number of input reads | 177,149,132 | 52,171,015 | 342,767,035 |
| Average input read length | 178 | 179 | 199 |
| Uniquely mapped reads number | 64,507,790 | 18,416,487 | 206,003,021 |
| Average mapped length | 175 | 176 | 196 |
| Number of splices*: Total | 496,170 | 267,268 | 1,700,235 |
| Number of splices*: Annotated (sjdb) | 0 | 0 | 0 |
| Number of splices*: GT/AG | 258,208 | 105,551 | 1,162,885 |
| Number of splices*: GC/AG | 10,749 | 5,693 | 79,495 |
| Number of splices*: AT/AC | 1,486 | 2,192 | 1,840 |
| Number of splices*: Non-canonical | 225,727 | 153,832 | 456,015 |
| Mismatch rate per base % | 0.50% | 0.53% | 0.59% |
| Deletion rate per base | 0.03% | 0.03% | 0.03% |
| Deletion average length | 2.72 | 2.53 | 3.02 |
| Insertion rate per base | 0.02% | 0.02% | 0.03% |
| Insertion average length | 1.93 | 1.86 | 1.91 |
| Number of reads mapped to multiple loci | 98,694,222 | 29,366,122 | 108,669,657 |
| Number of reads mapped to too many loci | 4,652,918 | 1,555,704 | 1,541,238 |
| % of reads mapped to too many loci | 2.63% | 2.98% | 0.45% |
| % of reads unmapped: too many mismatches | 0% | 0% | 0% |
| % of reads unmapped: too short | 5.25% | 5.43% | 7.75% |
| % of reads unmapped: other | 0% | 0% | 0% |
| Number of chimeric reads | 0 | 0 | 0 |
| % of chimeric reads | 0% | 0% | 0% |
Illumina paired-end reads used for generating assemblies were mapped back to the corresponding cultivar specific transcriptomes using STAR.
*Number of reads crossing supposed splice sites.
‘Initially constructed transcriptomes (prior to filtering steps).
#Relevant % of mapped reads: % of uniquely mapped reads + % of reads mapped to multiple loci.
Prior and post-filtering transcriptome summary statistics for potato cultivar-specific coding sequences generated by TransRate.
| TransRate metrics | Désirée | PW363 | Rywal | |||
|---|---|---|---|---|---|---|
| Pre-filter (initial) | Post-filter | Pre-filter (initial) | Post-filter | Pre-filter (initial) | Post-filter | |
| No. sequences | 350,271 | 197,839 | 273,216 | 159,278 | 134,755 | 79,095 |
| Sequence mean length | 504 | 792 | 516 | 775 | 459 | 707 |
| No. sequences under 200 nt | 125,465 | 25,330 | 88,230 | 17,370 | 52,653 | 13,198 |
| No. sequences over 1000 nt | 57,679 | 55,837 | 44,508 | 42,571 | 19,175 | 18,748 |
| No. sequences over 10000 nt | 23 | 23 | 3 | 3 | 1 | 1 |
| ’n90 | 369 | 444 | 366 | 429 | 351 | 390 |
| ’n50 | 1,194 | 1,209 | 1,110 | 1,131 | 1,227 | 1,218 |
| GC % | 41% | 42% | 42% | 42% | 42% | 42% |
| Ambiguous nucleotide (N) % | 0% | 0% | 0% | 0% | 0% | 0% |
| No. seq. with CRBB hits* | 160,295 | 138,131 | 138,443 | 116,834 | 66,258 | 55,239 |
| No. reference seq. with CRBB hits* | 29,858 | 27,642 | 25,739 | 23,839 | 23,549 | 22,163 |
| coverage50#* | 25,991 | 24,586 | 21,875 | 20,620 | 20,258 | 19,538 |
| coverage95#* | 19,329 | 18,246 | 15,664 | 14,727 | 14,967 | 14,470 |
| Reference coverage* | 65% | 63% | 56% | 54% | 53% | 52% |
’The largest contig size at which at least 90% or 50% of bases are contained in contigs at least this length.
*Reference-based summary statistics (merged Phureja DM coding sequences were used as reference).
#Proportion of reference proteins with at least N% of their bases covered by a Conditional Reciprocal Best Blast (CRBB) hit.
Summary statistics for potato cultivar-specific representative transcript sequences generated by TransRate.
| TransRate metrics | Désirée | PW363 | Rywal | PGSC+ |
|---|---|---|---|---|
| No. sequences | 57,943 | 43,883 | 36,336 | 39,031 |
| Sequence mean length | 922 | 926 | 1,028 | 1,283 |
| No. sequences under 200 nt | 875 | 1,377 | 1,310 | 87 |
| No. sequences over 1000 nt | 18,500 | 14,545 | 14,307 | 20,226 |
| No. sequences over 10000 nt | 13 | 6 | 2 | 0 |
| ’n90 | 369 | 387 | 440 | 645 |
| ’n50 | 1,566 | 1,535 | 1,673 | 1,726 |
| GC % | 40% | 41% | 41% | 40% |
| Ambiguous nucleotide (N) % | 0% | 0% | 0% | 0% |
| No. seq. with CRBB hits* | 38,034 | 30,826 | 28,389 | 38,600 |
| No. reference seq. with CRBB hits* | 25,094 | 21,751 | 21,299 | 37,534 |
| coverage50#* | 12,799 | 10,693 | 7,909 | 36,379 |
| coverage95#* | 8,053 | 6,430 | 5,053 | 30,187 |
| Reference coverage* | 33% | 28% | 20% | 75% |
’The largest contig size at which at least 90% or 50% of bases are contained in contigs at least this length.
*Reference-based summary statistics (merged Phureja DM coding sequences were used as reference).
#Proportion of reference proteins with at least N% of their bases covered by a Conditional Reciprocal Best Blast (CRBB) hit.
+PGSC_DM_v3.4_transcript-update_representative.fasta.zip file from Spud DB was used for Phureja-specific representative transcript sequences (PGSC).
Assessment of completeness of constructed transcriptomes.
| cv. Désirée | initial rep+alt | post 1st filtering rep+alt | final rep+alt |
|---|---|---|---|
| ( | 37.8 | 37.8 | 37.4 |
| ( | 59.4 | 59.2 | 58.4 |
| ( | |||
| (F) | 1.1 | 1.2 | 1.4 |
| (M) | 1.7 | 1.8 | 2.8 |
| ( | 39.9 | 39.2 | 38.4 |
| ( | 51.7 | 51.2 | 50.9 |
| ( | |||
| (F) | 2.9 | 3.4 | 3.5 |
| (M) | 5.6 | 6.2 | 7.2 |
| ( | 55.8 | 55.8 | 55.1 |
| ( | 35.2 | 34.8 | 34.7 |
| ( | |||
| (F) | 2.4 | 2.6 | 2.7 |
| (M) | 6.5 | 6.9 | 7.5 |
| ( | 92.2 | 11.0 | 3.9 |
| ( | 6.1 | 85.9 | 95.6 |
| ( | |||
| (F) | 1.4 | 1.3 | 0.3 |
| (M) | 0.3 | 1.7 | 0.3 |
Percentage of BUSCOs identified in each transcriptome assembly step.
(S): Complete and single-copy BUSCOs %;
(D): Complete and duplicated BUSCOs %
(C): Complete BUSCOs (S + D) %
(F): Fragmented BUSCOs %
(M): Missing BUSCOs %
rep: representative
alt: alternative
*Database size: 1440.
Fig. 4Sanger sequencing validates the constructed cultivar specific transcriptome. Multiple sequence alignment of NPR1-1 coding sequence obtained from eight E. coli colonies (NPR1-1 seq. 1–8) by the Sanger method, assembled short or long-read cv. Rywal transcripts and Phureja DM gene model (Sotub07g016890.1.1). Grey - sequence identity, black - SNPs. The alignment was prepared and visualised with Geneious Prime 2020.1.1[65].
Fig. 5Transcript variants present in pan-transcriptome paralogue gene groups. a) Alignment part of stPanTr_010101 with two PW363-specific SNPs marked by red dots. Such SNPs can be used to design cultivar- or allele-specific qPCR assays. b) Alignment part of stPanTr_074336 showing an alternative splice variant in Désirée, (VdnDe4_33782). Both multiple sequence alignments were made using ClustalOmega v 1.2.1[69] and visualized with MView v 1.66[60]. The remaining alignments can be found in Auxiliary file 2[61].
Mapping of independent dataset to newly assembled cultivar specific reference transcriptome.
| Reference | Désirée | ITAG/PGSC | PGSC | Désirée | ITAG/PGSC | PGSC |
|---|---|---|---|---|---|---|
| Number of input reads | 14,953,659 | 14,610,172 | ||||
| Average input read length | 252 | 252 | ||||
| Average mapped length | 246 | 244 | 246 | 246 | 244 | 246 |
| Number of splices: Total | 123,490 | 104,658 | 159,842 | 127,414 | 110,423 | 165,595 |
| Number of splices: Non-canonical | 40,960 | 43,764 | 45,750 | 40,898 | 45,459 | 42,978 |
| Mismatch rate per base % | 0.75% | 0.94% | 1.00% | 0.74% | 0.93% | 1.00% |
| Deletion rate per base | 0.05% | 0.04% | 0.06% | 0.05% | 0.04% | 0.06% |
| Deletion average length | 3.42 | 3.46 | 2.88 | 3.41 | 3.47 | 2.89 |
| Insertion rate per base | 0.03% | 0.02% | 0.04% | 0.03% | 0.02% | 0.04% |
| Insertion average length | 2.21 | 2.99 | 2.52 | 2.24 | 3.00 | 2.53 |
| % of reads mapped to too many loci | 0% | 0% | 0% | 0% | 0% | 0% |
| % of reads unmapped: too many mismatches | 0% | 0% | 0% | 0% | 0% | 0% |
| % of reads unmapped: too short | 21% | 29% | 27% | 22% | 31% | 29% |
| % of reads unmapped: other | 0% | 0% | 0% | 0% | 0% | 0% |
| Number of input reads | 14,755,430 | 44,319,261 | ||||
| Average input read length | 252 | 252 | ||||
| Average mapped length | 245 | 243 | 246 | 246 | 244 | 246 |
| Number of splices: Total | 95,409 | 77,103 | 115,083 | 346,313 | 292,184 | 440,520 |
| Number of splices: Non-canonical | 33,269 | 31,224 | 31,065 | 115,127 | 120,447 | 119,793 |
| Mismatch rate per base % | 0.75% | 0.94% | 1.01% | 0.75% | 0.94% | 1.00% |
| Deletion rate per base | 0.06% | 0.05% | 0.07% | 0.05% | 0.05% | 0.06% |
| Deletion average length | 3.23 | 3.25 | 2.83 | 3.36 | 3.41 | 2.87 |
| Insertion rate per base | 0.03% | 0.03% | 0.04% | 0.03% | 0.02% | 0.04% |
| Insertion average length | 2.27 | 3.15 | 2.56 | 2.24 | 3.04 | 2.54 |
| % of reads mapped to too many loci | 0% | 0% | 0% | 0% | 0% | 0% |
| % of reads unmapped: too many mismatches | 0% | 0% | 0% | 0% | 0% | 0% |
| % of reads unmapped: too short | 47% | 55% | 52% | 30% | 38% | 36% |
| % of reads unmapped: other | 0% | 0% | 0% | 0% | 0% | 0% |
Mapping statistics for Désirée leaf samples under drought stress to Désirée, ITAG/PGSC merged and PGSC representative transcriptome sequences is shown.
RNA-seq data from Désirée leaf samples under drought stress retrieved from the GEO Series GSE140083 – “Transcriptome profiles of contrasting potato (Solanum tuberosum L.) genotypes under water stress”. No chimeric reads detected.
#Relevant % of mapped reads: % of uniquely mapped reads + % of reads mapped to multiple loci.
| Measurement(s) | genome • RNA • sequence_assembly • transcriptome |
| Technology Type(s) | digital curation • RNA sequencing • sequence assembly process |
| Factor Type(s) | cultivar |
| Sample Characteristic - Organism | Solanum tuberosum |