| Literature DB >> 33919777 |
Ashwil Klein1, Lizex H H Husselmann1, Achmat Williams1, Liam Bell2, Bret Cooper3, Brent Ragar4, David L Tabb1,5,6.
Abstract
While proteomics has demonstrated its value for model organisms and for organisms with mature genome sequence annotations, proteomics has been of less value in nonmodel organisms that are unaccompanied by genome sequence annotations. This project sought to determine the value of RNA-Seq experiments as a basis for establishing a set of protein sequences to represent a nonmodel organism, in this case, the pseudocereal chia. Assembling four publicly available chia RNA-Seq datasets produced transcript sequence sets with a high BUSCO completeness, though the number of transcript sequences and Trinity "genes" varied considerably among them. After six-frame translation, ProteinOrtho detected substantial numbers of orthologs among other species within the taxonomic order Lamiales. These protein sequence databases demonstrated a good identification efficiency for three different LC-MS/MS proteomics experiments, though a seed proteome showed considerable variability in the identification of peptides based on seed protein sequence inclusion. If a proteomics experiment emphasizes a particular tissue, an RNA-Seq experiment incorporating that same tissue is more likely to support a database search identification of that proteome.Entities:
Keywords: LC-MS/MS; RNA-Seq; Salvia columbariae; Salvia hispanica; bioinformatics; nonmodel organisms; proteogenomics; proteomics
Year: 2021 PMID: 33919777 PMCID: PMC8070742 DOI: 10.3390/plants10040765
Source DB: PubMed Journal: Plants (Basel) ISSN: 2223-7747
Trinity assembled each of four different chia RNA-Seq studies individually.
| Sreedhar | Peláez | Wimberley | Gupta | |
|---|---|---|---|---|
| Accession | PRJNA196477 | PRJNA448759 | PRJNA597830 | PRJEB19614 |
| Expts (pairs) | 5 | 10 * | 6 | 13 * |
| Tissues | Developing Seeds | Seeds | Leaves and Roots | Thirteen Tissues |
| Sequencer | Genome Analyzer IIx | HiSeq4000 | HiSeq4000 | HiSeq2500 |
| Total Reads | 178,652,566 | 314,945,194 | 224,284,401 | 138,921,028 |
| Read Length | 72 | 99 | 99 | 99 |
* This analysis included only cultivated strain data from Peláez; similarly, it included only the first of three FASTQ pairs from each tissue for Gupta.
Figure 1Once sequencing reads have been assembled to a set of transcript sequences (represented by the orange pentagon), a great variety of analyses become possible, with each information product represented by a green ellipse.
The RNA-Seq assemblies ranged from 27,543 to 301,858 transcript sequences, with the trimming tool playing a large role in the number and length of transcript sequences. “Genes” is reported by TrinityStats, representing the number of sequence clusters from which transcript sequences were drawn.
| TrimGalore! | Trimmomatic | |||||
|---|---|---|---|---|---|---|
| Assembly | “Genes” | Transcript Sequences | Median Length | “Genes” | Transcript Sequences | Median Length |
| Gupta | 72,525 | 145,679 | 660 | 61,217 | 156,748 | 845 |
| Sreedhar | 34,590 | 53,127 | 857 | 30,649 | 63,468 | 1129 |
| Wimberley | 71,869 | 142,899 | 837 | 84,087 | 301,858 | 1388 |
| Peláez | 60,319 | 108,164 | 756 | 56,761 | 186,842 | 1319 |
| Peláez | 41,802 | 71,324 | 775 | 39,705 | 109,835 | 1162 |
| Peláez | 20,106 | 27,543 | 357 | 20,908 | 30,909 | 377 |
| Peláez | 36,749 | 50,681 | 427 | 38,666 | 59,163 | 469 |
Figure 2BUSCO assesses whether or not single-copy orthologs are represented in an assembled transcript sequence set (“completeness”) and whether or not they are present in multiple copies (“redundancy”). The first five bars illustrate a high completeness (red plus grey: Percentages shown below bar) but also considerable redundancy (grey bar) within the TrimGalore assemblies; the last four bars illustrate the heterogeneity among the four cultivars in the Peláez set.
Figure 3UpSet plot visualizing orthology among assemblies. Each transcript sequence set is internally clustered; its size after clustering is represented by the bar size in the lower left. The sizes of intersections (sets of sequences found in multiple assemblies) are shown by bars in the main plot, sorted with the most common intersections first. The beads connected by lines below report which assemblies contain each set of homologous sequences. “Plants” represents the transcript sequence set published by Wimberley et al., while “TGWimberley” represents a new Trinity assembly of the Wimberley FASTQs, pruned by TrimGalore.
The completed genome sequences within the order Lamiales range widely in transcript sequence set size, from 17,685 transcript sequences for the carnivorous G. aurea to 67,009 transcript sequences for the European olive. [1] A. thaliana falls within the order Brassicales rather than the order Lamiales. [2] The transcript sequence set for S. asiatica contains redundant accessions that greatly outnumber the proteins listed for this species.
| Species | Transcript Sequences | Proteins |
|---|---|---|
| [ | 53,827 | 48,265 |
|
| 47,778 | 47,778 |
|
| 34,410 | 31,861 |
|
| 17,685 | 17,685 |
|
| 30,271 | 30,271 |
|
| 67,009 | 58,334 |
|
| 30,299 | 30,330 |
|
| 55,562 | 53,354 |
|
| 38,621 | 35,410 |
|
| [ | 33,426 |
Figure 4Like Figure 3, this UpSet plot shows the extent of the orthology between sequence databases, but in this case, the diagram visualizes the homology among protein sequences rather than transcript sequences, and the comparison ranged across Lamiales rather than between assembled S. hispanica transcript sequence sets. Clusters of protein sequences found in only one species were excluded. The TGGupta transcript sequence set, when translated to amino acids, showed a strong homology to proteins of other taxa within Lamiales. The number of orthologs found across the entire set of species was almost as large as the number of orthologs found uniquely with its genus-mate Salvia splendens.
Figure 5Different sets of peptides are identified from Aguilar-Toalá 2019 in each of the four translated assemblies. The two assemblies that incorporated seeds (TGGupta and TGPeláezCelaya) identified the most peptides, but the assembly that added the most peptides uniquely was TGSreedhar.