| Literature DB >> 25465054 |
John Archer1, Gareth Whiteley2, Nicholas R Casewell3, Robert A Harrison4, Simon C Wagstaff5.
Abstract
BACKGROUND: Within many research areas, such as transcriptomics, the millions of short DNA fragments (reads) produced by current sequencing platforms need to be assembled into transcript sequences before they can be utilized. Despite recent advances in assembly software, creating such transcripts from read data harboring isoform variation remains challenging. This is because current approaches fail to identify all variants present or they create chimeric transcripts within which relationships between co-evolving sites and other evolutionary factors are disrupted. We present VTBuilder, a tool for constructing non-chimeric transcripts from read data that has been sequenced from sources containing isoform complexity.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25465054 PMCID: PMC4260244 DOI: 10.1186/s12859-014-0389-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1VTBuilders Graphical User Interface (GUI). Green boxes indicate completed steps of the pipeline while grey indicate those yet to be performed. The yellow box shows the step that is currently running while the yellow text provides a brief description of the step currently running. The inset panel displays the setup area that the user is presented with when they initially double click the jar file. The red circle indicates the command that is required if the user wishes to use the software without a GUI via the command line.
Figure 2Implementation. (A) Schematic diagram of the VTBuilder assembly pipeline. (B) For each scaffold-like alignment produced during mapping a network is constructed. (i) Non-overlapping windows are positioned along the assembly. (ii) Reads spanning each window are extracted and truncated. (iii) These are then clustered to produce nodes. (iv) Edges are placed between clusters that share reads.
The 54 known SSTs used to seed the simulation of reads as described in case study 1
|
|
|
|
|---|---|---|
| SVMP I | 1 | 1600 |
| SVMP II | 3 | 1600 - 2000 |
| SVMP III | 10 | 1600 - 2300 |
| Serine Protease | 9 | 700 - 1400 |
| Phospholipase A2 | 3 | 600 |
| CTL | 16 | 500 - 700 |
| NGF | 1 | 700 |
| CRISP | 1 | 850 |
| VEGF | 1 | 650 |
| LAAO | 1 | 1450 |
| Creatine Kinase | 1 | 790 |
| β-Actin | 1 | 630 |
| HSP90 Endoplasmin | 1 | 780 |
| ATPase6 | 1 | 720 |
| Cytochrome C Oxidase | 1 | 880 |
| Poly A Binding Protein | 1 | 680 |
| Cytochrome B | 1 | 800 |
| Protein Disulfide Isomerase | 1 | 1650 |
Column 2 contains the number of sequences representing each protein family. Column 3 displays the lengths of the sequences included.
Figure 3Transcript reconstruction on simulated reads. (A) Lengths of all transcripts constructed by VTBuilder and Trinity compared to those of the SSTs. The top and bottom of the boxes represent the 25th and 75th percentiles respectively, while the top and bottom whiskers represent the third quartile +1.5 times the inter quartile range (IQR) and the first quartile - 1.5 times the IQR respectively. Outliers beyond these points are represented as black circles. (B) Lengths of transcripts constructed by VTBuilder and Trinity that had a sequence similarity of 90% or greater to the SSTs. (C) Network showing the relationship between the VTBuilder transcripts and the SSTs. Grey nodes represent the VTBuilder transcripts. Colored nodes represent the protein families to which the individual SSTs belong (see key). Node size is proportional to sequence length. Edges represent a 90% or greater sequence similarity. (D) Same as (C) but using Trinity to construct the transcripts.
Figure 4Scaling up to real data. Reads from the venom gland of Bitis arietans were assembled using VTBuilder and annotated using BLAST2GO [59]. (A) Box and whisker plot depicting the length distribution of the constructed transcripts (see Figure 3 for details of whiskers). (B) Transcripts were categorized into four groups; (i) Toxins, (ii) Non-Toxins, (iii) No significant match, and (iv) Bacterial or Viral DNA. (C) The Toxin group in (A) was split into sub categories representing the different protein families present.
The 101 unique toxin transcripts recovered by VTBuilder from reads sequenced from the venom gland of (column 1) and the overall percentage of the toxin DNA that they make up within the transcriptome
|
|
|
|
|---|---|---|
| CTL | 44.87 | 31 |
| SVMP + DIS | 22.99 | 26 |
| SP | 11.08 | 14 |
| VEGF | 8.13 | 5 |
| SPI | 6.18 | 9 |
| SVMP Inhibitor | 2.28 | 1 |
| LAO | 1.44 | 3 |
| CYS | 0.96 | 1 |
| PLA2 | 0.79 | 3 |
| 5NUC | 0.60 | 1 |
| NGF | 0.39 | 2 |
| AP | 0.14 | 1 |
| HYA | 0.06 | 1 |
| DPP | 0.06 | 2 |
| PDE | 0.04 | 1 |
Combined these made up 33.71% of the expressed transcriptome (Figure 4A) but only make up 6.81% of the total number of unique sequences present.