| Literature DB >> 25541944 |
Carol A Soderlund1, William M Nelson1, Stephen A Goff2.
Abstract
Sequencing the transcriptome can answer various questions such as determining the transcripts expressed in a given species for a specific tissue or condition, evaluating differential expression, discovering variants, and evaluating allele-specific expression. Differential expression evaluates the expression differences between different strains, tissues, and conditions. Allele-specific expression evaluates expression differences between parental alleles. Both differential expression and allele-specific expression have been studied for heterosis (hybrid vigor), where the hybrid has improved performance over the parents for one or more traits. The Allele Workbench software was developed for a heterosis study that evaluated allele-specific expression for a mouse F1 hybrid using libraries from multiple tissues with biological replicates. This software has been made into a distributable package, which includes a pipeline, a Java interface to build the database, and a Java interface for query and display of the results. The required input is a reference genome, annotation file, and one or more RNA-Seq libraries with optional replicates. It evaluates allelic imbalance at the SNP and transcript level and flags transcripts with significant opposite directional allele-specific expression. The Java interface allows the user to view data from libraries, replicates, genes, transcripts, exons, and variants, including queries on allele imbalance for selected libraries. To determine the impact of allele-specific SNPs on protein folding, variants are annotated with their effect (e.g., missense), and the parental protein sequences may be exported for protein folding analysis. The Allele Workbench processing results in transcript files and read counts that can be used as input to the previously published Transcriptome Computational Workbench, which has a new algorithm for determining a trimmed set of gene ontology terms. The software with demo files is available from https://code.google.com/p/allele-workbench. Additionally, all software is ready for immediate use from an Atmosphere Virtual Machine Image available from the iPlant Collaborative (www.iplantcollaborative.org).Entities:
Mesh:
Substances:
Year: 2014 PMID: 25541944 PMCID: PMC4277417 DOI: 10.1371/journal.pone.0115740
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Steps of the AW pipeline and Java processing.
| Program description | Input | Output | Tool used; Postprocessing |
| QC | Raw read files | HTML files with quality measures | FastQC |
| Trim | Raw read files | Trimmed read files | Trimmomatic |
| Align to GS | Trimmed read files, GS files, annotation file | Alignment files | Tophat2 |
| Variants | Alignment files | Variant files | Samtools, bcftools |
| Mask GS | GS, variant files | Masked GS | Bedtools |
| Align to masked GS | Trimmed read files, masked GS, annotation file | Alignment files | Tophat2, Samtools |
| SNP coverage | Alignment files, variant file | SNP coverage files | Samtools; Parse counts from mpileup output |
| Transcripts counts | Parental transcript files | Heterozygous count files | STAR |
| AW build database (runAW) | SNP coverage files, annotation file, variant file; Optional: GS files, heterozygous count files | AW database, parental protein files, parental transcript files | - |
| TCW build database | Transcript or protein file; total count files | TCW database | BLAST |
Raw read files (.fastq), GS (genome sequence, fasta), annotation file (.gtf), alignment file (.bam), variant file (.vcf), SNP coverage (.bed), transcript counts (.xprs).
Though not listed in their output column, all scripts output an.html summary file. The two Java build programs enter summary information into their database for display by their Java query program.
Pipeline scripts are Perl, except QC is shell. Each script executes one or more tools on all input files, renames the result files with their library abbreviations, puts them into the/Results directory, and writes the summary.html file.
These steps are only necessary if the variant file is not available.
runAW must be executed before the “Transcripts counts” step to produce the parental transcript files and again afterwards to update the database with the transcripts heterozygous count files. The optional AW build files are not needed for the initial build.
Java graphical interface.
Figure 1viewAW tables.
The blue circles represent tables that can be queried in viewAW. From each table, one or more rows may be selected to view the associated table of data, which is indicated by the pointed-to circles. The “LibList” is the library counts for a selected set of genes, transcripts or SNPs, which link to the associated replicate counts.
Figure 2viewAW transcript table.
The columns are shown in the lower panel; when an adjoining box is checked, the corresponding column is shown in the table. Selecting “Hide” closes the column listing. The SpNYfKid and SpNYfLiv columns are the SNP coverage p-values. The RpNYfKid and RpNYfLiv are the read counts p-values. The #SNPCov is the number of SNPs with ≥20 reads for any library, #SNPAI is number of SNP that are AI (p-value <0.05) for any library, and #Mis is the number of missense SNPs. #SNPCov and #SNPAI take into account all four libraries, where only two are shown but the others can be viewed by selecting their respective column box next to “Tissue”.
Figure 3viewAW drawing of a gene with three transcripts and 11 variants.
The black exons are non-coding. The coding exons that are stacked but are different colors have different coordinates, e.g. the stack with two pink exons (the same) and a blue (different). The long vertical lines represent SNPs (black) and indels (red); if the number below the variant line is followed by an “*”, then it is AI (p-value <0.05) for at least one library, e.g. variant #2 is AI for libraries NYfBr and NYfLiv.
Figure 4viewAW drilling down into the data.
(a) The table shows the variants for an AI transcript. The S:NYfMus column displays the ref:alt SNP coverage for library NYfMus, and the SpNYfMus column shows the corresponding p-values. There are three AI SNPs, where two are ref> alt and the other is alt
Figure 5TCW trimmed GO set.
All 76 DE-enriched GOs are shown in the table, and the 24 green rows are the trimmed set.
Timing and memory of steps.
| Script | Time | Memory | CPUs | Output |
| QC | 2 h | 2 G | 4 | 7 M |
| Trim | 2 h | 4 G | 4 | 22 G |
| Align | 41 h | 4 G | 10 | 38 G |
| Variants | 7 h | 1 G | 4 | 50 M |
| GS Mask | 10 m | 600 M | 1 | 4 G |
| SNP coverage | 10 m | 1 G | 4 | 12 M |
| Read counts | 4 h | 3 G | 4 | 300 M |
| runAW | 30 m | 2 G | 1 | 400 M |
Gzip, singletons not saved.
Same as the original genome sequence.
Allele imbalance of major variant effect categories.
| Effect | Count | Covered | AI |
| 3_prime_UTR_variant | 32,628 | 32,288 | 10,896 (34%) |
| 5_prime_UTR_variant | 6,848 | 4129 | 1,384(34%) |
| Missense_variant | 10,732 | 9,647 | 2,818(29%) |
| Synonymous_variant | 18,998 | 27,388 | 7,751(28%) |
The number of libraries with SNP coverage ≥20 (since there are 4 libraries, the maximum would be 4 x count).
The number of libraries with allele imbalance (p-value <0.05); the percent is in relation to the number covered.
AI and DE processing for three studies.
| Bell et al. | Pemrumba et al. | Zhia et al. | ||
|
|
| 454-contigs (invasive) | Chicken genome | Nipponbare rice genome |
|
| Parents, 2 hybrid pools | 2 lines, 2 types, 7 replicates | Parents and hybrid, 2 stages, 2 replicates | |
|
|
| Custom script | FASTQC | Custom script |
|
| MOSAIK | Tophat | RSEM | |
|
| Samtool | Freebayes | Custom script | |
|
| Assume custom script | HTSeq | ||
|
| Binomial + FDR | ANOVA | Binomial | |
|
| Binomial + FDR | DEseq | EdgeR | |
|
|
| Cis-, trans-acting | Union between types | Comparison between stages, substitutions |
|
| Additive, dominance | Union between types | Comparison between stages | |
|
| GO (TAIR | ANNOVAR | Comparison between AI and DE, WEGO | |