| Literature DB >> 27618913 |
Alan Hodgkinson1,2, Jean-Christophe Grenier3, Elias Gbeha3,4,5, Philip Awadalla3,4,5.
Abstract
BACKGROUND: Allele specific expression (ASE) has become an important phenotype, being utilized for the detection of cis-regulatory variation, nonsense mediated decay and imprinting in the personal genome, and has been used to both identify disease loci and consider the penetrance of damaging alleles. The detection of ASE using high throughput technologies relies on aligning short-read sequencing data, a process that has inherent biases, and there is still a need to develop fast and accurate methods to detect ASE given the unprecedented growth of sequencing information in big data projects.Entities:
Keywords: Allele specific expression; Normalization; RNA sequencing
Mesh:
Year: 2016 PMID: 27618913 PMCID: PMC5020486 DOI: 10.1186/s12859-016-1238-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1a Schematic for the normalization procedure. For a given heterozygous SNV the underlying proportion of reference and alternative alleles is unknown. After mapping, the proportion of reference/alternative alleles is observed, but may contain biases. To correct for this, a null dataset is generated for this site containing a 50:50 ratio of the two alleles (see panel b), and this data, together with null data from all other heterozygous sites is mapped using the same procedure as used for the original alignment. The observed proportion of mapped alleles from the null dataset is then used to correct the original data. b Generation of the null dataset. All reads and read pairs covering a heterozygous SNV are shown in the left hand panel. From these data, read pairs are randomly selected and the second haplotype is generated from known SNV data for the individual. In the right hand panel, three examples of this process are shown. At the top, the original read pair contains the reference allele at the SNV of interest (C/T), as well as the reference allele at a neighbouring SNV (G/A). The second haplotype is thus generated with the alternative alleles at both positions. In the middle, the original read pair contains two alternative alleles at the SNV sites, so an alternative read pair is generated with both reference alleles. At the bottom, the read pair contains the reference allele at the central SNV site, and what appears to be a sequencing error upstream at a site where no SNV has been identified. As such, a read pair is created with the sequencing error unchanged, and the alternative allele at the SNV position. This process is repeated for all read pairs to generate a null dataset with coverage of 4000X, and reads are converted into fastq format for remapping
Fig. 2The proportion of reference alleles at heterozygous sites before and after normalization. Each plot shows the combined results from five simulated datasets, with the known reference proportion (ground truth) on the x-axis and the reference proportion obtained from aligning sequencing data (estimated) on the y-axis. a Shows the results obtained from initial mapping under four different approaches, and (b) shows the results of the same approaches after normalization. The sum of squared errors (SSE) is calculated around the red line (x = y), whereas R2 is obtained from analysing the correlation between the two variables
A comparison of ASE call rates for original mapped data and after normalization, for four different alignment methods
| Method | True Positives | False Positives | True Negatives | False Negatives | Sensitivity | Specificity | Precision |
|---|---|---|---|---|---|---|---|
| Tophat2 | 108.2 | 452.4 | 11281.0 | 74.6 | 59.15 % | 96.14 % | 19.32 % |
| Tophat2 Normalized | 110.4 | 94.0 | 11639.4 | 72.4 | 60.33 % | 99.20 % | 53.90 % |
| STAR | 130.8 | 87.6 | 11666.8 | 52.4 | 71.40 % | 99.25 % | 58.87 % |
| STAR Normalized | 135.2 | 40.2 | 11714.2 | 48.0 | 73.75 % | 99.66 % | 77.03 % |
| TH2_5MM | 149.6 | 50.6 | 11702.8 | 33.4 | 81.78 % | 99.57 % | 74.72 % |
| TH2_5MM Normalized | 148.2 | 20.0 | 11733.4 | 34.8 | 80.97 % | 99.83 % | 88.03 % |
| STAR (No clip) | 151.6 | 51.4 | 11703.0 | 31.6 | 82.66 % | 99.56 % | 74.61 % |
| STAR (No clip) Normalized | 152.0 | 30.0 | 11724.4 | 31.2 | 82.91 % | 99.74 % | 83.50 % |
In all cases, the true number of significant ASE events averaged across five simulations is 183.2
A comparison of ASE call rates for three different approaches: a filtering methods, mapping to two parental genomes and our normalization approach
| Mapping procedure | Control Method | True Positives | False Positives | True Negatives | False Negatives | Sensitivity | Specificity | Precision |
|---|---|---|---|---|---|---|---|---|
| TH2_5MM | Normalized | 142.28 | 19.80 | 11737.80 | 36.80 | 79.42 % | 99.83 % | 87.69 % |
| TH2_5MM | Two Genomes | 144.96 | 21.84 | 11737.68 | 34.44 | 80.84 % | 99.81 % | 86.85 % |
| TH2_5MM | Filtered | 117.44 | 46.12 | 11711.48 | 61.64 | 65.71 % | 99.61 % | 71.77 % |
| STAR (No clip) | Normalized | 147.40 | 29.00 | 11729.24 | 31.80 | 82.28 % | 99.75 % | 83.52 % |
| STAR (No clip) | Two Genomes | 152.24 | 22.28 | 11724.96 | 27.16 | 84.80 % | 99.81 % | 87.17 % |
| STAR (No clip) | Filtered | 122.68 | 43.72 | 11714.52 | 56.52 | 68.53 % | 99.63 % | 73.70 % |
Fig. 3ASE calls per individual. The number of sites showing ASE per individual after resampling to depth 20X (a) and the relationship between the proportions of ASE events per site, per individual, and smoking status (b)