| Literature DB >> 25236449 |
Steven C Munger1, Narayanan Raghupathy1, Kwangbom Choi1, Allen K Simons1, Daniel M Gatti1, Douglas A Hinerfeld1, Karen L Svenson1, Mark P Keller2, Alan D Attie2, Matthew A Hibbs3, Joel H Graber1, Elissa J Chesler1, Gary A Churchill4.
Abstract
Massively parallel RNA sequencing (RNA-seq) has yielded a wealth of new insights into transcriptional regulation. A first step in the analysis of RNA-seq data is the alignment of short sequence reads to a common reference genome or transcriptome. Genetic variants that distinguish individual genomes from the reference sequence can cause reads to be misaligned, resulting in biased estimates of transcript abundance. Fine-tuning of read alignment algorithms does not correct this problem. We have developed Seqnature software to construct individualized diploid genomes and transcriptomes for multiparent populations and have implemented a complete analysis pipeline that incorporates other existing software tools. We demonstrate in simulated and real data sets that alignment to individualized transcriptomes increases read mapping accuracy, improves estimation of transcript abundance, and enables the direct estimation of allele-specific expression. Moreover, when applied to expression QTL mapping we find that our individualized alignment strategy corrects false-positive linkage signals and unmasks hidden associations. We recommend the use of individualized diploid genomes over reference sequence alignment for all applications of high-throughput sequencing technology in genetically diverse populations.Entities:
Keywords: Diversity Outbred (DO); Diversity Outbred mice; MPP; Multiparent Advanced Generation Inter-Cross (MAGIC); Multiparental populations; QTL mapping; RNA-seq; expression QTL; haplotype reconstruction; high-density genotyping; mixed models
Mesh:
Year: 2014 PMID: 25236449 PMCID: PMC4174954 DOI: 10.1534/genetics.114.165886
Source DB: PubMed Journal: Genetics ISSN: 0016-6731 Impact factor: 4.562
Annotated SNPs and indels segregating among the eight CC/DO founder strains
| Genome | Transcriptome | |||||
|---|---|---|---|---|---|---|
| Strain | SNPs | Insertions | Deletions | SNPs | Insertions | Deletions |
| A/J | 4,198,324 | 401,264 | 422,424 | 104,358 | 7,846 | 8,394 |
| 129S1/SvImJ | 4,458,004 | 428,081 | 458,055 | 109,598 | 8,154 | 8,875 |
| NOD/ShiLtJ | 4,323,530 | 389,285 | 407,801 | 108,881 | 7,599 | 8,168 |
| NZO/HILtJ | 4,492,372 | 396,393 | 410,118 | 108,026 | 7,551 | 7,905 |
| CAST/EiJ | 17,673,726 | 1,359,607 | 1,367,482 | 410,805 | 26,975 | 27,474 |
| PWK/PhJ | 17,202,436 | 1,247,627 | 1,388,258 | 411,647 | 25,226 | 27,842 |
| WSB/EiJ | 6,045,573 | 588,061 | 608,945 | 146,495 | 10,966 | 11,559 |
| All strains | 31,593,523 | 2,963,385 | 3,213,340 | 746,993 | 56,354 | 61,204 |
For each of the founder strains, the cumulative numbers of high-quality SNPs, insertions, and deletions that differ from the NCBIM37 genome and transcriptome are listed. Transcript boundaries are based on the Ensembl v67 annotation. The bottom row tabulates the total number of variants segregating among the founder strains that differ from NCBIM37.
Figure 1Flowchart showing the RNA-seq analysis pipeline and Seqnature tool. A Diversity Outbred mouse sample is shown as an example. Genomic DNA is genotyped at 7664 SNPs, which are then input into a hidden Markov model to impute 36-state founder strain genotypes. Seqnature (highlighted in blue) infers genotype transitions by calculating the smallest number of recombinations necessary to produce the observed 36-state patterns and outputs two 8-state genotype transition files. Seqnature constructs two haploid genomes by incorporating founder strain SNPs and indels into the reference genome according to the genotype transition files and creates two gene annotation files with adjusted coordinates (to offset insertions and deletions) and founder strain appended to feature identifiers. The two genomes and annotation files are merged, and then individualized diploid isoform sequences (individualized transcriptome) are constructed and indexed. Sample RNA-seq data are aligned with Bowtie to the individualized transcriptome, and allele-, isoform-, and gene-level abundances are estimated using an EM algorithm (RSEM) to resolve multimapped reads.
Summary of read alignment in the simulated CAST and DO data
| Aligned to CAST | |||||||
|---|---|---|---|---|---|---|---|
| CAST gene-level summary | Read class | Incorrect unique reads | Incorrect multireads | Unmapped reads | Correct multireads | Correct unique reads | Total |
| Aligned to NCBIM37 | Incorrect unique reads | 1,076 | 15,948 | ||||
| Incorrect multireads | 531 | 6,780 | |||||
| Unmapped reads | 1,709,356 | 2,123,597 | |||||
| Correct multireads | 976,821 | 999,795 | |||||
| Correct unique reads | 6,843,309 | 6,853,803 | |||||
| Total | 1,107 | 536 | 1,709,657 | 1,011,723 | 7,276,900 | 9,999,923 | |
The simulated reads were aligned to the NCBIM37 and individualized transcriptomes, and alignments were collapsed to the genomic location. Reads that improve by alignment to the individualized transcriptomes are in italics, with those that improve by two or more categories in underlined italics. Reads that improve by alignment to NCBIM37 are in boldface type, with those that improve by two or more categories in underlined boldface type. Reads on the diagonal align equivalently by both strategies.
Comparison of gene abundance estimates for simulated CAST and DO RNA-seq data after alignment to NCBIM37 reference and individualized transcriptomes
| No. genes with estimates | ||||||
|---|---|---|---|---|---|---|
| Aligned to | Mismatches allowed | Genes above threshold | <5% | <10% | >10% | >50% |
| CAST reads | ||||||
| NCBIM37 | 3 | 12,186 | 4,319 | 8,217 (67) | 3,969 (33) | 485 |
| CAST | 3 | 12,108 | 8,718 | 10,544 (87) | 1,542 (13) | 174 |
| NCBIM37 | 0 | 12,137 | 1,465 | 2,925 (24) | 9,212 (76) | 1,576 |
| CAST | 0 | 12,059 | 7,023 | 9,568 (79) | 2,491 (21) | 152 |
| DO reads | ||||||
| NCBIM37 | 3 | 11,899 | 7,260 | 9,805 (82) | 2,094 (18) | 230 |
| DO IRG | 3 | 11,863 | 8,569 | 10,471 (88) | 1,380 (12) | 161 |
| NCBIM37 | 0 | 11,879 | 2,309 | 4,810 (40) | 7,069 (60) | 530 |
| DO IRG | 0 | 11,857 | 7,110 | 9,575 (81) | 2,262 (19) | 164 |
Alignment of simulated CAST reads to the individualized CAST transcriptome results in twice as many gene estimates (N = +4399) that fall within 5% of ground-truth value and fewer than half as many gene estimates (N = −2427) that deviate >10% from the ground truth. Gene estimates in the simulated DO sample are also improved by read alignment to the individualized transcriptome, yielding 18% more estimates (N = +1309) within 5% of the ground-truth value and 34% fewer estimates (N = −714) that deviate >10% from the ground truth.
Figure 2Read alignment to an individualized diploid transcriptome yields accurate allelic abundance estimates. Estimated allele frequency (y-axis) is plotted against the ground-truth allele frequency (x-axis) for 5270 genes in the simulated data set of 10 million DO reads that were robustly expressed (sum of allele counts ≥100) and had at least 5 uniquely aligned reads that differentiated the two gene alleles. Allele-level gene abundances are strongly correlated to the ground-truth values (r = 0.82), with the estimated frequency of the lower-expressed allele differing on average by <7% (median = 4%) from the ground-truth value. Most genes have a ground truth and estimated allele frequency near 0.5 (red and orange regions), and some estimates show absolute allele-specific expression (i.e., 0 or 1) while the ground truth is somewhere in between (horizontal lines of dots at top and bottom).
Figure 3Gene-level abundance estimates in real data are improved by the individualized alignment strategy. (A) Gene-level abundance estimates are plotted for one CAST sample after alignment to the NCBIM37 (x-axis) and CAST transcriptomes (y-axis). Points are colored based on the difference between alignments and the results of the simulation study (n = 11,964 total genes). Gray circles denote genes with abundance estimates that differ by <10% between alignment strategies (n = 8980). Green denotes genes that differ in the real data by >10% between alignment strategies and for which the alignment to CAST improved the abundance estimate in the simulation study (n = 2242). Red denotes genes that differ by >10% in the real data and for which alignment to NCBIM37 improved the abundance estimate in the simulation study (n = 439). Black denotes genes that differ by >10% in the real data but for which the two alignment strategies yielded the same abundance estimates in the simulation study (n = 71). (B) The differences in gene-level abundance estimates between alignment strategies in the real CAST data are plotted as a stacked histogram. The percentage of difference between CAST and NCBIM37 alignments is plotted on the x-axis, and the total number of genes with that difference is plotted on the y-axis. The same coloring conventions are used as in A. White bars denote genes that differ by >10% in the real data but that were not expressed above threshold in the simulated data set (n = 232). Differences were scaled to a maximum value of 100%. (C) Gene-level abundance estimates are plotted for one DO sample after read alignment to the NCBIM37 (x-axis) and individualized transcriptomes (y-axis). A total of 714 genes in the real data differ by >10% between alignment strategies (n = 714/12,248), of which 432 gene estimates were improved by alignment to the individualized transcriptome in the simulation study (green circles), 124 were improved by alignment to NCBIM37 in the simulation (red circles), and 16 yielded the same gene estimate by both alignment strategies in the simulation study (black circles). (D) The difference in gene-level abundance estimates between alignment strategies in the real DO data are plotted as a stacked histogram. The percentage of difference between DO and NCBIM37 alignment is plotted on the x-axis, and the total number of genes with that difference is plotted on the y-axis. The same coloring conventions are used as in C. White bars denote genes that differ by >10% in the real data but that were not expressed above threshold in the simulation study (n = 142).
Comparison of gene expression QTL (eQTL) from simulated DO RNA-seq data aligned to individualized or GRCm38 reference transcriptome
| Assignment | Aligned to individualized | Aligned to GRCm38 |
|---|---|---|
| Correct assignment | ||
| Local eQTL ( | 6,349 | 5,973 |
| Distant eQTL ( | 540 | 438 |
| No eQTL ( | 7,889 | 6,839 |
| Total correct (%) | 14,778 (98.3) | 13,250 (88.2) |
| Incorrect assignment | ||
| False negative | 128 | 508 |
| False positive, local | 64 | 1,086 |
| False positive, distant | 57 | 183 |
| Total incorrect (%) | 249 (1.7) | 1,777 (11.8) |
Thirty million 100-bp RNA-seq reads were simulated from 277 DO genomes. eQTL mapping on the simulated gene expression values yields 7033 significant associations, including 6437 local and 596 distant eQTL, as well 7994 genes with no significant eQTL. Alignment of simulated DO reads to individualized transcriptomes improves the accuracy of eQTL mapping relative to alignment to GRCm38.
Figure 4Alignment of Diversity Outbred mice to individualized transcriptomes (DO IRGs) reveals significant local eQTL and reduces the number of spurious pseudogene eQTL. (A) An example of a local eQTL unmasked by alignment to individualized transcriptomes. Expression estimates for Hebp1 do not appear linked to local genotype when reads are aligned to the common reference (red line). Accounting for individual genetic variation in the alignment step uncovers a strong local eQTL with a peak centered at the gene (blue line; black arrow denotes gene location). (B) Venn diagram showing the overlap of local eQTL from the individualized or common reference alignment strategy. Local eQTL are identified for a majority of expressed genes by one or both alignment strategies. Alignment to individualized transcriptomes (DO IRGs) identifies 2900 novel local associations. Even in the case of the 6097 local eQTL that are identified as significant by both alignments (overlapping region), LOD significance scores are generally higher after alignment to individualized transcriptomes (y-axis in scatterplot) compared to NCBIM37 (x-axis). (C) Alignment to individualized transcriptomes reduces the number of spurious distant eQTL at pseudogenes. Accounting for segregating founder strain polymorphisms in the parent protein-coding gene Rps12 ablates the distant Chr 10 eQTL peak for the pseudogene Rps12-ps2 (compare blue to red lines) located on Chr 14.
Figure 5Liver expression patterns observed in the DO founder strains suggest that novel local eQTL are real. (A) Alignment to individualized transcriptomes (DO IRGs, blue line) reveals a strong local eQTL for the lincRNA Gm12976 on Chr 4. The eight founder strain coefficients inferred from the additive mapping model are plotted in the inset and show that DO animals that derive this region of Chr 4 from the 129S1/SvImJ strain have higher expression of Gm12976. (B) Allele-level abundance estimates in the DO population show that the 129S1 allele of Gm12976 is high expressing, confirming that the local eQTL is due to cis-acting variation. Founder strain origin is listed on the x-axis, and Gm12976 allelic abundance (upper quartile normalized, square-root transformed) is plotted on the y-axis. (C) This inferred DO strain pattern of Gm12976 expression is concordant with that observed in the eight founder strains. Strains are listed on the x-axis, and Gm12976 abundance (upper quartile normalized, square-root transformed) is plotted on the y-axis.