Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 The uniqueome: a mappability resource for short-tag sequencing.

Literature DB >> 21075741

The uniqueome: a mappability resource for short-tag sequencing.

Ryan Koehler¹, Hadar Issac, Nicole Cloonan, Sean M Grimmond.

Abstract

SUMMARY: Quantification applications of short-tag sequencing data (such as CNVseq and RNAseq) depend on knowing the uniqueness of specific genomic regions at a given threshold of error. Here, we present the 'uniqueome', a genomic resource for understanding the uniquely mappable proportion of genomic sequences. Pre-computed data are available for human, mouse, fly and worm genomes in both color-space and nucletotide-space, and we demonstrate the utility of this resource as applied to the quantification of RNAseq data. AVAILABILITY: Files, scripts and supplementary data are available from http://grimmond.imb.uq.edu.au/uniqueome/; the ISAS uniqueome aligner is freely available from http://www.imagenix.com/.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2010 PMID： 21075741 PMCID： PMC3018812 DOI： 10.1093/bioinformatics/btq640

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

Massively parallel short-tag (25–100 nt) sequencing technologies are enabling a large repertoire of genomic and genetic research due to the depth of coverage that can be achieved in a cost-effective manner. Although short tags are most informative if they can be aligned uniquely to a reference genome, repetitive elements are not randomly distributed throughout the genome (Campbell ); therefore, the proportion and location of uniquely mappable short sequences will also be non-randomly distributed. This presents a specific problem where quantitative comparison between two or more genomic regions is required (such as RNAseq or CNVseq). For any quantitative analysis, it is desirable to understand the boundaries of the unique genome (the uniqueome), so that the amount of uniquely mappable sequence can be used to normalize tag counts. Uniqueomes have been studied comprehensively for small genomes with both long (Chaisson ) and short (Whiteford ) sequencing tags. For mammalian genomes, where comprehensive studies can be computationally prohibitive, the problem has been tackled with simulation (Campbell ), region-specific computation (Robertson ) or computation without mismatches (Rozowsky ). Counterintuitively, considering only tags that align uniquely without mismatches does not resolve the problem of ambiguous mapping. In cases where the error rate of the sequencing platform exceeds the number of mismatches allowed during alignment, false positive uniquely aligning tags will occur (Supplementary Figure S1). It is therefore important to compute the uniqueome allowing for at least the number of errors likely to be present in the data. We have used the exhaustive alignment feature of ISAS (Imagenix, USA) to systematically generate uniqueome data for human (hg18 and hg19), mouse (mm9), worm (ce6) and fly (dm3) genomes in both color-space and nucleotide-space. Ungapped alignments were performed independently for tag lengths between 25 and 90 nt with varying numbers of mismatches, in both nucleotide-space and color-space (Supplementary Material). To visualize the results, non-unique genomic regions are formatted as bigBED and bigWig files, and these can be loaded directly into the UCSC genome browser (Kuhn ). The BED files are also compatible with large-scale genomic analysis using the Galaxy interface (Goecks ). Figure 1 illustrates the utility of uniqueome in identifying problematic alignment areas in an RNAseq dataset (Guttman ).

Fig. 1.

Color-space (CS-50-5) and nucleotide-space (NS-50-2) uniqueome plots visualized alongside RNAseq data. The same 50mer RNAseq tags were aligned using several specialized short-read aligners in both nucleotide-space (red) and color-space (green). The yellow region highlights an area with no uniqueome coverage (confirmed by BLAT as a multimapping region), where tags have been falsely declared as ‘uniquely mapping’ by the various aligners. No repetitive elements were detected by RepeatMasker. See Supplementary Material for details. Table 1 and Supplementary Tables S1–S4 describe the proportion of unique start sites and unique coverage for different genomes and different tag lengths in both nucleotide-space and color-space. Interestingly, increasing the length of the tag beyond 50 bp does little to overcome redundancy issues in mammalian genomes, suggesting that short-read technologies do not need to progress significantly beyond their current lengths to achieve optimum utility in fragment datasets.

Table 1.

Proportions of unique start sites for nucleotide-space short tag alignments

Species	25 (1) (%)	30 (1) (%)	35 (1) (%)	50 (2) (%)	60 (3) (%)	75 (4) (%)	90 (5) (%)
Homo sapiens^a	66.0	70.9	74.1	76.9	77.5	79.3	80.8
Mus musculus^b	69.9	74.4	77.1	79.1	79.4	80.7	81.7
Caenorhabditis elegans^c	85.3	87.7	89.0	89.8	89.9	90.6	91.1
Drosophila melanogaster^d	67.5	68.4	69.0	69.2	69.2	69.5	69.8

Columns shown are length of tag matched; numbers in parentheses represent the number of mismatches allowed.

aBuild hg19.

bBuild mm9.

cBuild ce6.

dBuild dm3.

Proportions of unique start sites for nucleotide-space short tag alignments Columns shown are length of tag matched; numbers in parentheses represent the number of mismatches allowed. aBuild hg19. bBuild mm9. cBuild ce6. dBuild dm3. To better understand the effect of mapping uniqueness on RNAseq quantification, we determined the proportion of uniquely mappable positions in the RefSeq set of genes (Pruitt ) for 50mers in both color-space and nucleotide-space. Figure 2 shows a wide distribution of off-diagonal points reflecting the variability in the uniqueome content of RefSeq genes. Both the color-space and base-space plots reveal a group of RefSeq transcripts >5000 nt long but with less than 10% of uniquely mapping tags. This group of genes is highly enriched for large multicopy gene families, such as HLA. The uniqueness of RefSeq exon–exon junctions is described in Supplementary Tables S5 and S6.

Fig. 2.

A mirror image plot showing the relationship between the length of a gene and the unique length of a gene for color-space (red) and nucleotide-space (blue). The uniqueomes of human RefSeq genes (release 39) using hg19 coordinates were investigated for 50mer tags using two mismatches in nucleotide-space and five mismatches in color-space. Overall, the effect of non-unique short sequences in genes can be significant. More than 25% of RefSeq genes contain at least 10% of non-unique sequence when mapped as 50mers. Given that almost 40% of genes in mammalian genomes have arisen due to gene duplication (Zhang, 2003), this is not a surprising result. However, unless this is specifically normalized for in RNAseq experiments, this could bias both differential expression and gene set enrichment analyses. We have examined the utility of normalization using the uniqueome and compared it to both raw tag counts and non-unique tag rescue, using previously published sequencing and microarray data from the same samples (Cloonan ). Table 2 shows an improvement in the correlation of RNAseq to array data when using tag counts normalized to the proportion of unique sequence in each gene. Although the correlation improvements are lower than using a rescue approach, there is no additional computational time required to achieve this improvement, whereas significant CPU time is required for rescue (6 CPU hours using RNA-MATEv1.1; see Supplementary Material).

Table 2.

Strategies to deal with multimapping tags and their correlation to microarray data from the same RNA sample

Method	Pearson	95% confidence interval
Raw tag counts (RPKM)	0.38	0.35–0.41
Non-unique tag rescue counts (RPKM)	0.46	0.43–0.49
Uniqueome normalized tag counts (RPKM)	0.50	0.47–0.52

Strategies to deal with multimapping tags and their correlation to microarray data from the same RNA sample Finally, the uniqueome allows higher confidence in mutation detection (e.g. cancer resequencing), where mis-mapping can confound SNP calling algorithms. This is a particular problem faced by users of paired-end or mate-pair data, where the mapping position of a multimapping tag is rescued based on its pair which uniquely maps. It is important to note that while this rescue can lead to improved levels of coverage (Bainbridge ), it does not increase the uniquely mapping proportion of the genome, and can lead to the misplacement of tags and false positive variant calls (Supplementary Figure S2). The uniqueome can be used to identify these regions of low confidence, independently of the aligner used to generate the data, as illustrated in Figure 1. Although described here as a resource for short-tag sequencing applications, the utility of this resource extends beyond this theme. Primer design, comparative genomics and microarray probe design would also derive benefit from this resource. A PDF tutorial on using the uniqueome with Galaxy is provided (Supplementary Material). The ISAS uniqueome aligner is freely available, and a PDF tutorial on its use is provided (Supplementary Material).

11 in total

1. Fragment assembly with short reads.

Authors: Mark Chaisson; Pavel Pevzner; Haixu Tang
Journal: Bioinformatics Date: 2004-04-01 Impact factor: 6.937

2. Stem cell transcriptome profiling via massive-scale mRNA sequencing.

Authors: Nicole Cloonan; Alistair R R Forrest; Gabriel Kolle; Brooke B A Gardiner; Geoffrey J Faulkner; Mellissa K Brown; Darrin F Taylor; Anita L Steptoe; Shivangi Wani; Graeme Bethel; Alan J Robertson; Andrew C Perkins; Stephen J Bruce; Clarence C Lee; Swati S Ranade; Heather E Peckham; Jonathan M Manning; Kevin J McKernan; Sean M Grimmond
Journal: Nat Methods Date: 2008-05-30 Impact factor: 28.547

3. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs.

Authors: Mitchell Guttman; Manuel Garber; Joshua Z Levin; Julie Donaghey; James Robinson; Xian Adiconis; Lin Fan; Magdalena J Koziol; Andreas Gnirke; Chad Nusbaum; John L Rinn; Eric S Lander; Aviv Regev
Journal: Nat Biotechnol Date: 2010-05-02 Impact factor: 54.908

4. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing.

Authors: Peter J Campbell; Philip J Stephens; Erin D Pleasance; Sarah O'Meara; Heng Li; Thomas Santarius; Lucy A Stebbings; Catherine Leroy; Sarah Edkins; Claire Hardy; Jon W Teague; Andrew Menzies; Ian Goodhead; Daniel J Turner; Christopher M Clee; Michael A Quail; Antony Cox; Clive Brown; Richard Durbin; Matthew E Hurles; Paul A W Edwards; Graham R Bignell; Michael R Stratton; P Andrew Futreal
Journal: Nat Genet Date: 2008-04-27 Impact factor: 38.330

5. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls.

Authors: Joel Rozowsky; Ghia Euskirchen; Raymond K Auerbach; Zhengdong D Zhang; Theodore Gibson; Robert Bjornson; Nicholas Carriero; Michael Snyder; Mark B Gerstein
Journal: Nat Biotechnol Date: 2009-01-04 Impact factor: 54.908

6. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

7. Whole exome capture in solution with 3 Gbp of data.

Authors: Matthew N Bainbridge; Min Wang; Daniel L Burgess; Christie Kovar; Matthew J Rodesch; Mark D'Ascenzo; Jacob Kitzman; Yuan-Qing Wu; Irene Newsham; Todd A Richmond; Jeffrey A Jeddeloh; Donna Muzny; Thomas J Albert; Richard A Gibbs
Journal: Genome Biol Date: 2010-06-17 Impact factor: 13.583

8. An analysis of the feasibility of short read sequencing.

Authors: Nava Whiteford; Niall Haslam; Gerald Weber; Adam Prügel-Bennett; Jonathan W Essex; Peter L Roach; Mark Bradley; Cameron Neylon
Journal: Nucleic Acids Res Date: 2005-11-07 Impact factor: 16.971

9. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2006-11-27 Impact factor: 16.971

10. The UCSC Genome Browser Database: update 2009.

Authors: R M Kuhn; D Karolchik; A S Zweig; T Wang; K E Smith; K R Rosenbloom; B Rhead; B J Raney; A Pohl; M Pheasant; L Meyer; F Hsu; A S Hinrichs; R A Harte; B Giardine; P Fujita; M Diekhans; T Dreszer; H Clawson; G P Barber; D Haussler; W J Kent
Journal: Nucleic Acids Res Date: 2008-11-07 Impact factor: 16.971

37 in total

1. ChIP-Seq: technical considerations for obtaining high-quality data.

Authors: Benjamin L Kidder; Gangqing Hu; Keji Zhao
Journal: Nat Immunol Date: 2011-09-20 Impact factor: 25.606

2. Joint detection of copy number variations in parent-offspring trios.

Authors: Yongzhuang Liu; Jian Liu; Jianguo Lu; Jiajie Peng; Liran Juan; Xiaolin Zhu; Bingshan Li; Yadong Wang
Journal: Bioinformatics Date: 2015-12-07 Impact factor: 6.937

3. Context-dependent individualization of nucleotides and virtual genomic hybridization allow the precise location of human SNPs.

Authors: José Reyes; Laura Gómez-Romero; Ximena Ibarra-Soria; Kim Palacios-Flores; Luis R Arriola; Alejandro Wences; Delfino García; Margareta Boege; Guillermo Dávila; Margarita Flores; Rafael Palacios
Journal: Proc Natl Acad Sci U S A Date: 2011-08-29 Impact factor: 11.205

4. Sniper: improved SNP discovery by multiply mapping deep sequenced reads.

Authors: Daniel F Simola; Junhyong Kim
Journal: Genome Biol Date: 2011-06-20 Impact factor: 13.583

5. Whole exome association of rare deletions in multiplex oral cleft families.

Authors: Jack Fu; Terri H Beaty; Alan F Scott; Jacqueline Hetmanski; Margaret M Parker; Joan E Bailey Wilson; Mary L Marazita; Elisabeth Mangold; Hasan Albacha-Hejazi; Jeffrey C Murray; Alexandre Bureau; Jacob Carey; Stephen Cristiano; Ingo Ruczinski; Robert B Scharpf
Journal: Genet Epidemiol Date: 2016-12-01 Impact factor: 2.135

6. User guide for mapping-by-sequencing in Arabidopsis.

Authors: Geo Velikkakam James; Vipul Patel; Karl J V Nordström; Jonas R Klasen; Patrice A Salomé; Detlef Weigel; Korbinian Schneeberger
Journal: Genome Biol Date: 2013-06-17 Impact factor: 13.583

7. Spatial clustering for identification of ChIP-enriched regions (SICER) to map regions of histone methylation patterns in embryonic stem cells.

Authors: Shiliyang Xu; Sean Grullon; Kai Ge; Weiqun Peng
Journal: Methods Mol Biol Date: 2014

8. Minor class splicing shapes the zebrafish transcriptome during development.

Authors: Sebastian Markmiller; Nicole Cloonan; Rea M Lardelli; Karen Doggett; Maria-Cristina Keightley; Yeliz Boglev; Andrew J Trotter; Annie Y Ng; Simon J Wilkins; Heather Verkade; Elke A Ober; Holly A Field; Sean M Grimmond; Graham J Lieschke; Didier Y R Stainier; Joan K Heath
Journal: Proc Natl Acad Sci U S A Date: 2014-02-10 Impact factor: 11.205

9. Identification of high-confidence somatic mutations in whole genome sequence of formalin-fixed breast cancer specimens.

Authors: Shawn E Yost; Erin N Smith; Richard B Schwab; Lei Bao; HyunChul Jung; Xiaoyun Wang; Emile Voest; John P Pierce; Karen Messer; Barbara A Parker; Olivier Harismendy; Kelly A Frazer
Journal: Nucleic Acids Res Date: 2012-04-06 Impact factor: 16.971

10. Comparison of sequencing platforms for single nucleotide variant calls in a human sample.

Authors: Aakrosh Ratan; Webb Miller; Joseph Guillory; Jeremy Stinson; Somasekar Seshagiri; Stephan C Schuster
Journal: PLoS One Date: 2013-02-06 Impact factor: 3.240