| Literature DB >> 21589938 |
René L Warren1, Robert A Holt.
Abstract
As next-generation sequence (NGS) production continues to increase, analysis is becoming a significant bottleneck. However, in situations where information is required only for specific sequence variants, it is not necessary to assemble or align whole genome data sets in their entirety. Rather, NGS data sets can be mined for the presence of sequence variants of interest by localized assembly, which is a faster, easier, and more accurate approach. We present TASR, a streamlined assembler that interrogates very large NGS data sets for the presence of specific variants by only considering reads within the sequence space of input target sequences provided by the user. The NGS data set is searched for reads with an exact match to all possible short words within the target sequence, and these reads are then assembled stringently to generate a consensus of the target and flanking sequence. Typically, variants of a particular locus are provided as different target sequences, and the presence of the variant in the data set being interrogated is revealed by a successful assembly outcome. However, TASR can also be used to find unknown sequences that flank a given target. We demonstrate that TASR has utility in finding or confirming genomic mutations, polymorphisms, fusions and integration events. Targeted assembly is a powerful method for interrogating large data sets for the presence of sequence variants of interest. TASR is a fast, flexible and easy to use tool for targeted assembly.Entities:
Mesh:
Year: 2011 PMID: 21589938 PMCID: PMC3092772 DOI: 10.1371/journal.pone.0019816
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Detection of true positive versus false positive SNVs in lobular breast cancer.
TASR was run incrementally on up to 2 billion, 51 and 76 nt lobular breast cancer NGS whole-genome shotgun reads, providing 5 to 36-fold coverage of the 3 Gbp human genome. We used as targets 51 nt sequences containing one of 31 SNVs detected by NGS read alignment and confirmed by Sanger sequencing (true positive), 31 matching sequences containing the reference base instead (reference) and 31 detected by NGS read alignment but not confirmed by Sanger sequencing (false positive). Although close to twice as much WGSS data had been generated from the LBC sample, we see that a fraction of that (∼19-fold) is sufficient for confirming most (68%) true positive SNVs.
Figure 2De novo assembly of prostate carcinoma RNA-seq data.
Using a TMPRSS2:ERG target sequence that differs from a TMPRSS2 target by a single base (underlined), TASR generated a contig, which captures 18 ERG-specific bases fused to exon 1 of TMPRSS2 in a prostate adenocarcinoma sample (SRA accession SRX027125). These bases were not specified in the target sequence and thus, unknown from the original hypothesis. A total of 121 reads span the TMPRSS2:ERG fusion coordinate (underlined base). Higher base coverage is expected in the middle of the contig where 15-mer read recruitment reaches a maximum for both strand and is unaffected by the limiting effects of the minimum overlap (-m) option on the edge of the sequence target. This highlights the importance of using a sequence target that is sufficiently long and at least the same length as the input reads. From this result, it is very likely that the prostate adenocarcinoma sample contains an admixture of TMPRSS2 transcripts, including the TMPRSS2{NM_005656.2}:r.1_71_ERG{NM_004449.3}:r.226_3097 fusion and that those have varied abundance, as reflected by high depth of coverage.
Re-identification of human genome variations from the 1000 Genomes pilot project.
| Genome variations | Read count(s) over SNP or junction | ||||
| Type | Identifier(dbSNP or 1000 Genomes ID) | SNP | Mother | Father | Daughter |
| SNP | rs1736565 | C/T | 0/7T/T | 11/9C/T | 0/15T/T |
| rs6443930 | C/G | 7/5C/G | 12/0C/C | 14/0C/C | |
| rs2645341 | C/T | 0/5T/T | 0/17T/T | 0/9T/T | |
| rs13191323 | C/T | 0/89T/T | 0/11T/T | 0/15T/T | |
| rs1965370 | C/G | 0/11G/G | 0/14G/G | 0/9G/G | |
| rs10862125 | C/T | 0/12T/T | 0/11T/T | 0/14T/T | |
| rs6511602 | C/T |
| 0/4T/T | 0/7T/T | |
| rs2245425 | G/A | 2/2G/A | 9/2G/A | 12/7G/A | |
| rs4509745 | C/T | 5/6C/T | 8/10C/T | 0/21T/T | |
| rs7004273 | G/A | 0/10A/A | 0/13A/A | 0/15A/A | |
| Indels | rs58432514 | −/G | 0/1G/G | 0/6G/G | 0/15G/G |
| rs11450450 | −/C | 0/2C/C | 0/10C/C | 0/19C/C | |
| rs35933224 | −/TTTG |
| 17/46−/TTTG | 10/0−/− | |
| rs140511 | −/C | 0/1C/C | 0/4C/C | 0/8C/C | |
| rs11382443 | −/A | 9/0−/− | 6/8−/A | 5/3−/A | |
| rs57304020 | −/G | 0/3G/G | 0/2G/G | 0/5G/G | |
| rs3078330 | −/TA | 0/2TA/TA | 0/11TA/TA | 0/10TA/TA | |
| rs11303415 | −/C | 1/0−/− |
| 2/3−/C | |
| rs59393160 | −/GT |
| 0/14GT/GT | 0/15GT/GT | |
| rs35117663 | −/AG | 5/2−/AG | 23/0−/− | 11/0−/− | |
| SV | P2_M_061510_1_103 | −/Δ175G | 4/0 | 4/8 | 12/1 |
| P2_M_061510_1_308 | −/Δ58A | 0/3 | 8/0 | 0/10 | |
| P2_M_061510_1_533 | −/Δ67G | 11/0 | 7/11 | 10/0 | |
| P2_M_061510_1_198 | −/Δ66T | 0/3 | 5/5 | 11/6 | |
| P2_M_061510_2_234 | −/Δ103 | 0/0 | 7/3 | 11/6 | |
| P2_M_061510_2_875 | −/Δ60C | 0/0 | 0/0 | 0/0 | |
| P2_M_061510_2_606 | −/Δ76C | 0/12 | 15/4 | 10/14 | |
| P2_M_061510_2_578 | −/Δ63C | 5/0 | 9/0 | 18/0 | |
| P2_M_061510_2_858 | −/Δ63ATCATA | 0/4 | 1/0 | 9/8 | |
| P2_M_061510_2_210 | −/Δ61CTCAT | 7/10 | 7/0 | 14/0 | |
SNP: Single Nucleotide Variant SV: Structural Variation.
Re-identified by TASR. Only reads spanning variation within 5 bases of read start/end were counted.
In superscript, genotypes were determined by the 1000 genomes project. This information is not available for SVs.
A/B : A = coverage of reads over the non-deleted portion B = coverage over the deletion breakpoint.
x = ACTAGTGCATTTCAATAATCATG.
Underlined are discrepancies between TASR and the genotype calls, all of which are due to insufficient read coverage.