| Literature DB >> 28948042 |
Cindy G Santander1, Philippe Gambron2, Emanuele Marchi3, Timokratis Karamitros1, Aris Katzourakis1, Gkikas Magiorkinis1,4.
Abstract
The advancements of high-throughput genomics have unveiled much about the human genome highlighting the importance of variations between individuals and their contribution to disease. Even though numerous software have been developed to make sense of large genomics datasets, a major short falling of these has been the inability to cope with repetitive regions, specifically to validate structural variants and accordingly assess their role in disease. Here we describe our program STEAK, a massively parallel software designed to detect chimeric reads in high-throughput sequencing data for a broad number of applications such as identifying presence/absence, as well as discovery of transposable elements (TEs), and retroviral integrations. We highlight the capabilities of STEAK by comparing its efficacy in locating HERV-K HML-2 in clinical whole genome projects, target enrichment sequences, and in the 1000 Genomes CEU Trio to the performance of other TE and virus detecting tools. We show that STEAK outperforms other software in terms of computational efficiency, sensitivity, and specificity. We demonstrate that STEAK is a robust tool, which allows analysts to flexibly detect and evaluate TE and retroviral integrations in a diverse range of sequencing projects for both research and clinical purposes.Entities:
Keywords: HTS; endogenous retroviruses; evolution; mobile element; transposons; virus integration
Year: 2017 PMID: 28948042 PMCID: PMC5597868 DOI: 10.1093/ve/vex023
Source DB: PubMed Journal: Virus Evol ISSN: 2057-1577
Software for detecting TEs and viruses in WGS data.
| Software | Detection target | Detection method | Detects in reference? | Requires specific aligning? | Third party tools | Parallelised? | Implementation |
|---|---|---|---|---|---|---|---|
| RetroSeq ( | Transposable elements | Discordant reads, then split reads | No | No, but must be in BAM | SAMtools (v0.9), bcftools, exonerate, BEDtools | No | Perl |
| Tangram ( | Transposable elements | Split reads and discordant reads simultaneously | Yes | Yes, MOSAIK | MOSAIK (2.0), zlib, pthread lib | Yes | C, C ++ |
| VirusSeq ( | Viruses | Unmapped reads for general detection; Discordant and split-reads for integration site detection | No | Yes, MOSAIK | MOSAIK (0.9.0891) | Yes | Perl, C, C ++ |
| MELT ( | Transposable elements | Discordant reads, then split reads | Detects deletions | No, but must be in BAM | Bowtie2 | No | Java |
| VirusFusionSeq (VFS) ( | Viruses | Unmapped reads for general detection; Discordant and split-reads for integration site detection | Yes | Yes | BWA, SAMtools, BLAST,CAP3, SSAKE | Partially (BWA portion) | Pipeline (Perl) |
| Tlex2 ( | Transposable elements | Looks at host and annotated TE flanks. Searches for split reads | Only detects in reference | No | MAQ, SHRIMP, BLAT, RepeatMasker, Phrap | Partially (MAQ) | Pipeline (Perl) |
| STEAK | Transposable elements and viruses | Split-reads, retrieves mate for PE data | Yes | No | Aligner of choice, BEDtools | Yes | C, C ++ |
Can detect in reference but was not designed to mark presence-absence of reference insertions.
Figure 1.Workflow of STEAK. Processing data: All reads are locally aligned using the Smith Waterman algorithm and allowing mismatches when mapping reads against a TE reference (5′- and 3′-ends and respective reverse complements). Reads that match with the TE are trimmed of the matching portion. Information on the trimmed reads and their mates, such as the original mapping positions, MAPQ, and sequence qualities, are kept in STEAK outputs. Detection Module: Trimmed reads can be remapped to the human reference either as single-end (trimmed read detection) or paired-end reads (guided detection).
Figure 2.Parallelisation of STEAK processing. A 50× coverage simulation of chromosome 1 was processed using our MPI-based software. The speedup as a function of the number of cores shows that the program scales well because few concurrency issues affect it.
Figure 3.Processing in STEAK. (A) The software runs as several processes that read different parts of the input SAM or FASTQ file. (B) In the case of a BAM/CRAM file, because the data is read from a single point, a unique process is executed but it spawns several threads, one that accumulates the data in a buffer and the others that process them. TEs are specified in a FASTA file. The trimmed sequence is written in the first FASTQ file. The original sequence of the other read (or the trimmed sequence if there is a match as well) is placed in the second FASTQ file.
Figure 4.Distributions for cluster read depth of known HK2 integrations. Top: Cluster read depth for all known integrations. Middle: Cluster read depth for known integrations within repetitive elements. Bottom: Cluster read depth for known integrations that are not within repetitive elements.
Figure 5.Distribution of twenty simulated HIV integrations within human reference (hg19) in chromosome 1. Respective genomic coordinates can be found in Table 2.
Twenty simulated HIV integrations into human chromosome 1.
| Human reference (hg19) | Original mapping | Post-trimming | ||
|---|---|---|---|---|
| Chromosome | Position | Supporting reads | Guided detection | Trimmed read detection |
| 1 | 4179520 | 35 | 100 | 64 |
| 1 | 10331435 | 34 | 113 | 70 |
| 1 | 16830086 | 30 | 96 | 57 |
| 1 | 18869777 | 31 | 112 | 67 |
| 1 | 20389049 | 36 | 76 | 45 |
| 1 | 54327146 | 34 | 93 | 63 |
| 1 | 57730318 | 34 | 101 | 70 |
| 1 | 99180019 | 24 | 92 | 53 |
| 1 | 116586277 | 31 | 85 | 54 |
| 1 | 144993094 | 16 | 18 | 0 |
| 1 | 149062299 | 36 | 99 | 63 |
| 1 | 165127302 | 43 | 109 | 71 |
| 1 | 170764099 | 15 | 69 | 44 |
| 1 | 188462855 | 43 | 121 | 85 |
| 1 | 191791184 | 25 | 80 | 53 |
| 1 | 197518001 | 34 | 88 | 55 |
| 1 | 213559631 | 26 | 90 | 52 |
| 1 | 219498833 | 37 | 97 | 63 |
| 1 | 223662699 | 38 | 95 | 61 |
| 1 | 231971371 | 26 | 78 | 44 |
Where trimmed reads alone could not detect the integration.
The sequencing samples analysed in benchmarking.
| Sample | Dataset | Coverage |
|---|---|---|
| TCGA-A6-2681-10A-01D-2188-10 (COAD) | TCGA: Colon adenocarcinoma | 50× |
| TCGA-HC-7233-10A-01D-2115-08 (PRAD) | TCGA: Prostate adenocarcinoma | 50× |
| TCGA-NJ-A4YQ-10A-01D-A46J-10 (LUAD) | TCGA: Lung adenocarcinoma | 50× |
| TCGA-BW-A5NQ-10A-01D-A27I-10 (LIHC) | TCGA: Liver hepatocellular carcinoma | 45× |
| NA12878 | 1K Genomes: CEU pedigree (Offspring) | 50× |
| NA12891 | 1K Genomes: CEU pedigree (Father) | 50× |
| NA12892 | 1K Genomes: CEU pedigree (Mother) | 50× |
| HK2_Enrich01 | NA | 500× |
| HK2_Enrich02 | NA | 500× |
All TCGA samples were DNA derived from peripheral blood and were sequenced with Illumina platform for whole genome sequencing. CEU pedigree samples derived from immortalised cell lines maintained by 1000 Genomes Project and were sequenced with Illumina platform for whole genome sequencing. Target enrichment samples were germline derived and sequenced as described in the methods section.
Figure 6.Comparative performance of HK2 detection in whole genome sequencing. (A) Bar graph displays known non-reference integrations detected in WGS projects by each respective system. (B) Pyramid plot depicts number of polymorphic integrations detected per genome—number of present non-reference and number of absent reference integrations.
Figure 7.Detection performance in target enrichment data. The left facet depicts presence of non-reference integrations detected by each respective system. The right facet depicts the marking of presence of reference integrations. MELT results detect 5 reference presences and 137 absences.