| Literature DB >> 29390960 |
Erika M Kvikstad1,2, Paolo Piazza3,4, Jenny C Taylor3,5, Gerton Lunter3.
Abstract
BACKGROUND: Transposable elements (TEs) are mobile genetic sequences that randomly propagate within their host's genome. This mobility has the potential to affect gene transcription and cause disease. However, TEs are technically challenging to identify, which complicates efforts to assess the impact of TE insertions on disease. Here we present a targeted sequencing protocol and computational pipeline to identify polymorphic and novel TE insertions using next-generation sequencing: TE-NGS. The method simultaneously targets the three subfamilies that are responsible for the majority of recent TE activity (L1HS, AluYa5/8, and AluYb8/9) thereby obviating the need for multiple experiments and reducing the amount of input material required.Entities:
Keywords: Alu; Bioinformatics; LINE1; Next generation sequencing; Polymorphism; Transposable elements
Mesh:
Substances:
Year: 2018 PMID: 29390960 PMCID: PMC5796560 DOI: 10.1186/s12864-018-4485-4
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1TE-NGS sequencing workflow. Enrichment for genomic fragments spanning active TEs and their unique flanking sequence is achieved by several enzymatic steps as described in the main text. First, genomic DNA is sheared, and adapters for sequencing are ligated to the genomic fragments following standard library preparation protocols. Next, a small aliquot (10 ng) of library is used as template for targeted amplification with primers complementary to TE subfamily-specific sequences and to the Illumina Universal PCR (P5) primer. Remaining genomic background fragments and inverted TEs in head-to-head orientation are removed by ssDNA exonuclease digestion after linear PCR amplification with TE-target primers or Illumina Universal primer, respectively. Last, amplification with nested primers targeting TE diagnostic bases, and containing Illumina i7 index and P7 primer sequences generates full double-stranded dual-adapter libraries containing unique indices for each sample and each TE subfamily, allowing for downstream pooling and multiplexing of many samples simultaneously. High throughput sequencing followed by alignment to the reference genome demarcates the TE insertion site by its 3′ end (read 2) and unique flanking sequence (read 1). TE insertions present in the reference genome can be identified by clustering of read pairs, whereas read 2 generated from polymorphic or novel TE insertions absent from the reference will map with lower quality and/or not at all; these TE can be identified by clusters of read 1 alone (see Methods; Supplemental Material for detailed procedures)
TE loci observed in NA12878 NGS libraries
| TE library | Reference TPa | Reference FNb | NA12878 TPc | NA12878 FNd | FPe | Validated Novelf |
|---|---|---|---|---|---|---|
| L1HS | 589 (84642) | 35 | 54 (1493) | 22 | 19 (74) | 10 (38) |
| 2335 (51529) | 404 | 143 (874) | 91 | 9 (44) | 6 (32) | |
| 1664 (61099) | 183 | 119 (953) | 29 | 3 (12) | 1 (4) |
aReference TP, observed TE insertions (reads) in the reference truth set with a TE cluster within 600 bp window of 3′ terminal position and match to predicted TE subfamily. Clusters contain filtered reads with a minimum 2 or more Illumina read 1 derived from the unique flanking sequence. See text for details
bReference FN, false negatives computed as reference TE subfamily members lacking cluster within 600 bp window of TE 3′ terminal position
cNA12878 TP, observed 1000 Genomes Phase 3 MEI calls in NA12878 having an identified TE cluster within 600 bp window of 3′ terminal position and matching predicted TE class (Alu, LINE1)
dNA12878 FN, MEI calls with TE subfamily classification lacking an observed cluster within 600 bp window of TE 3′ terminal position
eFP, false positive clusters lacking previous evidence of TE insertion within 600 bp window of cluster position before validation with GiaB and ONT long-read data
fValidated Novel, FP clusters supported by evidence from GiaB and ONT long-read data
Fig. 2Precision and recall as a function of cluster read depth. Performance of TE insertion detection was assessed computing precision and recall separately for reference and polymorphic TEs present in NA12878. Reference (NA12878) true positives (TP) and false negatives (FN) were determined by comparison of clusters to reference (hg19) RepeatMasker annotations (NA12878-specific 1000 Genomes Phase 3 MEI calls), respectively. False positives (FP) were defined conservatively as TE candidate clusters failing to intersect any previously identified NA12878 non-reference events (see main text; Methods for details)