| Literature DB >> 32132168 |
Benjamin C Reiner1, Glenn A Doyle2, Andrew E Weller2, Rachel N Levinson2, Esin Namoglu2, Alicia Pigeon2, Emilie Dávila Perea2, Cynthia Shannon Weickert3,4,5, Gustavo Turecki6, Deborah C Mash7, Richard C Crist2, Wade H Berrettini2.
Abstract
Long interspersed element-1 retrotransposons (LINE-1 or L1) are ∼6 kb mobile DNA elements implicated in the origins of many Mendelian and complex diseases. The actively retrotransposing L1s are mostly limited to the L1 human specific (L1Hs) transcriptional active (Ta) subfamily. In this manuscript, we present REBELseq as a method for the construction of Ta subfamily L1Hs-enriched next-generation sequencing libraries and bioinformatic identification. REBELseq was performed on DNA isolated from NeuN+ neuronal nuclei from postmortem brain samples of 177 individuals and empirically-driven bioinformatic and experimental cutoffs were established. Putative L1Hs insertions passing bioinformatics cutoffs were experimentally validated. REBELseq reliably identified both known and novel Ta subfamily L1Hs insertions distributed throughout the genome. Differences in the proportion of individuals possessing a given reference or non-reference retrotransposon insertion were identified. We conclude that REBELseq is an unbiased, whole genome approach to the amplification and detection of Ta subfamily L1Hs retrotransposons.Entities:
Keywords: L1; L1Hs; REBELseq; Retrotransposons
Mesh:
Substances:
Year: 2020 PMID: 32132168 PMCID: PMC7202019 DOI: 10.1534/g3.119.400613
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.542
Figure 1Florescence assisted cell sorting (FACS) of NeuN stained nuclei. Representative image showing FACS separation of NeuN+ and NeuN- DAPI stained nuclei. Boxes indicate gating for populations of nuclei that were collecting for gDNA isolation. Only NeuN+ gDNA was utilized for this work.
Figure 2Schematic of the construction of Ta subfamily enriched L1Hs sequencing libraries. gDNA isolated from NeuN+ nuclei was enzymatically digested with HaeIII in the presence of shrimp alkaline phosphatase (rSAP) to fragment the genome and remove 5′ phosphates from cleavage products. A single primer extension using the Ta subfamily specific L1HsACA primer extends the 3′ end of the L1 sequence into the downstream gDNA. The 3′ ‘A’ overhang from the single primer extension is ligated to a custom T-linker, and primary PCR amplifies the construct using L1HsACA and T-linker specific primers. Hemi-nested secondary PCR using the L1Hs specific L1HsG primer and T-linker primer reduces the length of the L1 sequence carried forward and adds a sequencing adapter to the L1 end. Tertiary PCR uses primers complementary to the 5′ end of library amplicons to add a barcode to the L1 end and Illumina flow cell adapters to both ends of the amplicons.
Figure 3Detection levels of Ta subfamily L1Hs. (A) The number of Ta subfamily ref L1Hs detected per chromosome vs. the number of Ta subfamily ref L1Hs annotated in hg19 repeat masker. We detected ∼99% of Ta subfamily L1Hs annotated in hg19 repeat masker. (B) Average number of sequencing reads per person for a given L1 insertion. Data were binned to show contrast between the average number of sequencing reads per person seen for known and putative novel L1Hs insertions. An average number of sequencing reads per person ≥ 100 was used as a bioinformatic cutoff.
Figure 4Example image of a successful and unsuccessful confirmatory PCR. PCR experiments were conducted to determine the proportion of putative novel non-ref L1Hs insertions detected by REBELseq that could be independently validated. For each insertion, gDNA from a person predicted to have the L1Hs insertion (+) and a person not predicted to have the insertion (-) were used to amplify the genomic region purported to contain the L1Hs insertion (Filled site, F) and the same genomic region if it did not contain the insertion (Empty site, E). Insertion #1 is a positive confirmation of the results predicted by REBELseq, while Insertion #2 is a negative confirmation.
Table describing the rates of L1Hs insertion validation by binned average number of sequencing reads. The percentage of successfully validated novel non-ref L1Hs per read bin was used to calculate the number of predicted true positive novel non-ref L1Hs per bin. The calculated total true positive per read bin is the sum of the predicted true positive novel non-ref L1Hs and the total number of detected known non-ref and ref L1Hs per bin. The calculated percent true positive per bin is calculated as the total true positive for bin, divided by the sum of the number of detected novel non-ref L1Hs, number of detected known non-ref L1Hs and number of detected ref L1Hs. The cumulative percent true positive is calculated as the percentage of true positive L1Hs insertions having at least the average number of sequencing reads for that bin and above
| novel non-ref L1Hs | known non-ref L1Hs | ref L1Hs | Calculated total true positive per bin | Calculated % true positive per bin | Cumulative % true positive (per bin and above) | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Avg. # of reads | Successfully validated | Validations attempted | % Validated | total # detected per bin | Predicted true positive | total # detected per bin | total # detected per bin | |||
| ≥1,000 | 32 | 44 | 72.7 | 460 | 334 | 227 | 465 | 1026 | 89.1 | 89.1 |
| 500-999 | 16 | 41 | 39.0 | 186 | 73 | 43 | 90 | 206 | 64.6 | 83.8 |
| 250-499 | 13 | 40 | 32.5 | 320 | 104 | 28 | 67 | 199 | 48.0 | 75.9 |
| 100-249 | 6 | 36 | 16.7 | 876 | 146 | 34 | 77 | 257 | 26.0 | 58.8 |
Table 1. Validation of putative Ta subfamily L1Hs insertions.
Figure 5Distributions of high confidence L1Hs insertions. (A) The genomic distribution of known L1Hs (above the chromosome numbers) and novel L1Hs (below the chromosome numbers) per 10 MB window of each chromosome. Alternating color pattern and labeled central blocks represent the different chromosomes. Known and novel L1Hs are distributed throughout the genome, suggesting REBELseq is an unbiased whole genome approach. (B) Distribution of the number of unique known L1Hs insertions per a given number of individuals. The number of known L1Hs insertions shared by different numbers of individuals shows a trimodal distribution. While some known L1Hs occur with a rate of affected individuals approaching 1.0, the other local maxima are focused at few or less than half of surveyed individuals. This demonstrates that known L1Hs, such as those annotated in repeat masker for the human draft genome, should be considered polymorphic in nature, rather than ubiquitous in the human genome. (C) Distribution of the number of unique novel L1Hs insertions per a given number of individuals. The number of novel L1Hs insertions shared by different numbers of individuals shows a right skewed distribution. While some novel L1Hs were detected in the majority of individuals, most novel L1Hs were detected in one or a few individuals.