| Literature DB >> 28149701 |
Jinfeng Chen1, Travis R Wrightsman2, Susan R Wessler3, Jason E Stajich4.
Abstract
BACKGROUND: Transposable element (TE) polymorphisms are important components of population genetic variation. The functional impacts of TEs in gene regulation and generating genetic diversity have been observed in multiple species, but the frequency and magnitude of TE variation is under appreciated. Inexpensive and deep sequencing technology has made it affordable to apply population genetic methods to whole genomes with methods that identify single nucleotide and insertion/deletion polymorphisms. However, identifying TE polymorphisms, particularly transposition events or non-reference insertion sites can be challenging due to the repetitive nature of these sequences, which hamper both the sensitivity and specificity of analysis tools.Entities:
Keywords: Annotation; Bioinformatics; Diversity; Parallel processing; Population genomics; Resequencing; Rice; Short read; Transposons
Year: 2017 PMID: 28149701 PMCID: PMC5274521 DOI: 10.7717/peerj.2942
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Workflow for identification of transposable element insertions in population resequencing data using Illumina paired-end reads.
Figure 2Performance of RelocaTE2, RelocaTE, TEMP and ITIS on simulated rice data.
Comparison of tool performance on rice chromosome 3 (OsChr3) for Sensitivity (A), Specificity (B), Recall rate of Target Site Duplication (TSD) (C), and comparison of performance on rice chromosome 4 (OsChr4) for Sensitivity (D), Specificity (E), Recall rate of TSD (F). Three replicate simulations of 200 random transposable element (TE) insertions were generated for OsChr3 and OsChr4. A series of datasets were constructed by sampling at varying sequence depths (from 1 to 40) from each simulated TE datasets. Sensitivity (SN), Specificity (SP) and TSD recall of each tool was estimated on each simulated dataset across multiple sequence depths. The error bars show the standard deviation among the three replicates which had different composition of 200 random TE insertions. SN was defined as the percentage of TE insertions from 200 simulated TE insertions were recalled within 100 base pairs of simulated TE insertion sites. SP was defined as the percentage of TE insertions from all calls were within 100 base pairs of 200 simulated TE insertions. Recall rate of TSD was defined as the percentage of true positive calls that correctly matched the simulated TSD of TE insertions.
Figure 3Performance of RelocaTE2 and TEMP on biological dataset in HuRef genome, IR64 genome, and 50 rice and wild rice strains.
(A) Venn diagram of the overlap in non-reference TE insertions identified in the HuRef genome and the rice IR64 genome using RelocaTE2 and TEMP. Sensitivity (SN) and Specificity (SP) were assessed by comparing the assembled HuRef genome to the GRCh36 reference genome and the assembled IR64 genome to the MSU7 reference genome. SN was defined as the percentage of validated calls out of all validated calls by either RelocaTE2 or TEMP. SP was defined as the percentage of validated calls out of all calls by each tool. (B) Comparison of the number of non-reference TE insertions of 14 TE families in 50 rice and wild rice strains identified by RelocaTE2 and TEMP. Strains are color-coded based on subpopulation classification.