Literature DB >> 28228978

Population and clinical genetics of human transposable elements in the (post) genomic era.

Lavanya Rishishwar¹, Lu Wang², Evan A Clayton³, Leonardo Mariño-Ramírez⁴, John F McDonald³, I King Jordan¹.

Abstract

Recent technological developments-in genomics, bioinformatics and high-throughput experimental techniques-are providing opportunities to study ongoing human transposable element (TE) activity at an unprecedented level of detail. It is now possible to characterize genome-wide collections of TE insertion sites for multiple human individuals, within and between populations, and for a variety of tissue types. Comparison of TE insertion site profiles between individuals captures the germline activity of TEs and reveals insertion site variants that segregate as polymorphisms among human populations, whereas comparison among tissue types ascertains somatic TE activity that generates cellular heterogeneity. In this review, we provide an overview of these new technologies and explore their implications for population and clinical genetic studies of human TEs. We cover both recent published results on human TE insertion activity as well as the prospects for future TE studies related to human evolution and health.

Entities: Chemical Disease Gene Species

Keywords: bioinformatics; disease; genetics; genomics; health; human; natural selection; polymorphisms; transposable elements; transposition

Year: 2017 PMID： 28228978 PMCID： PMC5305044 DOI： 10.1080/2159256X.2017.1280116

Source DB: PubMed Journal: Mob Genet Elements ISSN： 2159-2543

Human transposable element research in the (post) genomic era

Technology driven research and discovery on human transposable elements

A convergence of new technologies in three key areas –genomics, bioinformatics and high-throughput experimental techniques – is providing unprecedented opportunities for research and discovery on population and clinical genetic aspects of human transposable elements (TEs). In this review, we briefly cover these exciting technological developments and explore their implications for understanding how the activity of human TEs impacts the evolution and health of the global population. We would like to emphasize that our treatment is by no means intended as an exhaustive review of the subject, rather we are simply attempting to call the readers' attention to what we perceive to be some of the most relevant developments in this area along with the potential for future studies that these advances entail. It should also be noted that the review is focused primarily on the new bioinformatics tools that can be used to detect polymorphic TE insertions from next-generation sequence data, rather than the high-throughput experimental techniques, since we are most familiar with the computational approaches. Developments in genomics technology, and next-generation sequencing in particular, have taken us from the analysis of a single human genome, which alone has provided profound insight into the biology of human TEs, to the population genomics era where whole genome sequences from thousands of human individuals can be compared. Concomitant developments of bioinformatics tools for genome sequence analysis have allowed for the discovery and characterization of the genetic variants that are generated via recent TE activity, i.e. human TE polymorphisms, via the comparative analysis of next-generation re-sequencing data from multiple human genomes. Finally, a suite of novel high-throughput experimental techniques, which also leverage next-generation sequencing data, have been developed and applied for the characterization of human polymorphic TE insertions at the scale of whole genomes across numerous samples. The initial analysis of the first draft of the human genome sequence was, in some sense, a watershed event for TE research. One of the most significant findings of this research was the large fraction of the human genome that was shown to be derived from TE sequences; 47% of the genome sequence was reported to be TE-derived with a single family of elements, LINE-1 (L1), making up ∼17% of the genome and another family, Alu, contributing almost 11 million individual copies. These remarkable results were generated using homology-based sequence analysis with the program RepeatMasker. Subsequent analysis of the human genome sequence, using a more sensitive ab initio algorithmic approach, has revised the estimate upwards to more than two-thirds of genome being characterized as TE-derived. The abundance of TE sequences found in the human genome almost surely did not come as a surprise to members of the TE research community, but this finding certainly did underscore the potentially far reaching impact of these often underappreciated genetic elements on the human condition. The 1000 Genomes Project (1KGP) can be considered as the successor to the initial human genome project as well as the initiative that ushered human genomic research into the so-called post genomics era. As its name implies, the 1KGP entailed the characterization of whole genome sequences from numerous human individuals, and it did so with an eye toward capturing a broad swath of world-wide human genome sequence diversity. The 1KGP resulted in the characterization of whole genome sequences for 2,504 individual donors sampled from 26 global populations, which can be organized into 5 major continental population groups. The project was executed in three phases, each of which included a substantial focus on technology development, not only with respect to sequencing methods but also for the computational techniques that are needed to call sequence variants from next-generation re-sequencing data. This focus on technology development ultimately led to the characterization of genome-wide collections of human polymorphic TE (polyTE) insertion genotypes for all individuals in the project. Importantly, these data have been released into the public domain, thereby facilitating population and clinical genetic studies of human TE polymorphisms. Advances in next-generation sequencing technology have also facilitated the development of high-throughput experimental techniques that can be used to detect de novo TE insertions, genome-wide across multiple samples. These high-throughput experimental techniques couple enrichment for sequences that are unique to active families of human TEs with subsequent next-generation sequencing and mapping techniques in order to discover the locations of novel TE insertions. Notably, these innovative experimental approaches have been successfully applied toward the characterization of somatic human TE activity in a variety of tissues, along with its potential role in cancer, as is discussed later in this review.

Active families of human TEs

As described above, a large fraction of the human genome sequence has been derived from millions of individual TE insertions. The process of TE insertion and accumulation in the genome has taken place over many millions of years along the evolutionary lineage that led to modern humans, and it turns out that the vast majority of human TE-derived sequences were generated via relatively ancient insertion events. Most ancient TE insertions have accumulated numerous mutations since the time that they inserted in the genome, and as a consequence they are no longer capable of transposition. The vast majority of TE-derived sequences in the human genome (>99%) correspond to such formerly mobile elements. The most salient aspect of these inert human TEs, with respect to population and clinical genomics, is that their insertion locations are fixed in the human genome. In other words, each individual TE sequence insertion of this kind is found at the exact same genomic location in all human individuals and for all human populations. Thus by definition, these ancient and fixed TE sequences do not contribute to human genetic variation via insertion polymorphisms. There are, however, several families of TEs that are still active in the human genome. Elements of the HERV-K, L1, Alu and SVA families remain capable of transposition and can thereby generate insertion polymorphisms among individual human genomes. The resulting TE insertion polymorphisms have important implications for human evolution and health (disease) as detailed later in this review. HERV-K and L1 are autonomous TEs that encode all of the enzymatic machinery needed to catalyze their own transposition, whereas Alu and SVA are non-autonomous elements that are transposed in trans by L1 encoded proteins. All four active families of human TEs correspond to retrotransposons that transpose via the reverse transcription of an RNA intermediate. Members of the HERV-K family of active human TEs are human endogenous retroviruses, which are thought to have evolved from ancient retroviral infections that made their way into the germline and eventually lost the capacity for inter-cellular infectivity via loss of coding capacity for the envelope protein. As such, HERV-K elements have genomic structures that are very similar to retroviruses, including long-terminal repeat (LTR) sequences that flank the gag and pol open reading frames, which encode structural and enzymatic (integrase and reverse transcriptase) element proteins. L1 elements are long interspersed nuclear elements (LINEs) that are classified as non-LTR containing retrotransposons. Alu and SVA elements are both classified as short interspersed nuclear elements (SINEs). Alu elements are derived from 7SL RNA and are ∼300 bp in length. SVAs are hybrid elements that are made up of SINE, VNTR (variable number tandem repeat) and Alu sequences and can vary from 100–1,500 bp in length.

Genome-scale characterization of TE insertions

Human genome sequencing initiatives

The initial draft of the human genome sequence took more than 10 years to complete at a cost of ∼2.7 billion dollars. Characterization of the human genome sequence was done with Sanger sequencing technology, using essentially the same chain termination biochemistry that was invented in the mid-1970s, albeit with refinements in automation. In the mid-2000s, staring with the Roche 454 pyrosequencing method, there was explosion of novel biochemical methods for DNA sequencing. These so-called next-generation sequencing technologies enabled far higher throughput sequencing, at much lower cost, than the Sanger sequencing method used for the original human genome project. It is now possible to sequence an entire human genome in a single day at a cost of ∼1,000 dollars using Illumina's patented sequencing by synthesis (SBS) technology. This hyper-exponential increase in sequencing capacity, and simultaneous decrease in its cost, is powering a series of human genome sequencing initiatives that have profound implications for the study of human TE genetic variation (Table 1).

Table 1.

Large scale genome sequencing initiatives. Projects are sorted in descending order by the number of participants.

Project Name	PMID	# Participants	Description
Million Veteran Program (MVP)	26441289	1,000,000	Planned sequencing of 1 million US. Veterans (genotyping, whole genome and exome); current enrollment at 500k
SHGP	26583887	100,000	Catalog of whole genome sequences of 100k Saudis
TOPMed	N/A	62,000	Sequencing of 62k individual genomes along with a variety of data for precision medicine initiative
UK10K	26367797	10,000	Sequencing of ∼10k individuals from UK to inspect the effect of rare and low-frequency variants to human traits
Human Longevity	Awaiting	10,000	Deep sequencing of 10k human genomes; Data donated to Precision FDA
Iceland Genome Project	25807286	2,636	Catalog of whole genome sequences of 2,636 Icelanders
1000 Genomes Project	26432245	2,504	International whole genome project that sampled 2,504 healthy individuals from 26 populations
EGDP	27654910	483	Catalog of whole genome sequences of 483 genomes from 148 diverse population
SGDP	27654912	300	Catalog of whole genome sequences of 300 genomes from 142 diverse population
GoNL	24974849	250	Catalog of whole genome sequences of 250 Dutch parent-offspring families
Australian Aboriginals	27654914	108	Catalog of whole genome sequences of 108 Aboriginal Australians

Large scale genome sequencing initiatives. Projects are sorted in descending order by the number of participants. The previously discussed 1KGP is the emblematic initiative for the characterization of whole human genome sequences at the population level; as such, it is difficult to overstate the impact that this project has had, and continues to have, on human population and clinical genomics. The 1KGP had the critical effect of stimulating experimental methods related to sequencing as well as numerous bioinformatics methods that are used for the analysis of genome sequence data, particularly as they relate to characterizing genetic variants. A major part of this effort was the development and refinement of methods for calling structural variants, including but not limited to TE insertion polymorphisms. Nevertheless, the 1KGP, which entailed the characterization of just over 2,000 whole genome sequences, has been dwarfed in scale by a number of subsequent initiatives that are currently underway (Table 1). Several of the most ambitious human genome sequencing initiatives involve the characterization of cancer genome sequences. For example, the International Cancer Genome Consortium (ICGC) is collaborating with the US National Cancer Institute's The Cancer Genome Atlas (TCGA) to sequence genomes for 500 pairs of matched normal and tumor samples for 500 different tumor types, for an expected yield of 50,000 whole genome sequences. The US National Heart, Lung, and Blood Institute's (NHLBI) TOPMed precision medicine initiative is another health-related project that aims to sequence the genomes of 62,000 individuals. There are a number of other large-scale human genome sequencing initiatives that are aimed at the populations of specific countries or global sets of populations. For example, the Wellcome Trust is sponsoring the UK10K initiative to sequence the genomes of 10,000 citizens of the United Kingdom, and Saudi Arabia intends to sequence 100,000 Saudi individuals for their own project. The Simons Genome Diversity Project recently completed sequencing of 300 human genomes from 142 diverse populations, and the Estonian Genome Diversity Project sequenced 483 genomes from 148 populations. Together, these projects, along with others like them, will provide a wealth of raw sequence data that can be mined for TE insertion polymorphisms using the computational and experimental approaches described in the sections that follow.

High-throughput techniques for TE insertion detection

Bioinformatics approaches

The characterization of single nucleotide variants (SNVs) from computational analysis of next-generation re-sequencing data has proven to be relatively straightforward: sequence reads are mapped to a reference genome sequence, allowing for mismatches, and sites where the mapped reads differ in sequence from the reference are used to call variants. The characterization of structural variants from next-generation sequence data has proven to be a far more challenging, but by no means intractable, problem. Early methods for calling structural variants operated in manner that was agnostic with respect to the particular class of variant that was being characterized, whereas subsequent efforts have resulted in refined methods that are specifically tailored to individual structural variant classes. The most widely used and reliable methods for the computational detection of human TE insertion polymorphisms fall into the latter class of more specific methods. We want to be clear that these novel computational methods that we are describing are aimed at the detection of TE insertion polymorphisms, which will differ from the reference genome sequence, rather than the more mature bioinformatics methods (e.g. RepeatMasker) that are used to characterize the identity of the more ancient, fixed TE sequences that are included as part of a reference genome sequence. There exist numerous computational tools that allow for the detection of TE insertions from next-generation whole genome sequence data (Table 2). While these programs may differ substantially in their details, they all tend to rely on the same two fundamental principles: discordant read pair mapping and split (or clipped) reads (Fig. 1A). Discordant read pair mapping occurs when one member of a read pair maps uniquely to the reference genome sequence and the second member of the pair maps to a repetitive TE sequence that is not found in the adjacent genomic region in the reference sequence. In some cases, the second member of the pair may map partially to unique reference genome sequence and partially to the TE sequence. The presence of multiple read pairs that show this pattern, from within the same genomic interval, is taken as evidence of a TE insertion, with the specific identity of the inserted element determined by the mapping of the second member of the read pair. Typically, a TE reference library is provided to facilitate these mappings and the corresponding characterizations of insertion identities. The discordant read pair mapping technique is ideal for short read, pair end sequencing technology, such as the Illumina SBS method. The somewhat less commonly used, at least at this time, split read technology for computational detection of polymorphic TE insertions relies on longer sequence reads that map partially to unique reference genome sequence and partially to a repetitive TE sequence. This can include reads with one end in unique genome sequence and the other end in a TE sequence or reads that span an entire TE insertion (i.e., have a TE sequence in the middle of the read). As longer sequence read technologies – such as the Pacific Biosciences single molecule real time sequencing method (PacBio SMRT) – become more widely used for human genome sequencing, the split read approach should become increasingly useful. Alternatively, long reads may eventually come to be used for ab initio assembly of complex eukaryotic genomes, such as the human genome, thereby obviating the need for computational TE insertion detection methods altogether.

Table 2.

Computational approaches for genome-wide detection of TE insertions. Methods are sorted in order by their year of publication.

Tool Name	PMID	Year	Comments
VariationHunter	19447966	2009	Originally developed for SV detection, later refined for TE calling
HYDRA-SV	20308636	2010	General purpose SV tool; reported on mouse genome
TE-Locate	24832231	2012	Reported on 1001 Arabidopsis genomes project
Tea	22745252	2012	Specialized TE caller for cancer WGS data
ngs_te_mapper	22347367	2012	Requires TSDs; reported for Drosophila melanogaster
RetroSeq	23233656	2013	Tested on 1KGP and mouse strains
ReloaTE	23576519	2013	Requires TSDs; designed for rice genomes
Mobster	25348035	2014	Tested on 1KGP; reliable predictor for Human genome
Tangram	25228379	2014	Used in Phase II of 1KGP; no longer maintained
TEMP	24753423	2014	Reported on 1KGP and Drosophila genomes
T-lex2	25510498	2014	Reported on 1KGP and Drosophila genomes
TE-Tracker	25408240	2014	Reported on Arabidopsis genome and simulated human genome
TIGRA	24307552	2014	A breakpoint assembler and not a structural variant caller
TranspoSeq	24823667	2014	Specialized TE caller for cancer WGS data
TraFiC	25082706	2014	Specialized TE caller for cancer WGS data
MELT	26432246	2015	Used in Phase III of 1KGP; reported to work on Human, Chimp and dog.
ITIS	25887332	2015	Reported on Medicago truncatula; not optimized for Human genome
Jitterbug	26459856	2015	Reported on 1KGP and Arabidopsis genome
MetaSV	25861968	2015	General purpose SV tool; reported on simulated genome
DD_DETECTION	26508759	2016	Database free dispersed duplication detection approach
GRIPper	—	—	Detects non-reference gene copy insertion

Figure 1.

Schematic of the high-throughput bioinformatics (A) and experimental (B) approaches to human TE insertion discovery.

Computational approaches for genome-wide detection of TE insertions. Methods are sorted in order by their year of publication. Schematic of the high-throughput bioinformatics (A) and experimental (B) approaches to human TE insertion discovery. Two of the earliest computational methods developed specifically for the detection of TE insertions from next-generation sequence data are VariationHunter and the program Spanner, which was used for calling TE insertions in the first phase of the 1KGP. Subsequent phases of the 1KGP included additional refinement of next-generation sequence based TE insertion calling methods resulting in the Tangram and MELT programs, for the second and third phases of the project, respectively. RetroSeq and Mobster are two of the other most widely used programs for sequence based TE insertion detection. RetroSeq was implemented primarily for the detection of endogenous retrovirus insertions in the mouse genome, whereas Mobster was tested mainly on human L1 and Alu elements. Until very recently, all of these individual programs had only been benchmarked and validated individually by the same groups that developed each one. In other words, there was no independent and controlled comparison of the accuracy, runtime performance and usability of these tools. We recently performed just such a benchmarking and validation comparison of 21 different programs for sequence based TE insertion detection in an effort to provide researchers with an unbiased assessment of their utility. Our benchmarking study was focused solely on human TE detection, owing both to the importance of human TE detection for population and clinical genetic studies as well as the availability of an experimentally validated set of TE insertions for an entire human genome. The first phase of our benchmarking study entailed an effort to select tools that would be of the most potential use to the human TE research community. This included eliminating all programs from consideration that: 1) performed general structural variant detection (these typically have worse performance of TE insertion detection), 2) were specifically designed for cancer and required matched normal-cancer genome pairs, or 3) perform breakpoint assembly for TE insertion identification and are not able to detect insertion site locations without prior information. In this phase, we also eliminated all programs that were no longer supported and/or could not be used due to non-user generated errors such as previously reported bugs. This process resulted in reducing the original set of 21 programs down to 7 programs, which we then evaluated using simulated and actual human genome sequences. The final 7 programs that we evaluated were: MELT, Mobster, RetroSeq, Tangram, TEMP, ITIS and T-lex239 (Table 2). For each of these programs, we provided a detailed set of notes in support of their installation and use, including the exact commands and parameters that are required for their optimal performance. We compared all of the programs with respect to a set of qualitative and quantitative benchmarks. The qualitative benchmarks were ease of installation, ease of use, level of detail in the user manual and source code availability (i.e., open or closed source). The quantitative benchmarks were precision and recall accuracy measures along with the runtime parameters: CPUtime, walltime, RAM and the number of CPUs used. The simulated data that we used consisted of artificial genome sequences with randomly generated TE insertions and sequence read pairs simulated based on the Illumina sequencing profile. The empirical data was taken from a single individual from the 1KGP whose genome was extensively characterized, including with PacBio long read sequence technology, resulting in an experimentally validated set of 893 TE insertions genome-wide. When all of these factors were taken into consideration, the program MELT showed the best overall performance followed by the programs Mobster and RetroSeq. The superior performance of MELT on these particular data should be taken with some caution given the fact that it was developed and refined on the exact same human data set. Indeed, the programs that were designed to perform more broadly, such as TEMP, or for different species, such as ITIS and T-lex2, did not perform as well, consistent with the possibility that they were at an inherent disadvantage when benchmarked on human genome sequence data from the 1KGP. Nevertheless, our benchmarking analysis clearly supports the use of MELT, and to a lesser extent Mobster and RetroSeq, for the computational detection of human TE insertions from next-generation sequence data. There remain a number of caveats and open issues that should be considered when using these kinds of programs to predict TE insertions from whole genome sequence data. The first thing to consider is that no single method can produce optimal overall results. The best strategy is to use two or more of the top 3 methods – MELT, Mobster and RetroSeq – and then to combine the methods by looking for consensus TE calls that are supported by multiple methods. This approach has the potential effect of increasing precision at only a minor cost to recall, i.e., it is simultaneously conservative but can also increase the number of total TE calls by using multiple methods. Of course, a combined approach of this kind can be quite user intensive and could exceed the ability of some labs to readily implement. Perhaps the most pressing open issue regarding computational methods for TE insertion detection relates to the level of resolution at which the insertion sites can be located in the genome sequence. In our experience, TE insertions can only be accurately localized within approximately ± 100 bp. This lack of resolution makes it particularly difficult to combine results from multiple methods, as suggested above, since the same predictions will most often not be located at exactly the same genomic location. This limitation can be overcome by considering TE insertions detected within ± 100 bp windows to represent the same calls. Nevertheless, further algorithm development aimed at more precise TE insertion location should prove to be an important future development in the field.

High-throughput experimental approaches

In addition to driving the bioinformatics based efforts at TE insertion detection, next-generation sequencing techniques have also enabled a number of high-throughput experimental approaches for the detection of novel TE insertions (Table 3). Much like the computational approaches for TE detection, these high-throughput experimental techniques also share a core set of design principles (Fig. 1B). The first phase of these experiments consists of fragmentation of genomic DNA followed by enrichment for sequence elements that uniquely correspond to active human TE subfamilies, mainly Alu and L1. Different methods are distinguished by the approaches that they take to genomic fragmentation as well as whether they use PCR or hybridization for the enrichment step. Enrichment of sequence fragments from active TE subfamilies is followed by next-generation sequencing, for the most recently developed methods, or hybridization to tiling arrays for some of the older methods.

Table 3.

Next-generation sequence based			Tiling arrays/Sanger based
Method	PMID	Year	Method	PMID	Year
L1-Seq	20488934	2010	TIP-Chip	20602999	2010
Transposon-Seq	20603005	2010	Fosmid-based	20602998	2010
ME-Scan	20591181	2010	AIP	22495107	2012
RC-Seq	22037309	2011

High-throughput experimental approaches for TE insertion detection. Next-generation sequence based methods are presented separately from methods that used tiling arrays or Sanger sequencing. Methods are sorted in descending order by their year of publication. The first attempt at the systematic and unbiased characterization of novel human TE insertions was based on tiling array technology and was relatively low throughput. A number of next-generation sequence based techniques for TE insertion, which allowed for a substantial increase in the numbers of TE insertions that could be detected, were independently developed right around the same time in 2010 and 2011. Three such methods were published in 2010: ME-Scan, L1-Seq and Transposon-Seq. ME-Scan was used to characterize polymorphic Alu insertions, L1-Seq was applied to L1 insertions, and Transposon-Seq was used for TE insertion discovery with both families of elements. A fourth early sequence based method for TE insertion detection employed a lower-throughput approach that utilized fosmid sequences, characterized via Sanger sequencing, to characterize L1 insertions. The RC-Seq method was developed in 2011 and is the only method of its kind to be applied to all three families of active human TEs: Alu, L1 and SVA. RC-Seq combines tiling array based hybridization with next-generation sequencing for TE insertion discovery.

Evolutionary genetics of active human TEs

The high-throughput approaches to TE insertion detection described in the previous section, particularly the computational genome sequence based methods, have the potential to yield genome-wide catalogs of human TE insertion polymorphisms across numerous individuals from multiple populations. The realization of this possibility is exemplified by the 1KGP, phase 3 of which includes the public release of 16,192 TE insertion genotype calls for 2,504 individuals from 26 global populations. data sets of this kind have the potential to yield unprecedented insight into the nature of the evolutionary forces that act on TE polymorphisms.

Human genetic variation from TE activity

The first step in any genome-scale evolutionary analysis of human TE insertional polymorphisms involves a basic description of the nature of the genetic variation that is generated by TE activity. This includes descriptive statistics regarding the levels of TE insertion variation within and between populations along with a sense of how polymorphic TEs are distributed across the genome, particularly with respect to the location of functionally relevant genomic features such as genes and regulatory elements.

Levels and patterns of TE genetic variation

TE insertion detection programs yield presence absence genotype calls for individual loci – homozygous absent (0), heterozygous (1) and homozygous present (2) – across the entire genome, when applied to whole genome next-generation sequence data. For large scale human genome sequence initiatives, such as the 1KGP, this yields the kind of data that can be used to calculate polyTE insertion allele frequencies within and between populations. PolyTE allele frequencies () can be calculated from site-specific genotype data as the total number of TE insertions observed at any given genomic site () normalized by the total number of chromosomes in the population under consideration (): . This can be done for individual populations or for groups of related populations, such as the 5 major continental population groups characterized as part of the 1KGP. Population level polyTE allele frequencies can in turn be used in turn to calculate a variety of population genetic parameters that measure how genetic variation is apportioned among populations, such as heterozygosity () and related fixation index () statistics.where is the sample (within population) polyTE heterozygosity and is the total (between population) polyTE heterozygosity. These kinds of statistics are ideal for measuring the effects of natural selection on TE insertion polymorphisms as described later in this review.

Genomic landscape of TE insertions

Genome-wide catalogs of polyTE genotypes can also be used to systematically evaluate the landscape TE insertions and to compare their locations to the locations of functionally important genomic features such as genes, regulatory elements and epigenetic chromatin marks. The overall human TE genomic landscape is already very well defined, dating to the initial analysis of the draft human genome sequence and even earlier, but the extent to which polyTE distributions resemble those of the more ancient, fixed TEs that predominate in the human genome remains an open question. When all TE-derived sequences are considered, there a number of anomalous genomic regions that are particularly enriched or depleted for human TE sequences, and these are thought to be related to gene density and tight regulatory requirements. The rate of recombination is also an important factor influencing the genomic distribution of human TEs. It is thought that TE density should be reduced in regions of high recombination owing to the deleterious effects of ectopic recombination among dispersed element insertions. However, the situation is more complicated in the human genome where marked differences in TE genomic distributions can be seen for different families of TEs and even for different age classes within the same TE family. Across the entire genome, LINE elements (L1) tend to be enriched in AT-rich DNA and are primarily found in low recombining intergenic regions, whereas SINE elements (Alu) are enriched in GC-rich DNA regions in and around gene sequences. These TE distribution patterns correspond very well to previously defined isochores, which are large regions of DNA with uniform GC-content patterns. One particularly interesting finding from the initial analysis of the human genome sequence was that the distribution patterns of Alus change drastically for different age classes. Older subfamilies of Alus, i.e., those that inserted in the genome long ago, show the most skewed genomic distributions and the highest enrichment in GC-rich DNA. As the Alu subfamilies under consideration become progressively younger, they are progressively less enriched in GC-rich DNA; in fact, the very youngest AluY subfamily shows a preference for AT-rich DNA. These results were taken to indicate that Alus are preferentially retained in GC-rich DNA, and conversely more frequently lost from AT-rich DNA, since Alus are known to insert into the AT-rich target sequences favored by L1 encoded endonucleases. This was initially thought to be due to some positive selective force acting on Alus in GC-rich DNA, but was later shown to be more likely related to the relative ease with which Alu deletions were tolerated in gene poor AT-rich regions, compared to gene rich GC-regions where Alu deletions via ectopic recombination between nearby insertions would be far more deleterious. This issue has received substantial attention in the ensuing years and remains controversial. Now that there is a complete catalog of very recent Alu insertions, it will be very interest to see if this same patterns holds up.

Polymorphic TE insertions as ancestry informative markers

Ancestry informative markers (AIMs) are genetic variants that distinguish evolutionary lineages, different species or distinct populations within the same species, and can thereby be used to reconstruct evolutionary histories. For a number of reasons, TE insertions have proven to be extremely useful as AIMs, both within and between species. Most critically, locus-specific TE insertions nearly always represent synapomorphies, i.e., shared derived character states that are free from homoplasies where identical states do not result from shared ancestry. TE insertions also have the advantage that the ancestral state can be assumed to be absence of the insertion, and TE insertions are ideal AIMs for the very practical reason that they can be rapidly and accurately typed via PCR based assays. A number of studies from the pre-genomic era used polyTE insertions to study human evolution and ancestry. Most of these studies have focused on Alu elements, owing both to their relative abundance and the ease with which their shorter sequences can be PCR amplified. Far fewer studies have used L1s as AIMs, and to our knowledge, SVAs have yet to be used as markers in human evolutionary studies. Our own lab recently published the first evolutionary analysis of human populations using the genome-wide collection of human polyTE insertions characterized as part of the 1KGP. These data confirmed that human polyTE insertions are substantially geographically differentiated with many population-specific insertions. Furthermore, the patterns of polyTE insertion divergence within and between populations recapitulate known patterns of human evolution. African populations show both the highest numbers of polyTE insertions and the highest levels of polyTE sequence diversity, consistent with their ancestral status. Evolutionary relationships among human populations computed from the analysis of polyTE genotypes were entirely consistent with those that have been derived from single nucleotide polymorphisms (SNPs). In addition, when select subsets of population differentiated polyTEs are used as AIMs, they were able to accurately predict patterns of human ancestry and admixture. It is becoming increasingly apparent that patterns of human genetic ancestry and admixture are relevant to the study of human health and disease. In particular, there are numerous health disparities between human populations, and many of these are likely to be genetically based. Thus, the utility of polyTEs as AIMs could prove to be of clinical relevance, in applications such as admixture mapping for instance, in addition to their applications to population genetic studies.

Effects of natural selection on polymorphic TE insertions

The ability to calculate polyTE allele frequencies genome-wide, as detailed in the previous section of this review, should prove to be critical for measuring the effects of natural selection on TE insertions. One aspect of natural selection on polyTE insertions is already abundantly clear: the role that negative (purifying) selection plays in eliminating deleterious insertions from the population. The fact that TE insertions are deleterious is underscored by the numerous studies that have linked TE insertions to human disease. We describe a number of such clinically related human TE studies in subsequent sections of this review. The deleterious nature of mutations generated by TE activity is not at all surprising when you consider that TE insertions can be hundreds to thousands of base pairs long. Such large-scale mutations are clearly far more substantial mutational changes than the more commonly considered SNPs. In addition, the simple fact that TE mutations are insertions of DNA sequence, rather than duplications or other re-arrangements, also attests to their potentially disruptive nature. Our own previous genome-wide study of polyTE insertions turned up several lines of evidence consistent with the action of negative selection on human TEs. First of all, human polyTE insertions tend to be found at very low allele frequencies within and between human populations. Indeed, the allele frequency spectrum of polyTE insertions is highly skewed toward the lower end, and even more so than seen for SNPs, consistent with purifying selection. In addition, polyTE insertions were found to be severely under-represented in functional genomic regions including genes and exons. Despite the documented deleterious effects of TE insertions, the majority of TE insertion events are likely to be neutral or nearly so. This can be attributed in part to humans' relatively low effective population size, which renders natural selection less able to eliminate individual TE insertions that have only moderate fitness effects. Indeed, most recently integrated Alu elements have been shown to evolve as neutral alleles, and relatively short L1 insertions, which result from 5′ truncations that occur frequently during L1 retrotransposition, have also been shown to evolve neutrally in the human genome. The findings on the neutrality of short L1 elements underscore the importance of distinguishing among different types of TE insertions, short versus long insertions in particular, when considering the potential effects of selection on TE polymorphisms. As alluded to previously, the results demonstrating the action of negative selection on human polyTE insertions are not surprising considering the disruptive nature of genic TE insertions and their known link to diseases. In addition to the deleterious effects of TE insertions on gene function, TEs can also affect fitness post-insertionally by mediating ectopic recombination and by causing cellular toxicity via DNA damage. Selection against longer TE insertions, along with their relative paucity in high recombining regions, is consistent with deleterious effects of ectopic recombination among dispersed TE copies. It has been suggested that TEs represent such a potent mutational threat that host genomes were forced to evolve global regulatory mechanisms to repress their activity. For example, a number of epigenetic regulatory systems may have originally evolved to defend against TE activity and were only subsequently coopted to serve as host gene regulators. Nevertheless, for us it is also particularly interesting to speculate as to a possible role for positive (adaptive) selection in sweeping polyTE insertions to (relatively) high frequencies along specific population lineages. If positive selection on polyTE insertions was to be detected, it would suggest that such sequences can somehow encode functional utility for the human genome. The possibility that TE sequences can provide functional utility for their host genomes is well supported by numerous studies on the phenomenon of exaptation, or molecular domestication, whereby formerly selfish TE sequences come to encode essential cellular functions. This has been seen most often in the context of regulatory sequences. Human TE sequences have been shown to provide a wide variety of gene regulatory sequences including promoters, enhancers, transcription terminators and several classes of small RNAs. Human TE sequences can also affect host gene regulation via changes in the local chromatin environment. However, all of the human TE-derived regulatory sequences studied to date correspond to relatively ancient TE insertions that are no longer capable of transposition and are consequently fixed with respect to their genomic locations. Accordingly, it is not known whether exaptation of TE sequences can occur on the far shorter time scale that would be needed in order for polyTE insertions to show evidence of evolving by positive selection. At this time, there are some tentative lines of evidence that are consistent with a role for positive selection in shaping the evolution of human polyTE insertions. Closer inspection of the polyTE insertion allele frequency spectrum mentioned above revealed a shift at the higher end of the spectrum, suggesting that some TE insertions may have increased in frequency owing to the effects of positive selection. This pattern was seen for Asian and European populations but not for African populations. Thus, it is possible that this shift could reflect genetic drift, and accordingly less efficacious selection, in human populations that have historically lower effective population sizes. Additional work is needed to distinguish between these two possibilities. There is also data from a more narrowly focused study on polymorphic L1 insertions showing patterns of linkage disequilibrium and extended haplotypes that are consistent with positive selection on human polyTE insertions. More detailed studies on human TE genetic variation will be needed to fully assess the role that positive selection has played in the evolution of polyTEs. The flood of whole genome sequence data coming from human genome initiatives around the world, coupled with the maturing computational techniques for characterizing polyTE insertions from those data, should provide ample opportunities for studies of this kind. In addition, the analytical framework for detecting positive selection at the genomic level is already well established and should be readily portable to genome-wide studies of TE genetic variation.

Clinical genetics of polymorphic TE insertions

TE insertions in Mendelian disease

Human TE insertions are relatively large scale mutations that are considered to be both rare and deleterious, particularly if they occur in genes or other functionally important genomic elements. In other words, TE insertions often correspond to highly penetrant mutations, and accordingly they have been linked to many Mendelian diseases. Indeed, the ability of L1 sequences to transpose was first confirmed by a study showing that a novel L1 insertion into the F8 (Coagulation Factor VIII) gene causes hemophilia A. Subsequent studies have implicated Alu insertions in a number of Mendelian diseases, including hemophilia B, cystic fibrosis and Apert syndrome. An SVA insertion in the BTK (Bruton Tyrosine Kinase) gene causes X-linked agammaglobulinaemia. Despite their known disease causing properties, TE insertion mutations are often not considered in screens for disease causing variants. For example, widely used exome based methods for disease variant discovery will necessarily overlook the contribution of TE insertions to human disease. Computational and experimental approaches to TE insertion discovery provide a number of potential advantages with respect to the discovery of TE mutations that can cause Mendelian diseases. As we have discussed previously, these kinds of approaches allow for the systematic and unbiased characterization of deleterious TE insertions genome-wide, a critical dimension of genomic approaches to the diagnosis of disease. In addition, characterization of the genomic landscape of TE insertions for large scale population based genome initiatives (Table 1) will provide an important reference panel of TE mutations that are found in healthy individuals for the purpose of screening for rare potential disease causing variants.

TE activity and cancer

There are a number of lines of evidence that indicate a relationship between the activity of human TEs and the etiology of cancer, particularly for the active subfamily of L1 elements. The initial studies that uncovered a potential connection between TEs and cancer focused on expression, both transcript and protein, of L1 elements in tumor tissue samples. While it was previously thought that L1 expression was largely repressed in somatic tissue, it has been shown that numerous L1 elements are also expressed in a wide variety of tumor types including testicular cancer, germ cell tumors and breast cancer. More recently, nearly half of all human cancers were found to be exclusively immunoreactive to L1 ORF1 encoded proteins compared to matched normal tissue samples, suggesting that the ORF1 proteins could serve as cancer diagnostic biomarkers. In addition to the aforementioned L1 expression analysis, numerous studies have employed next-generation sequence analysis based techniques, followed by validation with PCR and Sanger sequencing, in order to characterize the TE insertion landscape of human cancers. Tumor genome sequences from a wide variety of cancer types have been found to be enriched for L1 insertions; these include colorectal tumors, esophageal carcinoma,134135 and gastrointestinal tumors. In one particularly broad survey, 53% of 244 cancer genomes were found to have L1 insertions, many of which included 3′ transduced sequences that are introduced as copying errors from run-on transcripts during the reverse transcription process. As was the case for TE cancer expression research, these surveys of TE insertion in cancer genomes were suggestive and interesting but did not necessarily establish a causal relationship for TE activity in the etiology of cancer (i.e., tumorigenesis). A smaller number of studies have shown even more direct evidence that specific TE insertions play a causal role in the etiology of cancer. The application of the RC-Seq technique to 19 hepatocellular carcinoma genome sequences uncovered two different L1 insertions, each of which initiated tumorigenesis via a different oncogenic pathway. Independent L1 insertions were found in the MCC (Mutated in Colorectal Cancers) and ST18 (Suppression of Tumorigenicity) tumor suppressor genes in this study. Perhaps the strongest evidence for an L1 insertion that is an actual driver mutation for tumorigenesis was recently reported for colorectal cancer. Investigators in this study found a somatic L1 insertion in one allele of the APC (Adenomatous Polyposis Coli) tumor suppressor gene, and they showed that this L1 insertion coupled with a point mutation in the second allele of the same gene to initiate tumorigenesis via the so-called two hit colorectal cancer pathway.

TE activity and aging

A number of recent studies have also suggested a link between human TE activity and aging. The connection between TE activity and aging is tangentially supported by TEs' involvement in a number of age related diseases including cancer, as described in the previous section, and intriguingly the rate of L1 transposition in cancer does appear to increase with age. More direct evidence for a link between TE activity and aging comes from several studies that have uncovered evidence that the epigenetic silencing of human TEs declines with age. For example, the methylation levels of Alu were shown to decline with age as was heterochromatin based silencing of L1 elements. In senescent fibroblast cells, the chromatin environment for relatively young members of the Alu, L1 and SVA families becomes progressively more open leading to an increase in both their levels of transcription and rates of transposition. Finally, transcriptional de-repression of Alu elements has been implicated in aging via an indirect route linked to nuclear cytotoxicity in senescent stem cells that is caused by DNA damage. Upregulated Alu transcription in senescent stem cells inhibits their ability to efficiently repair DNA damage, and suppression of Alu transcription was shown to reverse this effect. Research on the connection between TE activity and aging is a particularly new and promising area of investigation.

Polymorphic TE insertion associations with common diseases

The association of TE insertions with both Mendelian disease and cancer, discussed in previous sections, rests on the assumptions that TE mutations are rare, deleterious and penetrant. However, recent results from analysis of the 1KGP sequences indicate that numerous TE insertions can be found in the genomes of healthy individuals. Population genetic analysis of these data shown that TE polymorphisms segregate within and between human populations and can, albeit relatively rarely, increase to high allele frequencies. In other words, TE polymorphisms can in some cases come to represent common genetic variants. Common genetic variants of this kind, also referred to common mutations, have been widely used over the last decade or so in association studies that aim to characterize the genetic architecture of common human diseases or conditions. Genomic characterization of TE insertion genotypes, for hundreds of thousands of individuals among various human populations, can provide an ideal source of data for genome wide association studies (GWAS), which to date have almost exclusively been conducted using SNPs. GWAS require hundreds or thousands of cases and controls in order to have sufficient statistical power to detect associations between common genetic variants and disease. Despite the drastic decreases in the cost of whole genome sequencing over the last several years, it is still not practical to use this approach for most GWAS. Accordingly, these studies rely on the use of array technology to characterize variant alleles for hundreds of thousands of known SNPs genome-wide. This approach yields disease associations with SNP alleles that do not necessarily represent causal mutations. In other words, an associated SNP may simply tag a genomic region that contains a nearby disease causing variant that is in linkage disequilibrium (LD) with the associated SNP. The existence of LD structure provides an important opportunity for the association of TE insertion polymorphisms with common diseases. As more and more whole genome sequences accumulate from the various genome sequencing initiatives around the world, the genomic landscape of TE insertions should become increasingly well characterized, assuming computational methods for TE insertion detection are accurately applied to these data. The accumulation of thousands of whole genome sequences, from diverse human populations, that include genome-wide catalogs of TE insertion genotypes provides the opportunity for imputation of TE insertion genotypes via comparison with SNP array data. In this way, TE insertion polymorphisms could be associated with disease via thousands of existing GWAS studies along with untold numbers of future GWAS. The potential of this approach to TE GWAS is supported by a recent genome-wide survey of human L1 insertions that found abundant evidence of LD between these TE polymorphisms and nearby SNPs.

TE insertion associations with quantitative traits

The same logic that applies to the association of TE insertion polymorphisms with common diseases via GWAS can be used to associate TE polymorphisms with a number of different quantitative traits. These include anthropometric phenotypes, measures of human performance and a wide variety of so-called endophenotypes, which are considered as intermediate physiological traits that underlie higher order, observable phenotypes. Gene expression levels are perhaps the most widely studied class of endophenotype. Expression quantitative trait loci (eQTL) analysis correlates levels of gene expression with genetic variant genotypes in order to characterize the influence of genetic variants on gene regulation. As is the case with GWAS, the vast majority of eQTL studies compare SNP genotypes with gene expression levels. However, more recent studies have begun to analyze different classes of genetic variants using the eQTL framework. For example, copy number genotypes for short tandem repeat sequences at >2,000 loci were recently shown to be associated with the expression of numerous human genes using an eQTL approach. In addition, the Structural Variation Group of the 1KGP used the eQTL approach to quantify the influence of structural variants on human gene expression using RNA-seq data characterized for 1KGP samples from European and African populations by the GUEVEDIS project. Many of the large scale genome initiatives listed in Table 1 will include abundant donor meta-data along with their genome sequences. For example, the NHLBI TOPMed precision medicine initiative will collect molecular, behavioral, imaging, environmental, and patient clinical data along with a variety of omics data sources, including DNA methylation, metabolite and RNA expression profiles. These kinds of quantitative data can all be compared to the genetic variants that will be characterized by whole genome sequencing, including TE insertion polymorphisms, in order to characterize the genetic architecture of a variety of quantitative human traits. Imputation of TE genotypes onto SNP array data, as described previously in the context of GWAS, could also provide abundant opportunities to characterize TE-eQTLs in particular. The GTEx eQTL project, for instance, has compared genome-wide SNP genotypes from hundreds of individuals to their RNA-seq gene expression data for 53 human tissue types. Imputation of TE insertion genotypes onto the SNP arrays used for this study could lead the discovery of TE influences on human gene expression related to a wide variety of phenotypes.

Conclusions and prospects

Human TE research has been profoundly influenced by the ongoing revolution in genomic technology. There are a number of new computational and experimental approaches that allow for the genome-wide characterization of TE insertions across numerous samples. These kinds of techniques are continually being refined and improved, and this process often goes hand-in-hand with large scale genome sequencing initiatives, such as was the case for the 1KGP. These new approaches are making it possible to study the population and clinical genetics of human TEs at the genome-scale for the first time. The explosion of genome sequencing initiatives, which are often explicitly motivated by evolutionary or clinical considerations, will provide abundant opportunities for the application of these novel genomic techniques for TE discovery and research. Nevertheless, the sheer abundance of the data that is being generated by such initiatives will provide substantial challenges to the research community. The temptation could exist to focus on the most easily accessible sequence variants, i.e., SNPs, and disregard the more difficult to characterize structural variants. We feel that this would be a mistake, as it is simply not possible to appreciate the full scope of human genetic variation without considering TE insertion polymorphisms. Hopefully the new genomic technologies for TE discovery and characterization will come to be even more widely used and applied for future genome powered studies of human genetics.

145 in total

1. A vision for the future of genomics research.

Authors: Francis S Collins; Eric D Green; Alan E Guttmacher; Mark S Guyer
Journal: Nature Date: 2003-04-14 Impact factor: 49.962

2. SVA elements: a hominid-specific retroposon family.

Authors: Hui Wang; Jinchuan Xing; Deepak Grover; Dale J Hedges; Kyudong Han; Jerilyn A Walker; Mark A Batzer
Journal: J Mol Biol Date: 2005-10-19 Impact factor: 5.469

3. The biased distribution of Alus in human isochores might be driven by recombination.

Authors: Michael Hackenberg; Pedro Bernaola-Galván; Pedro Carpena; José L Oliver
Journal: J Mol Evol Date: 2005-03 Impact factor: 2.395

4. Inhibition of activated pericentromeric SINE/Alu repeat transcription in senescent human adult stem cells reinstates self-renewal.

Authors: Jianrong Wang; Glenn J Geesman; Sirkka Liisa Hostikka; Michelle Atallah; Benjamin Blackwell; Elbert Lee; Peter J Cook; Bogdan Pasaniuc; Goli Shariat; Eran Halperin; Marek Dobke; Michael G Rosenfeld; I King Jordan; Victoria V Lunyak
Journal: Cell Cycle Date: 2011-09-01 Impact factor: 4.534

5. Polymorphic Alu insertions and the Asian origin of Native American populations.

Authors: G E Novick; C C Novick; J Yunis; E Yunis; P Antunez de Mayolo; W D Scheer; P L Deininger; M Stoneking; D S York; M A Batzer; R J Herrera
Journal: Hum Biol Date: 1998-02 Impact factor: 0.553

Review 6. Positive natural selection in the human lineage.

Authors: P C Sabeti; S F Schaffner; B Fry; J Lohmueller; P Varilly; O Shamovsky; A Palma; T S Mikkelsen; D Altshuler; E S Lander
Journal: Science Date: 2006-06-16 Impact factor: 47.728

7. LINE-mediated retrotransposition of marked Alu sequences.

Authors: Marie Dewannieux; Cécile Esnault; Thierry Heidmann
Journal: Nat Genet Date: 2003-08-03 Impact factor: 38.330

8. Genetic variation among world populations: inferences from 100 Alu insertion polymorphisms.

Authors: W Scott Watkins; Alan R Rogers; Christopher T Ostler; Steve Wooding; Michael J Bamshad; Anna-Marie E Brassington; Marion L Carroll; Son V Nguyen; Jerilyn A Walker; B V Ravi Prasad; P Govinda Reddy; Pradipta K Das; Mark A Batzer; Lynn B Jorde
Journal: Genome Res Date: 2003-06-12 Impact factor: 9.043

9. A global reference for human genetic variation.

Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

10. Integrative analysis of 111 reference human epigenomes.

Authors: Anshul Kundaje; Wouter Meuleman; Jason Ernst; Misha Bilenky; Angela Yen; Alireza Heravi-Moussavi; Pouya Kheradpour; Zhizhuo Zhang; Jianrong Wang; Michael J Ziller; Viren Amin; John W Whitaker; Matthew D Schultz; Lucas D Ward; Abhishek Sarkar; Gerald Quon; Richard S Sandstrom; Matthew L Eaton; Yi-Chieh Wu; Andreas R Pfenning; Xinchen Wang; Melina Claussnitzer; Yaping Liu; Cristian Coarfa; R Alan Harris; Noam Shoresh; Charles B Epstein; Elizabeta Gjoneska; Danny Leung; Wei Xie; R David Hawkins; Ryan Lister; Chibo Hong; Philippe Gascard; Andrew J Mungall; Richard Moore; Eric Chuah; Angela Tam; Theresa K Canfield; R Scott Hansen; Rajinder Kaul; Peter J Sabo; Mukul S Bansal; Annaick Carles; Jesse R Dixon; Kai-How Farh; Soheil Feizi; Rosa Karlic; Ah-Ram Kim; Ashwinikumar Kulkarni; Daofeng Li; Rebecca Lowdon; GiNell Elliott; Tim R Mercer; Shane J Neph; Vitor Onuchic; Paz Polak; Nisha Rajagopal; Pradipta Ray; Richard C Sallari; Kyle T Siebenthall; Nicholas A Sinnott-Armstrong; Michael Stevens; Robert E Thurman; Jie Wu; Bo Zhang; Xin Zhou; Arthur E Beaudet; Laurie A Boyer; Philip L De Jager; Peggy J Farnham; Susan J Fisher; David Haussler; Steven J M Jones; Wei Li; Marco A Marra; Michael T McManus; Shamil Sunyaev; James A Thomson; Thea D Tlsty; Li-Huei Tsai; Wei Wang; Robert A Waterland; Michael Q Zhang; Lisa H Chadwick; Bradley E Bernstein; Joseph F Costello; Joseph R Ecker; Martin Hirst; Alexander Meissner; Aleksandar Milosavljevic; Bing Ren; John A Stamatoyannopoulos; Ting Wang; Manolis Kellis
Journal: Nature Date: 2015-02-19 Impact factor: 69.504

6 in total

1. TranSurVeyor: an improved database-free algorithm for finding non-reference transpositions in high-throughput sequencing data.

Authors: Ramesh Rajaby; Wing-Kin Sung
Journal: Nucleic Acids Res Date: 2018-11-16 Impact factor: 16.971

2. Human Retrotransposon Insertion Polymorphisms Are Associated with Health and Disease via Gene Regulatory Phenotypes.

Authors: Lu Wang; Emily T Norris; I K Jordan
Journal: Front Microbiol Date: 2017-08-02 Impact factor: 5.640

3. Stress response, behavior, and development are shaped by transposable element-induced mutations in Drosophila.

Authors: Gabriel E Rech; María Bogaerts-Márquez; Maite G Barrón; Miriam Merenciano; José Luis Villanueva-Cañas; Vivien Horváth; Anna-Sophie Fiston-Lavier; Isabelle Luyten; Sandeep Venkataram; Hadi Quesneville; Dmitri A Petrov; Josefa González
Journal: PLoS Genet Date: 2019-02-12 Impact factor: 5.917

4. The Simons Genome Diversity Project: A Global Analysis of Mobile Element Diversity.

Authors: W Scott Watkins; Julie E Feusier; Jainy Thomas; Clement Goubert; Swapon Mallick; Lynn B Jorde
Journal: Genome Biol Evol Date: 2020-06-01 Impact factor: 3.416

5. InpactorDB: A Classified Lineage-Level Plant LTR Retrotransposon Reference Library for Free-Alignment Methods Based on Machine Learning.

Authors: Simon Orozco-Arias; Paula A Jaimes; Mariana S Candamil; Cristian Felipe Jiménez-Varón; Reinel Tabares-Soto; Gustavo Isaza; Romain Guyot
Journal: Genes (Basel) Date: 2021-01-28 Impact factor: 4.096

6. K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes.

Authors: Simon Orozco-Arias; Mariana S Candamil-Cortés; Paula A Jaimes; Johan S Piña; Reinel Tabares-Soto; Romain Guyot; Gustavo Isaza
Journal: PeerJ Date: 2021-05-19 Impact factor: 2.984

6 in total