| Literature DB >> 36100885 |
Juan O Lopez1, Jaime Seguel2, Andres Chamorro2, Kenneth S Ramos3.
Abstract
BACKGROUND: Long interspersed element 1 (LINE-1 or L1) retrotransposons are mobile elements that constitute 17-20% of the human genome. Strong correlations between abnormal L1 expression and several human diseases have been reported. This has motivated increasing interest in accurate quantification of the number of L1 copies present in any given biologic specimen. A main obstacle toward this aim is that L1s are relatively long DNA segments with regions of high variability, or largely present in the human genome as truncated fragments. These particularities render traditional alignment strategies, such as seed-and-extend inefficient, as the number of segments that are similar to L1s explodes exponentially. This study uses the pattern matching methodology for more accurate identification of L1s. We validate experimentally the superiority of pattern matching for L1 detection over alternative methods and discuss some of its potential applications.Entities:
Keywords: GFF; K-mer; LINE-1; Probe
Mesh:
Substances:
Year: 2022 PMID: 36100885 PMCID: PMC9472350 DOI: 10.1186/s12859-022-04907-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Pattern matching overview
Fig. 2Relationship between random LINE-1 insertion/deletion and pattern count in Chrm 2
Fig. 3Relationship between random LINE-1 insertion/deletion and pattern count in Chrm 16
F1 Scores of L1PD outputs for different values of parameters m, t, and
| Precision | Recall | F1 score | |||
|---|---|---|---|---|---|
| 5 | 650 | 7 | 0.74503 | 0.55979 | 0.63926 |
| 675 | 7 | 0.74481 | 0.56001 | 0.63932 | |
| 700 | 7 | 0.74465 | 0.56016 | ||
| 725 | 7 | 0.74445 | 0.56023 | 0.63932 | |
| 750 | 7 | 0.74414 | 0.56038 | 0.63931 | |
| 10 | 625 | 9 | 0.79454 | 0.58554 | 0.6742 |
| 650 | 9 | 0.79409 | 0.58591 | 0.67429 | |
| 675 | 9 | 0.7937 | 0.5862 | ||
| 700 | 9 | 0.79321 | 0.58642 | 0.67431 | |
| 725 | 9 | 0.7925 | 0.58642 | 0.67405 | |
| 15 | 650 | 9 | 0.79167 | 0.58956 | 0.67582 |
| 675 | 9 | 0.7912 | 0.58986 | ||
| 700 | 9 | 0.7908 | 0.59008 | ||
| 725 | 9 | 0.78995 | 0.59008 | 0.67553 | |
| 750 | 9 | 0.78937 | 0.59022 | 0.67541 | |
| 20 | 650 | 9 | 0.7894 | 0.59417 | 0.678 |
| 675 | 9 | 0.78893 | 0.59468 | 0.67816 | |
| 700 | 9 | 0.78856 | 0.59498 | ||
| 725 | 9 | 0.78774 | 0.59505 | 0.67796 | |
| 750 | 9 | 0.78717 | 0.5952 | 0.67785 | |
| 25 | 650 | 9 | 0.78842 | 0.59395 | 0.6775 |
| 675 | 9 | 0.78788 | 0.59447 | 0.67764 | |
| 700 | 9 | 0.78745 | 0.59483 | ||
| 725 | 9 | 0.78663 | 0.5949 | 0.67745 | |
| 750 | 9 | 0.78606 | 0.59505 | 0.67734 | |
| 30 | 650 | 9 | 0.78819 | 0.59366 | 0.67722 |
| 675 | 9 | 0.78764 | 0.59417 | 0.67735 | |
| 700 | 9 | 0.78721 | 0.59454 | ||
| 725 | 9 | 0.78639 | 0.59461 | 0.67718 | |
| 750 | 9 | 0.78587 | 0.5949 | 0.67717 |
Highest F1 Score for each value of δ is in bold
Fig. 4Size vs. Time comparison The time of execution of our pipeline is linearly dependent on the size of the input FASTQ files
Pipeline running time
| Sample | FASTQ Size | Pre-processing | L1PD |
|---|---|---|---|
| (GB) | (min.) | (min.) | |
| HG02153 | 3.05 | 93 | 29 |
| HG00119 | 3.51 | 129 | 28 |
| HG00114 | 8.03 | 198 | 27 |
| HG01383 | 8.98 | 235 | 28 |
| HG00304 | 9.23 | 227 | 28 |
| HG01612 | 9.70 | 250 | 28 |
| HG01883 | 15.93 | 354 | 28 |
| HG01275 | 19.05 | 399 | 28 |
| HG01840 | 29.33 | 642 | 29 |
| HG00551 | 30.98 | 686 | 28 |
L1PD vs. BLASR vs. MUMmer4
| L1PD | BLASR | MUMmer4 | |
|---|---|---|---|
| Finding L1s in general | Successful with no additional input | Successful with all L1s provided as input | Too many results with all L1s provided as input |
| Finding L1s with L1PD probes | Successful | Very few results | No results |
Fig. 5Sample GFF3 output generated by L1PD
Sample of CNVPG values
| Sample | Pattern count | CNV value |
|---|---|---|
| HG02153 | 8,137 | − 0.02457 |
| HG00119 | 8,139 | 0 |
| HG00114 | 8,140 | 0.01229 |
| HG01383 | 8,138 | − 0.01229 |
| HG00304 | 8,135 | − 0.04915 |
| HG01612 | 8,128 | − 0.13515 |
| HG01883 | 8,136 | − 0.03686 |
| HG01275 | 8,140 | 0.01229 |
| HG01840 | 8,126 | − 0.15973 |
| HG00551 | 8,131 | − 0.09829 |
Fig. 6Histogram of LINE-1s per chromosome
Fig. 7Effect on LINE-1 distribution after random LINE-1 insertion/deletion in Chrm X
Fig. 8L1PD mode flowchart