| Literature DB >> 33910608 |
Akihiro Fujimoto1,2, Jing Hao Wong3,4, Yukiko Yoshii4, Shintaro Akiyama5,6, Azusa Tanaka3,4, Hitomi Yagi4, Daichi Shigemizu5,6, Hidewaki Nakagawa5,6, Masashi Mizokami7, Mihoko Shimada3.
Abstract
BACKGROUND: Identification of germline variation and somatic mutations is a major issue in human genetics. However, due to the limitations of DNA sequencing technologies and computational algorithms, our understanding of genetic variation and somatic mutations is far from complete.Entities:
Keywords: Germline SVs; Long reads; Origin of structural variations (SVs); Somatic SVs
Mesh:
Substances:
Year: 2021 PMID: 33910608 PMCID: PMC8082928 DOI: 10.1186/s13073-021-00883-1
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Fig. 1Result of benchmarking using NA19240. Indels (≥ 100 bp) were identified with several software from the fastq file of NA19240 and compared with a gold standard SV callset as performed previously [13]. We classified SVs as common; detected in gold standard SV calls and each caller, gold standard SV call only; detected only in gold standard SV calls and caller only; detected only in each caller. SVs in “gold standard SV call only” were considered false negatives and “caller only” false positives, and F-measures were calculated with sensitivity and specificity. SVs were classified based on repeat information of regions. SVs in tandem repeat regions, self-chain regions, regions in both repeats, and non-repeat regions were evaluated separately. “CAMPHOR” was a result of parameter setting in this study, and “CAMPHOR (appropriate parameter set for 20× data)” was a result of the released version
Fig. 2Germline insertions and deletions. a Length distribution of insertion and deletion events. Insertions and deletions with length ≥ 100 bp were identified, classified to insertion and deletion events, and lengths were compared. b Classification of the repeat(s) in insertion events. c Classification of the repeat(s) in deletion events. We considered sequences as covered by repeat(s) if repeat(s) occupied ≥80% of the sequences. d Classification of SINE families in insertion and deletion events. Families of SINEs that covered indel regions were analyzed. Families with count ≤5 were classified as others. e Classification of LINE families in insertion and deletion events. Families of LINEs that covered the indel regions were analyzed. Families with count ≤5 were classified as others. f Composition of multi-repeats covering indels. Families with count ≤5 were classified as others. g Combination of LINE subfamilies of two LINEs covering insertion events. h Strands of the combination of two LINEs found in insertion events. Combinations of different strands were significantly larger than that of the same strands in L1HS-L1HS and L1HS-L1P1 (Fisher’s exact test)
Fig. 3Polymorphic insertion of processed pseudogenes. Analysis of long reads revealed the entire structures of insertion of processed pseudogenes. a Examples of the processed pseudogene. Inserted sequences of 3017 bp insertion at chr12:125316602_125316603 and 1363 bp insertion at chr4:7947946_7947947 were mapped to exons of TDG and MOSMO genes. b An example of non-reference poly(A) sequence in a polymorphic insertion of processed pseudogenes. An output of web-BLAT is shown. Blue and black characters indicate aligned and unaligned bases, respectively. Black characters in the sequences show the nucleotides that are not found in the reference genome sequence. c Average expression level of candidate origin genes of processed pseudogenes. The expression data was obtained from GTEx expression data [33], and the average expression level was calculated. The expression levels were compared between the candidate origins and other genes (Wilcoxon signed-rank test)
Fig. 4Analysis of the mechanism of germline insertion and deletion events. a Possible mechanism of germline deletion events. The mechanism was estimated from the flow in Fig. S12 (Additional file 1). b Analysis of repeat of breakpoints of each deletion mechanism. Combinations of repeats at both breakpoints are shown. NR, non-repeat. “Within” indicates both breakpoints were included in the same repeats or non-repeats. c Replication timing and insertion events. d Replication timing and deletion events. Replication timing is represented by four groups from left to right, earliest to latest. The proportions of each category are represented by colored bars. The black line is at 0.25, the proportion expected in each group if the categories were divided equally among them. NAHRs were biased toward early replicating regions
Fig. 5Somatic SVs in liver cancers. a Comparison of somatic SVs between short-read study and the current study. Somatic SVs detected by this study were compared with a previous short-read study [2]. b Sensitivity of SV calling in the current study. We calculated the sensitivity for SVs identified by short reads. Variant allele frequencies (VAFs) were estimated from short reads [2]. c Possible mechanism of somatic SVs. The mechanism was estimated as done for germline deletions. The proportions were significantly different between germline deletions and somatic SVs (p value = 9.9 × 10−77, chi-square test). d Comparison of HBV integrations between short-read study and this study. HBV integrations identified by this study were compared with a previous short-read study [2]