| Literature DB >> 32827758 |
Jing Sun1, Yanfang Zhang2, Minhui Wang3, Qian Guan3, Xiujia Yang4, Jin Xia Ou5, Mingchen Yan6, Chengrui Wang6, Yan Zhang6, Zhi-Hao Li7, Chunhong Lan1, Chen Mao7, Hong-Wei Zhou5, Bingtao Hao8, Zhenhai Zhang9.
Abstract
Identification of genetic variants via high-throughput sequencing (HTS) technologies has been essential for both fundamental and clinical studies. However, to what extent the genome sequence composition affects variant calling remains unclear. In this study, we identified 63,897 multi-copy sequences (MCSs) with a minimum length of 300 bp, each of which occurs at least twice in the human genome. The 151,749 genomic loci (multi-copy regions, or MCRs) harboring these MCSs account for 1.98% of the genome and are distributed unevenly across chromosomes. MCRs containing the same MCS tend to be located on the same chromosome. Gene Ontology (GO) analyses revealed that 3800 genes whose UTRs or exons overlap with MCRs are enriched for Golgi-related cellular component terms and various enzymatic activities in the GO biological function category. MCRs are also enriched for loci that are sensitive to neocarzinostatin-induced double-strand breaks. Moreover, genetic variants discovered by genome-wide association studies and recorded in dbSNP are significantly underrepresented in MCRs. Using simulated HTS datasets, we show that false variant discovery rates are significantly higher in MCRs than in other genomic regions. These results suggest that extra caution must be taken when identifying genetic variants in the MCRs via HTS technologies.Entities:
Keywords: Genetic study; High-throughput sequencing; Multi-copy region; Multi-copy sequence; Variant discovery
Mesh:
Year: 2020 PMID: 32827758 PMCID: PMC8377240 DOI: 10.1016/j.gpb.2019.05.004
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Figure 1Identification and chromosomal distribution of theMCRs
A. Tiling seed sequences of 300 bp in length with a 1 bp interval (top panel) from ChrN1 are mapped to the reference genome hg19. A set of consecutive seed sequences perfectly mapped to both their origin locus on ChrN1 (blue bar on the bottom left) and another locus on ChrN2 (red bar on the bottom right). These two sequence regions with a length of at least 300 bp are thus defined as MCRs. B. Distribution of MCR seeds over different length spans. C. Distribution of MCR groups with different members. MCR, multi-copy region.
Figure 2Biological and clinical significance of MCRs
A. GO enrichment analysis of MCR-overlapping genes (adjusted P value < 0.01). B. Overlap of MCRs with DSBs identified by Crosetto et al. [37] for cells treated with aphidicolin (left) and neocarzinostatin (right). The number in the intersection indicates the number of MCRs overlapping with DSBs. The bar graphs show the enrichment test results for aphidicolin or neocarzinostatin treatment. Real-world dataset contains DSBs overlapping with MCRs, and simulated dataset contains DSBs overlapping with regions randomly chosen from the genome. Data are presented as mean ± SD (n = 1000). Chi-Squared test was used for statistical analysis (***, P < 0.001). C. Clinical significance classification of the ClinVar records in the MCRs. GO, gene ontology; DSB, double-strand break.
Figure 3Variant detection in simulated datasets
A. Accuracy and false discovery rates in MCRs and their flanking regions using different sequencing strategies. Data are presented as mean ± SD and error bars are shown in red. Left panel, Accuracy rate; Center panel, False positive rate; Right panel, False negative rate. Mean and standard deviation are shown in Table S8. B. Statistical differences in variant detection accuracies among different sequencing strategies. NS, not significant; PE, paired-end sequencing. Independent samples t-test was used for statistical analysis (****, P < 0.0001).