| Literature DB >> 31366358 |
Zhongqu Duan1,2, Yuyang Qiao1, Jinyuan Lu1, Huimin Lu1, Wenmin Zhang1, Fazhe Yan1, Chen Sun1, Zhiqiang Hu1, Zhen Zhang3, Guichao Li3, Hongzhuan Chen4, Zhen Xiang5, Zhenggang Zhu5, Hongyu Zhao2,6, Yingyan Yu7, Chaochun Wei8,9,10.
Abstract
The human reference genome is still incomplete, especially for those population-specific or individual-specific regions, which may have important functions. Here, we developed a HUman Pan-genome ANalysis (HUPAN) system to build the human pan-genome. We applied it to 185 deep sequencing and 90 assembled Han Chinese genomes and detected 29.5 Mb novel genomic sequences and at least 188 novel protein-coding genes missing in the human reference genome (GRCh38). It can be an important resource for the human genome-related biomedical studies, such as cancer genome analysis. HUPAN is freely available at http://cgm.sjtu.edu.cn/hupan/ and https://github.com/SJTU-CGM/HUPAN .Entities:
Keywords: Core genome; Genome assembly; Pan-genome; Population-specific variation; Presence-absence variation (PAV)
Mesh:
Substances:
Year: 2019 PMID: 31366358 PMCID: PMC6670167 DOI: 10.1186/s13059-019-1751-y
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1System diagram of pan-genome construction subsystem in HUPAN. Seven processes include as follows: ① de novo assembly all reads into contigs, ② removing contigs similar to the human reference genome, ③ extracting unaligned sequences (including fully unaligned sequences and partially unaligned sequences), ④ merging unaligned sequences from multiple individuals, ⑤ removing redundant sequences, ⑥ removing potential contaminations, and ⑦ constructing pan-genome combining the human reference genome and novel sequences
Comparing of HUPAN and EUPAN in the procedure of extracting non-reference sequences of an individual genome
| HUPAN | EUPAN | |
|---|---|---|
| # raw contigs (> 500 bp) | 610,537 | 610,537 |
| raw contigs length (bp) | 2,709,735,693 | 2,709,735,693 |
| # contigs after filtering | 24,150 | – |
| contigs length after filtering (bp) | 76,168,613 | – |
| # misassemblies | 1037 | 1050 |
| Misassembled contigs length (bp) | 5,483,408 | 5,657,999 |
| # Fully unaligned contigs | 5371 | 5394 |
| Fully unaligned contigs length (bp) | 5,000,779 | 5,014,971 |
| # Partially unaligned contigs | 1187 | 1197 |
| Partially unaligned contigs length (bp) | 5,435,999 | 5,628,509 |
| CPU time (hours) | 42 | 275 |
| Maximum memory (Gb) | 92 | 250 |
Fig. 2Summary of non-reference sequences for individual genomes. a The total length (Mb) and b the GC content (%) of unaligned contigs (including fully unaligned sequences and partially unaligned sequences) obtained for each individual after removing potential contamination. In b, the solid black line represents GC content of the primary sequence in GRCh38 (40.87%); the dotted lines represent GC content of novel sequences of YH genome [26] (red, 44.11%); 5.8 Mb novel contigs from SGDP [9] (green, 43.43%) and novel sequences of NA18507 genome (orange, 42.87%). The width of each plot indicates the frequency of samples with a given length or GC content
Fig. 3Characterization of sequences fully unaligned to GRCh38 primary assembly sequences in 185 deep sequencing Han Chinese genomes. a Length distribution of fully unaligned sequences. b The total length of fully unaligned sequences (Mb) obtained by using lower identity (80–90%) to remove redundant sequences. c The sequence count and sequence size when aligning the sequences to GRCh38 primary assembly sequences with lower sequence identity (80–90%). d Simulation of the total fully unaligned sequences using different numbers of individuals. e The percentage of repeat elements resulted from RepeatMasker, “hs38d1” is 5.8 Mb novel sequences from SGDP, and “GRCh38” is the primary assembly sequences of the human reference genome GRCh38. The RepeatMasker masked result of GRCh38 was downloaded from http://www.repeatmasker.org/species/hg.html. f Validation of fully unaligned sequences by aligning to other available human sequences (≥ 90% identity). “Aligned” defines the sequences that could be aligned to the target sequences, “Partially aligned” defines the sequences that could be partially aligned to the target sequences, “Aligned to other” defines the sequences that could not be aligned to the target sequences but could be aligned to other six available human sequences, and “No alignment” defines the sequences that could not be aligned to anyone of the seven data sets
Validation of fully unaligned sequences by aligning to other existing human sequences (>= 90% identity). The last line showed the length of sequences unaligned to any of existing genomes
| Assembled genomes | Alignment (bp) | Partially unaligned (bp) | Fully unaligned (bp) |
|---|---|---|---|
| ALT | 9,383,032 | 2,641,297 | 18,693,461 |
| HuRef | 17,031,675 | 5,138,981 | 8,547,134 |
| WGSA | 13,261,237 | 5,836,552 | 11,620,001 |
| YH | 24,099,632 | 2,330,394 | 4,287,764 |
| KOREF | 24,374,934 | 2,012,184 | 4,330,672 |
| HX1 | 14,797,305 | 2,748,577 | 13,171,908 |
| GM12878 | 8,635,247 | 4,473,988 | 17,608,555 |
| No alignment | 646,233 | ||
Fig. 4PAV profile analysis of 185 deep sequencing Han Chinese genomes. a The number of genes present in an individual using different CDS coverage threshold (80%, 85%, 90%, 95%, and 100%) versus the sequencing depth. b The gene PAV distributed across 185 individuals with the CDS coverage of 0.95. c The number of core genes and the total number of genes in the pan-genome determined with different number of individuals. Each time we randomly increased one individual and calculated the number of core genes and the total number of genes. These processes repeated 100 times. d PAV profile of 606 distributed genes. The red indicated gene presence and the blue indicated gene absence
Fig. 5Comparison of 33.58 Mb novel sequences from 275 Han Chinese genomes with other human genome data sets. Other human genome datasets include the patch sequence and alternative loci of GRCh38, novel sequences of African pan-genome (APG), and the novel sequences of hs38d1