| Literature DB >> 31231536 |
Masao Nagasaki1,2,3, Yoko Kuroki1,2,4, Tomoko F Shibata1,2, Fumiki Katsuoka1,2, Takahiro Mimori1,2, Yosuke Kawai1,2,3, Naoko Minegishi1,2, Atsushi Hozawa1,2, Shinichi Kuriyama1,2,5, Yoichi Suzuki1,2, Hiroshi Kawame1,2, Fuji Nagami1, Takako Takai-Igarashi1, Soichi Ogishima1, Kaname Kojima1,2,3, Kazuharu Misawa1,2, Osamu Tanabe1,2, Nobuo Fuse1,6, Hiroshi Tanaka1, Nobuo Yaegashi1,2,6, Kengo Kinoshita1,3, Shiego Kure1,2,6, Jun Yasuda1,2, Masayuki Yamamoto1,2.
Abstract
In recent genome analyses, population-specific reference panels have indicated important. However, reference panels based on short-read sequencing data do not sufficiently cover long insertions. Therefore, the nature of long insertions has not been well documented. Here, we assembled a Japanese genome using single-molecule real-time sequencing data and characterized insertions found in the assembled genome. We identified 3691 insertions ranging from 100 bps to ~10,000 bps in the assembled genome relative to the international reference sequence (GRCh38). To validate and characterize these insertions, we mapped short-reads from 1070 Japanese individuals and 728 individuals from eight other populations to insertions integrated into GRCh38. With this result, we constructed JRGv1 (Japanese Reference Genome version 1) by integrating the 903 verified insertions, totaling 1,086,173 bases, shared by at least two Japanese individuals into GRCh38. We also constructed decoyJRGv1 by concatenating 3559 verified insertions, totaling 2,536,870 bases, shared by at least two Japanese individuals or by six other assemblies. This assembly improved the alignment ratio by 0.4% on average. These results demonstrate the importance of refining the reference assembly and creating a population-specific reference genome. JRGv1 and decoyJRGv1 are available at the JRG website.Entities:
Keywords: DNA sequencing; Genomics
Year: 2019 PMID: 31231536 PMCID: PMC6555796 DOI: 10.1038/s41439-019-0057-7
Source DB: PubMed Journal: Hum Genome Var ISSN: 2054-345X
Fig. 1Features of the 3691 long insertions (TMMINSs).
a Distribution of the 3691 insertions in the chromosomes. The red lines on the chromosomes indicate the locations of the insertions on each chromosome. The gray bands in each chromosome indicate its cytobands. b Distribution of the lengths of TMMINSs. The two prominent peaks correspond to Alus and LINEs. Left inner box: distribution of GC ratios accompanied by entropy information. Right inner box: TMMINSs with high entropy tended to show medium GC ratios of ~0.5. The distribution of entropy is accompanied by the GC ratio information. The peaks in the high-entropy region indicated that many TMMINSs had high complexity. c Repeat motif enrichment analysis of TMMINSs. The boxplot indicates the background distribution of the total number of motif classes. Each box represents the 25th and 75th percentiles of the total number of each motif. The notches represent the 1.5 × interquartile range. The red dots outside the notches indicate the enriched motif classes in TMMINSs. The other black dots show outliers. d The relative frequencies of nonreference alleles of TMMINSs, SNVs, and short indels are indicated as green, red, and blue lines, respectively. The nonreference allele frequencies of each variant were calculated from the genotypes of 1070 individuals, and only variants found in JPN00001 were used. e Repeat categories and allele frequencies of TMMINSs in 1KJPN. The horizontal axis shows the allele frequencies of TMMINS in 1KJPN, and the vertical axis shows the occupancy ratio of repeat motifs in TMMINS
Length and ratio of genome repeat class to TMMINSs. Features of TMMINSs
| Class | Subclass | Total | Mean length | Total number of normalized bases | Ratio of normalized length to total length | Active Mobile Element |
|---|---|---|---|---|---|---|
| No. of repeats | 2922 | 811 | 2,370,250 | 0.2686 | ||
| SINEs | AluJ | 171 | 1978 | 338,180 | 0.0383 | |
| AluS | 382 | 1525 | 582,731 | 0.066 | ||
| AluY | 846 | 586 | 495,440 | 0.0561 | Yes | |
| MIR | 142 | 2103 | 298,660 | 0.0338 | ||
| Other Alu | 54 | 2245 | 121,255 | 0.0137 | ||
| Total | 1595 | 1151 | 1,836,266 | 0.2079 | ||
| LINEs | L1 | 804 | 1440 | 1,157,769 | 0.1312 | Yes |
| L2 | 297 | 1554 | 461,442 | 0.0523 | ||
| Other | 60 | 1810 | 108,581 | 0.0123 | ||
| Total | 1161 | 1488 | 1,727,792 | 0.1958 | ||
| LTR | ERVL | 222 | 1838 | 408,002 | 0.0462 | |
| Other | 179 | 1872 | 335,097 | 0.038 | ||
| Total | 401 | 1853 | 743,099 | 0.0842 | ||
| SVA | SVA_A | 8 | 505 | 4,041 | 0.0005 | Yes |
| SVA_B | 6 | 1837 | 11,024 | 0.0012 | Yes | |
| SVA_C | 3 | 1523 | 4,569 | 0.0005 | Yes | |
| SVA_D | 56 | 606 | 33,961 | 0.0038 | Yes | |
| SVA_E | 33 | 1899 | 62,654 | 0.0071 | Yes | |
| SVA_F | 74 | 1271 | 94,026 | 0.0107 | Yes | |
| Total | 180 | 1168 | 210,275 | 0.0238 | ||
| DNA | hAT-Charlie | 82 | 2224 | 182,389 | 0.0207 | |
| TcMar-Tigger | 72 | 1678 | 120,841 | 0.0137 | ||
| Other | 48 | 2002 | 96,113 | 0.0109 | ||
| Total | 202 | 1977 | 399,343 | 0.0453 | ||
| RNA | Total | 4 | 2341 | 9,364 | 0.0011 | |
| Satellite | Low complexity | 160 | 1090 | 174,477 | 0.0198 | |
| Simple repeats | 1119 | 1139 | 1,274,131 | 0.1444 | ||
| Other | 93 | 817 | 75,985 | 0.0086 | ||
| Total | 1372 | 1111 | 1,524,593 | 0.1728 | ||
| Unknown | Other | 3 | 1363 | 4,089 | 0.0005 | |
| All | Total | 7840 | 1126 | 8,825,071 | 1 |
Statistics of TMMINSs
| Chr | Number of TMMINSs | Sum of TMMINSs length | GRCh38 original length | JRGv1_len | Increased length |
|---|---|---|---|---|---|
| 1 | 309 | 239,427 | 248,956,422 | 249,198,570 | 242,148 |
| 2 | 243 | 187,725 | 242,193,529 | 242,386,836 | 193,307 |
| 3 | 216 | 202,994 | 198,295,559 | 198,505,658 | 210,099 |
| 4 | 195 | 99,669 | 190,214,555 | 190,315,561 | 101,006 |
| 5 | 168 | 125,425 | 181,538,259 | 181,665,793 | 127,534 |
| 6 | 209 | 131,123 | 170,805,979 | 170,939,170 | 133,191 |
| 7 | 218 | 141,534 | 159,345,973 | 159,491,300 | 145,327 |
| 8 | 152 | 82,336 | 145,138,636 | 145,221,493 | 82,857 |
| 9 | 189 | 121,060 | 138,394,717 | 138,517,748 | 123,031 |
| 10 | 173 | 140,919 | 133,797,422 | 133,954,928 | 157,506 |
| 11 | 208 | 134,875 | 135,086,622 | 135,226,433 | 139,811 |
| 12 | 178 | 144,826 | 133,275,309 | 133,423,817 | 148,508 |
| 13 | 172 | 100,189 | 114,364,328 | 114,466,305 | 101,977 |
| 14 | 97 | 93,675 | 107,043,718 | 107,139,526 | 95,808 |
| 15 | 87 | 55,846 | 101,991,189 | 102,048,214 | 57,025 |
| 16 | 96 | 50,737 | 90,338,345 | 90,389,652 | 51,307 |
| 17 | 130 | 89,379 | 83,257,441 | 83,348,358 | 90,917 |
| 18 | 93 | 43,078 | 80,373,285 | 80,416,794 | 43,509 |
| 19 | 118 | 80,500 | 58,617,616 | 58,701,084 | 83,468 |
| 20 | 136 | 78,303 | 64,444,167 | 64,526,175 | 82,008 |
| 21 | 83 | 68,287 | 46,709,983 | 46,780,291 | 70,308 |
| 22 | 83 | 35,720 | 50,818,468 | 50,855,005 | 36,537 |
| X | 112 | 88,583 | 156,040,895 | 156,131,050 | 90,155 |
| Y | 26 | 46,055 | 57,227,415 | 57,280,509 | 53,094 |
| Total | 3691 | 2,582,265 | 3,088,269,832 | 3,090,930,270 | 2,660,438 |
The sum of TMMINSs length for each chromosome is not consistent with the increased length because some of original sequences in GRCh38 was removed when the TMMINSs inserted to GRCh38 (see Supplementary Fig. Integration of detected INSs to GRCh38)
Summary of genetic annotations to TMMINSsa
| Class | Count |
|---|---|
| Intergenic | 1792 |
| Motif | 31 |
| Transcript | 1660 |
| Gene | 2245 |
| Exon | 59 |
| Intron | 1112 |
| Upstream (5kb) | 341 |
| Downstream (5kb) | 402 |
The software annotates multiple classes to one insertion
aIn total, 3691 TMMINSs were annotated using SNPEff ver 4.3b
Summary of annotation with GWASCataloga
| Distance (base) | |
|---|---|
| <100 | 1 |
| <1K | 43 |
| <10K | 363 |
| <100K | 1751 |
| <1M | 1372 |
| <10M | 135 |
| NA | 26 |
| Total | 3691 |
aThe version downloaded at 29/Jan/2016. NA is TMMINSs in chrY
Fig. 2Typical patterns of read coverage distributions of TMMINSs in 1KJPN and other populations.
The left side of each panel shows the normalized coverage distribution of a TMMINS in the 1KJPN (top) and i1000g (bottom) populations, and the right side shows the genotype frequencies in eight populations. Blue: null, green: hetero, red: homo. a The allele frequencies were rare, common, and abundant in the African, East Asian and CEU, and BEB and CLM populations, respectively. b A monomorphic sequence in modern humans. c Correlation of the allele frequencies of 871 biallelic TMMINSs in 1KJPN (x-axis) and the shared ratio between JPN00001 and AK1 (y-axis)
Fig. 3Insertion frequency heat map.
a Phylogenetic relationship among chimpanzees, Denisovans, Neanderthals, and modern humans. b Heat map of the allele frequencies of 194 selected TMMINSs. The right panel shows the frequency in 1KJPN and other population data from HapMap projects[42,43] (i1000g). The vertical axis is ordered by the allele frequencies in i1000g. The populations were clustered into three groups: Africa, East Asia, and others. The middle color bar indicates the frequency in all i1000g populations. The left panel shows the existence of each allele in the genomes of a chimpanzee, a Denisovan, and a Neanderthal
Alignment performance of JRGv0 and GRCh38 + decoyJRGv0
| (a) The comparision of alighment ratio with GRCh38, GRCh38 + decoyJRGv1 and JRGv1 | ||||
|---|---|---|---|---|
| Mean | S.D. | Improvement (compared to total reads) | Improvement (compared to unmapped reads) | |
| Alignment ratio with GRCh38 | 96.92% | 0.51% | - | - |
| Alignment ratio with GRCh38 + decoyJRGv1 | 97.35% | 0.50% | 0.43% | 16.22% |
| Alignment ratio with JRGv1 | 97.36% | 0.50% | 0.44% | 16.47% |
Fig. 4Insertions in the ZNF676 region and functional single-nucleotide variants in 1KJPN.
a GRCh38 + decoyJRGv0 improved the alignment around the ALG1L2 gene region. Some of the sequence reads that mapped to ALG1L2 when the reference was GRCh38 were mapped to the decoy sequence when decoyJRGv0 was added to the reference. b Multiple alignment of a portion of the ZNF676 protein from a chimpanzee, GRCh38, and JRGv0. c Suggested functional variants of a novel insertion, TMMINS2292, in the ZNF676 gene coding region