| Literature DB >> 22848389 |
Xin Jin1, Mingze He, Betsy Ferguson, Yuhuan Meng, Limei Ouyang, Jingjing Ren, Thomas Mailund, Fei Sun, Liangdan Sun, Juan Shen, Min Zhuo, Li Song, Jufang Wang, Fei Ling, Yuqi Zhu, Christina Hvilsom, Hans Siegismund, Xiaoming Liu, Zhuolin Gong, Fang Ji, Xinzhong Wang, Boqing Liu, Yu Zhang, Jianguo Hou, Jing Wang, Hua Zhao, Yanyi Wang, Xiaodong Fang, Guojie Zhang, Jian Wang, Xuejun Zhang, Mikkel H Schierup, Hongli Du, Jun Wang, Xiaoning Wang.
Abstract
Non-human primates have emerged as an important resource for the study of human disease and evolution. The characterization of genomic variation between and within non-human primate species could advance the development of genetically defined non-human primate disease models. However, non-human primate specific reagents that would expedite such research, such as exon-capture tools, are lacking. We evaluated the efficiency of using a human exome capture design for the selective enrichment of exonic regions of non-human primates. We compared the exon sequence recovery in nine chimpanzees, two crab-eating macaques and eight Japanese macaques. Over 91% of the target regions were captured in the non-human primate samples, although the specificity of the capture decreased as evolutionary divergence from humans increased. Both intra-specific and inter-specific DNA variants were identified; Sanger-based resequencing validated 85.4% of 41 randomly selected SNPs. Among the short indels identified, a majority (54.6%-77.3%) of the variants resulted in a change of 3 base pairs, consistent with expectations for a selection against frame shift mutations. Taken together, these findings indicate that use of a human design exon-capture array can provide efficient enrichment of non-human primate gene regions. Accordingly, use of the human exon-capture methods provides an attractive, cost-effective approach for the comparative analysis of non-human primate genomes, including gene-based DNA variant discovery.Entities:
Mesh:
Year: 2012 PMID: 22848389 PMCID: PMC3407233 DOI: 10.1371/journal.pone.0040637
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Data production.
| CE1 | CE2 | JP (mean ± s.d.) | CM (mean ± s.d.) | HM1 | HM2 | |
| Target region (Mb) | 37.63 | 37.63 | 45.88 | 37.63 | 37.63 | 37.63 |
| # of clean reads(Mb) | 31.71 | 51.12 | 98.43±8.24 | 27.55±2.75 | 25.56 | 25.56 |
| Reads mapped to humangenome(Mb) | 18.12 | 28.05 | 55.68±6.07 | 23.54±2.30 | 23.58 | 23.35 |
| (fraction of clean reads%) | 57.13 | 54.86 | 56.48±2.06 | 85.46±1.06 | 92.26 | 91.37 |
| Reads mapped to human genomeafter filter duplication(Mb) | 17.66 | 27.30 | 48.44±4.93 | 22.22±2.23 | 22.74 | 22.42 |
| (fraction of clean reads%) | 55.69 | 53.40 | 49.19±2.56 | 80.67±1.68 | 88.97 | 87.72 |
| Reads mapped to own genome(Mb) | 27.02 | 45.92 | 88.22±8.09 | 24.22±2.15 | 23.58 | 23.35 |
| (fraction of clean reads%) | 85.20 | 89.83 | 89.58±1.19 | 88.01±1.79 | 92.26 | 91.37 |
| Reads mapped to target region(Mb) | 14.26 | 21.04 | 88.22±8.09 | 16.67±1.71 | 17.08 | 17.35 |
| (fraction of clean reads%) | 44.96 | 41.17 | 40.66±2.95 | 60.51±1.60 | 66.82 | 67.89 |
| Mean depth of target region | 30.72 | 45.77 | 70±8.52 | 35.03±3.40 | 35.41 | 36.14 |
| Coverage of target region(%) | 91.06 | 92.70 | 93.29±0.14 | 97.16±0.18 | 97.08 | 97.73 |
| Capture specificity | 52.77 | 45.83 | 45.36±2.84 | 68.78±2.21 | 72.43 | 74.30 |
| Fraction of target covered > = 4X(%) | 81.14 | 83.67 | 87.03±0.40 | 91.06±0.83 | 91.30 | 92.27 |
| Fraction of target covered> = 10X(%) | 67.59 | 72.32 | 78.80±0.86 | 78.41±2.38 | 79.09 | 79.91 |
| Rate of nucleotide mismatch (%) | 2.90 | 2.76 | 2.76±0.07 | 1.03±0.05 | 0.44 | 0.48 |
Summary of captured target sequence coverage for each non-human primate exome and two human exomes. The total size of the captured target is 37,627,322 bp for CE and 45,880,359 bp for JP. Each exome was compared to the human reference genome. Listed for each exome are the details of captured data including alignment, depth and coverage of target region. The exomes of all twenty one individuals were sequenced with a mean depth ≥28 fold on array designed target region (TR). Coverage of target region ranged from 91.06% to 97.73%, and 67.59% to 81.33% of sites in the target region were covered by more than 10 reads. Non-human primates were aligned to their own or closest genomes (rheMac2 was used as reference genome for crab-eating cynomolgus and Japanese macaque) to evaluate the capture efficiency.
Comparison of different scanning regions and theoretical coverage of genes.
| Target Region | Target Orthologous Region | Target Orthologous Region &depth≥10 | |
| Length(bp) | 37,627,322 | 35,163,761 | 17,067,287 |
| Fraction of raw Target Region | 100.00% | 93.45% | 45.36% |
| Coverage of CCDS genes | 86.21% | 80.63% | 44.93% |
Summary of theoretical analysis for target region, target orthologous region and target orthologous region & depth≥10. Target orthologous region was the region we used to assess for capture efficiency and the target orthologous region & depth≥10 was used for SNP detection. 18,594 CCDS genes were used to analyze the theoretical coverage.
Figure 1The profile of gene coverage.
The distribution of coverage of 18,594 gene coding regions of human CCDS by theoretical target region, target orthologous region and sequencing data from 3 species is shown. The sequencing coverage of each gene was inferred from one individual in each species. Sequencing data showed that 80.40%–80.45% of genes were covered ≥90% for both humans and chimpanzees, and 77.76% for macaques. Coverage of macaque CCDSs was higher than target orthologous region. The coverage of human and chimpanzee genes was close when concerns target region, and higher for macaque compared to target orthologous region.
Distribution of orthologous score (OS) for macaques.
| OS = 0 | OS = 1 | OS = 2 | |
| Length | 1.78 M | 35.16 M | 0.68 M |
| Percent of target region | 4.73% | 93.45% | 1.82% |
A fragment from the target region of the human reference genome was aligned to the Indian macaque reference genome. The OS was calculated for every site and classified into three possible levels: OS = 0, no alignment hit, unable to be captured theoretically; OS = 1, only one hit, high level of 1∶1 orthologous, well covered; OS = 2, multiple hits at different locations in the macaque reference, would give rise to misalignment and potentially false polymorphisms.
Genomic features of captured and not captured targets.
| (A) Genomic features of captured targets | |||||
| # of targets | # of bases | mismatch | Indel bases | GC | |
| CE1 | 155,610 | 35,682,723 | 3.07% | 0.55% | 47.13% |
| CE2 | 157,062 | 36,002,143 | 3.10% | 0.56% | 47.16% |
| JP1 | 185,636 | 44,284,360 | 3.28% | 0.63% | 47.26% |
| JP2 | 186,122 | 44,395,960 | 3.29% | 0.64% | 47.30% |
| JP3 | 185,628 | 44,295,992 | 3.28% | 0.63% | 47.26% |
| JP4 | 185,584 | 44,280,953 | 3.28% | 0.63% | 47.24% |
| JP5 | 185,481 | 44,251,493 | 3.28% | 0.63% | 47.24% |
| JP6 | 185,540 | 44,256,348 | 3.28% | 0.63% | 47.23% |
| JP7 | 186,066 | 44,364,842 | 3.29% | 0.64% | 47.28% |
| JP8 | 185,812 | 44,305,013 | 3.28% | 0.64% | 47.26% |
|
| |||||
|
|
|
|
|
| |
| CE1 | 9,961 | 1,944,599 | 5.85% | 1.47% | 56.29% |
| CE2 | 8,509 | 1,625,179 | 5.90% | 1.47% | 57.15% |
| JP1 | 8,381 | 1,595,999 | 6.85% | 1.87% | 53.84% |
| JP2 | 7,895 | 1,484,399 | 6.86% | 1.89% | 53.27% |
| JP3 | 8,389 | 1,584,367 | 6.83% | 1.88% | 53.84% |
| JP4 | 8,433 | 1,599,406 | 6.82% | 1.88% | 54.26% |
| JP5 | 8,536 | 1,628,866 | 6.80% | 1.85% | 54.13% |
| JP6 | 8,477 | 1,624,011 | 6.86% | 1.87% | 54.27% |
| JP7 | 7,951 | 1,515,517 | 6.95% | 1.93% | 53.70% |
| JP8 | 8,205 | 1,575,346 | 6.92% | 1.88% | 53.99% |
We examined the following genomic features for the 165,571 (for CE) and 194,017 (for JP) human targets with best reciprocal orthologs in the Indian macaque genome: the number of nucleotide differences between human and macaque, the number of indel bases between human and macaque and the GC content. (A) A target was considered “captured” if more than half of the human targeted bases were covered by at least one sequence read. (B) Otherwise, the target was not captured.
Figure 2GC distribution of captured and un-captured targets by CE1.
Sequenced CE1 exome was compared with its closest reference genome (rheMac2) to examine the influence of GC content on coverage. (A) Captured targets were defined as targets with more than 50% of bases covered at least by one read. The captured targets were mainly of moderate GC content. (B) Uncaptured targets were less than a half of the bases coverd by at least one read. The uncaptured targets were dominated by high GC content, except for a small portion with low GC content.
Figure 3Average sequencing depth of captured targets versus human reference and rhesus macaque reference mismatch.
Mismatch rate of each target region was calculated by comparing human reference and rhesus macaque references respectively. (A) The X-axis identifies the mismatch rate region, and the Y-axis identifies the mean depth of target region, as calculated from CE2. As the mismatch rate increases, the mean depth decreases. (B) Average sequencing depth of captured targets versus each target GC content. Targets with extreme GC content were poorly captured; the mean depth was lower compared with moderate GC content.
(A) Comparison of sequencing based point variants and variants between references. (B) Coding SNP summary; piN refers to X and piS refers to Y. dN indicates X and dS indicates Y.
| (A) Comparison of sequencing based point variants and variants between references. | ||||||||
| Total SNPs | consistent | inconsistent | non-overlap | candidate intra-species polymorphism | consistent rate | |||
| hom | het | hom | het | |||||
| JP1 | 311,252 | 287,901 | 4,067 | 532 | 37 | 18,715 | 23,351 | 99.81% |
| JP2 | 310,564 | 287,724 | 3,935 | 536 | 42 | 18,327 | 22,840 | 99.80% |
| JP3 | 310,948 | 288,188 | 3,927 | 524 | 45 | 18,264 | 22,760 | 99.81% |
| JP4 | 310,935 | 287,036 | 4,115 | 521 | 47 | 19,216 | 23,899 | 99.81% |
| JP5 | 311,477 | 287,427 | 4,195 | 519 | 40 | 19,296 | 24,050 | 99.81% |
| JP6 | 310,847 | 287,096 | 4,131 | 525 | 36 | 19,059 | 23,751 | 99.81% |
| JP7 | 310,132 | 285,929 | 4,181 | 528 | 45 | 19,449 | 24,203 | 99.80% |
| JP8 | 311,362 | 287,570 | 4,131 | 513 | 40 | 19,108 | 23,792 | 99.81% |
| CE1 | 306,534 | 279,641 | 5,929 | 543 | 49 | 20,372 | 26,893 | 99.79% |
| CE2 | 303,992 | 276,601 | 5,910 | 517 | 44 | 20,920 | 27,391 | 99.80% |
| CM1 | 80,882 | 64,328 | 2,398 | 67 | 18 | 14,071 | 16,554 | 99.87% |
| CM2 | 80,571 | 64,042 | 2,446 | 71 | 21 | 13,991 | 16,529 | 99.86% |
| CM3 | 80,721 | 64,433 | 2,379 | 68 | 19 | 13,822 | 16,288 | 99.87% |
| CM4 | 80,061 | 63,718 | 2,392 | 70 | 15 | 13,866 | 16,343 | 99.87% |
| CM5 | 80,895 | 64,498 | 2,271 | 74 | 18 | 14,034 | 16,397 | 99.86% |
| CM6 | 80,737 | 64,230 | 2,391 | 68 | 16 | 14,032 | 16,507 | 99.87% |
| CM7 | 80,381 | 64,034 | 2,407 | 77 | 16 | 13,847 | 16,347 | 99.86% |
| CM8 | 79,704 | 64,691 | 2,047 | 71 | 16 | 12,879 | 15,013 | 99.87% |
| CM9 | 80,500 | 64,228 | 2,346 | 70 | 17 | 13,839 | 16,272 | 99.87% |
|
| ||||||||
|
|
|
|
|
|
|
|
| |
| JP1 | 8,172 | 10,757 | 13,096 | 0.7597 | 60,162 | 155,835 | 0.3861 | 1.9678 |
| JP2 | 7,948 | 10,503 | 12,651 | 0.7567 | 60,089 | 155,702 | 0.3859 | 1.9608 |
| JP3 | 7,896 | 10,513 | 12,638 | 0.7511 | 60,224 | 156,046 | 0.3859 | 1.9461 |
| JP4 | 8,429 | 10,922 | 13,662 | 0.7717 | 59,936 | 155,274 | 0.386 | 1.9993 |
| JP5 | 8,502 | 10,991 | 13,827 | 0.7735 | 59,995 | 155,564 | 0.3857 | 2.0058 |
| JP6 | 8,382 | 10,876 | 13,563 | 0.7707 | 60,022 | 155,398 | 0.3862 | 1.9953 |
| JP7 | 8,603 | 11,053 | 14,014 | 0.7783 | 59,738 | 154,756 | 0.386 | 2.0164 |
| JP8 | 8,348 | 10,913 | 13,643 | 0.765 | 60,104 | 155,724 | 0.386 | 1.9819 |
| CE1 | 7,477 | 13,753 | 16,904 | 0.5437 | 58,382 | 151,782 | 0.3846 | 1.4134 |
| CE2 | 7,850 | 13,719 | 17,330 | 0.5722 | 57,596 | 149,784 | 0.3845 | 1.4881 |
| CM1 | 5,207 | 7,449 | 9,975 | 0.699 | 16,421 | 31,766 | 0.5169 | 1.3522 |
| CM2 | 5,292 | 7,492 | 10,240 | 0.7064 | 16,301 | 31,622 | 0.5155 | 1.3702 |
| CM3 | 5,119 | 7,305 | 9,714 | 0.7008 | 16,399 | 31,852 | 0.5148 | 1.3611 |
| CM4 | 5,191 | 7,477 | 10,044 | 0.6943 | 16,246 | 31,466 | 0.5163 | 1.3447 |
| CM5 | 5,211 | 7,325 | 9,698 | 0.7114 | 16,461 | 31,879 | 0.5164 | 1.3777 |
| CM6 | 5,179 | 7,389 | 9,885 | 0.7009 | 16,359 | 31,759 | 0.5151 | 1.3607 |
| CM7 | 5,176 | 7,417 | 9,833 | 0.6979 | 16,296 | 31,670 | 0.5146 | 1.3562 |
| CM8 | 4,763 | 6,789 | 8,287 | 0.7016 | 16,511 | 31,975 | 0.5164 | 1.3587 |
| CM9 | 5,272 | 7,329 | 9,934 | 0.7193 | 16,342 | 31,719 | 0.5152 | 1.3962 |
| HM1 | 2,944 | 4,024 | 4,282 | 0.7316 | ||||
| HM2 | 2,957 | 4,092 | 4,310 | 0.7226 | ||||
Figure 4Phylogenetic tree.
Evolutionary distance in human, chimpanzee and macaque lineages, showing the inferred evolutionary relationships among 21 samples based upon similarities and differences in their genetic characteristics. The taxa joined together in the tree are implied to have descended from a common ancestor.
Figure 5Principle Component Analysis.
Principal components analysis plot of 9 chimpanzees, 2 crab-eating macaques, 8 Japanese macaques and 2 human samples. Chimpanzees and humans cluster well within their own group, while crab-eating and Japanese macaques spread away are in relatively close proximity to each other, indicating recent genetic divergence between them.
Indel summary.
| CE1 | CE2 | JP (mean ± s.d.) | CM (mean ± s.d.) | HM1 | HM2 | |
| Total number of indels | 13,890 | 17,417 | 17,482.8±475.4 | 6,693.7±334.3 | 773 | 697 |
| Ins-coding | 698 | 910 | 1,072.4±40.3 | 427.6±17.6 | 93 | 84 |
| Del-coding | 607 | 698 | 696.5±17.6 | 360.3±8.4 | 59 | 56 |
| Splice site | 885 | 1,078 | 833.1±17.4 | 303.7±49.5 | 45 | 33 |
| Intron | 10,813 | 13,615 | 13,771.6±398.6 | 5,079.8±267.4 | 518 | 465 |
| 5′ UTRs | 256 | 341 | 344.5±8.2 | 190.3±6.3 | 22 | 29 |
| 3′ UTRs | 579 | 709 | 701.4±8.3 | 280.9±14.8 | 30 | 25 |
| Intergenic | 52 | 66 | 63.3±2.3 | 51.1±2.3 | 6 | 5 |
| Total insertion | 7,460 | 9,589 | 11,127.4±321.8 | 3,673.7±177.5 | 437 | 395 |
| Total deletion | 6,430 | 7,828 | 6355.4±155.8 | 3,020±177.6 | 336 | 302 |
| Heterozygous indels | 378 | 536 | 295.0±19.3 | 452.4±60.2 | 280 | 240 |
| Homozygous indels | 13,512 | 16,881 | 17,187.8±457.9 | 6,241.2±277.5 | 493 | 457 |
Each exome from 21 individuals was aligned to the human reference genome for indel identification.
Figure 6The length distribution of coding indels for each individual.
140∼1,883 coding indels were called in each individual, and 54.61%∼77.32% of them were 3 bp in length. All indels were relative to the human reference genome.