| Literature DB >> 28486572 |
Leonardo Arduino Marano1, Letícia Marcorin2, Erick da Cruz Castelli3, Celso Teixeira Mendes-Junior2.
Abstract
The advent of next-generation sequencing allows simultaneous processing of several genomic regions/individuals, increasing the availability and accuracy of whole-genome data. However, these new approaches may present some errors and bias due to alignment, genotype calling, and imputation methods. Despite these flaws, data obtained by next-generation sequencing can be valuable for population and evolutionary studies of specific genes, such as genes related to how pigmentation evolved among populations, one of the main topics in human evolutionary biology. Melanocortin-1 receptor (MC1R) is one of the most studied genes involved in pigmentation variation. As MC1R has already been suggested to affect melanogenesis and increase risk of developing melanoma, it constitutes one of the best models to understand how natural selection acts on pigmentation. Here we employed a locally developed pipeline to obtain genotype and haplotype data for MC1R from the raw sequencing data provided by the 1000 Genomes FTP site. We also compared such genotype data to Phase 3 VCF to evaluate its quality and discover any polymorphic sites that may have been overlooked. In conclusion, either the VCF file or one of the presently described pipelines could be used to obtain reliable and accurate genotype calling from the 1000 Genomes Phase 3 data.Entities:
Year: 2017 PMID: 28486572 PMCID: PMC5488459 DOI: 10.1590/1678-4685-GMB-2016-0180
Source DB: PubMed Journal: Genet Mol Biol ISSN: 1415-4757 Impact factor: 1.771
Figure 1Percentage of mismatches observed for data generated by our pipelines (UnifiedGenotyper in gray and HaplotypeCaller in black) as compared to data obtained directly from the VCF concerning 2,504 individuals analyzed by the 1000 Genomes Project.
Figure 2Percentage of mismatches observed for data generated by our pipelines (UnifiedGenotyper in gray and HaplotypeCaller in black) as compared to data obtained directly from the VCF concerning 178 (UnifiedGenotyper) or 150 (HaplotypeCaller) loci analyzed by the 1000 Genomes Project.
Comparison between the 1000 Genomes Project Phase 3 VCF with data processed by the pipelines that include either UnifiedGenotyper or HaplotypeCaller. For this purpose, 40 callings were retrieved from the locus that presented the higher levels of inconsistencies (rs885479 for UnifiedGenotyper and chr16:89985177 for HaplotypeCaller) and 60 were randomly chosen among the remaining loci.
| UnifiedGenotyper approach | HaplotypeCaller approach | |||
|---|---|---|---|---|
| Mismatches |
| Random sites | chr16:89985177 | Random sites |
|
| 39 | 33 | 38 | 56 |
|
| 0 | 5 | 0 | 1 |
|
| 1 | 22 | 2 | 3 |
Observed (H ) and expected (H ) heterozygosities, Hardy-Weinberg Equilibrium (HWE) probability values and Ewens-Watterson neutrality test results regarding the promoter and coding regions of MC1R in four population groups composed of autochthonous populations from the 1000 Genomes project. Significant p-values are marked in boldface.
| Ewens-Watterson neutrality test | |||||||
|---|---|---|---|---|---|---|---|
|
|
|
| HWE
|
|
|
| |
|
| |||||||
| AFR | 1008 | 0.881 | 0.883 | 0.537 |
|
|
|
| EAS | 1008 | 0.639 | 0.625 | 0.617 |
|
|
|
| EUR | 1006 | 0.692 | 0.689 | 0.499 |
|
|
|
| SAS | 978 | 0.847 | 0.839 | 0.613 |
|
|
|
|
| |||||||
| AFR | 1008 | 0.669 | 0.666 | 0.494 |
|
|
|
| EAS | 1008 | 0.587 | 0.565 | 0.546 |
|
|
|
| EUR | 1006 | 0.638 | 0.668 | 0.584 |
|
|
|
| SAS | 978 | 0.481 | 0.491 | 0.425 |
|
|
|
A p-value was computed by the comparison of the estimated statistic to a distribution of estimates computed for 99,999 random samples of the same number of alleles and sample size as the observed data, and represents the proportion of samples having a probability smaller or equal to the observed sample. Due to the nature of the test, large p-values (i.e., p > 0.95) are still significant.
Summary of sequence variation in the promoter and coding regions of the MC1R gene in four population groups composed of autochthonous populations from the 1000 Genomes project and results of three neutrality tests based on sequence data: Tajima’s D test, Fu’s F test, and synonymous and non-synonymous nucleotide substitution test (dN - dS) of positive and purifying selection for analysis averaging MC1R coding haplotypes. Significant p-values are marked in boldface.
| Tajima’s | Fu’s |
| |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| Number of nucleotide sites |
|
| θ | π ± SD (%) |
|
|
|
| Number of codons | HA = Positive Selection | HA = Purifying Selection | |
| ( | ( | ||||||||||||
|
| |||||||||||||
| AFR | 1008 | 3001 | 78 | 55 | 6.140 ± 1.367 | 0.224 ± 0.117 | -0.991 | 0.152 |
|
| - | - | - |
| EAS | 1008 | 3001 | 42 | 45 | 4.805 ± 1.131 | 0.188 ± 0.100 | -1.024 | 0.136 | -4.521 | 0.210 | - | - | - |
| EUR | 1006 | 3001 | 36 | 42 | 4.272 ± 1.036 | 0.164 ± 0.089 | -0.808 | 0.227 | -3.360 | 0.291 | - | - | - |
| SAS | 978 | 3001 | 64 | 56 | 6.298 ± 1.399 | 0.213 ± 0.112 | -1.030 | 0.157 |
|
| - | - | - |
|
| |||||||||||||
| AFR | 1008 | 954 | 33 | 31 | 4.138 ± 1.012 | 0.093 ± 0.072 |
|
|
|
| 316 | -1,426; p = 1,000 | 1,453; p = 0,074 |
| EAS | 1008 | 954 | 26 | 20 | 2.669 ± 0.741 | 0.150 ± 0.102 | -1.078 | 0.128 |
|
| 317 | -0,142; p = 1,000 | 0,141; p = 0,444 |
| EUR | 1006 | 954 | 22 | 22 | 2.937 ± 0.792 | 0.103 ± 0.077 |
|
|
|
| 317 | 0,444; p = 0,329 | -0,439; p = 1,000 |
| SAS | 978 | 954 | 24 | 23 | 3.082 ± 0.821 | 0.062 ± 0.055 |
|
|
|
| 317 | -1,034; p = 1,000 | 1,055; p = 0,147 |
2n: number of chromosomes analyzed
Number of different haplotypes
Number of segregating sites
Average nucleotide diversity and its standard deviation
A p-value was computed by the comparison of the estimated statistic to a distribution of estimates computed for 99,999 random samples of the same sample size and level of polymorphism as the observed data, and represents the proportion of the simulated D statistics less or equal to the observed value.
A p-value was computed by the comparison of the estimated statistic to a distribution of estimates computed for 99,999 random samples of the same sample size and level of polymorphism as the observed data, and represents the proportion of the simulated F statistics less or equal to the observed value. The 2% percentile of the distribution corresponded to the 5% cutoff value. Therefore, a F statistic should be considered as significant at the 5% level, if its p-value is below 0.02, and not below 0.05.