| Literature DB >> 32272904 |
Oscar A Nyangiri1,2, Harry Noyes3, Julius Mulindwa1, Hamidou Ilboudo4, Justin Windingoudi Kabore5, Bernardin Ahouty6, Mathurin Koffi7, Olivier Fataki Asina8, Dieudonne Mumba8, Elvis Ofon9, Gustave Simo9, Magambo Phillip Kimuda1, John Enyaru10, Vincent Pius Alibu10, Kelita Kamoto11, John Chisi11, Martin Simuunza12, Mamadou Camara13, Issa Sidibe5, Annette MacLeod14, Bruno Bucheton13,15, Neil Hall3,16, Christiane Hertz-Fowler3, Enock Matovu17.
Abstract
BACKGROUND: Copy number variation is an important class of genomic variation that has been reported in 75% of the human genome. However, it is underreported in African populations. Copy number variants (CNVs) could have important impacts on disease susceptibility and environmental adaptation. To describe CNVs and their possible impacts in Africans, we sequenced genomes of 232 individuals from three major African ethno-linguistic groups: (1) Niger Congo A from Guinea and Côte d'Ivoire, (2) Niger Congo B from Uganda and the Democratic Republic of Congo and (3) Nilo-Saharans from Uganda. We used GenomeSTRiP and cn.MOPS to identify copy number variant regions (CNVRs).Entities:
Keywords: Adaptation; CNV; Niger Congo A; Niger Congo B; Nilo-Saharan; Signatures of selection; Structural variation; Tag haplotypes
Mesh:
Year: 2020 PMID: 32272904 PMCID: PMC7147055 DOI: 10.1186/s12864-020-6669-y
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Ethnicity and origin of individuals analysed for CNV
| Pop | Country | District | Ethno-linguistic group (ethnologue code, n) |
|---|---|---|---|
| UNL | Uganda | Maracha | Lugbara (IGG, 50) |
| UBB | Uganda | Iganga | Basoga (XOG, 33) |
| DRC | Democratic Republic of Congo | Bandundu | Kingongo (NOQ, 30) Kimbala (MDP, 20) |
| GAS | Guinea | Forecariah Boffa, Dubreka | Soussou (SUS, 49) |
| CIV | Côte d’Ivoire | Bonon Sinfra | Baoule (BCI, 11) Gouro (GOA 21) Moore (MOS, 12) Senoufo (SEF, 4) Malinke (LOI, 1) Koyaka (KGA, 1) |
Ethnologue codes are derived from the ethnic languages of the world resource [13]
Fig. 1Selection of high confidence CNV and analysis strategy. GenomeSTRiP CNVR overlapping cn.MOPS CNVR were selected and singletons assessed for removal. The resulting consensus dataset was annotated to identify novel CNVs, show population structure deduced from CNV calls and tag SNP analysis
CNV statistics using GenomeSTRiP and cn.MOPS algorithms
| Parameter | GenomeSTRiP | cn.MOPS | GenomeSTRiP that overlap cn.MOPS |
|---|---|---|---|
| Raw CNV regions (CNVR) | 16,149 | 9213 | |
| CNVR after QC | 11,275 | 2115 | 7608 |
| Total CNV scored | 127,699 | 37,679 | 106,922 |
| Deletion CNV | 65,588 | 26,008 | 61,025 |
| Gain CNV | 62,111 | 11,671 | 45,897 |
| Mean CNV count per CNVR | 11.3 | 17.8 | 14.0 |
| Mean CNVR per individual | 654 | 193 | 548 |
| Count of overlapping CNVRs a | 7608 | 1691 | 7608 |
| Mean Length of CNVR (kb) | 9.5 | 541.7 | 10.7 |
| SD length of CNVR (kb) | 13.2 | 1287.6 | 14.1 |
| Median Length of CNVR (kb) | 5.3 | 32.4 | 6 |
| Total Length of CNVR (Mb) | 108.1 | 1145.8 | 81.2 |
| Observed Length CNV present in both methods (Mb) (Simulated ± SD)b | 81.2 (43.4 ± 1.0) | ||
Descriptive statistics of CNVR found using GenomeSTRiP and cn.MOPS. Note that: GenomeSTRiP has about 5.3 times the number of CNVs compared with cn.MOPS (11,275 cf. 2115); GenomeSTRiP CNVRs were shorter (median length 5.3 kb) than cn.MOPS (median length 32.4 kb); Total length of cn.MOPS CNVRs was about 10.6 times greater (1146 Mb cf. 108 Mb) than GenomeSTRiP CNVRs. CNVR = CNV region; a genomic location with chromosome, start and end base pair positions that has overlapping CNVs; CNVRs after QC = The CNVRs left after some CNVRs were dropped because they were only found in samples that were outliers in principal component analysis (PCA) plots of raw data. CNV count per CNVR = Number of samples with a CNV at each CNV region = Total CNVs count/ Total CNVRs; Mean CNVRs per sample = Count of CNV divided by number of samples; Mean, Standard deviation, Median, Total length, Observed length: Calculated per CNV not CNVR
aCount of any overlap (minimum 1 bp) between GenomeSTRiP and cn.MOPS CNVR
bThe expected length of CNVs that would be found by both methods was obtained by 100 simulations using all the observed lengths of CNVs allocated to random places in the genome
Fig. 2Venn diagram showing counts of CNVR shared between populations. a All CNVR from Niger Congo A (NCA), Niger Congo B (NCB) and Nilo-Saharan (NS) ethnic groups. CNVR overlapping 5 kb genomic regions were plotted for each population. A majority of the CNVR are shared between populations, but Nilo-Saharans appear to have the least CNVR, with most of them shared with the Niger Congo A and Niger Congo B. b Sharing of novel CNV regions between populations. Most novel CNVR are unique to individual populations studied whereas others are shared. To enable comparison, the genome was divided into 5 kb regions and regions with novel CNVR in each of these regions for each population were compared for overlaps
Fig. 3CNV density comparison between TrypanoGEN and the 1000 Genomes project. Counts of Loci per Mb and Counts of CNV per Mb for each chromosome in TrypanoGEN and 1000 Genomes project data. a Counts of CNVR per Mb in TrypanoGEN b CNV loci counts per Mb in TrypanoGEN c Counts of CNVR per Mb in 1000 Genomes d CNV loci counts per Mb in TrypanoGEN Both sets show similar patterns of CNV per chromosome, with 1000 Genomes data having tighter interquartile ranges
Fig. 4Heat Map showing Pearson Correlation coefficient between the Count of CNV in 10 Mb windows in each population across the genomes of TrypanoGEN and 1000 Genomes samples. The histogram in the legend indicates the number of correlations with each value of Pearson’s r, there are large numbers of correlations between 0.5 and 0.6 and also between 0.9 and 1. Correlation coefficients are high (> 0.9) between populations from the same dataset but lower (0.5–0.6) between populations from different data sets
Fig. 5Genomic distribution of CNVR and their frequency in our samples. a Known and novel CNVR are distributed throughout the genome, with novel CNVR having lower frequencies compared to known CNVR. The centre of the circle has the least frequency of < 1% whereas the outermost bounds represent higher frequencies of up to 100%. Novel CNVR shown in red are lower frequency compared to known CNVR shown in black. A few known CNVRs show high frequencies. b Comparison of frequencies in the various populations. No major differences in CNVR frequencies were found between populations. All populations are represented in the plot with different colours. The centre of the plot has the least frequency of 0% whereas the outermost bounds represent higher CNVR frequencies. Frequencies are similar across populations. The frequencies of CNVRs with CNV frequencies < 20% are set to 0% to enhance visibility. Cyan shows the CNV frequency of those common to GAS and all populations, UBB are in black, DRC are in green, CIV are in dark blue and UGN are in red
Counts of SNPs inside and outside CNVRs with significant (−log10 p > 3) and non-significant p values
| UNL CNV + 5 kb flanks | -LOG10 | -LOG10 |
|---|---|---|
| SNP in CNVR | 1805 | 493,241 |
| SNP not in CNVR | 10,473 | 8,114,213 |
CNVRs were defined as the boundaries identified by GenomeSTRiP plus 5 kb upstream and downstream flanks to maintain consistency with the Tag SNP analysis
Classification of Genes in CNVR with evidence of selection
| Type | Observed Count | Count in Ensembl | Ratio Observed: Expected |
|---|---|---|---|
| pseudogene | 259 | 14,975 | 1.5 |
| protein_coding | 184 | 21,817 | 0.7 |
| lincRNA | 89 | 7177 | 1.1 |
| IG_V_gene | 25 | 138 | 16.1 |
| IG_V_pseudogene | 22 | 187 | 10.5 |
| antisense | 20 | 5339 | 0.3 |
| miRNA | 19 | 3243 | 0.5 |
| snRNA | 11 | 2001 | 0.5 |
| processed_transcript | 9 | 799 | 1.0 |
| IG_C_gene | 9 | 14 | 57.2 |
| misc_RNA | 8 | 2127 | 0.3 |
SNP with evidence of selection were annotated with a gene name if they were within 5 kb of the gene start or end. Counts of gene types were based on Ensembl annotation and the Count in Ensembl was the total number of each type recorded in Ensembl Biomart
Counts of CNV at CNVR with and without SNP under selection
| Deletions | Wild Type | Insertions | |
|---|---|---|---|
| CNVR with Selected SNP | 2779 | 39,811 | 6534 |
| CNVR without Selected SNP | 83,003 | 1,566,194 | 63,553 |
Fig. 6PCA plot showing CNV population structure in our data compared to 1000 Genomes. The PCA distinguishes major continental populations from each other, but is not able to resolve specific populations within the continental populations. Africans in the 1000 Genomes (AFR) are closer to our data (TGN). Conventions for major continental populations are described by the 1000 genomes project [8, 23]. b PCA plot showing population structure for bi-allelic deletion CNV. Phase information is non-ambiguous for bi-allelic deletions. The Africans in the 1000 Genomes overlay the TrypanoGEN African samples, indicating similar CNV in the datasets. c PCA plot showing population structure due to bi-allelic insertion CNV. There was no specific pattern observed as fewer bi-allelic insertions were available in the data
FST for CNVs computed from numbers of deletions per locus
| UNL | DRC | GAS | UBB | CIV | |
|---|---|---|---|---|---|
| 0 | |||||
| 0.004 | 0 | ||||
| 0.008 | 0.004 | 0 | |||
| 0.004 | 0.003 | 0.004 | 0 | ||
| 0.008 | 0.004 | 0.001 | 0.004 | 0 |
FST were calculated in PLINK using only bi-allelic deletions since phase of these is known