| Literature DB >> 31784700 |
Massimiliano Cocca1, Caterina Barbieri2, Maria Pina Concas1, Antonietta Robino1, Marco Brumat3, Ilaria Gandin3, Matteo Trudu4, Cinzia Felicita Sala2, Dragana Vuckovic5, Giorgia Girotto1,3, Giuseppe Matullo6,7, Ozren Polasek8, Ivana Kolčić8, Paolo Gasparini1,3, Nicole Soranzo5, Daniela Toniolo2, Massimo Mezzavilla9.
Abstract
The genomic variation of the Italian peninsula populations is currently under characterised: the only Italian whole-genome reference is represented by the Tuscans from the 1000 Genome Project. To address this issue, we sequenced a total of 947 Italian samples from three different geographical areas. First, we defined a new Italian Genome Reference Panel (IGRP1.0) for imputation, which improved imputation accuracy, especially for rare variants, and we tested it by GWAS analysis on red blood traits. Furthermore, we extended the catalogue of genetic variation investigating the level of population structure, the pattern of natural selection, the distribution of deleterious variants and occurrence of human knockouts (HKOs). Overall the results demonstrate a high level of genomic differentiation between cohorts, different signatures of natural selection and a distinctive distribution of deleterious variants and HKOs, confirming the necessity of distinct genome references for the Italian population.Entities:
Mesh:
Year: 2019 PMID: 31784700 PMCID: PMC7080768 DOI: 10.1038/s41431-019-0551-x
Source DB: PubMed Journal: Eur J Hum Genet ISSN: 1018-4813 Impact factor: 4.246
Final data release of WGS data for all the INGI cohorts
| INGI All samples | ||||
|---|---|---|---|---|
| CAR | FVG | VBI | INGI | |
| Samples | 124 | 378 | 424 | 926 |
| Females | 66 | 220 | 249 | 535 |
| Males | 58 | 158 | 175 | 391 |
| Average coverage | 6.31 | 7.23 | 6.12 | 6.55 |
| Sites | 13,370,262 | 17,002,010 | 19,361,094 | 26,619,091 |
| Multiallelic sites | 248,638 | 356,599 | 393,328 | 560,918 |
| SNPs | 12,208,629 | 15,521,313 | 17,830,208 | 24,557,366 |
| INDELs | 1,161,633 | 1,480,697 | 1,530,886 | 2,061,725 |
| Sites MAF ≤ 1% | 3,627,622 | 7,283,720 | 9,416,028 | 16,685,951 |
| Sites 1% < MAF ≤ 5% | 3,007,162 | 3,069,534 | 3,121,545 | 3,125,971 |
| Sites MAF > 5% | 6,735,478 | 6,648,756 | 6,823,521 | 7,123,064 |
| Singletons SNPs | 2,061,824 | 2,784,746 | 3,554,744 | 6,193,486 |
| Singletons INDELs | 92,372 | 131,275 | 133,156 | 273,679 |
| Average heterozygosity rate per sample | 17.57% | 13.27% | 12.16% | 13.34% |
| Average derived allele count per sample | 4,703,290 | 4,741,910 | 4,844,980 | 4,763,393 |
| Average variations per sample | 3,518,020 | 3,421,910 | 3,541,760 | 3,493,897 |
| Average INDELs per sample | 531,151 | 586,740 | 590,109 | 569,333 |
| Average singleton per sample | 17,285 | 7,671 | 8,646 | 6,925 |
The table shows information about the final data release for each INGI cohort separately as well as information on the pooled dataset (INGI column); sequence data were aligned to the Human genome reference build 37 (GRCh37)
Fig. 1Dataset description: a Geographical localisation of the three study cohorts. b The minor allele frequency spectrum of the final INGI data set. For comparison, the Minor allele frequency spectrum of the TSI cohort from 1000G Phase 3 data has been added. c The stacked bar-plot represent the number of novel sites identified in the whole INGI dataset, compared with the available resources. The majority of the private INGI sites are in the range of the rare variants (MAF < = 1% - cross-pattern). Singletons sites (AC = 1) are included
Fig. 2Imputation accuracy: mean values of r2 (right y-axes) stratified by minor allele frequency (coloured lines) and number of imputed sites (left y-axes) stratified by info score values and minor allele frequency (bar plot) for Italian cohorts. An outbred cohort from North Italy (NW-ITA) was included for comparison
Fig. 3GWAS analyses: a Manhattan plot of GWAS meta-analysis on Mean Corpuscular Haemoglobin (MCH) phenotype: results in the bottom panel are from IGRP1.0 imputed data while on the top panel we show GWAS results obtained using the 1000G reference panel for imputation. b Manhattan plot of GWAS meta-analysis on Red Blood Cell Count (RBC) phenotype: results in the bottom panel are from IGRP1.0 imputed data while on the top panel we show GWAS results obtained using the 1000G reference panel for imputation
Fig. 4Population genetic analyses: a PCA of Italian samples and European 1000G populations using a subset of 46 individuals from each population. Variance explained by each axis is reported. Each population from FVG cohort - Erto (ERT), Illegio (ILG), Resia (RSI), Sauris (SAU), San Martino del Carso (SMC) and Clauzetto (CLZ) - are shown. The first axis separates ILG from all other Italian populations; the second axis separates SAU from RSI; Val Borbera (VBI) and Carlantino (CAR) cluster with Toscani in Italia (TSI), Finnish in Finland (FIN), British in England and Scotland (GBR), Iberian Population in Spain (IBS). b Treemix graph analyses with 3 migration edges: a link between North European populations and isolates such as RSI and SAU is shown; c Bean plots of Inbreeding coefficient of 1000G European populations and Italian populations. All FVG population have a higher inbreeding coefficient respect to other Italian and European population except for FIN. The plot shows that in the INGI populations the distribution of the inbreeding coefficient values are more sparse with respect to the actual reference Italian population of TSI from 1000G; each horizontal black bar represents an observation from the dataset