| Literature DB >> 36175857 |
Tommaso Alfonsi1, Anna Bernasconi2, Arif Canakoglu2,3, Marco Masseroli2.
Abstract
BACKGROUND: Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics.Entities:
Keywords: 1000 Genomes; Data integration; Data warehousing; Data wrangling; Human genetic variation; Population variant analysis
Mesh:
Year: 2022 PMID: 36175857 PMCID: PMC9520931 DOI: 10.1186/s12859-022-04927-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Overview of the proposed framework. Our approach involves developing two source-specific data integration modules for the download and transform stages of our pre-existing data integration framework. Integrating 1KGP into our genomic data repository extends the set of data sources supported by GenoSurf and GMQL, two exploratory and data querying software applications. Then, through its API VarSum provides easy summarization of extensive genomic variation data for user-defined populations; the API queries a novel relational database derived from the genomic data repository for improved efficiency. In red are the novel contributions of this work
Comparison of the features of bioinformatics tools providing aggregated statistics over the variants of a population
| gnomAD | Ensembl | PGG.SNV | EVS | IGIB group | |
|---|---|---|---|---|---|
| (i) Filters on population by metadata | Limiteda | – | – | – | – |
| (ii) Filters on population by region data | – | – | – | – | – |
| (iii) Gene annotations | |||||
| (iv) Quality metrics | - | – | |||
| (v) Statistics grouped on metadata | Limitedb | Limitedc | Limitedc | Limitedc | Limitedc |
| (vi) API | – | – | – | – |
The evaluation concerns (i) the availability of filters on metadata attributes to select a subset of the donors from the dataset, i.e., the population of interest; (ii) the availability of filters to consider only the donors showing particular genomic features (either precise variants or mutated genome regions) distinct from the genomic variant studied in the population; (iii) the possibility to look at the gene annotations and (iv) sequencing quality metrics information concerning a genomic region or variant; (v) the possibility to group the result statistics on the metadata; (vi) the availability of an API
aAnalysis results are available considering all the donors or one of the seven predefined partitions. On top of the chosen partition, it is also possible to further limit the population to an ethnic group, country (when available), or gender type
bThe statistics about the variant frequency can be grouped by ethnicity, gender, or geographical origin, if the population has not been already filtered on these attributes. In addition, it is possible to know the age distribution of the donors included in the population and of the variant carriers.
cOnly for the groups identified by the original sequencing project and/or for the geographical origin of the donors
Fig. 2The META-BASE architecture and workflow. Datasets are downloaded from the original source and transformed into a GDM-compliant format. Metadata are cleaned, mapped into the GCM relational integrated data structure, normalized and enriched with related ontological concepts. The homogenized information is checked for correctness, flattened to a file-based data structure and loaded within the META-BASE repository.
Fig. 3Transformation process for 1000 Genomes Project files. The output sample GDM file-pairs are obtained by: (i) processing the 1KGP big VCF files dedicated to single chromosomes to extract single sample/individual GDM genomic region data files and (ii) distributing the 1KGP metadata information into single sample GDM metadata files
Mapping 1KGP population values to ethnicity values
| Ethnicity | Pop. Code | Population | In diaspora | Super Population |
|---|---|---|---|---|
| White | CEU | Utah Residents (CEPH) with Northern/Western Eur. Ancestry | Yes | EUR |
| TSI | Toscani in Italia | No | ||
| FIN | Finnish in Finland | No | ||
| GBR | British in England and Scotland | No | ||
| IBS | Iberian Population in Spain | No | ||
| Black or african american | YRI | Yoruba in Ibadan, Nigeria | No | AFR |
| LWK | Luhya in Webuye, Kenya | No | ||
| GWD | Gambian in Western Divisions in the Gambia | No | ||
| MSL | Mende in Sierra Leone | No | ||
| ESN | Esan in Nigeria | No | ||
| ASW | Americans of African Ancestry in SW USA | No | ||
| ACB | African Caribbeans in Barbados | No | ||
| Latin american | MXL | Mexican Ancestry from Los Angeles USA | No | AMR |
| PUR | Puerto Ricans from Puerto Rico | No | ||
| CLM | Colombians from Medellin, Colombia | No | ||
| PEL | Peruvians from Lima, Peru | No | ||
| Asian | GIH | Gujarati Indian from Houston, Texas | Yes | SAS |
| PJL | Punjabi from Lahore, Pakistan | No | ||
| BEB | Bengali from Bangladesh | No | ||
| STU | Sri Lankan Tamil from the UK | Yes | ||
| ITU | Indian Telugu from the UK | Yes | ||
| CHB | Han Chinese in Beijing, China | No | EAS | |
| JPT | Japanese in Tokyo, Japan | No | ||
| CHS | Southern Han Chinese | No | ||
| CDX | Chinese Dai in Xishuangbanna, China | No | ||
| KHV | Kinh in Ho Chi Minh City, Vietnam | No |
The ethnicity value, not specified in the original metadata of 1KGP samples, is assigned based on the available population value, thus enabling interoperability with other data sources.
Fig. 4Relational data model for VarSum. It is composed by a table for the region data of each data source (e.g., 1000 Genomes Project - KGENOMES, TCGA, GENCODE) and one materialized view (METADATA) providing fast access to the metadata of all sources. The materialized view is obtained as a selection of the most important attributes for VarSum from those comprehensively available in the tables of the database [21] based on the Genomic Conceptual Model [35], here represented only through their names
Fig. 5Overview of the server software architecture of VarSum
Statistics of the 1KGP input and output datasets
| Dataset | Region files | Metadata files | Size (.gz compressed) | Size (uncompressed) | |
|---|---|---|---|---|---|
| Input | hg19 | 25 | 4 | 17 GB | 796 GB |
| GRCh38 | 23 | 4 | 13 GB | 752 GB |
Filtering capability for TCGA and 1KGP
| Filter samples on | Availability in source | ||
|---|---|---|---|
| TCGA | 1KGP | ||
| Metadata | Gender | x | x |
| Ethnicity | x | x | |
| Population | x | ||
| Super population | x | ||
| Health status | x | x | |
| Disease | x | x | |
| Assembly | x | x | |
| DNA source (i.e., LCL/blood) | x | ||
| Region data | Presence of a specific variant/multiple variants | x | x |
| Absence of a specific variant/multiple variants | x | x | |
| Presence of any variant inside a specific genomic region | x | x | |
| Presence of two or more specific variants on the same chrom. copy | x | ||
| Presence of two specific variants on opposite chrom. copies | x | ||
| Having any germline variant | x | ||
| Having any somatic variant | x | ||
Endpoints available in the VarSum API
| HTTP method | Function | Endpoint |
|---|---|---|
| POST | Measure-based | |
| POST | Measure-based | |
| POST | Measure-based | |
| POST | Measure-based | |
| GET | Exploratory | |
| POST | Exploratory | |
| POST | Exploratory | |
| POST | Exploratory |
Fig. 6Comparison of the pathologies of donors with only TM1, or only TM2, or both TM1 and TM2 mutations. TM1 is found in 6 groups of patients corresponding to widely different pathologies, including brain lower grade glioma, bladder urothelial carcinoma, breast invasive carcinoma, glioblastoma multiforme, acute myeloid leukemia, and prostate adenocarcinoma. The co-presence of both TM1 and TM2 mutations reduces the number and types of associated pathologies to only 2 (both brain tumors), and increases of 2.1% the likelihood of correctly detecting the brain lower grade glioma
Fig. 7Candidate genes and their scores. On the left the scores for the tumor cohort; on the right the ones for the healthy cohort
Fig. 8Candidate genes and their scores. On the left the scores for the tumor cohort; on the right the ones for the healthy cohort