| Literature DB >> 31070705 |
Luis G Leal1, Alessia David1, Marjo-Riita Jarvelin2,3,4,5,6, Sylvain Sebert2,3, Minna Männikkö2, Ville Karhunen2,3,4,5,6, Eleanor Seaby7, Clive Hoggart8, Michael J E Sternberg1.
Abstract
MOTIVATION: Integration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci.Entities:
Mesh:
Year: 2019 PMID: 31070705 PMCID: PMC6954643 DOI: 10.1093/bioinformatics/btz310
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Main steps in cNMTF. Step 1: The algorithm takes four sources of data in the input: is a relationship matrix for the genotyping data. is the Laplacian matrix of an SNV–SNV network. is a phenotype matrix. is a kernel similarity matrix encoding the population origin of the subjects. Step 2: The matrix is approximated as the product of low-dimensional matrices , and . Here, we use and to penalize the factorization and guide the solutions of and . Step 3: The dimensional reduction provides information for clustering tasks, so and are taken as cluster indicator matrices for SNVs and subjects, respectively. Simultaneously, we compute the product of and to generate a score matrix . This matrix summarizes the effect of single SNVs on clusters of subjects with specific phenotypes and can be used to prioritize the SNVs. When SNV scores are compared between clusters we observe the relative importance of each SNV on a trait; therefore, those SNVs with high delta score, , can be prioritized for further analysis
Results of cNMTF applied on serum lipid levels
| Procedure | Variable | Finnish | White American | ||||
|---|---|---|---|---|---|---|---|
| LDL-C | HDL-C | TG | LDL-C | HDL-C | TG | ||
| Pre-processing phenotype data | Cut-off level for controls (mg/dl) | <100 | >60 | <150 | <100 | >60 | <150 |
| Cut-off level for cases (mg/dl) |
| <40 |
|
| <40 |
| |
| Number of subjects in the input: | 1711 | 1920 | 3780 | 446 | 605 | 1300 | |
| Number of controls | 1344 | 1775 | 3635 | 308 | 202 | 1214 | |
| Number of cases | 367 | 145 | 145 | 138 | 403 | 86 | |
| Pre-processing genetic data | Number of SNVs in the input: | 6945 | 7158 | 7620 | 9888 | 12 476 | 8662 |
| Candidate variants | 6703 | 6910 | 7407 | 9626 | 12 179 | 8446 | |
| Damaging variants | 242 | 248 | 213 | 262 | 297 | 216 | |
| Number of genes in the input: | 510 | 724 | 389 | 536 | 773 | 441 | |
| Seed genes | 136 | 180 | 123 | 142 | 193 | 139 | |
| Candidate genes in the PPIN | 374 | 544 | 266 | 394 | 580 | 302 | |
| Results | Number of SNVs prioritized: | 87 | 80 | 93 | 110 | 117 | 71 |
| Number of genes prioritized: | 40 | 41 | 25 | 54 | 65 | 40 | |
| Prioritized candidate genes | 21 | 14 | 6 | 36 | 33 | 26 | |
| Top candidate gene prioritized | Gene name |
|
|
|
|
|
|
| SNV |
|
|
|
|
|
| |
| Alleles ( | C, T | A, G | G, T | G, A | G, A | T, C | |
|
|
|
|
|
|
|
| |
|
| 3.7 | −3.8 | −5.3 | 4.4 | 4.0 | 4.0 | |
Refers to associations reported in GWAS catalogue under the genome-wide significance threshold .
This section shows the most significant novel gene not reported in GWAS catalogue. It lists the SNVs with the lowest P-value within each gen (P) and their delta score (). The complete list of prioritized genes and SNVs is annexed in Supplementary Files S2 and S3.
Fig. 2.Defining a subset of SNVs to analyze with cNMTF. (1) Reported SNV-trait associations are queried from GWAS catalogue and (2) mapped to genes in a PPIN. (3) The list of genes is expanded using their interactions in the first neighbourhood of the PPIN. (4) All variants located in the expanded list of genes were selected to conform the subset of SNVs, and later included in the SNV–SNV network
Fig. 3.Enhancing gene discovery with cNMTF. Prioritized genes across trait-cohorts are totalized and intersected with the results of LRM. Only genes benchmarked against GWAS catalogue are counted. In Supplementary Figure S15, we present Venn diagrams for specific trait-cohorts
Fig. 4.Benchmarking prioritized genes by cNMTF. Percentage of genes with known functional implications in the lipid metabolism. The search for evidence includes GWAS catalogue (the strongest benchmark for associations), KEGG, Reactome, GO and finalizes in PubMed
Fig. 5.Prioritized PPI in LDL-C. This PPIN shows only interactions between prioritized genes by cNMTF. A novel finding refers to a gene either not reported or not significant in GWAS catalogue ()