| Literature DB >> 18989397 |
Mark Yandell1, Barry Moore, Fidel Salas, Chris Mungall, Andrew MacBride, Charles White, Martin G Reese.
Abstract
The millions of mutations and polymorphisms that occur in human populations are potential predictors of disease, of our reactions to drugs, of predisposition to microbial infections, and of age-related conditions such as impaired brain and cardiovascular functions. However, predicting the phenotypic consequences and eventual clinical significance of a sequence variant is not an easy task. Computational approaches have found perturbation of conserved amino acids to be a useful criterion for identifying variants likely to have phenotypic consequences. To our knowledge, however, no study to date has explored the potential of variants that occur at homologous positions within paralogous human proteins as a means of identifying polymorphisms with likely phenotypic consequences. In order to investigate the potential of this approach, we have assembled a unique collection of known disease-causing variants from OMIM and the Human Genome Mutation Database (HGMD) and used them to identify and characterize pairs of sequence variants that occur at homologous positions within paralogous human proteins. Our analyses demonstrate that the locations of variants are correlated in paralogous proteins. Moreover, if one member of a variant-pair is disease-causing, its partner is likely to be disease-causing as well. Thus, information about variant-pairs can be used to identify potentially disease-causing variants, extend existing procedures for polymorphism prioritization, and provide a suite of candidates for further diagnostic and therapeutic purposes.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18989397 PMCID: PMC2565504 DOI: 10.1371/journal.pcbi.1000218
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Using sequence homology to identify variant pairs.
The protein encoded by a candidate disease gene (the subject in the alignment) is aligned to a paralogous protein encoded by a locus with known disease-causing alleles (the query in the above alignment). Shown in red is a paralogous variant pair. Variants in the candidate that occur in the same positions in the alignment as a known disease-causing variant in the other protein are prioritized for use in subsequent association studies.
ODDs scores associated with different types of variant pairs.
| Dataset | Genes | % Similarity | Syn. | Non-syn. | Non-con. | Con. | Frame-shift |
| Reciprocal Best-hits | 7,368 | 74.5 | 33.7 | 31.7 | 95.3 | 64.3 | 200.4 |
| Best hits | 17,111 | 69.1 | 32.2 | 31.3 | 89.3 | 62.1 | 218.2 |
Genes: number of genes in the dataset. % Similarity: average value for the dataset's aligned proteins. Syn: synonymous variants. Non-syn: non-synonymous variants (pooled variants from the other classes of variant, including nonsense variants). Non-con: non-conservative substitutions. Con: conservative substitutions. Frame-shift: frameshift inducing indels. Values in the table are ODDs scores (observed number of variant pairs/expected number of variant pairs).
ODDs ratios for disease-gene variant pairs.
| DATABASE | MIS-SENSE | SYNONYMOUS |
| dbSNP vs. dbSNP | 9.5 | 6.1 |
| all disease vs. all disease | 8.8 | N/A |
| all disease vs. dbSNP | 2.2 | N/A |
Column 1 lists the database of origin for each member of the variant pair. “all disease” means known disease-causing variants from OMIM and HGMD. Columns 2 and 3 give the odds ratios (observed/expected) for screening every gene from the Omicia disease gene set for paired variants using pooled non-conservative and conservative substitutions (here termed ‘MIS-SENSE’) and synonymous variants from the respective databases. P≪1e−4 for all values.
Figure 2Classification system for variant pairs.
Selected Class 1 SNP pairs.
| Gene A | ID of SNP in Gene A | Disease assoc. with Gene A | Gene B | ID of SNP in Gene B | Disease assoc. with Gene B |
| FGFR2 | HGMD:CX972741 | Pfeiffer syndrome | FGFR3 | HGMD:CM950470 | Thanatophoric dysplasia |
| JAG1 | HGMD:CD993777 | Alagille syndrome | FBN1 | HGMD:CM972811 | Marfan syndrome |
| ATP7A | HGMD:CM940140 | Menkes syndrome | ATP7B | HGMD:CM970138 | Wilson disease |
| ABCA1 | HGMD:CM993803 | Tangier disease | ABCA4 | HGMD:CM990025 | Stargardt disease |
| CFTR | HGMD:CM940275 | Cystic fibrosis | ABCC8 | HGMD:CM981883 | Hyperinsulinism |
Columns 1 & 4 give the gene symbols for two paralogous disease-causing genes. Columns 2 & 5 give the IDs of the two variants that comprise the Class 1 pair. Columns 3 & 6 list the diseases most commonly associated with the two paralogous variants.