| Literature DB >> 24048642 |
Matthew Norris1, Simon Lovell, Daniela Delneri.
Abstract
Variation in gene copy number can significantly affect organism fitness. When one allele is missing in a diploid, the phenotype can be compromised because of haploinsufficiency. In this work, we identified associations between Saccharomyces cerevisiae gene properties and genome-scale haploinsufficiency phenotypes from previous work. We compared the haploinsufficiency profiles against 23 gene properties and found that genes with higher level of connectivity (degree) in a protein-protein interaction network, higher genetic interaction degree, greater gene sequence conservation, and higher protein expression were significantly more likely to be haploinsufficient. Additionally, haploinsufficiency showed negative relationships with cell cycle regulation and promoter sequence conservation.Entities:
Keywords: correlation; genome; haploinsufficiency; machine-learning; prediction
Mesh:
Substances:
Year: 2013 PMID: 24048642 PMCID: PMC3815059 DOI: 10.1534/g3.113.008144
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
List of the 23 gene properties that were considered for LDA model building
| Gene Property | Description |
|---|---|
| Protein–protein interaction degree, ≥1× reported | Generated by calculating degree and betweenness for each gene according to physical interactions in the BioGRID ( |
| Protein–protein interaction degree, ≥2× reported | |
| Protein–protein interaction degree, ≥3× reported | |
| Protein–protein interaction betweenness, ≥1× reported | |
| Protein–protein interaction betweenness, ≥2× reported | |
| Protein–protein interaction betweenness, ≥3× reported | |
| Genetic interaction degree, lenient cut-off | Generated from genetic interaction data in the DRYGIN ( |
| Genetic interaction degree, intermediate cut-off | |
| Genetic interaction degree, stringent cut-off | |
| Genetic interaction betweenness, intermediate cut-off | |
| Genetic interaction betweenness, stringent cut-off | |
| ORF protein sequence identity Sc ↔ Sp | Calculated by examining protein sequence conservation between |
| ORF DNA sequence identity Sc ↔ Sp | Gene DNA sequence identity between |
| ORF DNA dN/dS Sc ↔ Sp | dN/dS calculated by comparing ORF sequence between |
| ORF protein sequence identity Sc ↔ Sk | Calculated by examining protein sequence conservation between |
| ORF DNA sequence identity Sc ↔ Sk | Gene DNA sequence identity between |
| ORF DNA dN/dS Sc ↔ Sk | dN/dS calculated by comparing ORF sequence between |
| ORF protein sequence identity Sc ↔ Sb | Calculated by examining protein sequence conservation between |
| ORF DNA sequence identity Sc ↔ Sb | Gene DNA sequence identity between |
| ORF DNA dN/dS Sc ↔ Sb | dN/dS calculated by comparing ORF sequence between |
| Promoter DNA sequence identity Sc ↔ Sb | Calculated by comparing the noncoding region upstream of the ORF between |
| Cell-cycle mRNA expression variation | mRNA expression variation scores were obtained from a previous study ( |
| Proteomics summed intensity | This value represents the level of protein expression as the combined sum of haploid and diploid protein abundance from a previous study ( |
The first column gives the gene property name, and the second column describes the source of the gene property data.
Figure 1Distributions of gene property values with respect to HI phenotypes in six different media. To find relationships between HI and gene properties, the gene property data were divided into HI and non-HI groups. The significance of the difference between HI and non-HI property values was then estimated through the Mann-Whitney U-test (significant differences are indicated with white panels and p-values). The X axis indicates fitness loss values relative to wild-type (WT) in six nutrient environments, whereas the Y axis describes gene property values. We have visualized the raw HI fitness values using a scatter plot, with each dot representing an individual gene. The overlaid box plots represent non-HI (blue) and HI (red) gene property distributions. The box represents the upper and lower quartiles, and the central line represents the median. Whiskers represent the lowest point within the 1.5 interquartile range (IQR) of the lower quartile and represent the highest point within 1.5 IQR of the upper quartile.
Figure 2Relationships between HI and non-HI gene properties in rich medium. (A) The p-values testing the difference between HI and non-HI gene property value distributions. These are on a log10 scale and are as estimated by the Mann-Whitney U-test. The vertical line shows a p-value of 0.05. (B) Mean z-scores of HI (red) and non-HI (blue) gene properties. Error bars represent the SEM. (C) The receiver-operating characteristic (ROC) area under curve (AUC) distributions. These were generated using cross-validation (see Materials and Methods). Whiskers represent the lowest point within 1.5 interquartile range (IQR) of the lower quartile and the highest point within 1.5 IQR of the upper quartile. Dots represent outliers of the aforementioned ranges. The vertical line in the center of the chart represents the random expectation for the ROC plot.
Figure 3Relationships between five gene properties and HI, stratified according to gene essentiality. The p-values and z-scores represent the differences between the distributions of HI and non-HI gene property values among nonessential and essential, essential, and nonessential gene sets. Error bars represent the SEM. Gene properties include (A) PPI network degree, (B) ORF sequence identity between S. cerevisiae and S. kudriavzevii, (C) promoter sequence identity between S. cerevisiae and S. bayanus, (D) mRNA expression variation through the cell cycle, and (E) protein expression level.
List of the 6 gene properties used in the 6GP model showing proportion of gene property data missing across the yeast genome
| Gene Property | Proportion of Genes with No Data (%) |
|---|---|
| PPI network degree | 2.02 |
| GI network degree | 34.31 |
| % ORF sequence identity | 10.48 |
| % Promoter sequence identity | 5.00 |
| Cell-cycle mRNA expression variation | 0.50 |
| Protein expression magnitude | 17.65 |
PPI, protein–protein interaction; GI, genetic interaction.
Figure 4False-positive rate (FPR) ≤ 0.1 area under curve (AUC) distributions across all combinations of gene properties, using median imputation. This demonstrates that model performance tends to increase as more gene properties are added. Our candidate six gene properties (6GP) model is highlighted with an arrow. The three letter codes identify gene properties and are described in the legend. Distributions are for 100 receiver-operating characteristic (ROC) curves generated during cross-validation (see Materials and Methods). Whiskers represent the lowest point within 1.5 interquartile range (IQR) of the lower quartile and the highest point within 1.5 IQR of the upper quartile. Dots represent outliers of the aforementioned ranges. The black horizontal line represents the random expectation from the ROC plot.
Figure 5Performance of the six gene property (6GP) candidate model. Receiver-operating characteristic (ROC) curve of the best model (6GP), which combines the six gene properties described in the text. The dark line shows the average of 100 ROC curves, with error bars indicating 1 SD. Gray lines represent 100 ROC curves produced during cross-validation superimposed.
Summary of phenotypes for the 23 candidate genes tested
| AUGC Mutant/AUGC WT ( | |||
|---|---|---|---|
| Gene | Rich Media | F1 Nitrogen-Limited | F1 Carbon-Limited |
| 1.011 (0.315) | 0.987 (0.462) | 0.972 (0.544) | |
| 1.017 (0.627) | 1.008 (0.445) | 0.972 (0.242) | |
| 1.007 (0.590) | 0.999 (0.962) | 1.002 (0.996) | |
| 0.999 (0.952) | 0.995 (0.445) | 1.007 (0.976) | |
| 0.967 (5.33 × 10−2) | 0.996 (0.682) | 1.015 (0.841) | |
| 1.009 (0.698) | 1.008 (0.431) | 1.057 (6.82 × 10−2) | |
| 1.003 (0.899) | 1.009 (0.642) | 1.003 (0.976) | |
| 1.004 (0.794) | 1.019 (0.104) | 1.027 (0.544) | |
| 0.908 (8.31 × 10−4) | 0.965 (1.35 × 10−4) | 1.000 (0.996) | |
| 0.903 (3.87 × 10−5) | 0.925 (5.39 × 10−5) | 0.874 (3.53 × 10−4) | |
| 0.947 (1.48 × 10−2) | 0.956 (1.00 × 10−3) | 0.963 (0.107) | |
| 0.998 (0.899) | 0.991 (0.404) | 0.959 (0.159) | |
| 0.996 (0.821) | 0.973 (0.158) | 0.978 (0.611) | |
| 0.998 (0.922) | 0.993 (0.445) | 0.998 (0.976) | |
| 0.966 (3.47 × 10−2) | 0.982 (0.445) | 0.982 (0.752) | |
| 0.990 (0.627) | 1.003 (0.720) | 1.016 (0.840) | |
| 0.994 (0.698) | 0.948 (3.33 × 10−4) | 0.990 (0.824) | |
| 0.994 (0.718) | 1.001 (0.992) | 1.000 (0.996) | |
| 1.035 (5.10 × 10−2) | 1.003 (0.791) | 0.995 (0.958) | |
| 0.994 (0.846) | 0.970 (3.47 × 10−2) | 0.991 (0.841) | |
| 1.036 (0.118) | 0.986 (0.158) | 0.989 (0.824) | |
| 0.942 (1.48 × 10−2) | 0.986 (0.104) | 0.990 (0.841) | |
| 0.961 (1.48 × 10−2) | 0.994 (0.431) | 1.002 (0.996) | |
The first column shows the gene names and the remaining columns describe phenotypes in rich medium (YPD), F1 medium with nitrogen limitation, and F1 medium with carbon limitation. Phenotypes are described according to the average area under growth curve (AUGC) relative to the average wild-type (WT) AUGC. The number in brackets is a p-value representing the significance of the difference between mutant and WT AUGCs, calculated as described in Materials and Methods.
Significantly HI phenotypes, i.e., those with p-value < 0.05.
Figure 6Relationships between HI and non-HI gene properties for Sz. pombe in rich medium. (A) The p-values testing the difference between HI and non-HI gene property value distributions. These are on a log10 scale and are as estimated by the Mann-Whitney U-test. The vertical line shows a p-value of 0.05. (B) Mean z-scores of HI (red) and non-HI (blue) gene properties. Error bars represent the SEM. (C) The receiver-operating characteristic (ROC) area under curve (AUC) distributions. These were generated using cross-validation (see Materials and Methods). Whiskers represent the lowest point within 1.5 interquartile range (IQR) of the lower quartile and the highest point within 1.5 IQR of the upper quartile. Dots represent outliers of the aforementioned ranges. The vertical line in the center of the chart represents the random expectation for the ROC plot.