Subtle metabolic changes precede and accompany chronic vascular complications, which are the primary causes of premature death in diabetes. To obtain a multimetabolite characterization of these high-risk individuals, we measured proton nuclear magnetic resonance (1H NMR) data from the serum of 613 patients with type I diabetes and a diverse spread of complications. We developed a new metabonomics framework to visualize and interpret the data and to link the metabolic profiles to the underlying diagnostic and biochemical variables. Our results indicate complex interactions between diabetic kidney disease, insulin resistance and the metabolic syndrome. We illustrate how a single 1H NMR protocol is able to identify the polydiagnostic metabolite manifold of type I diabetes and how its alterations translate to clinical phenotypes, clustering of micro- and macrovascular complications, and mortality during several years of follow-up. This work demonstrates the diffuse nature of complex vascular diseases and the limitations of single diagnostic biomarkers. However, it also promises cost-effective solutions through high-throughput analytics and advanced computational methods, as applied here in a case that is representative of the real clinical situation.
Subtle metabolic changes precede and accompany chronic vascular complications, which are the primary causes of premature death in diabetes. To obtain a multimetabolite characterization of these high-risk individuals, we measured proton nuclear magnetic resonance (1H NMR) data from the serum of 613 patients with type I diabetes and a diverse spread of complications. We developed a new metabonomics framework to visualize and interpret the data and to link the metabolic profiles to the underlying diagnostic and biochemical variables. Our results indicate complex interactions between diabetic kidney disease, insulin resistance and the metabolic syndrome. We illustrate how a single 1H NMR protocol is able to identify the polydiagnostic metabolite manifold of type I diabetes and how its alterations translate to clinical phenotypes, clustering of micro- and macrovascular complications, and mortality during several years of follow-up. This work demonstrates the diffuse nature of complex vascular diseases and the limitations of single diagnostic biomarkers. However, it also promises cost-effective solutions through high-throughput analytics and advanced computational methods, as applied here in a case that is representative of the real clinical situation.
Type I diabetes is caused by an autoimmune reaction against the insulin-producing pancreatic β-cells and subsequent disturbance of normal blood glucose metabolism. Insulin replacement therapy cures the acute symptoms, but is not able to match the natural response to rising or falling glucose levels. This persistent metabolic imbalance is linked to high incidence of vascular complications such as diabetic kidney disease (DKD) (Finne ), diabetic retinal disease (DRD) (Roy ) and macrovascular diseases (MVDs) (Libby ), all of which are co-occurring in vulnerable patients (Groop ; Thorn ; Ala-Korpela, 2007). The diagnosis, risk assessment and treatment of these conditions are currently determined by a number of biochemical and clinical variables, although none of these are conclusive on its own (Soedamah-Muthu ; Stadler ). Furthermore, the simultaneous clustering of complications and metabolic risk factors has not been studied by high-throughput analytical techniques that could reveal the multidimensional metabolic state of an individual more effectively.The standard differential diagnostics in medicine may not be sufficient in detecting complex perturbations of biological systems (Zenker ). Conditions such as insulin resistance and atherosclerosis stem from nonlinear interactive pathways between the genes (Hakonarson ), gene expression (Sieberts and Schadt, 2007), metabolic environment (Goodacre, 2007) and the symbiotic microflora (Martin ). To pinpoint the nodes and their roles in the disease networks requires a large number of samples with multidimensional quantitative data—a direct consequence of the curse of dimensionality. The genome-wide association studies have shown that this can be achieved at the DNA level (Frayling, 2007; Wellcome Trust Case Control Consortium, 2007). However, for personalized risk assessment and treatment the genetic approach is limited, as it does not take into account the dynamic environment, unlike the metabonomics approach (Nicholson and Wilson, 2003; Clayton ), which has gained popularity as analytical technologies are evolving (Nicholson, 2006; Ala-Korpela, 2007; Salek ).Diabetic complications pose a difficult challenge to public health care, as populations grow older and life style becomes more sedentary and energy-rich (Reunanen ). For this reason, we are aiming at new screening methods and metabolic characterization tools to find the vulnerable patients at an early stage when preventive treatment is still effective (Tenenbaum ; Gross ). Mass spectrometry and proton (1H) nuclear magnetic resonance (NMR) spectroscopy are the two key experimental methods in the area of ‘global biochemistry' (Fernie ). 1H NMR, in particular, is advantageous for screening, as it can efficiently extract detailed molecular information on a large number of metabolites in various biofluids (Tang ; Beckonert ; Ala-Korpela, 2008). The earliest experiments with plasma have already demonstrated this in type II diabetes (Nicholson ). Recently, we have shown that DKD can be detected by 1H NMR of serum (Mäkinen ) and that the metabolic syndrome (MetS) can be distinguished by multivariate methods and 1H NMR spectroscopy (Suna ). A similar approach has been applied to cardiovascular disease, but with limited success (Brindle ; Kirschenlohr ). The metabolic changes in type II diabetes have also been studied by chromatographic methods (Yang ; Wang ). Animal models have provided encouraging results and further justification for the metabonomic NMR approach (Williams ; Clayton ; Salek ), but more experience from human populations is needed (Griffin and Nicholls, 2006; Ala-Korpela, 2007).In this study, the emphasis is on the metabolic continuum that underlies the slow and often elusive development of chronic complications. We focus first on DKD due to its high significance in the treatment and prognosis of diabeticpatients (Gross ). Our main goal, however, is to extract a metabolite manifold that highlights not only DKD, but also other important clinical and biochemical characteristics and their complex relationships (Fernie ). The combination of 1H NMR of serum and metabonomic mapping provides the necessary insight: neural network analysis and statistically verified visualizations of both the spectroscopic and clinical data will not only help decision making in clinical environment, but will also increase the knowledge of multifactorial disease states that are difficult to pinpoint by reductionist approaches (Sams-Dodd, 2005; Weckwerth and Morgenthal, 2005; Loscalzo ). The new source of information can then be used in personalized risk assessment as a cost-effective high-throughput alternative to a collection of specific biomarker assays (Lindon ; Ala-Korpela, 2008).
Results
Molecular windows to metabolism
We obtained serum samples from the FinnDiane study to measure two molecular windows for 613 patients with type I diabetes. A typical 1H NMR spectrum of human serum is characterized by broad resonances from the lipid molecules of lipoprotein particles, such as the −CH3 group of triglycerides, cholesterol compounds and phospholipids (Figure 1C and D). This so-called lipoprotein lipids (LIPO) window is a complex mixture of the aforementioned lipid signals, serum albumin and albumin-bound fatty acid resonances across the aliphatic region, and the less intense signals from smaller molecules such as creatinine, lactate and glucose (Ala-Korpela, 1995, 2008).
Figure 1
1H NMR spectral profile of diabetic kidney disease. (A) The SOM of 613 × 2 1H NMR spectra of serum, colored according to the percentage estimate of DKD within a given map region. Each hexagonal map unit defines a specific model spectrum and a corresponding subset of patients, the spectra of which best match the aforementioned model. (B) The low molecular weight metabolites (LMWM) model spectrum and (C) the lipoprotein lipid and albumin (LIPO) model spectrum for a patient subset within the map unit with the lowest percentage of DKD. The colored curve segments indicate the current model, whereas the solid black curve indicates the mean spectrum over all data, thus serving as a constant reference. The colored areas below the model spectra represent the proportional differences of the unit-specific model and the mean model. (D) The LIPO model and (E) the LMWM model spectrum for patients within a map unit of the highest DKD percentage. An interactive presentation of the model spectra is available in Supplementary data 3.
To reveal the resonances from smaller molecules, the spectrometer settings can be altered to suppress most of the broad resonances while still enabling the detection of the more mobile low molecular weight molecules (LMWM). The LMWM window is dominated by the numerous glucose resonances between 3.1 and 3.9 p.p.m., although some lipid signals still remain (Figure 1B and E). The spectral shapes from both windows share a common axis of chemical shift and are superimposable, except for the intensity scaling constant. For example, lactate creates a strong doublet signal around 1.28 p.p.m. in the LMWM window, but only small shapes on top of the wider lipid and albumin resonances in the LIPO window. On the other hand, most of the molecules with the −CH3 group contribute to the prominent signal around 0.8 p.p.m. in the LIPO window, but only the more mobile species can be detected in the LMWM window.
Spectral profile of type I diabetes
A self-organizing map (SOM) (Kohonen, 2000) was constructed from the 1H NMR data (Figure 1A). The SOM was the result of reducing the 613 experimental spectra into 9 × 9=81 representative spectral models, each of which was assigned to a unique hexagonal unit on the map grid. Subsequently, a best-matching model was determined for each experiment, thus each patient had a best-matching unit or a ‘place of residence' on the map. Localized similarity is the fundamental idea behind the SOM, that is, neighboring units or the patients therein are more similar to each other than those from the opposite sides of the map. In this case, similarity was defined by the arithmetic multidimensional difference between the two spectral models; thus, any two neighbors shared more metabolic characteristics (their spectra looked the same) than two randomly picked patients, on average.As the SOM is analogous to a geographic map in all but the way the patients' coordinates are assigned, it is possible to use ordinary demographic methods to visualize the properties of patients in different metabolic neighborhoods (Supplementary data 1). Here, we started by coloring the units based on the percentage of DKD patients within a local population (Figure 1A). The highest value of 70% can be seen on the western edge of the map, on the unit at row 5 and column 1 (5,1). The unit at (2,9) near the northeast corner has the lowest percentage of DKD (16%), and is located far from (5,1). In fact, when looking at the overall coloring, the DKD patients are clustered on the western side, whereas the patients with fewer complications are concentrated on the northeast corner of the map.The typical spectral profile of DKD was examined by comparing the spectral models at (5,1) and (2,9) to see if any of the metabolite resonances differed. Each of the spectral plots, such as Figure 1C, consists of three components. First, the mean model over all data is depicted as a solid black line to serve as a constant reference to which the spectral models can be compared. The second curve alongside the reference is the unit-specific spectral model, which was split into orange or blue segments depending on where the model exceeded or was less intense than the reference. As the first two curves are close to each other in terms of absolute intensity, a third curve that depicts the proportional differences is helpful in revealing any significant changes in intensity. In Figure 1C for instance, the third curve was drawn below the two absolute intensity curves and painted similarly to the unit-specific model. The lipoprotein lipid resonances at around 0.81, 1.23 and 1.95 p.p.m. show reduced values from the mean, whereas the albumin background is increased. In Figure 1E, the two creatinine peaks at 2.98 and 3.99 p.p.m. are higher for the DKD region (5,1). However, looking at just two map locations is not enough for accurate interpretation; a more global perspective is required.
Diabetic kidney disease, the metabolic syndrome and mortality
A majority of the patients in this study had either micro- (22%) or macroalbuminuria (37%), the spatial distributions of which are revealed by class-specific colorings in Figure 2A and B. The macroalbuminuric group (clinically diagnosed with DKD) is concentrated on the western side of the map (P=1.2 × 10−8), whereas the diagnostically intermediate microalbuminuric group does not form any statistically significant pattern (P=0.077). While the DKD status was determined by urine albumin excretion, the map was constructed solely based on the 1H NMR spectra of serum, thus illustrating the systemic biological connection between the two biofluids.
Figure 2
Statistical colorings of albuminuria, the MetS, MVD and mortality. (A–C) Demographic properties of patients on the SOM that was constructed from 613 × 2 1H NMR spectra of serum. The three upper plots depict the clustering of (A) microalbuminuria, (B) macroalbuminuria and (C) 10-year mortality in patients with type I diabetes. The color of each hexagonal map unit indicates the estimated proportion of cases with respect to the total number of patients who reside on the unit in question. For mortality, the estimates were normalized by follow-up time. (D–H) Five grades of the MetS according to the NCEP ATP III recommendations and (I) the distribution of patients with a history of macrovascular events. Empirical P-values for each plot as a whole are shown below the colorings.
All-cause mortality in Figure 2C (P=0.00057) was estimated based on 8.2±0.6 years of follow-up and scaled to the percentage of deaths in a decade (number of deaths per 1000 patient years). As expected, there is a clear connection between DKD and increased mortality, and the highest value of 25% is observed at (7,1) close to the highest DKD percentage at (5,1). Additional details are available in Figure 1 in Supplementary data 2.The MetS (NCEP, 2002; Eckel ) represents a binary classification according to a clinical scoring system (a score of 3 or more is considered positive) that combines several components of insulin resistance and obesity (Figure 2D–H). Patients with the lowest score 1 reside on the northern part of the SOM with a 42% (P=1.1 × 10−5) occupancy at (1,8), and those with a score above 3 are tightly concentrated on the southwestern corner, with hardly any overlap with the first group (P=1.8 × 10−8 for score 4, P=1.3 × 10−7 for score 5).The SOM colorings indicate strong associations between DKD, mortality and the MetS, but with subtle differences. For instance, the first two MetS categories split the normoalbuminuric northeastern side, rather than spread evenly to mirror the DKD group (Figure 2A, B, D and E). The highest percentage of DKD at (5,1) does not coincide with the highest MetS scores at (9,1). Interestingly, a history of macrovascular complications in these patients seem to be related more to the MetS than to DKD (Figure 2I), although the numbers are too small for statistical significance (P=0.035 for the MVD pattern). Finally, the highest 10-year mortality of 25% is observed at (7,1) in Figure 2C, where the MetS and DKD overlap the most.
Confounding factors and treatments
The colorings for age (mean±s.d. 40±11 years), type I diabetes duration (27±10 years) and gender (311 males, 302 females) show only minor spatial clustering and weak statistical significance (Figure 3A–C). Furthermore, the observed patterns show little similarity to DKD, the MetS or MVD in Figure 2, suggesting that the major chronological and physiological determinants of risk do not confound the biochemical characteristics.
Figure 3
Statistical colorings of confounding factors and treatments. (A–F) Clinical characteristics of patients on the SOM of 1H NMR spectra. The colors of the map units indicate the estimates for the average (A) age and (B) diabetes duration within the patient subset on a particular map region. (C) The gender distributions on the map units. The color of each hexagonal map unit indicates the percentage of male gender with respect to the total number of patients that reside on the unit in question. (D, E) Blood pressure and (F) waist circumference for the patient subsets within each map unit. The percentages of (G) antihypertensive treatment, (H) DRD and (I) lipid-lowering treatment were obtained as described above.
Most of the patients were on medication or had undergone laser treatment for DRD. Antihypertensive treatment (P=2.7 × 10−5) is most common (up to 75%) in those areas on the western side of the map (Figure 3G), which have a high percentage of DKD in Figure 2B and elevated blood pressure in Figure 3D and E, as expected. Furthermore, the same areas have a high proportion of DRD (P=0.00027), although the pattern is more widely dispersed on the southwestern half of the SOM (Figure 3H). Lastly, patients with the highest MetS scores have also the widest waist (94 cm) in Figure 3F and the highest percentage (30%) of lipid-lowering treatment (Figure 3I).
Biochemical backdrop of diabetic complications
To create a more comprehensive metabolic picture than that in Figure 1, we colored the map according to estimates from regression models and spectral features that quantify key biochemical variables directly from the 1H NMR spectra (Mäkinen ). The previously used null hypothesis of no dependence between the map and a target variable could not be used here, as both the map and the coloring were derived from the spectra. The dynamic range of statistical fluctuations was nevertheless estimated to determine a suitable color scale (Supplementary data 1).Triglyceride concentration is a part of the MetS definition, and the highest unit-specific value (3.6 mmol/l) can therefore be seen at the southwest corner of the map, where also the MetS is most severe (Figure 4A). Total serum cholesterol is only partially linked to triglycerides, as it produces an ascending north-south pattern on the SOM (Figure 4B). Nevertheless, the highest value (6.0 mmol/l) coincides with that of triglycerides at (1,9). HDL2cholesterol exhibits a more complicated pattern (Figure 4C), with the highest value (0.64 mmol/l) located near the southeast corner, and the lower values (0.41–0.47 mmol/l) located on the western side.
Figure 4
Statistical colorings of 1H NMR estimates of biochemical variables. (A–I) Quantitative estimates of biochemical variables based on the statistical modeling of the 1H NMR spectra, and visualized on the SOM that was obtained previously from the same data. Regression model estimates for (A) serum triglyceride concentration, (B) cholesterol level, (C) HDL2 cholesterol and (D) serum creatinine. The colors of the map units indicate the averaged estimates for patients who reside in a given region. (E) Concentrations of serum urea were obtained by direct peak integration around 5.68 p.p.m. and (E) albumin concentration was estimated by parametric line fitting in the aliphatic region of the LIPO window. (G) Lactate signal at 4.05 p.p.m. (H) acetate at 1.86 p.p.m. and (I) glucose at 3.44 p.p.m. were quantified by peak integration.
Creatinine singlets at 2.98 and 3.99 p.p.m. and urea around 5.68 p.p.m. have a strong association with DKD (Figure 2B) and produce similar colorings of approximately doubled values on the western side of the SOM as compared with the eastern side (Figure 4D and E). Furthermore, lactate at 1.28 and 4.05 p.p.m. and acetate at 1.86 p.p.m. follow the same trend (Figure 4G and H), but with patterns closer to the MetS. Serum albumin is higher on the eastern half of the map (Figure 4F), which was already evident in the typical spectral model of DKD in Figure 1C and D.The effect of the glucose resonances between 3.1 and 3.9 p.p.m. was suppressed when constructing the SOM, but nevertheless the remaining doublet from α-glucose around 5.19 p.p.m. was enough to separate patients who had high blood glucose at the time of sample collection (Figure 4I). Although the daily variations are large in type I diabeticpatients, high glucose values do partially overlap with the MetS and other complications on the southwestern half.
Validation by standard biochemistry
Standard non-NMR measurements of a number of metabolites were also available for validating the NMR-derived results with methodologically independent data (Table 1 in Supplementary data 2). In addition, the metabolite manifold from the spectra was supplemented with biomolecular data that could not be detected by NMR to confirm biologically relevant observations.Figure 5A–C depicts the three most important lipid variables (see also the NMR-derived colorings in Figure 4A–C), with highly statistically significant patterns for triglycerides (P=3.1 × 10−19), total cholesterol (P=2.1 × 10−13) and HDL2C (P=2.5 × 10−6). HDL3C (P=4.3 × 10−6) has a pattern similar to HDL2C in Figure 5F. The highest concentration of ApoB (117 mg/dl) coincides with the highest triglycerides and cholesterol in the southwest corner at (9,1) and appears to be elevated if either of the two is higher than average (P=1.1 × 10−13 for ApoB).
Figure 5
Validation of the biochemical accuracy of the 1H NMR data and neural network analysis. (A–I) Measurements of clinical markers by standard (non-NMR) biochemistry, visualized on the SOM of 1H NMR spectra. (E) Apolipoprotein B-100, (F) HDL3 cholesterol, (G) insulin dose, (H) glycemic control and (I) 24 h-urine albumin were included instead of those metabolites from Figure 4 that were estimated by 1H NMR.
Creatinine is easily detectable by 1H NMR as two singlets at 2.98 and 3.99 p.p.m accordingly, the non-NMR measurement in Figure 5D closely matches the NMR-derived creatinine (and biologically correlated urea) in Figure 4D and E (P=9.2 × 10−7). The 24 h-urine albumin in Figure 5I is indicative of persistent albuminuria by definition (P=4.6 × 10−10), and albuminuria is tightly linked to the metabolite manifold from NMR, as was evident in Figure 2A and B.Patients in the southwest corner have significantly higher weight-adjusted insulin doses (P=0.00053), but, despite that, the worst glycemic control, which is indicated by HbA1c of 9.7% (P=2.9 × 10−8). Corresponding patterns were previously observed for the MetS (Figure 2F–H) and NMR-derived lactate, acetate and glucose (Figure 4G–I), which can be seen as an indication of insulin resistance.
Summary of clinical and metabolic characteristics
In the final stage, we merged map units into larger districts and collected the regional characteristics into tabular format to create a summary of the metabolic characteristics (Figure 6; Table 3 in Supplementary data 2). For instance, the southwest district (Figure 6A) is populated by patients with high relative risk due to DKD and the MetS (7.8 versus 2.0–2.1), compared with the districts in the north and northeast (Figure 6E and F). Biochemically, these groups differ significantly: triglycerides (2.8 versus 0.84 mmol/l), cholesterol (5.8 versus 4.5–4.9 mmol/l), serum creatinine (116 versus 87–88 μmol/l) and 24 h-urine albumin (356 versus 13–16 mg) are high, whereas HDL-subfractions are low in the MetS district. Patients in the MetS corner have also poor glycemic control (HbA1c 9.2 versus 8.2%) and larger waist circumference (94 versus 81–83 cm).
Figure 6
Summary of clinical and metabolic characteristics. (A–F) Statistics for a selection of non-NMR variables for patient groups defined by six districts on the SOM. The map was constructed based on the 1H NMR spectra for 613 type I diabetic patients. The percentages of cases with respect to the total number of patients in a given district and for the whole population (ALL) are listed for 10-year mortality (normalized by follow-up time), DKD, the MetS, MVD, DRD and male gender. Relative risk of death (RR) was defined as the ratio of the observed mortality in type I diabetic patients against the entire Finnish population. The MetS was defined as present if the score was three or more. Median values are listed for the continuous variables, with the full statistics available for the non-NMR data in Table 3 in Supplementary data 2.
The northwest corner is characterized by high susceptibility to microvascular complications, as DKD (58%) and DRD (58%) are common but MetS (42%), MVD (10%) and 10-year mortality (17%) are closer to average. Furthermore, there is a notable difference in HDL2C when comparing the northwest corner and the southeast corner of the SOM (0.45 versus 0.60 mmol/l), which is also visible in Figure 4C. On the other hand, total cholesterol is higher in the southeast (4.9 versus 5.7 mmol/l), so the difference in the ratio of HDL2 to total cholesterol is less pronounced.The two districts with favorable phenotypes are similar if complications alone are considered, but some minor biochemical differences can be observed (Figure 6E and F). Although both districts share relatively high HDL2C concentrations, ApoA1, ApoB, HDL3C and total cholesterol are lower in the northern district. The districts are also different with respect to serum albumin (Figure 4F). Triglycerides and glycemic control (HbA1c 8.2%) exhibit no clear differences, as was seen also in the spectral estimates of triglycerides in Figure 4A.
Discussion
Diabetes is associated with increased incidence of vascular complications and premature aging (Morrish ; Rönnback ; Pambianco ). Here, we characterized the metabolic background of adverse clinical phenotypes within this high-risk population by 1H NMR spectroscopy of serum. Specifically, we were able to identify differences and similarities in the biochemical patterns of DKD, insulin resistance, DRD, macrovascular complications and all-cause mortality in 613 type I diabeticpatients.The 1H NMR analyses were targeted at two molecular windows simultaneously. The LIPO window carries information particularly on lipoprotein lipids and albumin, whereas the LMWM window contains signals from smaller metabolites such as creatinine and glucose (Mäkinen ; Ala-Korpela, 2008). Lipoprotein levels are altered in type I diabetes in the presence and during the development of complications (Chaturvedi ; Lyons ; Thomas ), but they alone cannot capture all aspects of the diabetic condition. It is therefore plausible that the inclusion of the LMWM window as an integral part of the biochemical analysis is crucial for polydiagnostic applications that are targeted at several concurrent disease mechanisms (Kell, 2006).The raw metabolite manifold obtained from the two molecular windows was not usable as such, but multivariate computational analysis was required to transform the spectral data into an accessible form of information. We chose the SOM (Kohonen, 2000; Suna ) instead of linear decomposition methods such as principal component analysis (PCA), since the SOM algorithm preserves the spectral shapes and summarizes the data via a small number of spectral models that are directly relatable to individual samples (Figure 1). The decompositions, on the other hand, are focused on the variables and on the reduction of the data into abstract multidimensional linear spaces that may not have a direct connection to the observed NMR resonances.PCA and partial least squares methods have been established as the standard pattern recognition methods in the field of metabonomics (Dieterle ; Trygg ). However, the SOM analysis has been applied, for instance, in spectroscopic classification problems (Beckonert ; Lavine ; Suna ), in molecular conformation analysis (Hyvönen ) and in studies of gene–metabolite interactions (Hirai ). The last case is an example of complementary application of both PCA and SOM. Some studies have also compared SOM with other methods (Mangiameli ; Giraudel and Lek, 2001; Astel ) and, in most cases, the use of SOM produced additional insight compared with PCA (see also Supplementary data 1).The ambiguity in model selection and the danger of overfitting are the main drawbacks of neural network algorithms (Lampinen and Kostiainen, 1999). Here, the statistical significance of the observed patterns was addressed from a practical point of view, with emphasis on the ability of 1H NMR to extract relevant metabolic information from a serum sample. The question of overtraining was not relevant in this context, as we were not using the SOM as a predictive tool but included only the spectra as inputs. Thus, we were able to integrate different sources of data in a user-friendly fashion by stochastically normalized map colorings, without giving up statistical reliability.We chose not to perform supervised feature extraction or selection before the SOM analysis. This may have led to suboptimal predictive performance for a particular clinical endpoint but, on the other hand, the results reflect the intrinsic metabolic information content of the spectra as such, and the complex dependencies between the various bodies of data. For instance, we have previously shown that a linear regression model with nonlinear feature extraction can indicate the albumin excretion categories from the 1H NMR spectra of serum (Mäkinen ), but this model has less utility in nondiabetic populations. The SOM presented here, however, is not methodologically dependent on albuminuria and thus easier to relate to the general population.Omitting explicit feature selection does not guarantee optimal results, but other options would require detailed methodological treatment that is not yet available, and might complicate the interpretations of the original spectral shapes. DKD, for instance, is a discrete classification based on one biomarker (urine albumin), and while it makes a perfect test case for analytical techniques (Mäkinen ), a systems biology approach to diabetic complications should not aim to predict albuminuria, but to distinguish the multifactorial disease states, even if not all of them are associated with overt albuminuria (Caramori ). Put differently, a predictive model will only work in the clinical practice if the phenotype to be predicted is accurate (Loscalzo ).Despite the shortcomings, albumin excretion is an excellent indicator of progressing kidney damage after the early stages of the disease. Albuminuria starts to present itself within the first 15 years of diabetes (Gross ; Caramori ); patients included in this work had a long diabetes duration (26 years on average), so the normoalbuminuric subgroup has a low risk of ever developing DKD. Consequently, they represent a generally healthier subset of diabeticpatients, especially when considering that low-grade albuminuria is associated with macrovascular complications, even outside type I diabetes (Gerstein ). Still, normoalbuminuria does not preclude the cardiovascular risk factors: a significant portion of patients in the map region with dyslipidemia and insulin resistance (Figure 6A) had normal (17%) or intermediate (20%) albumin excretion as opposed to macroalbuminuria (63%).The metabolic differences between most normo- and macroalbuminuric (DKD) patients were evident in the 1H NMR spectra of serum. DKD was associated with elevated triglycerides, lower HDL cholesterol and decreased albumin in the LIPO window. Other studies have also reported the connection between albuminuria and triglycerides (Chaturvedi ; Jenkins ; Thomas ), but the exact role of HDL metabolism remains unclear. Serum creatinine and urea are two waste products that are normally excreted by the kidneys and, accordingly, the LMWM window revealed elevated values for the macroalbuminuric group, although none of the patients had end-stage renal disease. Crucially, the microalbuminuric patients did not clearly identify with either the low-risk or high-risk regions of the SOM, again highlighting the limited usefulness of any single biomarker in complex disease environments (Caramori ).The MetS and DKD are strongly associated in the Finnish type I diabetic population (Thorn ). Our results from the metabonomic analysis were similar: the SOM regions with patients that have a detectable loss in kidney function (i.e., elevated creatinine and urea, decreased serum albumin) overlapped with insulin resistance and related problems in glucose metabolism (dyslipidemia, high insulin dose, high HbA1c, elevated lactate and acetate and high fasting glucose) (Krentz ; Lovejoy ; Avogaro ; Choi ). Interestingly, the neighborhood with the most severe insulin resistance did not coincide with the highest values of creatinine and urea, which suggests that there is a subtle systematic difference between the two clinical conditions. Furthermore, the highest mortality was observed on the intersection of the two defects. Unfortunately, the data set size in this study was insufficient for any definitive conclusions, but we have made similar observations with the non-NMR data for the full FinnDiane population (Mäkinen et al, in preparation), so these results nevertheless support a multifactorial approach to the study of pathophysiology (Loscalzo ).The discrepancy between the MetS and DKD could not be explained by nonbiochemical factors. For instance, although lipid treatment was most common in the MetS neighborhood, this group of patients still had the highest triglyceride concentration. This suggests that DKD with lower triglycerides is not a product of lipid-lowering treatment alone. In contrast, the highest percentage of antihypertensive treatment coincided with the highest blood pressure and DKD, but not with the MetS, which in turn suggests that the MetS with lower albumin excretion is not just a product of decreased albuminuria due to blood pressure medication (Thomas and Atkins, 2006). Furthermore, there was little regional gender difference in the MetS-DKD half of the SOM, so the wide waist circumference in the MetS neighborhood did not represent a gender bias. Hence, our results did not match some of the previous findings on statistical gender groupings in 1H NMR experiments of plasma (Kirschenlohr ).Macrovascular events are the primary targets of intervention, as they are the most common cause of premature death in type I diabeticpatients (Libby ; Stadler ). In this respect, the difference between insulin resistance and kidney function becomes significant; the SOM neighborhood with the highest percentage of MVD was also the one with the strongest insulin resistance and the highest MetS scores, which agrees well with previous studies (Sierra-Johnson ). The result is only suggestive, as the numbers were low and we had only cross-sectional data on MVD available, but this finding nevertheless demonstrates the sensitivity of the metabonomics approach to MVD vulnerability, as opposed to the albuminuria classification alone. Undoubtedly, triglyceride concentration is the most significant serum biomarker (Davis ; Pambianco ), and its role may be even further emphasized here, as triglycerides produce pronounced 1H NMR resonances.DRD was the most common (51%) complication among the study population. Map regions with DKD or insulin resistance hallmarks had the highest percentages, but DRD did not specifically associate with either of the two. The connection to mortality was less obvious due to the large number of DRD cases, which agrees with previous observations (Torffvit ).Low concentration of HDL2cholesterol was the common, albeit not exclusive, denominator for all complications, which suggests that HDL2 has a protective effect, or is a marker of a favorable phenotype with respect to both micro- and macrovascular complications (Cutri ). This is in contrast to many observations regarding type I diabetes (Chaturvedi ; Jenkins ; Thomas ), but the disconnect may be due to the statistical models that were used in these studies. There is a negative correlation between HDL cholesterol and serum triglycerides, so any linear model that includes both the two is likely to emphasize one at the expense of the other. Furthermore, men and women have different concentrations, so gender bias is likely to influence the results. To resolve the controversy, studies that use modern multivariate pattern recognition techniques in addition to the classical medical statistics should be performed on large population-based cohorts.The biological heterogeneity of diabetic complications make the borderline between health and disease ambiguous, as is the case with many other slow pathophysiological processes. Our work illustrates this fundamental diagnostic challenge: DKD, DRD, the MetS and MVD shared much of the same biochemical basis, but nevertheless did not conclusively define each other. Even though the patients in this study were carefully selected to represent clinically relevant phenotypes, the metabolic landscape remained diffuse.This work is, to our knowledge, the first metabonomics study on premature death and vascular complications in a large human cohort. We used only serum to characterize the patients, and yet the high-risk metabolic features were easily observable. This is an encouraging result with respect to general applicability as, unlike type I diabetes, urine albumin (or any other single biomarker) does not have an equally critical role in type II diabetes, let alone in the nondiabetic population. Furthermore, our application of 1H NMR metabonomics and statistical visualizations may improve the tracking of patients' progress in the diabetic disease continuum in a way not attainable by traditional approaches. Hence, it may become possible to re-route the multimetabolite path of a vulnerable patient away from adverse clinical endpoints and towards a more favorable phenotype before it is too late (Ala-Korpela ).
Materials and methods
Study population
Patients with type I diabetes were recruited by the Finnish Diabetic Nephropathy Study (FinnDiane), which is a nationwide multicenter effort to identify genetic and clinical risk factors for DKD. The study protocol was in accordance with the Declaration of Helsinki and approved by the local ethics committee in each of the participating health care centers. Diagnostic criteria for type I diabetes included age of onset below 35, the transition to permanent insulin treatment within a year from onset and C-peptide negativity.Data on medication, cardiovascular status and diabetic complications were registered by a standardized questionnaire, which was completed by the patient's attending physician according to the medical file. Death certificates were obtained from the national registry maintained by the Population Register Centre of Finland. The average follow-up time was 8.2±0.6 years (4972 patient years in total).The classification of renal status was made centrally according to urinary albumin excretion rate (AER) in at least two out of three successive overnight urine samples. Absence of kidney disease was defined as normoalbuminuria (AER<20 μg/min), while the presence of overt kidney disease was defined as macroalbuminuria (AER⩾200 μg/min). The intermediary range is referred simply as microalbuminuria (20 μg/min⩽AER<200 μg/min). Albumin from 24 h-urine samples was also available, and it was used as a biochemical variable in parallel with the longitudinal records of albuminuria.The MetS scores were calculated according to the NCEP ATP III recommendations (NCEP, 2002), with every type I diabeticpatient having a base score of 1 for hyperglycemia (Table 2 in Supplementary data 2). DRD was defined present if a patient had undergone laser eye treatment. MVD was defined as a pooled composite of coronary heart disease, acute myocardial infarction, stroke and peripheral vascular disease. The events were pooled to have more cases for the statistical analyses. Of the 54 cases, 40 had coronary heart disease, 24 had acute myocardial infarction, 12 had undergone coronary bypass, 15 had cerebral stroke and 17 had undergone peripheral vascular bypass.Patients for the 1H NMR experiments were chosen based on the renal status. First, a random subset of macroalbuminuric patients was selected from the FinnDiane clinical database with preference for those individuals with genetic and clinical information already available. Next, sex- and age-matched peers were chosen from the normo- and microalbuminuric group, with preference for normoalbuminuria and availability of information. In total, 613 patients were included, of which 251 (41%) were normoalbuminuric, 137 (22%) had microalbuminuria and 225 (37%) had macroalbuminuria. Patients with end-stage renal disease or kidney transplant were excluded. The study set does not reflect the population-based cross-section, except within the macroalbuminuric group.
1H NMR spectroscopy
1H NMR data from serum samples at 37°C were recorded on a Bruker AVANCE spectrometer with a field strength of 500.13 MHz. The reference substance (sodium 3-trimethylsilyl[2,2,3,3-d4]propionate (TSP) 40 mmol/l, MnSO4 0.6 mmol/l in 99.8% D2O) was placed coaxially into the NMR sample container (o.d. 5 mm, contains 430 μl of serum) in a separate tube (o.d. 1.7 mm, supported by a Teflon adapter). This double-tube system was chosen to avoid mixing of the sample fluid and reference substance, which would make the absolute metabolite quantification less reliable (Ala-Korpela, 1995, 2008; Mäkinen ).The 1H NMR experiments were targeted at two different molecular windows: LIPO and LMWM. For the LIPO spectra, 128 transients were collected with a 90° flip angle, a 6.2 s acquisition time and a 0.1 s relaxation delay. The LMWM data were collected with a standard one-dimensional Carr-Purcell-Meiboom-Gill (CPMG) pulse sequence with a 325 ms T2-filter and a fixed 400 μs echo delay to eliminate diffusion and J-modulation effects. Forty-eight transients were collected after 16 dummy scans with a 6.2 s acquisition time and an 8.7 s relaxation delay. Water suppression was not used. The free induction decays (FID) with 65 536 data points were zero-filled and multiplied by an exponential window function with a 1.0 Hz line-broadening for the LIPO spectra and 0.5 Hz line-broadening for the LMWM spectra. The preprocessing of the FIDs and subsequent Fourier transformations were performed on the PERCH software platform (PERCH Solutions Ltd, Kuopio, Finland).Metabolite intensities in each spectrum were scaled according to the area of the respective TSP reference signal at 0 p.p.m.. The TSP area was obtained by first subtracting the linear baseline between −0.1 and 0.1 p.p.m., and then integrating over the remainder. In the LMWM window, the effects of the water peak around 4.6 p.p.m. were removed by fitting a Lorenzian tail on the aliphatic side and a piece-wise polynomial curve on the aromatic side, and then subtracting the tails. Furthermore, a minor piece-wise linear correction was applied in both windows, mainly affecting the region between 3 and 5 p.p.m.. Peaks in the LMWM spectra were aligned by first estimating the shift offsets at selected locations, then linearly interpolating the offsets where direct peak detection was difficult and finally re-mapping the frequency axis according to the estimated shift offsets. The same correction was performed also for the LIPO spectra, although the effects were small due to wider line shapes. All preprocessing was performed in the Matlab programming environment (The MathWorks Inc., Natick, MA, USA) by using the statistical toolbox and in-house scripts.
Self-organizing map and statistical analysis
Before constructing the SOM, the spectra were truncated to 0.3–3.3 p.p.m. (LIPO) and 0.7–5.8 p.p.m. (LMWM), and the chemical shift resolution was reduced to 0.003 p.p.m. (LIPO) and 0.001 p.p.m. (LMWM). Data points between 4.2 and 5.0 p.p.m. were omitted and the intensities between 3.22 and 3.88 p.p.m. were given only 0.1% weight in the SOM algorithm to attenuate the effect of glucose. The respective intensity units for the two molecular windows were adjusted such that the total variance of the data points within the LIPO window was equal to that in the LMWM window. This was done to ensure that both windows would have comparable effect to the SOM structure while preserving the relative intensities between individual experiments.We chose a 9 × 9 hexagonal sheet of map units with a Gaussian neighborhood function for the analysis (7.6 samples per unit). The SOM was initialized based on the two first principal components of the input data and finished by the batch-training algorithm, with a 3.4% topological error.After the positions of the study subjects on the SOM were computed, the map was colored according to the demographic properties (i.e., clinical and biochemical variables) within different parts of the SOM (Supplementary data 1). To verify statistical significance, regional variability for the colorings was estimated by first computing the squared deviations from the global average for each map unit, and then adding these unit-specific values to obtain a single descriptive statistic. Put differently, the ‘bumpiness' of a coloring was expressed as one numerical value. Some bumps will occur by change alone, and to estimate the expected null distribution, the target variable (DKD status for instance) was shuffled randomly several times, and at each round, the descriptive statistic was recomputed. The probability of getting the observed statistic (or a more extreme value) from the random distribution determined the statistical significance P. The null distributions from the permutation analysis were also the basis of the color scale in each figure so that categorical and continuous variables, possibly with some data missing, can be compared visually while maintaining the statistical interpretation.Mortality in Figure 2C was estimated by first computing the coloring for the percentage of deaths observed on each map unit, and then normalizing these values by the average unit-specific follow-up time in decades. Note that this is equivalent to the number of deaths in 1000 patient years. Only patients that had not died (91%) were used for the follow-up time estimation to avoid bias due to deaths. The unadjusted event percentages and follow-up times are available in Figure 1 in Supplementary data 2. The discrepancy between Figures 2C and 6A (max 25 versus 26%) stems from different formulations: the former was obtained from the smoothed map estimates for the number of deaths and follow-up period, whereas the latter was obtained from point estimates within the resident patient subgroup. Relative risk for early death in Figure 6 was obtained as the ratio of observed deaths and the mortality in the corresponding age segments of the entire Finnish population within the follow-up period (data from Statistics Finland).In addition to clinical variables, numerous quantities were estimated computationally from the spectral data. Easily distinguishable peaks in the LMWM window such as urea around 5.68 p.p.m. and acetate at 1.86 p.p.m. were quantified by direct peak integration. The wide baseline signal from serum albumin was estimated by fitting a polynomial curve at selected locations in the LIPO window and computing the respective curve area. Lastly, semilinear regression modeling was applied to quantities that were also available via non-NMR methods (Mäkinen ). All analyses were performed using the SOM Toolbox 2.0 for the Matlab environment (URL: http://www.cis.hut.fi/projects/somtoolbox/), and in-house scripts for subsequent quantification, map coloring and permutation analyses.Supplementary data 1Supplementary data 2Supplementary data 3Supplementary data 4Supplementary data 5
Authors: Joanne T Brindle; Henrik Antti; Elaine Holmes; George Tranter; Jeremy K Nicholson; Hugh W L Bethell; Sarah Clarke; Peter M Schofield; Elaine McKilligin; David E Mosedale; David J Grainger Journal: Nat Med Date: 2002-11-25 Impact factor: 53.440
Authors: Ville-Petteri Mäkinen; Pasi Soininen; Carol Forsblom; Maija Parkkonen; Petri Ingman; Kimmo Kaski; Per-Henrik Groop; Mika Ala-Korpela Journal: MAGMA Date: 2006-12-15 Impact factor: 2.310
Authors: H C Gerstein; J F Mann; Q Yi; B Zinman; S F Dinneen; B Hoogwerf; J P Hallé; J Young; A Rashkow; C Joyce; S Nawaz; S Yusuf Journal: JAMA Date: 2001-07-25 Impact factor: 56.272
Authors: M Stadler; M Auinger; C Anderwald; T Kästenbauer; R Kramar; C Feinböck; K Irsigler; F Kronenberg; R Prager Journal: J Clin Endocrinol Metab Date: 2006-08-01 Impact factor: 5.958
Authors: M T Hyvönen; Y Hiltunen; W El-Deredy; T Ojala; J Vaara; P T Kovanen; M Ala-Korpela Journal: J Am Chem Soc Date: 2001-02-07 Impact factor: 15.419
Authors: Linda S Kumpula; Sanna M Mäkelä; Ville-Petteri Mäkinen; Anna Karjalainen; Johanna M Liinamaa; Kimmo Kaski; Markku J Savolainen; Minna L Hannuksela; Mika Ala-Korpela Journal: J Lipid Res Date: 2009-09-05 Impact factor: 5.922
Authors: Shucha Zhang; Cheng Zheng; Ian R Lanza; K Sreekumaran Nair; Daniel Raftery; Olga Vitek Journal: Anal Chem Date: 2009-08-01 Impact factor: 6.986
Authors: Ville-Petteri Mäkinen; Carol Forsblom; Lena M Thorn; Johan Wadén; Kimmo Kaski; Mika Ala-Korpela; Per-Henrik Groop Journal: Cardiovasc Diabetol Date: 2009-10-06 Impact factor: 9.951
Authors: James R Bain; Robert D Stevens; Brett R Wenner; Olga Ilkayeva; Deborah M Muoio; Christopher B Newgard Journal: Diabetes Date: 2009-11 Impact factor: 9.461