Literature DB >> 25983544

Generalized monotone incremental forward stagewise method for modeling count data: application predicting micronuclei frequency.

Abstract

The cytokinesis-block micronucleus (CBMN) assay can be used to quantify micronucleus (MN) formation, the outcome measured being MN frequency. MN frequency has been shown to be both an accurate measure of chromosomal instability/DNA damage and a risk factor for cancer. Similarly, the Agilent 4×44k human oligonucleotide microarray can be used to quantify gene expression changes. Despite the existence of accepted methodologies to quantify both MN frequency and gene expression, very little is known about the association between the two. In modeling our count outcome (MN frequency) using gene expression levels from the high-throughput assay as our predictor variables, there are many more variables than observations. Hence, we extended the generalized monotone incremental forward stagewise method for predicting a count outcome for high-dimensional feature settings.

Entities: Chemical Disease Gene Species

Keywords: Poisson regression; gene expression; high-throughput; micronuclei; penalized model

Year: 2015 PMID： 25983544 PMCID： PMC4415688 DOI： 10.4137/CIN.S17278

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

Micronuclei (MNs) are small nuclear bodies that are formed in dividing cells but are not part of the nucleus. Therefore, MNs can only be found in cells that have undergone nuclear division at least once and appear as small extranuclear bodies. When two daughter nuclei are formed during cell division, these bodies are placed into a smaller nucleus that is not part of the main nuclei, hence the term “micronuclei.”1 Once the MNs are formed, the cell has several different response options. MNs can remain within the cell, if they have functional DNA, as separate entities or be reabsorbed into the main nucleus. If the DNA is nonfunctional, the MNs may be expelled from the cell or the whole cell may be destroyed through apoptosis. Because MNs can be expelled from the cell, they can be used as a mechanism to remove extra chromosomes from the cell.1 MNs can form spontaneously or they can be induced by mutagens. Some spontaneous MNs are actually beneficial to the organism. An example is in the mouse cerebral cortex, wherein MN formation adds diversity to the nervous system.1 However, a large majority of MNs are caused by mutagens and may play a role in carcinogenesis. Depending on the fate of the MN, the result could be a variety of different DNA and chromosome cell contents. This variety could result in an accumulation of DNA changes and instability that could result in cancer.1 Several studies have shown that higher MN counts result in a higher risk of cancer in the future.1 Thus, using the cytokinesis-block micronucleus (CBMN) assay as a risk assessment tool for cancer has potential clinical benefits. Further, combining CBMN with other high-throughput technologies such as gene expression and methylation analyses may help identify factors related to micronucleation. Quantifying MNs in patient samples has been shown to be a good measure of genetic damage. MN scoring, ie, counting the number of MNs present in a sample, is a popular tool for testing genotoxicity mostly because of its simplicity, accuracy, applicability to different cell types, and ease of automation. Cancer cells show a loss of genetic control, which can be caused by DNA damage; so, they are good candidates for MN testing. The CBMN assay has successfully been used and validated to score MNs. The CBMN assay uses cytochalasin-B, which stops cells from performing cytokinesis but does not stop nuclear division, giving rise to cells that are binucleated.2,3 Furthermore, the Organisation for Economic Co-operation and Development has developed a set of guidelines for running the CBMN assay to obtain the most consistent and reliable results.1 Guidelines for the process of scoring MNs have been presented by the HUman MicroNucleus (HUMN) project. This is an international collaborative project aimed at improving the application of the CBMN assay. One of the main goals of the HUMN project is to identify methodological variables in the scoring of the assay to minimize confounding effects.4 The HUMN project compiled a list of 6583 subjects from 25 laboratories in 16 countries and looked at background MN frequency using the CBMN assay. The goal of the study was to identify variables that affect the background MN frequency. Scoring criteria were found to account for 47% of the observed variability; thus, standardized scoring criteria were developed and described by Fenech et al.4 The guideline includes scoring 2000 cells to accurately estimate MN frequency. Because these guidelines were developed for assay performance, they do not address how to statistically analyze the data generated by the assay. This has led to the application of various statistical methods that may render different interpretations and conclusions. In a review article examining analytical methods, Ceppi et al.5 reviewed 63 studies that statistically analyzed MN data and developed recommendations for selecting an appropriate analytical method. The review included studies that applied both parametric and nonparametric tests. The nonparametric tests included Kruskal–Wallis, Friedman, Wilcoxon, and Mann–Whitney U-tests. Although these tests do not require an underlying distributional assumption, they are unable to adjust for confounding factors. There were a variety of parametric tests performed that assume normality, such as analysis of variance, analysis of covariance, and multivariable linear models, which can adjust for confounding factors. Other methods such as correlations and Student’s t-test were also used. However, applying these methods to MN data, which are rarely normally distributed, could result in inappropriate inferences. Although the data could be transformed to better adhere to a Gaussian distribution before applying such parametric tests, few studies applied any type of transformation. Further, Student’s t-tests and Pearson’s correlation cannot adjust for confounding variables. The common non-Gaussian models used were log-linear, Poisson, negative binomial, and logistic regressions. The logistic and log-linear models account for categories, whereas Poisson and negative binomial models directly model count data. For this reason, Ceppi et al.5 recommend using negative binomial or Poisson models for MN data analysis. Another advantage of these count models is that they can adjust for confounding variables such as age, gender, and smoking status. Finally, Ceppi et al.5 recommended that 2000 or more cells be scored for best model performance. If <2000 cells are scored, a zero-inflated Poisson model is recommended.5 When trying to identify molecular features related to MN frequency, high-throughput genomic assays can be used. However, the previously described methods cannot be applied in settings wherein there are more predictor variables than samples. Therefore, in this study, we extended the generalized monotone incremental forward stagewise (GMIFS) method to the Poisson regression setting and applied it to a cord blood study, the MN frequency of which we were interested in predicting using features from the Agilent 4×44k human oligonucleotide microarray.

Methods

Data

The cord blood data were collected as part of the Norwegian Mother and Child Cohort Study (MoBa).6 The target population of MoBa comprised all women who gave birth in Norway. The overall goal of this study was to collect data on pregnant women and their children to estimate the association between exposures and diseases. Specifically, the data are taken from a subcohort called BraMat, which translates to “good food” in English. This subcohort concentrates on what effect a pregnant woman’s diet has on her child. Umbilical cord blood samples were collected immediately after birth from 200 babies. After quality control and other exclusions, 111 samples were hybridized to Agilent 4 × 44k human oligonucleotide microarrays to measure gene expression. Of the 111 subjects, 29 also had MN data collected. The MNs were scored using the procedure described by Decordier et al.7 Further, demographics such as gender, were collected for all subjects. Data were downloaded from Gene Expression Omnibus (GSE31836). Sample processing, image analysis, normalization, background correction, and filtering are described in the study by Hochstenbach et al.8 For this analysis, the data were further filtered to only include genes that had no missing values, leaving 8497 genes for statistical analysis.

Statistical methods

There are many available methods that can model count data. However, these methods require independence of explanatory variables (p) and that the number of samples (n) does not exceed the number of explanatory variables. The incremental forward stagewise regression method for linear regression and the GMIFS for a logistic regression model have been previously described.9 The GMIFS method for modeling ordinal response data has also been described.10 To assist in our extension to the Poisson regression setting, we first review Poisson regression. We subsequently describe our GMIFS method for fitting Poisson regression models when n < p.

Poisson regression

Poisson regression is commonly used to model count data. Let i = 1,…, n be the number of observations and y represent a Poisson-distributed random variable. Let the expected value of y be written as Then, the conditional probability is given by for each observation i. The likelihood is represented by Mathematically, it is easier to maximize the log-likelihood, which is given by Thus, we are looking for the value of λ that maximizes the log-likelihood above. Further, an offset is used if the response variable can be considered a rate. For example, MN frequency is scored from a larger number of total cells. Therefore, if the total number of cells examined varies by subject, an offset is appropriate. In this case, the expected value is where t is the offset value. The conditional probability is then given by for each observation i. The likelihood is represented by Again, mathematically, it is easier to maximize the log- likelihood, which is given by Once again, we are looking for the λ value that maximizes the log-likelihood. These log-likelihoods are used to model predictor variables. In Poisson regression, the model assumes that the expected value can be modeled by a linear combination of predictors. In this case, the natural log of t is entered as an offset in the model estimation. The natural log of the expected value is where x is a vector of predictor variables and is a vector of coefficients. The estimated coefficients can be exponentiated to determine how the response changes with the predictor. By using the estimated linear combination of coefficient estimates and taking the exponent, we can calculate the estimated response of that particular subject.

GMIFS Poisson model

The GMIFS method was previously described for the logistic regression scenario by Hastie et al.9 but can be adapted to a Poisson regression model. For the proposed method, we consider three types of parameters that from the section “Poisson regression” can be separated into along with an offset (t). The parameters are the intercept (α), those corresponding to an unpenalized subset of predictors (), and those corresponding to a set of penalized predictors (). The design matrix, x, consists of two parts, x and x, where j = 1,…, J is the number of unpenalized predictors, k = 1,…, K is the set of penalized predictors, and J + K = P is the total number of predictors. The unpenalized predictors are those that we wish to force into the model, such as gender, age, and smoking status, which researchers consider important predictors of MN frequency5 and their values are in the x design matrix for subject i. The penalized variables (thousands of features from a high-throughput genomic experiment) are those that the model will choose for us and are considered to be the investigative predictors and their values are in the x design matrix for subject i. The algorithm proceeds in an iterative fashion and updates one of the penalized covariates by a small incremental amount at each step. To determine which penalized covariate is to be updated next, the largest negative gradient is used. Thus, we need to calculate the first derivative of the log-likelihood corresponding to each penalized predictor. The log-likelihood written in terms of α, , and is and the first derivative written in terms of α, , and in matrix notation is Once we know which covariate to update, we need to determine in what direction to update the covariate. To know the direction of the update, the second derivative would need to be calculated, which is a cumbersome process. Hastie et al.9 showed that to avoid having to calculate the second derivative, an expanded covariate space can be used. For example, let β1,…, β be the positive coefficient estimates and β+1,…, β2 be the negative coefficient estimates. Then, the original estimates are calculated by subtracting the pairs, β1 − β+1,…, β2 − β2. Thus, using the notation mentioned previously, where x are the unpenalized variables and x are the variables in the penalized subset, the expanded covariate space is . The proposed GMIFS algorithm using the expanded covariate set is Initialize the components of at step s = 0. Initialize the intercept α and the unpenalized coefficients γ where j = 1,…, J using a maximization algorithm of the log-likelihood. Considering α and fixed, find the predictor x where at the current estimate Update the corresponding coefficient to yield a new vector of parameter estimates. Update α and the unpenalized coefficients, γ, by maximum likelihood considering the from step 4 as fixed. Repeat steps 3–5 until the difference between successive log-likelihoods is less than a prespecified tolerance, τ. The defaults for the GMIFS algorithm are ε = 0.001 and τ = 0.00001.

Comparative method: penalized linear regression

A penalized linear regression model can be fit by adding a penalty term to the sums of squares. Specifically, the glmpath algorithm uses a linear combination of the L1 and L2 norm penalizations. The generalized linear model path (glmpath) algorithm is based on a previous algorithm called least absolute shrinkage and selection operator (LASSO). LASSO minimizes the typical sum of squares with an added constraint. Specifically, for linear regression, LASSO minimizes11 where x are the standardized predictors and y is the set of centered responses for i = 1,…, N and j = 1,…,p. Because of the form of the constraint, LASSO does both variable selection and shrinkage. The glmpath algorithm modifies this slightly by first considering the typical generalized linear model formula where L denotes the appropriate likelihood function. The glmpath algorithm then adds an analagous LASSO penalty term to help with variable selection when p > n: where λ > 0 is the regularization parameter. The glmpath algorithm computes coefficient estimates as λ varies. The algorithm starts with the largest λ that makes nonzero, with each step using a smaller λ. Each optimization consists of three parts: determining the step size in λ, predicting the corresponding change in the coefficients, and correcting the error in the previous prediction.12 The algorithm continues finding the next largest λ that will change the coefficient estimates until no further predictors can be found. However, when the predictors are strongly correlated, the coefficient estimates become highly unstable using the L1 norm penalization.9 Thus, the glmpath algorithm adds a quadratic penalty term and computes the solution to where λ1 ∈ (0, ∞) and λ2 is a fixed, small, positive constant. By adding this quadratic penalty, the effects of the strong correlations do not affect the stability of the fit. Further, when the correlations are not strong, the effects of the quadratic penalty are neglible.9 Thus, the glmpath algorithm uses both the L1 and L2 penalties as its default method. The glmpath algorithm uses a default binomial distribution with a logit link and λ2 = 0.00001. The algorithm also allows for a Poisson distribution with a log link and Gaussian distribution with an identity link. The algorithm then computes the regularization path for generalized linear models with L1 penalty.

Simulations

Simulations are a useful technique to test how well a new methodology performs. In this case, we wished to quantify how accurately the GMIFS method estimated true nonzero coefficients and predicted count data. Furthermore, we wished to determine how the GMIFS method compared relative to the glmpath method in predicting the count outcome, and simulations provide a good platform to accomplish this comparison. Several general steps must be considered in the simulation process: how to simulate the response, how to simulate the predictors associated with the response, and how to simulate the predictors not associated with the response. Furthermore, we wished to examine how the methods perform under ideal situations and nonideal situations, such as when distributional assumptions are met and are not met, respectively. Note that all simulations were performed using the R programming environment (version 3.1.1).13 First, we considered the situation where the response is Poisson distributed and the user fits a Poisson regression model. Then, we generated the response to follow a Poisson distribution where an offset was either used or not used. The uniform distribution was used to generate the predictors. The steps involved in simulating the data under these conditions were as follows: Randomly generate P variables, x1, x2,…, x where i = 1…n, using the uniform distribution on the [0,1] interval. Choose P1 of the P variables to be associated with the response. Assign the P1 β values associated with the response and the intercept value, α. If the offset is to vary, then a uniform distribution was used with maximum 2200 and minimum 1800 and subsequently rounded to the nearest integer. This range was selected because it is recommended to score MNs using 2000 cells. Generate the λ values for the Poisson distribution using the following formula: Randomly generate Y ~ Poisson(λ). Fit a Poisson GMIFS model and fit a glmpath model. Repeat steps 1–6 r times. This simulation method was adjusted in several places. In this case, we chose n = 30 and n = 80. We studied the models letting P = 100 predictor variables and P1 = 5 predictor variables associated with the response; r = 100 simulations were used. The intercept (α) and the five predictor variables associated with the response (β1, β2, β3, β4, and β5) were set to −5, 0.3, 0.2, −0.7, 0.5, and 0.1, respectively, for data simulated using no offset. For data simulated using an offset, α was set to −7. This was done to keep λ values low so that the Gaussian approximation for the Poisson distribution is not appropriate. To compare the two different statistical models, the following three outcomes were examined: The number of true predictors that have a nonzero coefficient; the number of false predictors that have a nonzero coefficient; accuracy of count predictions from the model (sum of squared residuals) when applied to an independent test set. The methods were compared with and without the use of an offset during the simulation process. Furthermore, the glmpath method allows for the use of Gaussian and Poisson distributions. Thus, those options were also used to see what effects user error had on the results. Thus, a total of three models were compared when the true distribution was Poisson: Poisson GMIFS model; glmpath using “poisson” family option and λ2 = 0 which fits a LASSO model; and glmpath using “gaussian” family option and λ2 = 0 which fits a LASSO model.

Results

Simulations were performed as described in “Simulations” of the Methods section, and Figures 1–3 show the results of the simulations. Figure 1 shows the distribution of the number of predictors correctly identified as nonzero over 100 simulations and the types of models used. The data were generated using both n = 30 and n = 80 observations. The median number of correctly identified nonzero coefficients with no offset using GMIFS is 1 (range = 0, 3) for n = 30 and 2 (range = 0, 4) for n = 80. Similarly, the median number of correctly identified nonzero coefficients with no offset using glmpath with Poisson family is 1 (range = 0, 5) for n = 30 and 2 (range = 0, 4) for n = 80. This number increases slightly when using the glmpath with Gaussian family to a median of 2 (range = 0, 5) for n = 30 and 4 (range = 2, 5) for n = 80. All the numbers are similar when an offset is used to generate the data. The median number of correctly identified non-zero coefficients using GMIFS is 0 (range = 0, 3) for n = 30 and 1 (range = 0, 4) for n = 80. The median number of correctly identified non-zero coefficients using glmpath with Poisson family is 0 (range = 0, 3) for n = 30 and 1 (range = 0, 4) for n = 80. Once again the medians increase when using the glmpath with Gaussian family to 2 (range = 0, 5) for n = 30 and 4 (range = 2, 5) for n = 80.

Figure 1

Number of predictors correctly identified as nonzero. This figure shows the distribution of the number of predictors correctly identified as nonzero over 100 simulations. There were five predictors that were set as nonzero. Boxplots are separated by the type of distribution used to generate the data and the number of observations.

Figure 3

Accuracy of count predictions. This figure shows the distribution of the sum of residuals squared over 100 simulations using a learning data set and an independent test data set. Boxplots are separated by the type of model fit to the data and the number of observations. The results for glmpath with Gaussian family using an offset are not displayed because both values are above 50000.

Figure 2 shows the distribution of the number of predictors incorrectly identified as nonzero over 100 simulations and the types of models used. The data were generated using both n = 30 and n = 80 observations. The median number of incorrectly identified nonzero coefficients with no offset using GMIFS is 3 (range = 0, 15) for n = 30 and 7 (range = 0, 28) for n = 80. Similarly, the median number of incorrectly identified nonzero coefficients with no offset using glmpath with Poisson family is 3 (range = 0, 17) for n = 30 and 7 (range = 0, 41) for n = 80. This number increases when using the glmpath with Gaussian family to a median of 26 (range = 23, 28) for n = 30 and 74 (range = 73, 76) for n = 80. All results are similar when an offset is used to generate the data. The median number of incorrectly identified non-zero coefficients using GMIFS is 2 (range = 0, 14) for n = 30 and 5 (range = 0, 26) for n = 80. The median number of incorrectly identified non-zero coefficients using glmpath with Poisson family is 2 (range = 0,24) for n = 30 and 4.5 (range = 0, 31) for n = 80. Once again the medians increase when using the glmpath with Gaussian family to 26 (range = 23, 28) for n = 30 and 74 (range = 72, 76) for n = 80.

Figure 2

Number of predictors incorrectly identified as nonzero. This figure shows the distribution of the number of predictors incorrectly identified as nonzero over 100 simulations. There were 95 predictors for which their coefficients were set to zero. Boxplots are separated by the type of distribution used to generate the data and the number of observations.

Figure 3 shows the distribution of the sum of residuals squared as a measure of the model prediction accuracy. The data were generated using both n = 30 and n = 80 observations. For both sample sizes, a learning data set was used to estimate coefficients and then the model was applied to an independent test data set. The median accuracy with no offset using GMIFS is 133 (range = 68, 240) for n = 30 and 325 (range = 188, 699) for n = 80. Similarly, the median accuracy with no offset using glmpath with Poisson family is 142 (range = 55, 254) for n = 30 and 333 (range = 185, 1666) for n = 80. The median accuracy with no offset using glmpath with Gaussian family is 206 (range = 90, 383) for n = 30 and 1503 (range = 535, 3772) for n = 80. The numbers are different when an offset is used to generate the data. The median accuracy using GMIFS is 80 (range = 30, 185) for n = 30 and 205 (range = 137, 367) for n = 80. The median accuracy using glmpath with Poisson family is 80 (range = 33, 805) for n = 30 and 206 (range = 126, 339) for n = 80. The median accuracy with an offset using glmpath with Gaussian family for both sample sizes is above 50000.

Gene expression analysis

Both GMIFS and glmpath models were applied to the cord blood gene expression data set described under “Data” of Methods section. For glmpath, the Poisson family option was used and the lambda2 option was set to zero. For GMIFS, the default options were chosen. The response in the model was MN counts, and the predictors were the gene expression intensities. Gender was included in the model as part of the unpenalized subset. Based on Figure 4, a Poisson distribution was assumed for both models because the data appear skewed. The final model parameters were chosen using the minimum Akaike information criterion. The GMIFS model identified 17 nonzero gene expression coefficients as associated with MN count and the glmpath with Poisson family identified 23. Out of the genes that were identified, 10 were common to both models. Figures 5 (sum of squared residuals = 101.7) and 6 (sum of squared residuals = 1.8) show that both models seem to predict MNs relatively well. Table 1 shows the genes that both models identified as being associated with MN count and the types of cancer with which they are linked. Nine out of the 10 genes in common between both models are linked to some type of cancer.

Figure 4

Histogram of MN counts.

Figure 5

Plot of actual MN counts versus predicted MN counts using GMIFS.

Table 1

Genes identified as associated with MN count by both GMIFS and glmpath.

PROBE ID	GENE SYMBOL	GENE NAME	ASSOCIATED WITH CANCER	GMIFS	GLMPATH
A-23-P100196	USP10	ubiquitin specific peptidase 10	Glioblastoma multiforme14	X	X
A-23-P138967	SDHD	succinate dehydrogenase complex	Tumor Suppressor15	X	X
A-23-P42331	HMGA1	high mobility group AT-hook 1	Pancreatic Adenocarcinoma16	X	X
A-23-P9293	TJP2	tight junction protein 2	Breast17	X	X
A-24-P19410	CBX7	chromobox homolog 7	Carcinomas18	X	X
A-24-P214858	TREML2	triggering receptor expressed on myeloid cells-like 2	Pancreatic19	X	X
A-24-P2463	WHSC1	Wolf-Hirschhorn syndrome candidate 1	Carcinogenesis20	X	X
A-24-P397584	TBCC	tubulin folding cofactor C	None Found	X	X
A-24-P398064	KIAA0258	KIAA0258	Colorectal21	X	X
A-32-P18547	C21ORF57	chromosome 21 open reading frame 57	Breast22	X	X
A-23-P103824	FAU	Finkel-Biskis-Reilly murine sarcoma virus (FBR-MuSV) ubiquitously expressed	None Found	X
A-23-P209394	CFLAR	CASP8 and FADD-like apoptosis regulator	Human cancers23	X
A-23-P79911	PSMF1	proteasome (prosome, macropain) inhibitor subunit 1 (PI31)	Breast24	X
A-24-P202567	ITPKC	inositol 1,4,5-trisphosphate 3-kinase C	Cervical25	X
A-24-P31235	EIF5A	eukaryotic translation initiation factor 5A	Chronic myeloid leukemia26	X
A-24-P405054	C1ORF144	chromosome 1 open reading frame 144	Mantle cell lymphoma27	X
A-32-P156549	C1ORF144			X
A-23-P118313	GABARAPL2	GABA(A) receptor-associated protein-like 2	Lung28		X
A-23-P143817	MYLK	myosin, light polypeptide kinase	Gastric29		X
A-23-P156809	LOC642880	similar to FKSG62	None Found		X
A-23-P394304	PDZK1IP1	PDZK1 interacting protein 1	Thyroid30		X
A-23-P39665	SLC11A1	solute carrier family 11, member 1	Esophageal31		X
A-23-P67529	KCNN4	potassium intermediate/small conductance calcium-activated channel, subfamily N, member 4	Colorectal32		X
A-24-P594683	LOC645592	similar to peptidylprolyl isomerase A isoform 1			X
A-24-P708161					X
A-24-P98086	GNA12	guanine nucleotide binding protein (G protein) alpha 12	Oral33		X
A-32-P10067					X
A-32-P137849					X
A-32-P169754	LOC145221	EST			X
A-32-P208078	MTHFR	5,10-methylenetetrahydrofolate reductase (NADPH)	Breast34		X

Discussion

We have described the GMIFS method for modeling a count response when we want to (1) coerce some variables into the model and (2) perform automatic variable selection and model estimation by penalizing predictors. High-throughput data contain more predictors than there are samples, so traditional methods are not appropriate in this setting. The GMIFS method was compared to glmpath, a popular penalization algorithm. Simulations showed that both methods performed similarly when identifying predictors known to be nonzero. GMIFS appeared to slightly outperform glmpath in the sense that GMIFS included fewer predictors that are truly unimportant in the model. Similarly, when applied to an independent data set, GMIFS appeared to have higher predictive accuracy. Thus, it appears that GMIFS is more generalizable than glmpath to independent data sets. Finally, both methods were applied to a cord blood gene expression data set. Gene expression profiles were used to predict MN frequency. Both models identified a similar number of genes as related to MN frequency. Further, 10 of those genes were common to both models. Nine out of the 10 genes have been shown to be associated with different types of cancers. Because MN count is a measure of DNA damage, genes associated with MN frequency would be expected to be linked to cancer. Both models appear to identify genes linked to cancer. As in the simulations, glmpath identified more genes as nonzero compared to GMIFS. In the simulations, this was because glmpath was including more predictors incorrectly. However, there is no way to know whether this is also the case in the cord blood data set, given that these data are observational and no further confirmatory studies can be performed on the samples.

29 in total

1. Cohort profile: the Norwegian Mother and Child Cohort Study (MoBa).

Authors: Per Magnus; Lorentz M Irgens; Kjell Haug; Wenche Nystad; Rolv Skjaerven; Camilla Stoltenberg
Journal: Int J Epidemiol Date: 2006-08-22 Impact factor: 7.196

2. Increased expression of thymidylate synthetase (TS), ubiquitin specific protease 10 (USP10) and survivin is associated with poor survival in glioblastoma multiforme (GBM).

Authors: Jessica M Grunda; L Burton Nabors; Cheryl A Palmer; David C Chhieng; Adam Steg; Tom Mikkelsen; Robert B Diasio; Kui Zhang; David Allison; William E Grizzle; Wenquan Wang; G Yancey Gillespie; Martin R Johnson
Journal: J Neurooncol Date: 2006-06-14 Impact factor: 4.130

3. Association of functional polymorphisms of SLC11A1 with risk of esophageal cancer in the South African Colored population.

Authors: Monique G Zaahl; Louise Warnich; Tommy C Victor; Maritha J Kotze
Journal: Cancer Genet Cytogenet Date: 2005-05

4. Histone lysine methyltransferase Wolf-Hirschhorn syndrome candidate 1 is involved in human carcinogenesis through regulation of the Wnt pathway.

Authors: Gouji Toyokawa; Hyun-Soo Cho; Ken Masuda; Yuka Yamane; Masanori Yoshimatsu; Shinya Hayami; Masashi Takawa; Yukiko Iwai; Yataro Daigo; Eiju Tsuchiya; Tatsuhiko Tsunoda; Helen I Field; John D Kelly; David E Neal; Yoshihiko Maehara; Bruce Aj Ponder; Yusuke Nakamura; Ryuji Hamamoto
Journal: Neoplasia Date: 2011-10 Impact factor: 5.715

Review 5. The HUman MicroNucleus Project--An international collaborative study on the use of the micronucleus technique for measuring DNA damage in humans.

Authors: M Fenech; N Holland; W P Chang; E Zeiger; S Bonassi
Journal: Mutat Res Date: 1999-07-16 Impact factor: 2.433

6. Loss of tight junction plaque molecules in breast cancer tissues is associated with a poor prognosis in patients with breast cancer.

Authors: Tracey A Martin; Gareth Watkins; Robert E Mansel; Wen G Jiang
Journal: Eur J Cancer Date: 2004-12 Impact factor: 9.162

7. HUMN project: detailed description of the scoring criteria for the cytokinesis-block micronucleus assay using isolated human lymphocyte cultures.

Authors: M Fenech; W P Chang; M Kirsch-Volders; N Holland; S Bonassi; E Zeiger
Journal: Mutat Res Date: 2003-01-10 Impact factor: 2.433

8. Hypusination of eukaryotic initiation factor 5A (eIF5A): a novel therapeutic target in BCR-ABL-positive leukemias identified by a proteomics approach.

Authors: Stefan Balabanov; Artur Gontarewicz; Patrick Ziegler; Ulrike Hartmann; Winfried Kammer; Mhairi Copland; Ute Brassat; Martin Priemer; Ilona Hauber; Thomas Wilhelm; Gerold Schwarz; Lothar Kanz; Carsten Bokemeyer; Joachim Hauber; Tessa L Holyoake; Alfred Nordheim; Tim H Brümmendorf
Journal: Blood Date: 2006-09-28 Impact factor: 22.113

2. Penalized negative binomial models for modeling an overdispersed count outcome with a high-dimensional predictor space: Application predicting micronuclei frequency.

Authors: Rebecca R Lehman; Kellie J Archer
Journal: PLoS One Date: 2019-01-08 Impact factor: 3.240

3. Controlled variable selection in Weibull mixture cure models for high-dimensional data.

Authors: Han Fu; Deedra Nicolet; Krzysztof Mrózek; Richard M Stone; Ann-Kathrin Eisfeld; John C Byrd; Kellie J Archer
Journal: Stat Med Date: 2022-07-06 Impact factor: 2.497

3 in total