| Literature DB >> 16848902 |
Matti Nykter1, Tommi Aho, Miika Ahdesmäki, Pekka Ruusuvuori, Antti Lehmussola, Olli Yli-Harja.
Abstract
BACKGROUND: Microarray technologies have become common tools in biological research. As a result, a need for effective computational methods for data analysis has emerged. Numerous different algorithms have been proposed for analyzing the data. However, an objective evaluation of the proposed algorithms is not possible due to the lack of biological ground truth information. To overcome this fundamental problem, the use of simulated microarray data for algorithm validation has been proposed.Entities:
Mesh:
Year: 2006 PMID: 16848902 PMCID: PMC1574357 DOI: 10.1186/1471-2105-7-349
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Microarray simulation model. Block diagram of the microarray simulation model.
List of noise parameters. Noise parameters available in the microarray simulation model.
| kernel | Kernel used to model the population effect. |
| copies | Number of times the population effect is applied. |
| errormodel | Error model to be used; each error model has its own parameters, see Table 5. |
List of slide parameters. Overview of the slide simulation parameters. More detailed documentation of the parameters is available on the companion web page [19].
| Type of the slide (single or two channel). | |
| Model used for the spot: circle, Gaussian, hyperbolic. | |
| Maximum width/height of the area for the spot in pixels. | |
| Probability for a spot to drift (move) from designated location. This parameter models random movement. See parameter | |
| Maximum allowed movement bias from designated location, movement in x-axis | |
| Mean radius of the simulated spot. Spot radius is drawn from | |
| Allowed variation (variance) of the spot size. | |
| If set, print tip leaves a mark to the spot. | |
| Probability for print tip mark to be visible in a spot. | |
| Maximum height of the print tip mark, print tip height is drawn from | |
| Maximum width of the print tip mark, print tip width is drawn from | |
| Maximum of how much print tip mark is allowed to drift from spot center. Movement in x-axis | |
| Probability for a spot to suffer from a chord cut. | |
| Maximum number of chord cuts from a spot. | |
| Maximum depth of the chord cut, cut depth is drawn from | |
| Number of slides to be generated. | |
| Time points when slides are made. This is relevant only for time series data. | |
| Number of channels (different dyes) on the slide. | |
| Total number of spots on the slide. | |
| Number of rows of spots on the slide. | |
| Number of columns of spots on the slide. | |
| Subarray layout on the slide i.e. number of (subarray)rows and (subarray)columns. | |
| Space between individual subarrays on the slide. | |
| Parameter used to control the subarray curving (i.e. systematic drift in spot printing). | |
| Maximum distance the bin is allowed to curve, curvature parameter is drawn from | |
| Number of spots in each subarray. | |
| Number of rows in subarrays. | |
| Number of columns in subarrays. |
List of hybridization parameters. Overview of the hybridization effect parameters.
| Multiplicative Gaussian hybridization noise variance. Hybridization noise is drawn from | |
| If set, hybridization errors are included in simulation. | |
| Percent of the intensity values covered by the background noise. | |
| Background noise variance, relative to background noise mean determined using | |
| Gradient (noise pattern) for background noise. | |
| Number of scratches on the slide. | |
| Maximum length of the scratch, scratch length is drawn from | |
| Width of the scratch. | |
| Number of air bubbles visible on the slide. | |
| Mean for the air bubble radius, drawn from | |
| Allowed variation (variance) for air bubble size radius. | |
| Percent of spots having dye outside spot area (bleeding). | |
| Size of the spot bleed (how many times the spot size). | |
| How far from the origin the bleeding goes. |
List of scanner parameters. Overview of the scanner effect parameters.
| Scanner power is used for histogram equalization, more power yields brighter image. | |
| The dynamic range of the scanner. Intensity values are quantized to | |
| If set, histogram equalization is applied. | |
| Threshold parameter for quantization, values over the threshold are saturated. | |
| Number of channel that is considered as red dye. | |
| Number of channel that is considered as green dye. | |
| If set, scanner errors are applied. | |
| Angle at which the slide is scanned. | |
| Misalignment between red and green channel. |
List of error models. Error models (EM) and the parameters for each of the implemented error model. Noise free input data is denoted by x and the noisy output data by y. Index i refers to gene, j to array (chip), and k to biological sample specific noise. Index p refers to a specific probe within a probe set.
| Model | Additive Gaussian noise is added to the data. |
| Mean of the additive Gaussian noise. Noise is drawn from | |
| Variance of the additive Gaussian noise. | |
| Model | Additive Gaussian noise is added to the data with given signal-to-noise ratio. |
| Mean of the additive Gaussian noise. | |
| SNR | Signal-to-noise ratio after the noise is added. |
| Model | |
| Binding efficiency of each probe | |
| Gene specific bias | |
| Gene and chip specific error | |
| Multiplicative gene and chip specific noise | |
| Model | In log scale |
| Chip specific bias | |
| Gene and chip specific error | |
| Model | In log scale |
| Independent random noise | |
| Gene specific noise | |
| Chip specific noise | |
| Gene and chip specific noise | |
| Gene, chip and biological sample specific noise | |
| Model | |
| Multiplicative noise | |
| Additive independent noise | |
| Background noise (bias) | |
| Model | |
| True expression signal log( | |
| Hybridization error term log( | |
| Variance | |
| Fractional binding |
Input data requirements. Requirements for the simulator input data used in microarray simulation.
| data | Expression values or ratios measured for probes (genes). One value for each time instant per probe is required. |
| time | Time instants when the expression values are obtained. |
| genes | Names of the probes. |
| spot | Location of each probe on the slide (x and y coordinate). |
| name | Name of the dataset. |
| type | Type of the input data i.e. cDNA or oligonucleotide expression or ratios. |
| scale | Scale of the input data, i.e. log or linear scale. |
Figure 2Slide generation errors. Errors in slide image generation are demonstrated. There is large variation in the spot size. In addition many spots have unideal shapes.
Figure 3Drift in spot alignment. Systematic drift due to the unideal printing of the microarray slide can be introduced with used adjustable parameters.
Figure 4Simulated spots. Shape of (a) simulated noise free cDNA spot, (b) noisy cDNA spot, and (c) noisy single-channel oligonucleotide array spot. Intensity of the spot is determined by the corresponding expression value.
Figure 5Hybridization errors. Hybridization errors are demonstrated. Spot bleeding, scratches, air bubbles and background noise are clearly visible.
Figure 6Simulated ground truth signals. Gene expression profiles of the selected genes. The effect of the gene knockout to the expression profiles is clearly observable. Reference signal is shown with solid and test signal with dashed line.
Figure 7Simulated ground truth signals with noise added. Gene expression profiles of the same genes as in Figure 6 with measurement noise added. Small trends (for example lowest sub figure) in the signals are covered by noise. Reference signal is shown with solid and test signal with dashed line.
Figure 8Simulated spotted microarray slide images. Simulated slide images at time instants (a) 10 and (b) 200 minutes. Several error sources, like spot size and shape variation and bleeding are included in simulation. On the slide image on the right (b) also unideal subarray alignment and scratches are introduced.
Figure 9Scatter plots of the simulated data. MA-plot is shown for (a) noise free simulated data, (b) noise free data extracted from slide, and (c) realistic simulated data. Lowess fit is shown over each scatter plot to illustrate the trends in the data. All data points extracted from slide are scaled to [0,1] interval, thus the scale of the x-axis is different to the simulated noise free data.
Figure 10Example of an Affymetrix microarray simulation. Example of the simulated single-channel oligonucleotide microarray slide image (crop from top left corner) (a). We have used an Affymetrix .cel file as the ground truth data. Thus the text about the slide type is observable. Real Affymetrix slide image is shown for comparison (b).
Figure 11Slide image segmentation examples. One subarray from each of the images used to test the segmentation algorithms are shown. From left to right: (a) high quality slide, (b) noisy slide with artifacts, and (c) disturbing noise and artifacts over the slide. Increase in noise and degradation of the spot quality is clearly observable.
Figure 12Results of segmentation example. The spot intensities estimated from the simulated images with the fixed circle (first row), the histogram segmentation (second row), and the seeded region growing (third row) segmentation algorithms are plotted against the input data (reference). The plots are from the first channel of the test images: (a) intensities for the high quality image given by the three segmentation algorithms, (b) intensity plots for image with noise and errors, (c) plots for image with disturbing noise and artefacts.
Segmentation results. Correlation coefficients between the estimated spot intensities and the input data. Histogram segmentation gives the highest correlation with the reference data. All methods give poorer correlations as the image quality is degraded.
| Algorithm | Results for image 1 | Results for image 2 | Results for image 3 |
| FC | 0.9952 | 0.9112 | 0.8452 |
| HST | 0.9962 | 0.9860 | 0.9432 |
| SRG | 0.9876 | 0.9602 | 0.8680 |