| Literature DB >> 18644149 |
Angela P Presson1, Eric M Sobel, Paivi Pajukanta, Christopher Plaisier, Daniel E Weeks, Karolina Aberg, Jeanette C Papp.
Abstract
BACKGROUND: Correctly merged data sets that have been independently genotyped can increase statistical power in linkage and association studies. However, alleles from microsatellite data sets genotyped with different experimental protocols or platforms cannot be accurately matched using base-pair size information alone. In a previous publication we introduced a statistical model for merging microsatellite data by matching allele frequencies between data sets. These methods are implemented in our software MicroMerge version 1 (v1). While MicroMerge v1 output can be analyzed by some genetic analysis programs, many programs can not analyze alignments that do not match alleles one-to-one between data sets. A consequence of such alignments is that codominant genotypes must often be analyzed as phenotypes. In this paper we describe several extensions that are implemented in MicroMerge version 2 (v2).Entities:
Mesh:
Year: 2008 PMID: 18644149 PMCID: PMC2515855 DOI: 10.1186/1471-2105-9-317
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Allele frequency histograms for the same marker X typed at different laboratories on two similar sets of samples (Data set 1 and Data set 2).
Figure 2Example of a MicroMerge alignment for two data sets. Data set 1 has four bins, data set 2 has three bins and the data sets are aligned to four theoretical alleles (TA).
Figure 3Example of a one-to-one MicroMerge v2 alignment for the data sets in Figure 2. In this case C' in data set 2 corresponds only with bin C in data set 1. TA is an abbreviation for "theoretical alleles".
Overview of the new features incorporated into MicroMerge v2 and guidelines for when to use them
| Creates flexible output files and provides a more suitable alignment format for most data sets. Both one-to-one and lumped alignment formats are available in v2. | The lumped format may be preferred for data sets that will be analyzed with Mendel and have 1) few rare bins (<5–10% of bins with <6 instances), 2) discrepant numbers of unique bins between data sets (most markers differ by 2–3 bins), 3) genotyping that was done on platforms with different resolving power, or 4) other situations where bin frequencies don't match well. Otherwise, use the default one-to-one format. | |
| Allows > 1 one-to-one translation from each lumped alignment. | This feature was not useful for our test data sets but can potentially increase alignment posterior probability for markers that have many one-to-one translations with competitive likelihoods. The default value is 1 one-to-one alignment translation per lumped alignment. | |
| Improves alignment of markers that have low posterior probabilities and rare bins by zeroing these bins and re-merging the data. Results in a second set of merged data files. | There are three parameters controlling marker selection for re-merging: 1) alignment posterior probability (< 0.425) and bin pair(s) that 2) have low overlap (< 0.85) and 3) low theoretical allele frequencies (< 0.015). The user can control the frequency of re-merging by adjusting these three parameters (from the above default values) in the control file. | |
| Controls emphasis on alignments that have fewer theoretical alleles. | Allows | |
| Improves alignment of data sets with unreliable allele frequencies. | Enables alignment of small data sets and data sets from different ethnic groups. If reliable population allele frequencies are available then this feature should be used to improve alignments. | |
| Allows simultaneous alignment of >2 data sets. | All data sets should be merged simultaneously. | |
| Provides another measure of alignment quality that is more general than the posterior probability. | Applicable to lumped alignments only, > 90% of lumped alignments should reject the LRT null hypothesis (LRT = 1). |
Figure 4Example of a lumped MicroMerge alignment and its corresponding one-to-one alignments, where TA = theoretical alleles, DS1 = data set 1 bins, and DS2 = data set 2 bins.
Example of a top alignment for marker D4S291 with low posterior probability 0.235 due to four consecutive low frequency alleles (theoretical alleles 5–8).
| Data 1 | Data 2 | TA | TF | Overlap Probability |
| A | A' | 1 | 0.5442 | 1.000 |
| B | B' | 2 | 0.0281 | 0.997 |
| C | C' | 3 | 0.0453 | 1.000 |
| D | D' | 4 | 0.2859 | 1.000 |
| E | E' | 5 | 0.0104 | 0.552 |
| E | F' | 6 | 0.0075 | 0.806 |
| E | G' | 7 | 0.0198 | 0.798 |
| F | H' | 8 | 0.0114 | 0.698 |
| F | I' | 9 | 0.0473 | 1.000 |
TA is an abbreviation for "theoretical alleles" and TF is an abbreviation for "theoretical frequencies".
Three simulated data sets each consisting of Ngenotype samples from the given distribution of population allele frequencies.
| Population Alleles | Allele Frequencies | Data | Data | Data |
| 1 | 0.005 | 0 | ||
| 2 | 0.010 | 0 | ||
| 3 | 0.185 | |||
| 4 | 0.800 |
These data sets were used to test the success of aligning multiple data sets simultaneously. Data set B was aligned with data set C, and then the combined data set was aligned with data set A. This alignment was then compared to the simultaneous alignment of data sets A, B, and C.
Figure 5Accuracy of MicroMerge v2 alignments for the 48 real data project markers, as compared to a manually obtained reference alignment for the lumped alignment format (a) and the one-to-one format (b). The manual alignment was obtained by eye (without the aid of a merging program) by using bin frequencies, relative base-pair size differences, and eight samples common to both data sets. Each plot shows the posterior probability of each marker plotted against its number of unique bins (summed from both data sets), where a black circle indicates agreement between the MicroMerge v2 alignment and its corresponding reference alignment and a white circle indicates disagreement. The horizontal line at 0.425 represents the same acceptance threshold for comparison between the methods. For the real data project the one-toone alignment format (b) is more accurate than the lumped alignment (a) because all correct MicroMerge v2 alignments had posterior probabilities above the threshold and only three markers were misaligned. In comparison the lumped format had four correctly aligned markers with posterior probabilities that fell below the threshold and eight markers total were misaligned.
MicroMerge v2 lumped alignment results for nine poorly aligned markers before and after remerging.
| # Bins | Original | Re-merged | |||||
| Marker | Lab1 | Lab2 | Pr( | || | Pr( | || | # Zeroed |
| D4S2961 | 6 | 9 | 0.248 | 0 | 0.958 | 0 | 3 |
| D4S250 | 9 | 10 | 0.413 | 0 | 0.957 | 0 | 1 |
| D4S2460 | 6 | 8 | 0.282 | 4 | 0.606 | 2 | 1 |
| D4S1637 | 7 | 8 | 0.34 | 5 | 0.539 | 3 | 1 |
| D4S3043 | 8 | 9 | 0.209 | 2 | 0.529 | 0 | 1 |
| AFMA082Y | 8 | 11 | 0.32 | 0 | 0.509 | 0 | 1 |
| D4S2380 | 11 | 16 | 0.086 | 6 | 0.509 | 2 | 3 |
| Average | 8.92 | 0.25 | 2.83 | 0.61 | 1.17 | 1.67 | |
| Total | 107 | 17 | 7 | 10 | |||
Ais the original MicroMerge alignment, Mis its corresponding manually obtained alignment, Ais the re-merged Micromerge v2 alignment, and Mis its corresponding manually obtained alignment.
*Posterior probability remained low after dropping problematic alleles and thus these markers' values were excluded from the 'Average' and 'Total' summaries.
Results for ten markers aligned with the one-to-one format for three different priors on the number of theoretical alleles.
| # Bins | Default | Min. | Max. | ||||||
| Group | Locus | Ds1 | Ds2 | Pr(A1) | Dist. | Pr(A1) | Dist. | Pr(A1) | Dist. |
| a.) | D4S1531 | 8 | 8 | 0.962 | 0 | 0.987 | 0 | 0.975 | 0 |
| D4S1578 | 8 | 8 | 0.980 | 0 | 0.975 | 0 | 0.983 | 0 | |
| D4S1591 | 6 | 6 | 0.997 | 0 | 0.989 | 0 | 0.993 | 0 | |
| b.) | D4S1534 | 10 | 11 | 0.601 | 0 | 0.849 | 0 | 0.491 | 0 |
| D4S2909 | 4 | 5 | 0.649 | 0 | 0.953 | 0 | 0.543 | 0 | |
| D4S2966 | 6 | 7 | 0.665 | 0 | 0.884 | 0 | 0.579 | 0 | |
| c.) | D4S1637 | 7 | 8 | 0.353 | 3 | 0.431 | 0 | 0.395 | 3 |
| D4S2380 | 11 | 16 | 0.149 | 7 | 0.216 | 11 | 0.112 | 11 | |
| D4S3042 | 13 | 16 | 0.304 | 10 | 0.326 | 12 | 0.220 | 8 | |
| d.) | D4S1517 | 9 | 11 | 0.593 | 0 | 0.926 | 12 | 0.557 | 0 |
Group a.) consists of markers where the top alignment contained n+ 1 theoretical alleles, the markers in b.) had a 2nd best alignment that had n+ 1 theoretical alleles, set c.) consists of markers with incorrect alignments, and d.) was an additional marker found to vary in alignment accuracy for different priors.
Figure 6The multiple data set alignment feature was tested by comparing a) the simultaneous alignment of the three simulated data sets in Table 3, to b) the result from merging data set The theoretical allele frequencies "Freq." and overlap probabilities "Overlap" are provided for each alignment, where bin i from lab 1 and a bin j from lab 2 overlap if they both align with one or more of the same theoretical alleles. Their overlap probability ocan be estimated by the fraction of the sampled alignments where overlap occurs. This figure illustrates the importance of merging all data sets simultaneously rather than conducting a series of pair-wise merges. (a) Simultaneous alignment of all three data sets gave the correct alignment with posterior probability 0.55. This posterior probability is lower than the posterior probability for the alignment presented in part b) shown below because the posterior probability of the alignment of Dwith Dwas low (0.509). (b) The alignment of data set b with data set c was incorrect, but MicroMerge finds a high posterior probability for their alignment (0.997) because the bin frequencies match well. Since this alignment Dwas not accurate, the alignment of Dwith Dwas also inaccurate (posterior probability 0.697).
Mendel's gamete competition analysis P-values for three data sets (Dutch, Finn1 and Finn2) analyzed separately and combined using the a) one-to-one and b) lumped alignment formats.
| a) | Gamete Competition (GC) P-value | 1:1 Format Results: GC P-value and Pr( | |||||
| Locus | i) D-F1-F2 | ii) F1-F2 | iii) D-F1 | iv) D-F2 | |||
| D11S1984 | 0.118 | 0.191 | 0.717 | 0.314 (0.59) | 0.689 (0.89) | 0.188 (0.65) | 0.487 (0.99) |
| D11S2362 | 0.318 | 0.442 | 0.169 | 0.127 (0.77) | 0.181 (0.57) | 0.914 (0.55) | 0.057 (0.89) |
| D11S1999 | 0.665 | 0.760 | 0.965 | 0.790 (0.90) | 0.925 (0.82) | 0.636 (0.61) | 0.900 (0.64) |
| D11S1981 | 0.235 | 0.477 | 0.675 | 0.781 (0.40) | 0.801 (0.54) | 0.548 (0.44) | 0.796 (0.61) |
| D11S1392 | 0.614 | 0.599 | 0.102 | 0.068 (1.00) | 0.140 (1.00) | 0.827 (1.00) | 0.035 (1.00) |
| D11S2000 | 0.843 | 0.373 | 0.132 | 0.694 (0.22) | 0.409 (0.49) | 0.889 (0.65) | 0.414 (0.52) |
| D11S1998 | 0.090 | 0.016 | 0.859 | 0.044 (0.65) | 0.273 (0.94) | 0.010 (0.94) | 0.148 (0.58) |
| D11S4464 | 0.467 | 0.132 | 0.302 | 0.102 (0.30) | 0.142 (0.86) | 0.080 (0.27) | 0.289 (0.88) |
| D11S912 | 0.567 | 0.099 | 0.559 | 0.366 (0.69) | 0.286 (0.42) | 0.572 (0.96) | 0.557 (0.42) |
| b) | Gamete Competition (GC) P-value | Lumped Format Results: GC P-value and Pr( | |||||
| Locus | i) D-F1-F2 | ii) F1-F2 | iii) D-F1 | iv) D-F2 | |||
| D11S1984 | 0.118 | 0.191 | 0.717 | 0.388 (0.97) | 0.508 (0.73) | 0.255 (0.62) | 0.511 (0.96) |
| D11S2362 | 0.318 | 0.442 | 0.169 | 0.137 (0.32) | 0.248 (0.32) | 0.870 (0.29) | 0.050 (0.43) |
| D11S1999 | 0.665 | 0.760 | 0.965 | 0.620 (0.64) | 0.925 (0.87) | 0.542 (0.32) | 0.755 (0.71) |
| D11S1981 | 0.235 | 0.477 | 0.675 | 0.790 (0.17) | 0.438 (0.48) | 0.565 (0.27) | 0.848 (0.65) |
| D11S1392 | 0.614 | 0.599 | 0.102 | 0.068 (0.97) | 0.140 (0.93) | 0.827 (0.98) | 0.035 (0.97) |
| D11S2000 | 0.843 | 0.373 | 0.132 | 0.459 (0.07) | 0.496 (0.38) | 0.861 (0.55) | 0.478 (0.16) |
| D11S1998 | 0.090 | 0.016 | 0.859 | 0.044 (0.52) | 0.273 (0.93) | 0.010 (0.87) | 0.173 (0.50) |
| D11S4464 | 0.467 | 0.132 | 0.302 | 0.101 (0.40) | 0.230 (0.45) | 0.082 (0.25) | 0.246 (0.84) |
| D11S912 | 0.567 | 0.099 | 0.559 | 0.498 (0.57) | 0.415 (0.42) | 0.572 (0.67) | 0.606 (0.19) |
The posterior probability of the top alignment for each marker is included in parentheses. Marker D11S2002 is emphasized because its association significance improved in the one-to-one merge of the two Finnish data sets. This marker was recently discovered to have the strongest linkage among these markers to familial dyslipidemia in a larger fine-mapping analysis.