Literature DB >> 18724869

Detection of genome-wide polymorphisms in the AT-rich Plasmodium falciparum genome using a high-density microarray.

Hongying Jiang1, Ming Yi, Jianbing Mu, Louie Zhang, Al Ivens, Leszek J Klimczak, Yentram Huyen, Robert M Stephens, Xin-Zhuan Su.   

Abstract

BACKGROUND: Genetic mapping is a powerful method to identify mutations that cause drug resistance and other phenotypic changes in the human malaria parasite Plasmodium falciparum. For efficient mapping of a target gene, it is often necessary to genotype a large number of polymorphic markers. Currently, a community effort is underway to collect single nucleotide polymorphisms (SNP) from the parasite genome. Here we evaluate polymorphism detection accuracy of a high-density 'tiling' microarray with 2.56 million probes by comparing single feature polymorphisms (SFP) calls from the microarray with known SNP among parasite isolates.
RESULTS: We found that probe GC content, SNP position in a probe, probe coverage, and signal ratio cutoff values were important factors for accurate detection of SFP in the parasite genome. We established a set of SFP calling parameters that could predict mSFP (SFP called by multiple overlapping probes) with high accuracy (> or = 94%) and identified 121,087 mSFP genome-wide from five parasite isolates including 40,354 unique mSFP (excluding those from multi-gene families) and approximately 18,000 new mSFP, producing a genetic map with an average of one unique mSFP per 570 bp. Genomic copy number variation (CNV) among the parasites was also cataloged and compared.
CONCLUSION: A large number of mSFP were discovered from the P. falciparum genome using a high-density microarray, most of which were in clusters of highly polymorphic genes at chromosome ends. Our method for accurate mSFP detection and the mSFP identified will greatly facilitate large-scale studies of genome variation in the P. falciparum parasite and provide useful resources for mapping important parasite traits.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 18724869      PMCID: PMC2543026          DOI: 10.1186/1471-2164-9-398

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


Background

Malaria parasites, particularly Plasmodium falciparum, impose heavy economic and health burdens on human population worldwide [1]. Hundreds of millions of people are infected by the parasite each year, leading to 1–2 million deaths annually. Lack of effective vaccines and emergence of drug-resistant parasites and insecticide-resistant mosquito vectors are the main reasons for the failure in controlling the parasites and the associated disease. A better understanding of the molecular mechanisms of drug resistance, the molecular basis of the host immune response, and the strategies the parasite employs to evade host immunity is critical for vaccine and drug development. Genetic variation in parasites can contribute to drug resistance, immune evasion, and disease manifestation. Genetic mapping is one of the powerful approaches for the identification of mutations that cause drug resistance and changes in other phenotypes [2]. For efficient mapping of a target gene, it is often necessary to genotype a large number of polymorphic markers. In addition to length polymorphisms such as microsatellites and minisatellites and large-scale sequencing, genome-wide single nucleotide polymorphisms (SNP) have been identified from many organisms, including P. falciparum, for genotyping and mapping genes associated with different phenotypes [3-5]. High-throughput SNP typing methods have also been developed [6-11], leading to recent successful identification of candidate genes (loci) associated with various human diseases [12-20]. One of the high-throughput typing methods is array-based hybridization. In this method, labeled genomic DNA is hybridized to microarrays comprising high-density short oligonucleotides designed based on known SNP or systematically tiled along all chromosomes to detect potential polymorphisms. High-density arrays have been successfully used to detect variation in copy number [21-23] and SNP [24,25]. The human malaria parasite P. falciparum has a genome with extremely high AT content (> 80%) as well as numerous repetitive sequences [26], making array design and data analysis challenging. Hybridizations of P. falciparum genomic DNA to both Affymetrix GeneChips® and slides printed with 70 mer oligonucleotides have been reported previously [27-29]. Kidgell et al. recently used an array with 327,782 probes to identify 23,653 single feature polymorphisms (SFP) among 14 isolates. The results from this study suggest that high-density array could be a promising tool for high-throughput detection of genome variations including SNP and copy number variations (CNV). However, calling SNP based on hybridization signals is a complex process, and many factors can affect SNP calling, including array design, GC content of a probe, the position of the SNP in a probe, hybridization conditions, and algorithms used to analyze array signals. Additionally, methods were developed to call SFP in many previous studies, but the accuracy of SFP calls were not verified with known SNP or through DNA sequencing. To investigate the influences of these factors on calling SFP in a highly AT-rich genome and to develop a reliable method for calling SFP from the P. falciparum genome using commercially available array platforms, we have analyzed data from a high-density 'tiling' array with ~2.5 million 25 mer probes designed at The Sanger Institute (PFSANGER GeneChips®) to detect genomic variations in five P. falciparum field isolates. Genomic DNA samples from the five parasite isolates were hybridized to the array, and signals from the parasites were compared with known SNP [4] to evaluate SNP calling accuracy under different conditions. Based on the comparison, we identified factors that could affect probe/DNA hybridization dynamics and established a set of conditions that allowed us to call SFP/SNP with ≥ 94% accuracy. We also sequenced 52 SFP calls that did not agree with known SNP and found that ~64% of the 'wrong' calls were actually due to errors in the genome sequences. Parameters that provided best SNP calling accuracy were used to identify 121,087 potential SNP, including ~18,000 new SFP that have not been reported previously.

Results

Basic probe statistics and quality control

The array has 2.56 million perfect-matched probes (25 mer) with 2,206,371 P. falciparum-specific probes (the rest of the probes were for rodent malaria parasites). Of the P. falciparum probes, 2,107,319 mapped uniquely to the genome and 99,052 mapped to more than one location or were not assigned to any chromosomes. Among the unique probes, 1,446,824 were in the predicted coding regions (CDS); 1,304,180 probes were within exons; 727,200 probes were intergenic; 84,622 were within introns; 58,022 probes spanned exon/intron junctions, and 32,347 probes spanned the predicted translation start sites or stop codons. Genomic DNA from five different parasites (Additional file 1) were labeled and hybridized (2–4 replicates) to the PFSANGER GeneChip®. After normalization of the hybridization signals across all array chips, an average signal intensity for each probe was calculated from replicates of each parasite. The qualities of the hybridizations were evaluated using various methods including MA plots, scatter plots (data not shown), and coefficient of variance (CV) tests (Additional file 1). Good reproducibility was obtained among replicates with the majority of the probes (> 90%) having CV less than 25% (Additional file 1). Histograms of signal ratios relative to 3D7, the reference genome, showed similar data distribution among different parasite samples (Additional file 2).

Probe coverage of known SNP

Accurate SNP calling and detection of insertions/deletions requires optimization of calling parameters. Here we evaluated potential factors that might affect SFP calling accuracy by comparing known SNP between 3D7 and four other parasites (Dd2, HB3, 7G8, and FCR3) identified in our previous study (i.e., NIAID SNP) [4] and hybridization signal ratios. Among the 3,836 NIAID SNP (excluding 82 that were mapped to multiple sites) identified previously, 2,651 (69%) were covered by 10,841 probes, including 1,787 covered by 5,600 probes in the predicted exons. The majority of the SNP were covered by 1–5 probes (average 4.4 probes/SNP), with a maximum coverage of 45 probes/SNP (Additional file 3). Overall, the SNP were distributed evenly across the 25 mer positions in the probe, with ~94% of probes having one SNP (Additional file 4).

Probe GC content and hybridization intensity

Because GC content in a probe is known to affect probe/DNA hybridization dynamics, we investigated the influence of probe GC content on hybridization signal intensity. The GC effect is likely exaggerated even more for the AT-rich genome of P. falciparum genome. The majority of the probes in the array have GC contents of 15% to 40% (Figure 1A). Signal intensity was similarly low for probes with GC content <16%, but for probes with GC content of 16% or higher, signal intensity increased with the increase of GC content until ~40%, when signal intensity began to plateau (Figure 1B). Signal intensity did not change much from 40% to 80% GC in 3D7; however, the intensity began to decrease and fluctuate dramatically after reaching 50% GC content in non-3D7 parasites (Figure 1C). Reduction in signal intensity in non-3D7 parasites suggested high levels of polymorphism in these probes. In the parasite genome, the first exons of the var gene family have a relatively high GC content and are highly variable in DNA sequence. These high-GC-content probes are therefore likely from the var genes. Comparison of the high-GC probes with var gene sequences showed that ~44% of the 5,491 probes with 50% or higher GC content were from the var genes. These probes likely contributed to the dramatic variation in signal ratio between parasites (Figure 1D). These results suggest that probes with GC content <16% and the var probes with >50% might not be reliable for the detection of SFP for genetic mapping of the P. falciparum traits.
Figure 1

Distribution of probes with different GC contents and the influence of GC content on signal intensity. A. Number of probes with different GC contents. B. Hybridization signals from probes with different GC contents using 3D7 DNA. C. Hybridization signals from probes with different GC contents using DNA from 7G8. D. Signal ratios of 3D7 over 7G8 from probes with different GC contents. The box plots (B-D) showed the lowest intensity, lower quartile, median, upper quartile, and the highest intensity. Note large variations in probes with GC contents higher than 50%.

Distribution of probes with different GC contents and the influence of GC content on signal intensity. A. Number of probes with different GC contents. B. Hybridization signals from probes with different GC contents using 3D7 DNA. C. Hybridization signals from probes with different GC contents using DNA from 7G8. D. Signal ratios of 3D7 over 7G8 from probes with different GC contents. The box plots (B-D) showed the lowest intensity, lower quartile, median, upper quartile, and the highest intensity. Note large variations in probes with GC contents higher than 50%.

Substitution positions in a probe and hybridization dynamics

The position of a nucleotide substitution in a probe can also influence probe hybridization intensity. A substitution in the middle of a probe is expected to affect hybridization stability more dramatically than a change at the end positions of a probe. Comparison of average signal ratios between 3D7 and the other four parasites and SNP at known probe positions showed that substitutions at the two end positions (1 and 25) of a probe did not affect probe-target hybridization; and substitutions at position 2 and 24 had minimal effect on signal intensity (Figure 2). Signal ratios (3D7/7G8) of probes with SNP from position 3 to position 7 increased from both ends, averaging more than 10 times of the probes without polymorphism. For all positions in a probe, the average signal ratios were approximately the same (< 1.5) if there was no known polymorphism in a probe. For probes that had known SNP, the signal ratio was generally 5 or higher if two positions at each end of a probe were excluded (Figure 2). Our data showed that substitutions located at probe position 3–23 (25 mer probes) had a strong effect on hybridization intensity and should be considered for SFP detection (Figure 2).
Figure 2

Relationship between probe signal ratios and SNP positions. 7G8-same indicates signals from probes with no known NIAID SNP within the probes between 3D7 and 7G8 parasites (3D7/7G8); 7G8-diff indicates probes with known differences between 3D7 and 7G8 parasites. The definitions for the rest of the parasites (FCR3, Dd2, and HB3) are the same as those for 7G8. The dashed line indicates signal cutoff ratio value of 5.0.

Relationship between probe signal ratios and SNP positions. 7G8-same indicates signals from probes with no known NIAID SNP within the probes between 3D7 and 7G8 parasites (3D7/7G8); 7G8-diff indicates probes with known differences between 3D7 and 7G8 parasites. The definitions for the rest of the parasites (FCR3, Dd2, and HB3) are the same as those for 7G8. The dashed line indicates signal cutoff ratio value of 5.0.

Estimates of correct SFP call rates

We next evaluated different signal cutoff ratios to obtain a value that produced the best SFP calling accuracy realizing that this ratio would balance false positive and false negative calling rates. We found that a signal cutoff ratio of 1.5 produced the highest overall correct call rates (≥ 90%) for Dd2, HB3, and 7G8 (Table 1). Correct call rates increased slightly after removing probes with high and low GC contents and increased further after excluding calls from single probes and calls with probe vote ratio < 75%. In contrast, correct call rates decreased with the increase of signal ratio cutoff values, likely because of the exclusion of some real SFP with relatively lower signal ratios. Even using a signal cutoff ratio of 5.0, we obtained correct call rates ≥ 85%. After correcting for wrong calls due to sequence errors (see below), we obtained correct call rates ≥ 94% (Table 1). The call rate for FCR3 could not be estimated accurately without known SNP information.
Table 1

Comparison of correct mSFP calling rates using different cut off values

Overall rateGC filteredProbe filteredCorrected rate




Cutoff value7G8Dd2HB37G8Dd2HB37G8Dd2HB37G8Dd2HB3
1.592.592.490.092.692.590.293.793.491.297.797.696.7
2.091.590.489.291.390.489.492.892.690.797.497.396.6
5.082.982.082.582.982.182.886.485.984.595.094.894.3

To obtain the best correct call rates, we compared mSFP calls using three cutoff values (1.5, 2.0, and 5.0). First we called mSFP using unique probes and probe position 3–23 (Overall rate). We repeated the calls after removing probes with GC contents < 16% and > 50% (GC filtered). We then obtained call rates after removing probes with GC content < 16% and > 50% and excluding calls with single probes and multiple probe calls with less than 75% probe votes (Probe filtered). Corrected rates were obtained after adjusting for 63.5% error rate in the wrong calls due to sequence errors, which were calculated using formula

[(100-probe filtered rate) × 0.635 + probe filtered rate].

A correct call was defined as correct calls over the sum of correct, wrong, and tie calls.

Comparison of correct mSFP calling rates using different cut off values To obtain the best correct call rates, we compared mSFP calls using three cutoff values (1.5, 2.0, and 5.0). First we called mSFP using unique probes and probe position 3–23 (Overall rate). We repeated the calls after removing probes with GC contents < 16% and > 50% (GC filtered). We then obtained call rates after removing probes with GC content < 16% and > 50% and excluding calls with single probes and multiple probe calls with less than 75% probe votes (Probe filtered). Corrected rates were obtained after adjusting for 63.5% error rate in the wrong calls due to sequence errors, which were calculated using formula [(100-probe filtered rate) × 0.635 + probe filtered rate]. A correct call was defined as correct calls over the sum of correct, wrong, and tie calls.

Sequencing verification of SFP calls

Both false positive (Fp) and false negative (Fn) calls could be caused by SFP calling errors, sequencing mistakes, or problems in sequence alignment in the databases. To investigate whether the discrepancies between our SFP calls and the known SNP were from array SFP calling or sequencing/alignment errors, we sequenced 52 Fp or Fn SFP calls (positions 3–23, 1.5 cutoff ratio between 3D7 and 7G8) with different probe coverage and probe vote ratios to verify the calls. Our results showed that 33 of the 52 (63.5%) initial wrong calls were due to sequence errors in the databases, including four Fp calls that did not have polymorphism at the expected sites but had new polymorphic sites nearby, leading to the incorrect Fp calls (MAL14.5217, MAL12.3146, MAL11.3013, and PFC0210c in Table 2). Among the 19 true wrong-calls verified by sequencing, 9 were called by a single probe, 6 had mixed probes calls, 3 had two one-sided probe calls, and 1 had three one-sided probe calls. If we excluded calls from single probes and mixed probe calls having a probe vote ratio <75% (for example, one probe suggested a SFP, but three others suggested no SFP), we would have had only four calls that were incorrect (7.7% of the 52). In other words, 92% (48/52) of the calls would have been correct if we had excluded single probe calls and calls with a probe call vote ratio of <75%. If we apply these corrections, we obtain a corrected overall SFP call rate of ≥ 94% even using a conservative cutoff value of 5.0 (Table 1).
Table 2

DNA sequencing verification of false negative (Fn) and false positive (Fp) calls

Gene IDChr positionMism alleSFPn3D77G8Forward (5'-3')Reversed (5'-3')
MAL2.808chr2: 306218T/AFn(0/5)AAtcagtagtatcttttgtttcatgtaaaactaccatcaaatg
PFC0210cchr3: 218122G/GFp(4/0)CCagatgtgttctttatctaattaaccaagtgataagcacata
PFC0235wchr3: 248155A/GFn(0/2)AAggaaatgtatttgagaaaaaccaatgtttactatccgaatt
PFC0770cchr3: 718081T/TFp(4/0)TAatggggagcaaagaatttctattccatgatgtattatgat
PFC1065wchr3: 995530C/GFn(1/8)GGggaaaaagaagaagatttaaaatatatcttccgaatcatc
PFC1065wchr3: 995640A/GFn(0/8)AAatagatgtatcgtgtgataaattattacttctgtctctag
PFE1390wchr5: 1154254A/AFp(3/0)TTcgaaaaagagaagaaaaacttgtgttggcttcttaatatt
MAL6P1.232chr6: 817214T/AFn(0/4)TTtccaaatcttctcaaagctggtttattcaaaacattagg
MAL7.743chr7: 181822C/CFp(5/0)CGtttaatgcttccctttgcttataattgtgatgaagtgatg
MAL7P1.30chr7: 512599T/TFp(1/0)AAatggtagaataattcatatgtttatcacacatggtttcaac
MAL7P1.65chr7: 519234T/CFn(0/2)TCaaaacaaccgtctgatataataaacaataaatccaactgt
MAL7.2803chr7: 621749G/AFn(0/6)GGttttcgctcggattattaaagcaacatgatttttttttttc
MAL7P1.67chr7: 677205C/AFn(0/2)GTatttaacttactggattggtaatggacaaccaggttaaaa
MAL7P1.82chr7: 794419A/CFn(0/7)AAgtgtacttcattttgtagttaatatctacaaaaggggaatt
MAL7P1.82chr7: 794421C/AFn(0/7)AAccatgtgctttcatatatatccatgtaccagctcatac
PF07_0102chr7: 922368C/AFn(0/4)CCaagagtattaataattccgtcgaacagaggatgaattattt
MAL8P1.42chr8: 1017925T/AFn(0/1)TAtccatgatatattcccaagtattcctcatttcagggtat
MAL8.3159chr8: 1057901C/AFn(0/3)CCgtacagctagttgtagtggagctttcttactaaagtat
PF08_0017chr8: 1179041C/TFn(0/1)CTcggtgataataataaatacggaatttatagaactttccgc
PF08_0017chr8: 169329T/CFn(0/1)TTccgtctacacaataattctttgggtagtaaatatgaggaaa
MAL8.2086chr8: 582828T/CFn(0/4)TTtgggataaacctatgtataatcattcaaatttacaggtcg
PFI1300cchr9: 1080645T/TFp(1/0)AAtatgatgacaatcatattccccttctatgaatagagatac
PFI1300cchr9: 1080729G/AFn(0/2)TCtacccatatcttgatttacgctttggagatttgtttagat
PFI0495wchr9: 464714G/GFp(5/0)GAattctcccaaaactgaaataatatcttcgttagttatgtg
MAL9.1104chr9: 548591A/GFn(0/1)AGtcttcttttcctttctacatttaaggttccttctgaatta
PFI0690cchr9: 603205T/AFn(0/2)TTcgaaaaaatcctttaccttaaagatttccccctactaaa
MAL5.878chr9: 926274G/AFn(0/1)CCgttcgtcttttttttcatatgGaatataagacagatgttcc
PF10_0314chr10: 1294935C/GFn(3/17)TTcaatgtgaggaatatttatagggcctcattgtggttatta
MAL10.3336chr10: 1334877A/AFp(1/0)AAtttaaacacccctcaaaaaaaaatatcaaaaccggaaatg
MAL10.4084chr10: 1433239G/AFn(0/1)CCaagaaataattggttgggctttctgtccaccatttttttg
PF10_0377chr10: 1554669T/AFn(9/11)TAtaaaacctgtataaccaaatatatacaaactttacaaaactc
PF10_0094chr10: 389999A/CFn(0/2)TTaaggtataccaatagatttggtaaatcattcaccctcat
PF10_0138chr10: 556132C/CFp(2/1)CCtaatgtgtatgtatcagctaggattgtaataagtatatgg
MAL10.1222chr10: 564556T/TFp(1/0)TTgttttatgcttaggcttatatgggaaaatataaatgaagg
PF11_0338chr11: 1272493A/AFp(1/0)AAgaatgttaacatacaaatgtacttcagggagaatatttattc
MAL11.3013chr11: 1294419T/TFp(3/0)TTtcatggttcaggtataagaccattattttcttgagctgc
PF11_0353chr11: 1327608G/AFn(0/5)GGttataccatatgtgtacaaaggaaatatcaaaatttcctaac
PF11_0360chr11: 1369690A/GFn(0/3)AAcctattctattcaatactgtctgtatacatttgtttggat
PF11_0046chr11: 151916A/GFn(0/2)AAacaagcatagatatcatagcataacatgtcctaaaggtga
PF11_0441chr11: 1717528T/AFn(5/15)ATcagttatatacctttatcagataagaaaaaatatccacac
MAL12.4052chr12: 1192527T/GFn(1/2)TGggatattcacaatggattttcatgtgtatcatttatacatg
MAL12.2128chr12: 577914T/TFp(1/0)TTctgatgaaagaatacatattgtgaacaatatattcggaaac
MAL12.3146chr12: 817466T/TFp(1/0)TTATaatctaaaaaatccaagtatgcataatgattgtatatccttt
PF13_0184chr13: 1376386T/CFn(1/9)TTtattcttgaattttcgctactatattttatggatcatctc
MAL13.4760chr13: 2159993C/TFn(0/2)CCcacaaaagtatacgtctatttaacagtttaggacacata
MAL13.670chr13: 304167C/AFn(0/1)AAattaaataattcttcttccagcatgtcttgtatttcgtttt
MAL13P1.67chr13: 557320A/TFn(0/3)AAgttcttctaacacaaataaatctacaggtaatatgttatc
PF13_0088chr13: 650502T/CFn(0/4)GGcggcatgctcctgaagtaaattatgttagagatgggtata
PF13_0125chr13: 912350T/GFn(2/3)AGcatagtactatcacctgaactatggttataaccaagaaat
MAL13P1.127chr13: 958583A/AFp(2/1)TCgatgaatttgttgtaacgtttacgttaataacaatcatgtga
MAL14.5217chr14: 2364467A/AFp(3/0)AAggtatatcctttctacatataattcttttcatagggagtt
PF14_0565chr14: 2428920A/TFn(16/23)TAatcgtcaataccttcctcgtaaacaaaatatgagcactg

Gene ID, gene ID or SNP ID in PlasmoDB; Chr position, chromosomal position of the polymorphic site; Mism Alle, mismatched alleles of our array calls and known NIAID SNP between 3D7 and 7G8; SFPn, calls not matching known SNP, either false positive (Fp) or false negative (Fn). The numbers in the parentheses are numbers of probes calling for SFP or no SFP. For example, Fp(3/0) indicates three probes called for a SFP and no probe called for no SFP, but there was no known SNP in the databases; and Fn(0/3) indicates three probes called for no SFP, but a known SNP existed (false negative); 3D7, alleles obtained from sequencing 3D7 DNA; and 7G8, alleles obtained from sequencing 7G8 DNA sequences. The gene ID in italic indicates SNP not confirmed by sequencing (true wrong calls) using 1.5 cutoff ratio and 3–23 positions in a probe; and those in bold had additional polymorphisms supporting the array calls. TAT in MAL12.3146 is a trinucleotide missing in 7G8. Forward and reverse are primers used in amplification and sequencing of the PCR products.

DNA sequencing verification of false negative (Fn) and false positive (Fp) calls Gene ID, gene ID or SNP ID in PlasmoDB; Chr position, chromosomal position of the polymorphic site; Mism Alle, mismatched alleles of our array calls and known NIAID SNP between 3D7 and 7G8; SFPn, calls not matching known SNP, either false positive (Fp) or false negative (Fn). The numbers in the parentheses are numbers of probes calling for SFP or no SFP. For example, Fp(3/0) indicates three probes called for a SFP and no probe called for no SFP, but there was no known SNP in the databases; and Fn(0/3) indicates three probes called for no SFP, but a known SNP existed (false negative); 3D7, alleles obtained from sequencing 3D7 DNA; and 7G8, alleles obtained from sequencing 7G8 DNA sequences. The gene ID in italic indicates SNP not confirmed by sequencing (true wrong calls) using 1.5 cutoff ratio and 3–23 positions in a probe; and those in bold had additional polymorphisms supporting the array calls. TAT in MAL12.3146 is a trinucleotide missing in 7G8. Forward and reverse are primers used in amplification and sequencing of the PCR products.

Use of receiver operating characteristic (ROC) curves to estimate call rates

To further test the reliability of our method in calling SFP, we also used a ROC curve to evaluate SFP calling accuracy and applied local pooled error (LPE) analysis to obtain Z-scores for calling SFP [30]. LPE generates corrected Z-scores that reduce Fp, which might result when sample variance happens to be low, by using a 'pooled' variance for all the probes that show similar intensities. The ROC curve is a graphic plot of sensitivity vs. (1-specificty) or fraction of true positive vs. the fraction of Fp [31]. As shown in Figure 3, if we allowed a Fp rate of approximately 2% (1-specificity), and at a Z-score of ~1.5, we could obtain a sensitivity of call rate ~81% genome-wide for data from 7G8, Dd2, and HB3.
Figure 3

Relationship of receiver operating characteristic (ROC) curve and Z-score values and estimates of SFP call rates. The black line is the ROC curve, and the red line is the Z-score curve. The vertical dash line indicates false positive rate (1-specificity) of 5%, and horizon lines point to a Z-score value of 1.5 and sensitivity level (call rate) of approximately 81%, respectively. The curves were generated using data from all replicates of hybridization. SFP calls were compared with known NIAID SNP described previously (see text).

Relationship of receiver operating characteristic (ROC) curve and Z-score values and estimates of SFP call rates. The black line is the ROC curve, and the red line is the Z-score curve. The vertical dash line indicates false positive rate (1-specificity) of 5%, and horizon lines point to a Z-score value of 1.5 and sensitivity level (call rate) of approximately 81%, respectively. The curves were generated using data from all replicates of hybridization. SFP calls were compared with known NIAID SNP described previously (see text). SFP were called using Z-scores of 1.5, 2.0, 3.0 and 4.0 and compared with SFP called using signal ratio cutoffs of 1.5, 2.0, 3.0, and 5.0. Results from cutoffs of Z-score of 3.0 and signal ratio of 3.0 had the best overall matches (~99%) and the best positive SFP call matches (~82%) for all 14 chromosomes. To minimize Fp calls (low Fp rate is important for genetic mapping) from unknown parasites that might have higher background, however, we decided to use a conservative signal ratio cutoff value of 5.0. Using this cutoff value, almost all (~98%) of the positive calls matched a positive call from a Z-score cutoff 3.0.

Detection of genome-wide substitutions among field isolates

We used a conservative signal cutoff ratio of 5.0 and all the parameters discussed above (Additional file 5) to call SFP and obtained 121,087 mSFP genome-wide among the five parasites, including 41,700 unique mSFP from 3D7, 8,856 from 7G8, 10,068 from Dd2, 10,449 from HB3, and 5,121 from FCR3 (Table 3). Inspection of the calls revealed that the large number of 3D7 unique calls was largely from multigene families such as var, rif, and stevor. We therefore flagged mSFP from multigene families (PFB0935w, PFD0090c, MAL7P1.6, MAL7P1.58, PFI1780w, PFA0655w, PFB0105c, MAL7P1.7, MAL7P1.59, PF10_0380, PFE1600w, PF10_0012, PF10_0005) and their paralogs. Excluding mSFP from these genes removed approximately 67% of the SFP and reduced the total number of mSFP to 40,354, including 6,618 unique mSFP for 3D7, 6,855 for HB3, 2,854 for FCR3, 7,173 for Dd2, and 6,342 for 7G8 (Additional file 6). A list of SFP and mSFP in each predicted gene and genes that are highly polymorphic (genes encoding potential antigens) can be found in Additional file 7.
Table 3

Summary of mSFP calls for the 14 chromosomes among five parasite isolates

IsolateCh1Ch2Ch3Ch4Ch5Ch6Ch7Ch8Ch9Ch10Ch11Ch12Ch13Ch14Total
0000*3572754790672386746390602859988571386826980641900502231052120373402272314001869240
000124761565689642950665552463576560610921486133710449
00102091161303313042112523131529543902453932445121
001118041492445513714326912117398165156761907
01003491636348688537344683527497761829122489075510068
0101248556734792792033362681931911783151641373620
01101751651934911732673841992275062234884782224191
01112962132355519415014065133895433788665973976826
1000300278359563345467580598511112568776212879948856
10011382411234271384016253153156214843424903274987
101017975702195581117136491822751062091621915
1011243288144533475934474383693632823486843165095
110012317916630987141343143782041772724241422788
110147727135686024011801344953564964768805472077975
11103631973407312223066284331334084025606442225589
1111250719241607382996837204215314928163984307239483809215241700
Total6034679548491115148209058109488320821810776855711613122587690121087

*Parasite isolate order is 7G8, Dd2, FCR3, and HB3. For example, '1000' indicates the numbers of unique alleles for 7G8. A '0' indicates that a parasite has the same allele as that of 3D7 (0), and '1' indicates a different allele (a mSFP). The numbers in the first row were positions with probes but no SFP were called (no polymorphism). These numbers were not counted in the total calculation. The counts were based on a signal cutoff value of 5.0. Note these calls were mSFP and were different from those defined previously, where each probe was defined as an independent SFP[28].

Summary of mSFP calls for the 14 chromosomes among five parasite isolates *Parasite isolate order is 7G8, Dd2, FCR3, and HB3. For example, '1000' indicates the numbers of unique alleles for 7G8. A '0' indicates that a parasite has the same allele as that of 3D7 (0), and '1' indicates a different allele (a mSFP). The numbers in the first row were positions with probes but no SFP were called (no polymorphism). These numbers were not counted in the total calculation. The counts were based on a signal cutoff value of 5.0. Note these calls were mSFP and were different from those defined previously, where each probe was defined as an independent SFP[28]. Some chromosomes appeared to have unusually large numbers of mSFP calls from some parasites. For example, Dd2 had 1636 unique mSFP from chromosome 2, whereas the other four parasites had fewer than 400 mSFP (Table 3). Close inspection of the calls revealed that the majority of the extra mSFP was from a deletion at one end of chromosome 2 in Dd2 (Additional files 8 and 9). Similarly, the higher numbers of mSFP from chromosome 12, 13, and 14 of HB3 were from specific regions either deleted or having highly polymorphic genes in a specific parasite (Additional file 8 and 9).

Genome-wide mSFP distribution

SFP and mSFP were uploaded into the GBrowse genome browser at the ABCC website [32] for genome-wide display of the polymorphic site. Probe sequences and locations in predicted exons, introns, and intergenic regions were mapped to chromosomes. SNP in the PlasmoDB and our SFP/mSFP calls were also displayed in the browser with allele information from each parasite. As shown in the browser, the majority of our mSFP (89%) matched well with the PlasmoDB SNP (estimated for 7G8 only), including SNP in the pfcrt (Figure 4A). This comparison identified ~18,000 new unique mSFP (excluding those from multi-gene families) from the five parasite genomes.
Figure 4

Genome browser displays (drawn in Canvas) showing SFP, mSFP and SNP from two genomic loci on chromosome 7.A. A genome browser window (~3 kb) showing expanded chromosome region covering pfcrt gene (top line) and predicted exons/introns of the pfcrt gene, SNP in PlasmoDB (blue circle), NIAID SNP (red diamonds), SFP from individual probe (light blue squares), mSFP (black squares) and all genomic probes covering the pfcrt gene. Color codes for the genomic probes are: green, probes in coding regions; purple, probes in noncoding regions; and yellow, probes spanning protein coding and noncoding regions. Note the mSFP matched well with those known SNP. B. An expanded region (500-bp window) from PF07_0028 showing distributions of PladmoDB SNP and array probe locations. Five of the seven PlasmoDB SNP (blue circle) in the intron were not covered by any probes. One SNP matched a mSFP call (black bars in multiple parasites), and another was covered by one probe and but was not called (filtered out because of single probe). The color codes for the genomic probes are the same as those in A; the labels are either SNP ID (blue circles) or probe ID (black and light blue bars).

Genome browser displays (drawn in Canvas) showing SFP, mSFP and SNP from two genomic loci on chromosome 7.A. A genome browser window (~3 kb) showing expanded chromosome region covering pfcrt gene (top line) and predicted exons/introns of the pfcrt gene, SNP in PlasmoDB (blue circle), NIAID SNP (red diamonds), SFP from individual probe (light blue squares), mSFP (black squares) and all genomic probes covering the pfcrt gene. Color codes for the genomic probes are: green, probes in coding regions; purple, probes in noncoding regions; and yellow, probes spanning protein coding and noncoding regions. Note the mSFP matched well with those known SNP. B. An expanded region (500-bp window) from PF07_0028 showing distributions of PladmoDB SNP and array probe locations. Five of the seven PlasmoDB SNP (blue circle) in the intron were not covered by any probes. One SNP matched a mSFP call (black bars in multiple parasites), and another was covered by one probe and but was not called (filtered out because of single probe). The color codes for the genomic probes are the same as those in A; the labels are either SNP ID (blue circles) or probe ID (black and light blue bars). We noticed that many of the PlasmoDB SNP (51.1%) were located on chromosomal regions that did not have probe coverage (Figure 4). Because the majority of the regions without probe coverage were likely in areas of AT-rich repetitive and/or noncoding sequences, the observation suggested that relatively larger numbers of SNP in the PlasmoDB could be from repetitive sequences. We next counted mSFP in a window of 10-kb segments and plotted mSFP from each segment along the chromosomes to investigate mSFP distribution on the chromosomes from each parasite (Additional file 8). Again, these plots showed clusters of some highly polymorphic regions, mostly at chromosome ends, corresponding to var/rif/stevor clusters. The plots also identified some unique peaks for individual parasite, for example, a unique peak on chromosome 2 for Dd2 and HB3, respectively. These unique peaks were likely due to deleted DNA segments or reflected the unique selection and evolutionary histories in an individual parasite (Additional file 8).

Genome-wide CNV

Genome-wide segmentation analyses showed that there were relatively few large-scale amplifications or deletions among the parasites (Figure 5). The 5 largest amplified regions were a ~28 kb on chromosome 4 of FCR3, a ~80–96 kb on chromosome 5 of Dd2 and FCR3, a ~30 kb on chromosome 9 of FCR3, a ~82.5 kb on chromosome 11 for HB3, and various sizes (~3–180 kb) in the middle of chromosome 12 for different parasites. The chromosome 5 amplified region contained a total of 20 unique genes, including 19 genes (PFE1065w-PFE1155c) amplified ~2–3 copies in FCR3 and 14 genes (PFE1095w-PFE1160w) amplified ~4–5 copies in Dd2 (Additional file 9) with a total of 13 genes shared by the two parasites. Eight of the shared genes were predicted to encode proteins related to ribosomal subunits, ATP-dependant helicase, nucleotide binding, s-adenosylmethionine-dependent methyltransferase, mitochondrial processing peptidase, G10, and multidrug resistance homolog protein, PfPgh-1. Similarly, segments of different sizes located at the middle of chromosome 12 were amplified ~7–8 copies in 7G8 (PFL1085w, PFL1125c-PFL1160c, ~67 kb), ~5 copies in Dd2 (PFL1085w, PFL1145w-PFL1150c, ~3 kb), ~3–4 copies in FCR3 (PFL1135c-PFL1160c, ~20kb), and ~2–3 copies in HB3 (PFL1085w, PFL1125w-PFL1310c, ~184kb). Only two genes (PFL1145w and PFL1150c) were amplified in all of the four parasites, one of which was a gene encoding putative ribosomal protein L24. A large region on chromosome 11 from HB3 containing 26 genes (PF11_0489 to PF11_0513) was amplified 2-3X, four of the genes were predicted to encode ring-infected erythrocyte surface antigen, antigen 332, and Ser/Thr protein kinase. The amplified region on chromosome 4 of FCR3 (~25 kb) contained genes encoding a putative reticulocyte-binding protein 1 and four hypothetical proteins (PFD0095c-PFD0115c) and was amplified at least five times. This amplified segment may play a role in the higher growth rate for this parasite, because the reticulocyte-binding protein may facilitate parasite invasion.
Figure 5

Copy number/segmentation analyses showing amplified and highly variable or deleted regions on 14 chromosomes. Amplified/deleted regions were displayed as a signal heat map (red, amplified; blue, deleted or highly polymorphic) from each parasite. The 14 chromosome diagrams showed amplified (red, > 1.5) or deleted/highly variable regions (blue, < 0.67) after filtering for regions 0.3 kb or larger. The dashed lines separate the four parasites in each chromosome in the order of 7G8, Dd2, FCR3, and HB3. The arrow indicates the chromosome 5 regions amplified in Dd2 and FCR3.

Copy number/segmentation analyses showing amplified and highly variable or deleted regions on 14 chromosomes. Amplified/deleted regions were displayed as a signal heat map (red, amplified; blue, deleted or highly polymorphic) from each parasite. The 14 chromosome diagrams showed amplified (red, > 1.5) or deleted/highly variable regions (blue, < 0.67) after filtering for regions 0.3 kb or larger. The dashed lines separate the four parasites in each chromosome in the order of 7G8, Dd2, FCR3, and HB3. The arrow indicates the chromosome 5 regions amplified in Dd2 and FCR3. The majority of the regions with reduced signals (blue) were located on chromosomes ends or regions containing the var/rif/stevor gene clusters, reflecting the highly variable nature of these DNA regions (Figure 5). Although it is difficult to distinguish highly polymorphic regions from deletions in this haploid genome, we considered several additional restrictions to exclude potential polymorphic loci. A segment was considered not truly deleted if it contained known highly polymorphic genes such as var/rif/stevor [29] or if a segment had reduced signals in all four parasites (suggesting highly polymorphic genes such as genes encoding surface proteins). For segments with reduced signal ratios occurring only in one or two parasites, they were more likely to be true deletions, which could also be detected in mSFP distribution plots (Additional file 8). For example, a deletion of ~42-kb segment (PFB0070w-PFB0100c) on chromosome 2 of Dd2 and FCR3 was found to contain a gene encoding knob-associated histidine-rich protein (KAHRP). Deletion of KAHRP in Dd2 was reported previously [28,29,33]. Another likely deleted segment was a ~98-kb region on chromosome 9 of HB3 containing 19 genes (PFI1710w-PFI1800w) including the gene encoding cytoadherence linked asexual protein (CLAG) and lysophospholipase. Again, deletion of this region had been reported [34]. A list of chromosome segments and mapped genes potentially amplified or deleted/highly polymorphic, including those reported previously, can be found in Additional file 9.

Discussion

The PFSANGER array, despite having ~2.2 million P. falciparum probes, was not designed specifically for SNP detection, and whether it was suitable for SNP detection was not certain. This study was initiated to investigate the possibility of using the PFSANGER array for genetic mapping and population studies. The large number of probes on the chip and their high AT content (some > 80%) require critical evaluation of factors that may affect hybridization dynamics before SFP can be reliably called. Based on comparison of mSFP calls with known SNP identified previously [4], we showed that the last two end positions in a probe had limited influence on hybridization signal and that probes with GC contents lower than 16% should be excluded for SFP calling in this genome. We also found that mSFP calls based on a single probe were not reliable after resequencing. For a potential mSFP call, a conservative signal cutoff ratio of 3–5.0 and a vote among several adjacent probes (within 25 bp) with a majority of the probes (at least 75%) should be applied. We demonstrated that this particular microarray could be successfully employed to detect mSFP with high mSFP calling accuracy (≥ 94%). This work provides important information for calling mSFP in the P. falicparum genome using microarrays. We used a 5.0 cutoff ratio in calling SFP because for genetic mapping, a high Fp rate may lead to misleading results that should be minimized. A higher cutoff value may result in a higher Fn rate or missing some calls too. Missing some calls will not be a big issue as the array can detect a large number of SFP. The 5.0 cutoff therefore represents a conservative value for minimizing Fp calls, considering potential higher backgrounds that may exist in some field isolates such as FCR3 in this study. Higher background in FCR3 requires further investigations, although signal intensity and distribution from this parasite appeared to be similar to those from other parasites (Additional file 1 and 2). A sample mixed with a smaller percentage of DNA from a different genotype (strain) may increase the hybridization background signal. Indeed, typing DNA from the FCR3 parasite with microsatellites showed that the DNA sample appeared to contain a secondary peak in some markers (data not shown). If this is true, a sample with high background may have to be discarded. Using an array with a much higher density of probes than those published previously [27-29], we identified 121,087 mSFP from five isolates, including ~18,000 new mSFP after excluding mSFP from multigene families. Among the 121,087 mSFP, ~67% were in clusters of highly polymorphic genes such as var/rif/stevor. Approximately 89% of our mSFP calls that also had probes spanning known SNP in PlasmoDB matched the SNP, reflecting relative high accuracy of our mSFP calls, although our stringent cutoff values may lead to higher Fn rates or "no-calls" (such as excluding single probe calls). Our mSFP also provided additional evidence confirming the SNP reported previously, which is important because the majority of SNP in PlasmoDB were generated from shotgun sequences and sequence alignments have not been visually inspected or adjusted. For a genome with large number of repetitive sequences, sequence alignment errors can be generated if sequence alignment is totally relied on computer software [4]. Distributions of mSFP across the chromosomes among the parasites were very similar except for a few unique peaks that may reflect deletion or amplification in each individual parasite. If we exclude the mSFP from the multigene families, we obtained 40,354 mSFP or approximately 570 bp per SFP in the genome, a frequency that is within the range (519–976 bp per SNP) of our previous estimates [35] and similar to an estimate of 446 bp per SNP by another group [5]. If we consider 45% of the 40,354 mSFP from five isolates as common mSFP, as estimated previously [4], we can expect ~18,000 common mSFP in the five parasite genomes that will be useful for genetic mapping. The highly AT-rich P. falciparum genome has a large number of repetitive sequences and low complexity regions in protein coding sequences [35-37]. The non-coding regions consist of more than 40% of the genome and generally have AT content >90% with large numbers of polymorphic AT repeats and polyA/T tracts [26,38]. These high-AT regions not only present a problem for genome sequencing and DNA sequence alignment but also make it difficult to design sequence-specific probes with reliable hybridization dynamics. SNP in these regions may not be very useful for mapping purposes because of difficulty in designing oligonucleotide probes or PCR primers for genotyping. Indeed, analyses of signal intensity from probes with different GC contents showed that probes with GC contents <16% produced similar low signals, suggesting that these probes might not be practical for calling mSFP. Of interest, probes with GC content >50% also produced highly variable signals. The majority of high-GC probes from the variable var genes can partly contribute to this variation. We excluded probes with GC content >50% for several reasons: 1) Approximately 44% of the probes with GC content >50% were var probes that should be discarded; 2) probes with high GC content would have higher 'affinity' than those with lower GC content during hybridization. A substitution in a probe with high GC content may not reduce the hybridization signal as much as a probe with low GC content; 3) there were only ~3000 probes with GC contents >50%. Exclusion of these probes should not have significant impact on our SFP calls. The P. falciparum chromosomes have been shown to be highly variable in size in pulse-field gel electrophoresis (PFG) [39]. Genomic segmentation analysis to detect chromosome deletion and amplification showed relatively few amplification/deletion events with segment size > 0.3 kb. The variation in chromosome sizes seen in PFG gels could be mainly due to chromosome translocation, which is difficult if not impossible to detect using microarrays. One of the amplified regions was a segment on chromosome 5 containing the pfmdr1 gene in the Dd2 and FCR3 parasites. Amplification of the pfmdr1 locus has been reported [28,29,33], which could be due to drug selection pressure [40]. Similarly, there were few deletions larger than 10 kb; many of the deleted/amplified regions detected in our study matched well with those reported previously [28,29]. Two well-known deleted regions on chromosome 2 and 9, respectively, were detected in our analyses [34,41]. Detection of previously reported deletions suggested that our methods for detecting deletion/amplification were working properly. However, using an array with higher probe density than previous studies, we also discovered many deletions/amplifications that have not been described previously (Additional file 9). We identified 181 amplified and 536 highly variable or deleted genes or fragments, 74 (40.9%) and 30 (5.6%) of which, respectively, were reported previously [28,29,33]. Some of the discrepancies were likely due to different filtering criteria used (e.g. cutoff ratios, minimum number of probes, length cutoff of segment). Because of our small parasite sample size, it is difficult to make any functional inferences from the amplifications and deletions found in this study, although amplification at the pfmdr1 locus may be associated with responses to some anti-malarial drugs [40,42], and amplification of chromosome 4 in FCR3 may contribute to its adaptation to higher growth rates.

Conclusion

This study developed methods for accurate detection of mSFP and CNV in the P. falciparum genome after evaluating factors that can influence DNA hybridization dynamics. More than 120,000 mSFP, including ~18,000 new and unique mSFP, and various chromosomal amplification/deletions were identified from the P. falciparum genome. Nearly 70% of the polymorphic sites are in clusters of var/rif/stevor gene families. Use of this array to analyze DNA samples from large numbers of parasites will facilitate our understanding of parasite diversity and evolution and genetic mapping of important parasite traits.

Methods

Parasites and parasite culture

P. falciparum parasite isolates used in this study have been described [4,43]. The parasites were cultured in vitro according to the methods of Trager and Jensen [44]. Briefly, parasites were maintained in RPMI 1640 medium containing 5% human O+ erythrocytes (5% hematocrit), 0.5% Albumax (GIBCO, Life Technologies, Grand Island, NY), 24 mM sodium bicarbonate, and 10 μg/ml gentamycin at 37°C with 5% CO2, 5% O2, and 90% N2.

DNA extraction and probe labeling

Parasites were cultured to a parasitemia of 5% or higher; and the cultures were centrifuged at 5000g to collect red blood cells that were lyzed with addition of 10 vol of 0.1% saponin in PBS. The parasites were centrifuged again; and genomic DNA was extracted from the parasite pellet using Wizard Genomic DNA Purification kit (Promega, Madison, WI). Genomic DNA (10 μg) from each parasite was used as probes in the hybridizations. Briefly, genomic DNA was fragmented to an average size of 50–150 bp with DNase I and the quality of the digested DNA evaluated in 2% agarose gels. Subsequently, fragmented DNA was end-labeled using terminal deoxynucleotidyl transferase and a biotin labeling kit (Affymetrix mapping 250 K reagent kit; Affymetrix, Inc., Santa Clara, CA).

Microarray hybridization

The PFSANGER Genechip® was purchased from Affymetrix, Inc. Array hybridization was performed at the microarray facility of the Laboratory of Immunopathogenesis and Bioinformatics, SAIC-Frederick, Inc (Frederick, MD). Briefly, biotin-labeled DNA were hybridized to array chips at 45°C for 16 h with constant rotation at 60 rpm. Affymetrix 20× hybridization control was used to make the hybridization cocktail. Hybridized chips were washed and stained following the company's EukGE-WS2v5 protocol. The chips were then scanned at 570 nm emission wavelength using an Affymetrix scanner 3000. All the parasites have two or more biological replicates (Additional file 1).

Microarray chip design and data analysis

The probes were designed based on P. falciparum genome (3D7) sequence v2.1.1 [45] covering genomic regions where unique probes with a reasonably broad 'thermal' range could be designed. A brief description of the array design has been reported recently [46]. Because of recent updates of genome databases, all probe sequences were reassigned with new coordinates along each chromosome and their relative positions in a predicted gene (exon, intron, across exon and intron, and intergenic regions) according to the 3D7 genome sequence in PlasmoDB V5.2. The scanned image CEL files were processed and analyzed using the R/Bioconductor package and the robust multichip analysis method [47]. Basically, the programs retrieved probe information (perfect match only), performed background subtraction, quantile-normalized signals from the chips, and transformed the data into a final normalized data matrix of log2 values. Partek Genomics Suite 6.3 (Partek Inc., St. Louis, MO) and in-house programs are also used in SFP calling and copy number analyses.

Mapping known SNP to array probes

After determining the correct genomic coordinates for each SNP and each array probe, known SNP from our previous study [4] and those in PlasmoDB [3,5,28,45] were mapped to probes that covered known SNP positions. Ambiguous SNP (mapped to multiple positions) were removed, and the remaining SNP were uploaded to a genome browser [32] with allele information from different parasites.

SFP calling

Because the signals from the probes do not allow for accurate mapping of the position of a SNP within a probe at the given probe density, we can only assert that somewhere within a probe there is likely a polymorphism. Therefore, we simply assigned the polymorphism to a feature (probe) and called it a single feature polymorphism (SFP) as described [28]. Because a polymorphic site was often covered by multiple probes (average ~4 probes), we treated calls from probes within 25 bp as one SFP (called mSFP). To establish optimal parameters for SFP calling, we investigated SFP calling rates and calling accuracies using various conditions. We first identified all of the probes that covered each SNP identified in our previous study [4]. Then we extracted their hybridization signals from a normalized data file. The average probe intensity (average of antilogs of the raw data) from the normalized data for all replicates of each parasite isolate was calculated. This value was compared with the average signal for 3D7 obtained in the same way. A ratio was obtained after comparison with the signal from that of 3D7. We evaluated the influences of SNP position in a probe, GC content of a probe, cutoff ratios of hybridization signal, and numbers of probes on SFP calling accuracy. Probes with GC content < 16% and > 50%, and probes with multiple hits in the genome were excluded for the analyses. The last two nucleotides at each ends of a probe were also discarded, because substitutions at these positions had minimal influences on hybridization signals. Once optimal parameters were identified for calling SNP using the NIAID SNP as an input set to test the method, we applied a similar procedure to a whole genome scan for probe-based SFP and mSFP (Additional file 9). Probe ratios were computed for each parasite for each probe, and raw alleles were generated by applying the cutoff ratio of 5.0 – it was an SFP if a ratio was above the cutoff value and it was not if below the ratio. Next, going through one parasite at a time, all probes were considered where there was more than one positive probe in a row within 25 bp of one another. Once this filtered set of probes was extracted from the full set, the ratios of intensity for each of the isolates compared with 3D7 was computed and tabulated. From this table, a vector was constructed for each parasite isolate where either a '1' or a '0' was added to each position determined by the value of the ratio. This vector was then scanned for stretches of '1's where the distance between the probes was less than 25 bp. In cases where longer stretches were identified, they were output as an additional feature type called long multiprobe polymorphism. Because some probes represent different strands of the exact same sequence region, we also discarded those stretches of '1's where the probes on either strand had a distance of 0 bp from the neighboring probe but did not exceed the threshold ratio value. All of the multiprobe polymorphisms corresponding to the mSFP were then output, and both classes of polymorphisms (single probe SFP and multi-probe mSFP) were then loaded into the genome browser. The procedure also tracks the 'alleles' by parasite isolate to determine the counts of mSFP shared by each possible combination of parasite isolates. Additional parameters that added confidence to a particular mSFP call, such as multiple parasite isolates having the same SFP and matches to known SNP in PlasmoDB, were also indicated.

Estimating SFP calling rates using ROC curve and Z-score

Hybridization measurements from Affymetrix CEL files were pre-processed in the R programming environment [48] using the read.affybatch function from the affy BioConductor package [49]. Background adjustment was performed using the method developed for the RMA algorithm, and normalization was done using the quantile method. Differential hybridization between parasite isolates was expressed as Z-scores calculated by the LPE package [30,50].

DNA sequencing

To verify selected mSNP (Table 2) that might be called incorrectly or calls that had contradictory signals, we amplified DNA fragments of 200–500 bp containing the probes and sequenced the PCR products directly according to methods described [43]. Primer sequences used in PCR and DNA sequencing are listed in Table 2.

Detection of CNV

To detect CNV, we imported the filtered probe data into Partek Genomics Suite v6.3 and normalized individual probe signal from the 3D7 reference genome to 1.0 (haploid genome). Basically, the genomic segmentation algorithm finds a segment according to three criteria: 1) neighboring regions have statistically significantly different average intensities (P ≥ 0.00001); 2) breakpoints (region boundaries) were chosen to give optimal statistical significance (smallest P-value); and 3) detected regions must contain a minimum of 15 probes. After determining the segments that had average signals higher or lower than 1.5 fold of those of the 3D7 reference, we filtered out regions that were less than 300 bp long. Detected segments, representing potential deletions or highly polymorphic regions, were plotted along chromosomes to produce CN genome view (Figure 5); and the segments were mapped to predicted genes in PlasmoDB to generate additional file 9. To screen for those highly polymorphic genes from potentially deleted segments, we flagged segments containing var/rif/stevor and other multigene families.

Abbreviations

CNV: copy number variation; CV: coefficient of variance; Fn: false negative; Fp: false positive; LPE: local pooled error; MS: microsatellites; ROC: receiver operating characteristic; SFP: single feature polymorphism; mSFP: SFP called by two or more overlapping probes; SNP: single nucleotide polymorphism.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

HJ prepared parasite culture, performed DNA extraction, labeling, DNA sequencing, array hybridization, and data analysis, and took an active part in writing the manuscript; MY performed data analysis; JM and LZ performed parasite culture and MS typing as well as DNA extraction; AI and YH performed data analysis; LJK, performed ROC and z-score analyses; RMS performed data analysis and took an active part in writing the manuscript; X-zS designed the project, performed data analysis, and took an active part in writing the manuscript. All authors read and approved the manuscript.

Additional file 1

Parasite sample replicates and basic hybridization statistics after normalization. Click here for file

Additional file 2

Plots of normalized signal ratios averaged from parasite replicates, showing distribution of probe signal ratios from each parasite (over 3D7). Click here for file

Additional file 3

Number of NIAID SNP that are covered by different numbers of probes. Click here for file

Additional file 4

The numbers of probes with NIAID SNP at positions 1–25. Probes with a single SNP are in light blue, two SNP are in red, and more than two SNP are in dark blue. Click here for file

Additional file 5

Summary of procedures for calling genome-wide SFP and copy number variation. Click here for file

Additional file 6

mSFP calls for the 14 chromosomes among five parasite isolates after excluding calls from multigene families. Click here for file

Additional file 7

Numbers of SFP, mSFP, and known SNP in predicted P. falciparum genes. Click here for file

Additional file 8

SFP counts per 10-kb bins across the 14 chromosomes from 7G8, Dd2, FCR3, and HB3. Click here for file

Additional file 9

Amplified and deleted chromosomal segments or genes 300 bp or larger. Click here for file
  46 in total

1.  Large-scale genotyping of complex DNA.

Authors:  Giulia C Kennedy; Hajime Matsuzaki; Shoulian Dong; Wei-min Liu; Jing Huang; Guoying Liu; Xing Su; Manqiu Cao; Wenwei Chen; Jane Zhang; Weiwei Liu; Geoffrey Yang; Xiaojun Di; Thomas Ryder; Zhijun He; Urvashi Surti; Michael S Phillips; Michael T Boyce-Jacino; Stephen P A Fodor; Keith W Jones
Journal:  Nat Biotechnol       Date:  2003-09-07       Impact factor: 54.908

2.  Development of an automated SNP analysis method using a paramagnetic beads handling robot.

Authors:  Hiroko Hagiwara; Kazumi Sawakami-Kobayashi; Midori Yamamoto; Shoji Iwasaki; Mika Sugiura; Hatsumi Abe; Sumiko Kunihiro-Ohashi; Kumiko Takase; Noriko Yamane; Kaoru Kato; Renkon Son; Michihiro Nakamura; Osamu Segawa; Mamiko Yoshida; Masafumi Yohda; Hideji Tajima; Masato Kobori; Yousuke Takahama; Mitsuo Itakura; Masayuki Machida
Journal:  Biotechnol Bioeng       Date:  2007-10-01       Impact factor: 4.530

3.  Genome wide gene amplifications and deletions in Plasmodium falciparum.

Authors:  Ulf Ribacke; Bobo W Mok; Valtteri Wirta; Johan Normark; Joakim Lundeberg; Fred Kironde; Thomas G Egwang; Peter Nilsson; Mats Wahlgren
Journal:  Mol Biochem Parasitol       Date:  2007-05-18       Impact factor: 1.759

4.  A genome-wide association study identifies KIAA0350 as a type 1 diabetes gene.

Authors:  Hakon Hakonarson; Struan F A Grant; Jonathan P Bradfield; Luc Marchand; Cecilia E Kim; Joseph T Glessner; Rosemarie Grabs; Tracy Casalunovo; Shayne P Taback; Edward C Frackelton; Margaret L Lawson; Luke J Robinson; Robert Skraban; Yang Lu; Rosetta M Chiavacci; Charles A Stanley; Susan E Kirsch; Eric F Rappaport; Jordan S Orange; Dimitri S Monos; Marcella Devoto; Hui-Qi Qu; Constantin Polychronakos
Journal:  Nature       Date:  2007-07-15       Impact factor: 49.962

5.  A genome-wide association scan identifies the hepatic cholesterol transporter ABCG8 as a susceptibility factor for human gallstone disease.

Authors:  Stephan Buch; Clemens Schafmayer; Henry Völzke; Christian Becker; Andre Franke; Huberta von Eller-Eberstein; Christian Kluck; Ingelore Bässmann; Mario Brosch; Frank Lammert; Juan Francisco Miquel; Flavio Nervi; Michael Wittig; Dieter Rosskopf; Birgit Timm; Christine Höll; Marcus Seeger; Abdou ElSharawy; Tim Lu; Jan Egberts; Fred Fändrich; Ulrich R Fölsch; Michael Krawczak; Stefan Schreiber; Peter Nürnberg; Jürgen Tepel; Jochen Hampe
Journal:  Nat Genet       Date:  2007-07-15       Impact factor: 38.330

6.  Large-scale discovery and genotyping of single-nucleotide polymorphisms in the mouse.

Authors:  K Lindblad-Toh; E Winchester; M J Daly; D G Wang; J N Hirschhorn; J P Laviolette; K Ardlie; D E Reich; E Robinson; P Sklar; N Shah; D Thomas; J B Fan; T Gingeras; J Warrington; N Patil; T J Hudson; E S Lander
Journal:  Nat Genet       Date:  2000-04       Impact factor: 38.330

7.  Genome-wide discovery and verification of novel structured RNAs in Plasmodium falciparum.

Authors:  Tobias Mourier; Celine Carret; Sue Kyes; Zoe Christodoulou; Paul P Gardner; Daniel C Jeffares; Robert Pinches; Bart Barrell; Matt Berriman; Sam Griffiths-Jones; Alasdair Ivens; Chris Newbold; Arnab Pain
Journal:  Genome Res       Date:  2007-12-20       Impact factor: 9.043

8.  Genome-wide association study of restless legs syndrome identifies common variants in three genomic regions.

Authors:  Juliane Winkelmann; Barbara Schormair; Peter Lichtner; Stephan Ripke; Lan Xiong; Shapour Jalilzadeh; Stephany Fulda; Benno Pütz; Gertrud Eckstein; Stephanie Hauk; Claudia Trenkwalder; Alexander Zimprich; Karin Stiasny-Kolster; Wolfgang Oertel; Cornelius G Bachmann; Walter Paulus; Ines Peglau; Ilonka Eisensehr; Jacques Montplaisir; Gustavo Turecki; Guy Rouleau; Christian Gieger; Thomas Illig; H-Erich Wichmann; Florian Holsboer; Bertram Müller-Myhsok; Thomas Meitinger
Journal:  Nat Genet       Date:  2007-07-18       Impact factor: 38.330

9.  Genome-wide association scan identifies a colorectal cancer susceptibility locus on chromosome 8q24.

Authors:  Brent W Zanke; Celia M T Greenwood; Jagadish Rangrej; Rafal Kustra; Albert Tenesa; Susan M Farrington; James Prendergast; Sylviane Olschwang; Theodore Chiang; Edgar Crowdy; Vincent Ferretti; Philippe Laflamme; Saravanan Sundararajan; Stéphanie Roumy; Jean-François Olivier; Frédérick Robidoux; Robert Sladek; Alexandre Montpetit; Peter Campbell; Stephane Bezieau; Anne Marie O'Shea; George Zogopoulos; Michelle Cotterchio; Polly Newcomb; John McLaughlin; Ban Younghusband; Roger Green; Jane Green; Mary E M Porteous; Harry Campbell; Helene Blanche; Mourad Sahbatou; Emmanuel Tubacher; Catherine Bonaiti-Pellié; Bruno Buecher; Elio Riboli; Sebastien Kury; Stephen J Chanock; John Potter; Gilles Thomas; Steven Gallinger; Thomas J Hudson; Malcolm G Dunlop
Journal:  Nat Genet       Date:  2007-07-08       Impact factor: 38.330

10.  A genome-wide association scan of tag SNPs identifies a susceptibility variant for colorectal cancer at 8q24.21.

Authors:  Ian Tomlinson; Emily Webb; Luis Carvajal-Carmona; Peter Broderick; Zoe Kemp; Sarah Spain; Steven Penegar; Ian Chandler; Maggie Gorman; Wendy Wood; Ella Barclay; Steven Lubbe; Lynn Martin; Gabrielle Sellick; Emma Jaeger; Richard Hubner; Ruth Wild; Andrew Rowan; Sarah Fielding; Kimberley Howarth; Andrew Silver; Wendy Atkin; Kenneth Muir; Richard Logan; David Kerr; Elaine Johnstone; Oliver Sieber; Richard Gray; Huw Thomas; Julian Peto; Jean-Baptiste Cazier; Richard Houlston
Journal:  Nat Genet       Date:  2007-07-08       Impact factor: 38.330

View more
  39 in total

Review 1.  How can we identify parasite genes that underlie antimalarial drug resistance?

Authors:  Tim Anderson; Standwell Nkhoma; Andrea Ecker; David Fidock
Journal:  Pharmacogenomics       Date:  2011-01       Impact factor: 2.533

2.  Genome-wide polymorphisms and development of a microarray platform to detect genetic variations in Plasmodium yoelii.

Authors:  Sethu C Nair; Sittiporn Pattaradilokrat; Martine M Zilversmit; Jennifer Dommer; Vijayaraj Nagarajan; Melissa T Stephens; Wenming Xiao; John C Tan; Xin-Zhuan Su
Journal:  Mol Biochem Parasitol       Date:  2014-03-29       Impact factor: 1.759

Review 3.  Drug resistance and genetic mapping in Plasmodium falciparum.

Authors:  Karen Hayton; Xin-Zhuan Su
Journal:  Curr Genet       Date:  2008-09-18       Impact factor: 3.886

Review 4.  Chemical genomics for studying parasite gene function and interaction.

Authors:  Jian Li; Jing Yuan; Ken Chih-Chien Cheng; James Inglese; Xin-zhuan Su
Journal:  Trends Parasitol       Date:  2013-11-09

5.  A large proportion of P. falciparum isolates in the Amazon region of Peru lack pfhrp2 and pfhrp3: implications for malaria rapid diagnostic tests.

Authors:  Dionicia Gamboa; Mei-Fong Ho; Jorge Bendezu; Katherine Torres; Peter L Chiodini; John W Barnwell; Sandra Incardona; Mark Perkins; David Bell; James McCarthy; Qin Cheng
Journal:  PLoS One       Date:  2010-01-25       Impact factor: 3.240

6.  Optimizing comparative genomic hybridization probes for genotyping and SNP detection in Plasmodium falciparum.

Authors:  John C Tan; Jigar J Patel; Asako Tan; J Craig Blain; Tom J Albert; Neil F Lobo; Michael T Ferdig
Journal:  Genomics       Date:  2009-03-11       Impact factor: 5.736

7.  Comparative Genomics and Systems Biology of Malaria Parasites Plasmodium.

Authors:  Hong Cai; Zhan Zhou; Jianying Gu; Yufeng Wang
Journal:  Curr Bioinform       Date:  2012-12-01       Impact factor: 3.543

8.  Single-feature polymorphism discovery by computing probe affinity shape powers.

Authors:  Wayne Wenzhong Xu; Seungho Cho; S Samuel Yang; Yung-Tsi Bolon; Hatice Bilgic; Haiyan Jia; Yanwen Xiong; Gary J Muehlbauer
Journal:  BMC Genet       Date:  2009-08-26       Impact factor: 2.797

9.  Gene copy number variation throughout the Plasmodium falciparum genome.

Authors:  Ian H Cheeseman; Natalia Gomez-Escobar; Celine K Carret; Alasdair Ivens; Lindsay B Stewart; Kevin K A Tetteh; David J Conway
Journal:  BMC Genomics       Date:  2009-08-04       Impact factor: 3.969

10.  Advances in parasite genomics: from sequences to regulatory networks.

Authors:  Elizabeth A Winzeler
Journal:  PLoS Pathog       Date:  2009-10-30       Impact factor: 6.823

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.