| Literature DB >> 20974003 |
Amit Bahl1, Paul H Davis, Michael Behnke, Florence Dzierszinski, Manjunatha Jagalur, Feng Chen, Dhanasekaran Shanmugam, Michael W White, David Kulp, David S Roos.
Abstract
BACKGROUND: Microarrays are invaluable tools for genome interrogation, SNP detection, and expression analysis, among other applications. Such broad capabilities would be of value to many pathogen research communities, although the development and use of genome-scale microarrays is often a costly undertaking. Therefore, effective methods for reducing unnecessary probes while maintaining or expanding functionality would be relevant to many investigators.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20974003 PMCID: PMC3017859 DOI: 10.1186/1471-2164-11-603
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Microarray Design1.
| Application | # of features | probes/feature | Tiling density | total # probes | % of chip |
|---|---|---|---|---|---|
| nuclear coding genes (3' biased)2 | 8,058 | 11 | 88,638 | 39.12% | |
| nuclear non-coding genes | 22 | 20 | 440 | 0.19% | |
| apicoplast organellar genome (nt) | 34,997 | 25 | 1,400 | 0.62% | |
| mitochondrial organellar genome (nt) | 6,071 | 25 | 243 | 0.11% | |
| all exons (chr Ib only)3 | 1,080 | 6 | 6,480 | 2.86% | |
| all introns (chr Ib only) | 1,080 | 5 | 5,400 | 2.38% | |
| antisense probes (opposite CDS; chr Ib only) | 227 | 20 | 4,540 | 2.00% | |
| ESTs without predicted gene models (nt) | 830,867 | 35 | 23,739 | 10.48% | |
| ORFs with BLASTX or TBLASTN hits (nt) | 1,263,357 | 35 | 36,096 | 15.93% | |
| human (immune response & housekeeping)4 | 301 | 11 | 3,311 | 1.46% | |
| mouse (immune response & housekeeping)4 | 291 | 11 | 3,201 | 1.41% | |
| cat (housekeeping genes) | 12 | 30 | 360 | 0.16% | |
| 228 | 40 | 9,120 | 4.02% | ||
| SNPs inferred from | 3,490 | 4 | 13,960 | 6.16% | |
| 1,985 | 4 | 7,940 | 3.50% | ||
| SFP discovery on 24 selected genes5 | 23,110 | 2 | 11,555 | 5.10% | |
| promoters (for ChIP) on 12 selected genes6 | 12,000 | 10 | 1,200 | 0.53% | |
| commonly used transgene reporters7 | 39 | 11 | 429 | 0.19% | |
| human & mouse normalization probes | 2,200 | 0.97% | |||
| yeast (housekeeping & spike-in probes) | 839 | 0.37% | |||
| mismatch probes (genes on chr 1b) | 227 | 11 | 2,497 | 1.10% | |
| surrogate mismatch (background) probes | 3,000 | 1.32% | |||
| 226,588 | 100.00% | ||||
1 See http://ancillary.toxodb.org/docs/Array-Tutorial.html for a detailed description, including probe sequences.
2 A small minority of the 7,793 genes are represented by more than 1 probeset, differing in the degree to which they cross hybridize, while even fewer don't have named probesets of their own as they are interragated by probesets for other genes.
3 Non-terminal exons only (terminal exons are interrogated as part of 3'-biased profiling).
4 See http://ancillary.toxodb.org/docs/HostResponse.htm for details.
5 CDS for AMA1, B1, BSR4/R, GRA3/6/7, MIC2, ROP1/16, SAG1/2/3/4, SRS1/2/9; introns from ATUB, BTUB, BAG1, UPRT. See http://ancillary.toxodb.org/docs/SNPDiscovery.htm for details.
6 BAG1, BTUB, LDH1, LDH2, SAG1, SAG2, SAG2C, DHFR-TS, MIC2, GRA1, OWP1, OWP2; see http://ancillary.toxodb.org/docs/ChIP.htm.
7 For selectable drug-resistance markers, enzyme and fluorescent protein reporters, etc; see http://ancillary.toxodb.org/docs/TransgeneReporters.seq.
Gene Expression in Toxoplasma gondii.
| Genes (total = 7793) | RH (type I) | Pru (type II) | VEG (type III) | No expression | |||
|---|---|---|---|---|---|---|---|
| 2,336 | 2,185 (2,073) | NA | NA | NA | 5,608 (5,270) | ||
| 1,229 | 901 (488) | NA | NA | NA | 6,892 (7,305) | ||
| 501 | 204 (106) | NA | NA | NA | 2,245 (2,343) | ||
| 7,793 | 3,986 (3,270) | 3,395 (1,692) | 3,065 (1,472) | 3,185 (1,623) | 4,072 (2,154) | ||
1 Number of Toxoplasma ESTs in dbEST, unique SAGE tags in TgSAGEDB, and probesets on the microarrays.
2 Values in parentheses reflect a more stringent criteria for evidence of expression: > = 3 SAGE tags or ESTs vs. > = 1; 150% above background vs. > 0% above background; 5% FDR vs. 10% FDR for Affymetrix arrays.
3 Some genes have more than 1 associated probeset, differing in degree of potential cross hybridization.
Figure 1Differential expression between clonal lineages. MA plots (intensity ratio versus average intensity) for hybridizations with representatives of the three major clonal lineages of Toxoplasma show a very high degree of reproducibility among biological replicates (comparisons shown along diagonal), and a significant number of differentially expressed genes between lineages (blue dots). Tables list genes exhibiting the greatest differences in hybridization intensity for each pairwise comparison, ranked by estimated fold-change (asterisks indicate genes where at least four probes are polymorphic). Gene presence was determined using a 10% false discovery rate (see Materials & Methods for details), resulting in 42% of genes called present in RH-, 38% in Prugniaud-, and 40% in VEG-strain parasites, with an aggregate total of 49% of genes called present in any strain during the tachyzoite (lytic) life stage. 5,307 genes exhibit differences in between-strain expression levels at a P-value of 1 × 10-3 (corrected for multiple testing); filtering to eliminate genes for which 4 or more probes are polymorphic, differences in fold change are under 2-fold, or are called absent (at a 10% FDR) leaves 2,078 genes with clear evidence of strain-specific differential expression. The pie chart indicates the distribution of differentially regulated genes by strain (+ and - indicate up- and down-regulation, respectively).
Strain-specific differential expression in Toxoplasma gondii.
| Number of genes | Significant up- or down-regulation in: | ||||||
|---|---|---|---|---|---|---|---|
| 7,793 | 3,986 | 2,078 | 477 | 570 | 193 | 838 | |
| 2,074 | 1,360 | 677 | 143 | 218 | 60 | 256 | |
| 2,764 | 1,805 | 885 | 193 | 272 | 80 | 340 | |
| 140 | 62 | 31 | 8 | 7 | 4 | 12 | |
| 250 | 177 0 | 660 | 16 | 19 | 4 | 27 | |
| 806 | 574 | 288 | 61 | 105 | 21 | 101 | |
| 528 | 331 | 179 | 31 | 58 | 14 | 76 0 | |
| 90 | 56 | 32 | 10 | 12 | 0 | 10 | |
| 456 | 297 | 153 | 38 | 36 | 17 | 62 | |
| 85 | 54 | 29 | 7 | 8 | 3 | 11 | |
| 52 | 28 | 13 | 2 | 5 | 4 | 2 | |
| 238 | 151 | 70 | 16 | 14 | 10 | 30 | |
| 17 | 8 | 2 | 1 | 0 | 1 | 0 | |
| 102 | 65 | 22 | 3 | 8 | 2 | 9 | |
| 5,719 | 2,626 | 1,401 | 333 | 352 | 133 | 582 | |
1 Multiple GO annotations are available for some genes, and some GO annotations map to multiple GO slim categories (although multiple annotations may also map to the same category). 2,074 genes include 2,979 GO annotations, corresponding to 3,503 GO Slims, for a total of 2,764 gene-GO Slim mappings.
2 Assessed using a 10% false discovery rate (see text for details).
3 Differential expression determined as described in the text. P-values represent enrichment for GO Slim terms, as determined by the hypergeometric distribution (only values less than 0.05 are shown). Green shading implies under-representation, while yellow shading implies over-representation.
Figure 2Genotyping design. The chromosome map illustrates three tiers of genotyping content present on the Toxoplasma microarray. The triangles represent the published RFLP markers, and represent genotyping capabilities prior to this work. The filled triangles represent those markers for which we have provided probesets that passed a rigorous screening process. The top half of each chromosome bar represents the EST-based SNPs, and the bottom half shows the SFPs that have passed screening. The table lists the exact numbers of SNPs represented on the array, those that passed screening, and the probe content for each. , An expanded view of chromosome Ib indicates the SNP frequency derived from comparative sequence analysis for the three archetypal strains (see Additional File 3), and indicates the location of probesets designed for SNP detection. , A magnified view of chromosome Ib demonstrating the overlap of SNPs with probes primarily designed for transcriptional profiling. Pink triangles indicate those probes which overlap SNP locations, and can be used to detect SFPs (see text).
Figure 3SNP detection performance. A, Performance of a single, sense strand probe: The ability of a single sense stranded probe overlapping a SNP to call the correct allele as a function of distance from the center of the probe to the SNP is shown. At a stringent P-value threshold of 10-4, approximately 65% of SNPs are called correctly using a probe centered exactly on the SNP (see haploid genotyping simulation section in Materials & Methods for a description of P-value calculations). B, Performance of an alternate probe when centered sense probe fails: When the centered probe fails to call the correct allele at a chosen threshold, the ability of one additional probe to rescue the call is shown as a function of strand and distance of the probe relative to the SNP. Probes on the sense strand at close distances to the SNP contribute little, presumably due to the same local constraints that caused the centered sense probe to fail, where as the opposite strand centered probe recovers 60% of missed calls at a threshold of 10-4. Therefore, at a threshold of 10-4, we achieve an 86% success rate.
Figure 4Detecting crossovers. A chromosomal SNP map for a recombinant progeny (clone A6AF) of a GT1 (type I) X CTG (type III) cross is represented, along with the published (triangles) and array-based (lines) genotyping calls for this clone. There is almost total agreement between markers called by both methods (>98.5%). The inset table summarizes the benefits of mapping crossovers using the array across 5 randomly selected progeny, showing that on average more breakpoints are discovered, and cover regions that are approximately 11-fold smaller. The numbers in parentheses in the breakpoint columns represent previous results using RFLP analysis.
Figure 5SNP discovery and gene expression profiling in the apicoplast. The T. gondii plastid (apicoplast; RH strain sequence) was tiled at a 25 nt resolution on alternating strands allowing probe level expression profiling across the entire organelle. Expression patterns (inner circle; red and blue bars represent opposite strands and high absolute expression; grey bars represent low expression levels) are consistent with an operon transcription, with two major origins of transcription evident at the LSU rRNA genes, running in opposite directions (as indicated by the arrows). SFPs were also uncovered using DNA hybridization differences between GT1 (type I), Pru (type II), and CTG (type III), revealing 43 type II SNPs (green diamonds), 12 type III SNPs (blue diamonds), and no type I SNPs (red diamonds).
Figure 6. Top panel shows a 50 kb region of chromosome Ib, illustrating, in addition to the 3'-biased expression profiling probes that are available genome-wide, the high density of probes available for this chromosome, including intron, exon, and antisense probes for each annotated gene (blue genes run from left to right; red from right to left). Transcript discovery probes interrogate unannotated EST clusters (≥3 ESTs) and ORFs (≥150 nt) that intersect with BLAST hits (bitscore ≥100). A barplot provides normalized probe-level expression data (union of sense probe intensities from antisense target kits, antisense probe intensities from sense kit), indicating probable expression of unannotated EST clusters and BLAST hits. See text and Table 1 for further details. Bottom panel displays a 6 kb span at higher resolution, illustrating the validation of gene structure, and comparable transcription levels in upstream ESTs that may correspond to non-coding exons.
Figure 7Validation of gene models. At a false discovery rate of 10%, ~64% of all exons on chromosome Ib are called "Present" (see Materials & Methods for details). These P/A calls are highly non-random in their distribution, with multi-exon genes showing good consistency among their individual exon calls (i.e. the majority of genes show expression of most annotated exons, or no expression at all). Among genes with inconsistent P/A calls, expression patterns are often clustered, suggesting alternative gene models. For example, the expression patterns associated with the kinesin motor domain-containing protein (25.m01768) suggest a coding start site that begins with the eighth exon.