| Literature DB >> 22507374 |
Saneyoshi Ueno1, Yoshinari Moriguchi, Kentaro Uchiyama, Tokuko Ujino-Ihara, Norihiro Futamura, Tetsuya Sakurai, Kenji Shinohara, Yoshihiko Tsumura.
Abstract
BACKGROUND: Microsatellites or simple sequence repeats (SSRs) in expressed sequence tags (ESTs) are useful resources for genome analysis because of their abundance, functionality and polymorphism. The advent of commercial second generation sequencing machines has lead to new strategies for developing EST-SSR markers, necessitating the development of bioinformatic framework that can keep pace with the increasing quality and quantity of sequence data produced. We describe an open scheme for analyzing ESTs and developing EST-SSR markers from reads collected by Sanger sequencing and pyrosequencing of sugi (Cryptomeria japonica).Entities:
Mesh:
Substances:
Year: 2012 PMID: 22507374 PMCID: PMC3424129 DOI: 10.1186/1471-2164-13-136
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Sequencing, masking and trimming statistics for unigene assembly
| | ||||||||
| Number of reads (1) | 39936 | N/A | N/A | 3456 | 768 | 5184 | 129195 | 1333444 |
| Number of 3' ESTs in (1) | 19968 | N/A | N/A | 2880 | 0 | 0 | 62776 | - |
| Number of base call (amount of sequences in Mbp) for (1) | 30.12 | N/A | N/A | 2.97 | 0.53 | 4.95 | 141.66 | 393.8 |
| Amount of bases (Mbp) masked after cross_match | 4.07 | N/A | N/A | 0.16 | 0.16 | 0.59 | 14.04 | 0.09 |
| Number of reads after SeqClean (reads passed to assembly) (2) | 36722 | N/A | N/A | 3175 | 485 | 4435 | 118319 | 1201150 |
| Amount of reads (Mbp) after SeqClean (reads passed to assembly) | 19.95 | N/A | N/A | 1.96 | 0.23 | 2.26 | 78.59 | 354.5 |
| Aveage read length (bp) in (2) with phred QV > =20 | 535.8 | N/A | N/A | 609.3 | 460.2 | 500.5 | 654.2 | 282.7 |
| Tissue/developmental stage | Male bud | Female bud | Leaf | Male flower | Female flower | Inner bark | – | Seedling |
| References | [ | Futamura et al. in prep. | Futamura et al. in prep. | [ | [ | [ | – | This study |
N/A: See Futamura et al. in preparation for details.
Figure 1Schematic representation of the bioinformatic analysis. Different colours correspond to different kinds of analysis (cleaning in yellow, assembly in red, comparative analysis in purple, location (UTR or coding) analysis based on peptide prediction in orange, gene ontology based analysis in blue and EST-SSR primer design in green). The † mark indicates a logical link between the estimated SSR location and comparative analysis.
SSR motifs and their frequency in 3′ UTR, 5′ UTR and coding regions
| di | AT | 154 | 70 | 73 | 6 | 303 |
| | AG | 42 | 53 | 155 | 6 | 256 |
| | AC | 27 | 18 | 46 | 3 | 94 |
| | CG | 0 | 0 | 2 | 0 | 2 |
| | sub-total | 223 | 141 | 276 | 15 | 655 |
| tri | AAG | 19 | 36 | 285 | 2 | 342 |
| | ATG | 19 | 32 | 178 | 1 | 230 |
| | AGG | 7 | 49 | 170 | 0 | 226 |
| | AGC | 18 | 34 | 129 | 2 | 183 |
| | AAT | 55 | 22 | 49 | 2 | 128 |
| | ACC | 6 | 7 | 64 | 0 | 77 |
| | GGC | 9 | 15 | 38 | 0 | 62 |
| | AAC | 11 | 9 | 32 | 0 | 52 |
| | ACG | 0 | 1 | 11 | 0 | 12 |
| | AGT | 2 | 1 | 4 | 0 | 7 |
| | sub-total | 146 | 206 | 960 | 7 | 1319 |
| tetra | | 61 | 65 | 60 | 8 | 194 |
| penta | | 180 | 193 | 349 | 19 | 741 |
| hexa | | 154 | 225 | 703 | 12 | 1094 |
| compound | | 15 | 5 | 35 | 1 | 56 |
| Total | 779 | 835 | 2383 | 62 | 4059 | |
SSR location was estimated by inferring coding regions using the prot4EST pipeline [44] and the fasty35 module of the FASTA package [49]. Some of the locations were un-determined because the corresponding SSRs extended over both coding and non-coding regions.
Figure 2Frequency distribution of SSRs by motif and repeat length in CjCon1.
Figure 3SSR frequency and density (/kbp) within each library. SSR frequency was defined as the percentage of SSR containing sequences within contigs, while SSR density was calculated as the number of SSRs in 10 kbp of contigs.
Figure 4Relationship between genome size and SSR frequency. SSR frequencies were plotted against genome size (Mbp) on a log scale. The gene indices are assigned as the following abbreviations: AGI; Arabidopsis thaliana, HAGI; Helianthus annuus, NTGI; Nicotiana tabacum, OGI; Oak, OSGI; Oryza sativa, PGI; Pinus and SGI; Picea. Genome size for Pinus taeda and Picea abies was used for PGI and SGI, respectively.
Figure 5SSR frequency according to estimated location (coding, 3′ UTR or 5′ UTR).
Factors affecting (a) PCR success and (b) levels of polymorphism, analyzed using generalized linear models
| a) | ||||
|---|---|---|---|---|
| Pipeline | | | −0.348 | 0.728 |
| CMiB | 0 | 0 | | |
| read2Marker | −0.1074 | 0.3090 | | |
| Primer location | | | −0.513 | 0.6081 |
| coding | 0 | 0 | | |
| others | −0.1590 | 0.3101 | | |
| Sum of primer melting temperature | 0.1420 | 0.1191 | 1.192 | 0.2333 |
| Expected PCR product size | −0.0033 | 0.0015 | −2.252 | 0.0244 |
| b) | | | | |
| Pipeline | | | 0.473 | 0.636 |
| CMiB | 0 | 0 | | |
| read2Marker | 0.0727 | 0.1536 | | |
| SSR location | | | −1.486 | 0.137 |
| coding | 0 | 0 | | |
| others | −0.2824 | 0.1901 | | |
| Maximum No. of SSR repeats | 0.1344 | 0.0233 | 5.782 | 7.39E-09 |
| SSR motif (number of SSR repeat unit) | −0.0966 | 0.1534 | −0.63 | 0.529 |
PCR success was coded using a variable that took a value of 1 for success and 0 for failure. The primer melting temperature (Tm) was summed for both primers in a pair. The R functions called when estimating PCR success were: glm(formula = PCR.success ~ Pipeline + Primer.location + Sum.of.primer.Tm + Expected.PCR.product.size, family = binomial). The level of polymorphism was expressed in terms of number of alleles per locus (Na) and was analyzed using the following function calls in R: glm(formula = Na ~ Pipeline + SSR.location + Maximum.No..of.SSR.repeats + SSR.motif, family = poisson). SSR motif corresponded to the number of bases in the SSR repeat unit; di-, tri-, tetra-, hexa-, and penta-SSRs were coded as 2, 3, 4, 5 and 6, respectively.