| Literature DB >> 16236172 |
Nathalie Pavy1, Charles Paule, Lee Parsons, John A Crow, Marie-Josee Morency, Janice Cooke, James E Johnson, Etienne Noumen, Carine Guillet-Claude, Yaron Butterfield, Sarah Barber, George Yang, Jerry Liu, Jeff Stott, Robert Kirkpatrick, Asim Siddiqui, Robert Holt, Marco Marra, Armand Seguin, Ernest Retzel, Jean Bousquet, John MacKay.
Abstract
BACKGROUND: The sequencing and analysis of ESTs is for now the only practical approach for large-scale gene discovery and annotation in conifers because their very large genomes are unlikely to be sequenced in the near future. Our objective was to produce extensive collections of ESTs and cDNA clones to support manufacture of cDNA microarrays and gene discovery in white spruce (Picea glauca [Moench] Voss).Entities:
Mesh:
Substances:
Year: 2005 PMID: 16236172 PMCID: PMC1277824 DOI: 10.1186/1471-2164-6-144
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Sequencing and quality parameters of white spruce cDNA libraries. Quality reads had a Phred score above 20 over at least 100 bp after vector trimming.
| Libraries, treatments and tissues | Number of reads | Library quality | Sequence quality | ||||
| 3' | 5' | % Empty | % >1.6 Kb | Nb of quality reads | % Quality reads | Average length of quality reads (nt) | |
| Male strobili development sequence | 1,536 | 1,536 | 4 | 19 | 2,589 | 84 | 527 |
| Female cones development sequence | 1,536 | 1,536 | 15 | 9 | 2,324 | 76 | 500 |
| Vegetative buds development sequence | 1,536 | 0 | 5 | 15 | 1,062 | 69 | 560 |
| Secondary xylem – mature trees | 4,608 | 4,608 | 10 | 27 | 7,735 | 84 | 600 |
| Cambium, phloem – mature trees | 4,608 | 3,072 | 2 | 8 | 6,705 | 87 | 635 |
| Secondary xylem – girdled seedlings | 3,072 | 0 | 9 | 24 | 1,053 | 69 | 556 |
| Cambium to bark – girdled seedlings | 1,536 | 1,536 | NA | NA | 937 | 31 | 577 |
| Elongating root tips – saplings | 1,536 | 1,536 | 6 | 19 | 1,053 | 69 | 395 |
| Primary, secondary shoots-N treatments | 3,072 | 1,536 | 16 | 50 | 3,031 | 66 | 736 |
| Immature somatic embryos | 3,072 | 0 | 4 | 44 | 2,220 | 72 | 692 |
| Clean roots systems – N treatments | 1,536 | 0 | 7 | 37 | 858 | 56 | 659 |
| Clean roots systems – P treatments | 3,072 | 1,536 | 15 | 19 | 3,776 | 82 | 705 |
| Clean roots systems – Diurnal cycle | 6,144 | 4,608 | 16 | 33 | 8,601 | 80 | 757 |
| Root secondary xylem – mature trees | 3,072 | 0 | 7 | 8 | 1,532 | 50 | 598 |
| Annual flush shoots diurnal cycle – trees | 4,608 | 3,072 | 11 | 10 | 5,164 | 67 | 658 |
| Needles – N fertilization treatments | 1,536 | 0 | 15 | 20 | 461 | 30 | 686 |
| Total | 46,848 | 24,576 | 49,101 | ||||
Figure 1Composition of white spruce consensus sequences (contigs and singletons) according to orientation of direction of the reads (3' or 5') and according to their redundancy in the database (number of clones).
Figure 2Sequence sizes. Size distribution of the consensus sequences derived from the pine (PGI5.0) and white spruce (ArboreaSet) assemblies.
Contig groups according to several levels of sequence identity based on 100 nt of overlap
| Number of contigs per group | 90% | 96% | 98% | 99% |
| 1 | 10,036 | 10,997 | 11,767 | 13,295 |
| 2 | 1,576 | 1,422 | 1,377 | 1,083 |
| 3 | 443 | 386 | 341 | 210 |
| 4 | 175 | 153 | 103 | 61 |
| 5 | 93 | 72 | 48 | 22 |
| 6 | 52 | 40 | 26 | 6 |
| 7 | 15 | 17 | 10 | 2 |
| 8 | 13 | 8 | 7 | 1 |
| 9 | 10 | 1 | 3 | 1 |
| ≥10 | 21 | 12 | 3 | 3 |
| Total number of groups | 12,435 | 13,109 | 13,686 | 14,685 |
Figure 3Sequence similarities. Number of white spruce transcript sequences similar to Uniref100 proteins, Arabidopsis, pine, Cycas according to the blast e-value cutoff.
Figure 4Hierarchical presentation of the number of spruce transcripts with or without similarities with pine, . The numbers were derived by the filtering of tblastx searches with an e-value < 1e-10.
Consensus sequences correlated to terms belonging to the "molecular function" categories of the Gene Ontology
| Molecular functions | Annotations including electronic annotations | Annotations excluding electronic annotations | ||||
| Number of consensus sequences | % of the number of annotated consensus sequences | % of the total number of consensus sequences | Number of consensus sequences | % of the number of annotated consensus sequences | % of the total number of consensus sequences | |
| Triplet codon-amino acid adaptor activity | 0 | 0 | 0 | 0 | 0 | 0 |
| Chaperone regulator activity | 0 | 0 | 0 | 0 | 0 | 0 |
| Motor activity | 23 | 0.35 | 0.14 | 3 | 0.06 | 0,02 |
| Enzyme regulator activity | 47 | 0.71 | 0.28 | 27 | 0.53 | 0.16 |
| Nutrient reservoir activity | 50 | 0.76 | 0.30 | 4 | 0.08 | 0.02 |
| Translation regulator activity | 70 | 1.06 | 0.42 | 59 | 1.16 | 0.36 |
| Antioxidant activity | 73 | 1.10 | 0.44 | 52 | 1.02 | 0.31 |
| Signal transducer activity | 77 | 1.16 | 0.46 | 33 | 0.65 | 0.2 |
| Obsolete molecular function | 113 | 1.71 | 0.68 | 76 | 1.5 | 0.46 |
| Transcription regulator activity | 118 | 1.78 | 0.71 | 73 | 1.44 | 0.44 |
| Chaperone activity | 166 | 2.51 | 1 | 142 | 2.79 | 0.86 |
| Structural molecule activity | 283 | 4.28 | 1.70 | 240 | 4.72 | 1.45 |
| Transporter activity | 503 | 7.60 | 3.03 | 335 | 6.59 | 2.02 |
| Binding | 1,248 | 18.87 | 7.52 | 741 | 14.6 | 4.46 |
| Molecular function unknown | 1,340 | 20.26 | 8.07 | 1,340 | 26.4 | 8.07 |
| Catalytic activity | 2,504 | 37.85 | 15.08 | 1,956 | 38.5 | 11.8 |
| Total | 6,615 | 100 | 39.84 | 5,081 | 100 | 30.6 |
Figure 5Protein families. Occurrence of the 30 most abundant protein families in the white spruce dataset identified by HMM searches with an e-value < 1e-10 against the PFAM database.
Identification of transcripts encoding putative regulatory proteins. Sequences were identified based on HMM searches suported by p-score < 1e-10 with PFAM profiles available for families of regulatory proteins. The PFAM accessions for which no homology was found in SpruceDB through HMM search were not reported.
| Protein family | PFAM accession | Number of spruce transcripts |
| Zinc finger, C3HC4 type (RING finger) | PF00097 | 66 |
| WD, G-beta repeat | PF00400 | 44 |
| AP2 domain-B3 DNA binding domain | PF00847 | 19 |
| HMG (high mobility group) box | PF00505 | 16 |
| MADS Family – SRF-type transcription factor – K-box region | PF00319 | 14 |
| MYB DNA-binding | PF00249 | 13 |
| AUX/IAA | PF02309 | 12 |
| Histone-like transcription factor (CBF/NF-Y) and archaeal histone | PF00808 | 11 |
| PHD finger – CW-type Zinc Finger | PF00628 | 10 |
| No apical meristem (NAM) protein | PF02365 | 10 |
| GRAS Family | PF03514 | 10 |
| WRKY DNA-binding domain | PF03106 | 9 |
| NAC domain | PF01849 | 9 |
| Homeobox domain | PF00046 | 8 |
| bZIP transcription factor – bZIP Maf transcription factor-G-box binding protein MFMR | PF00170 | 8 |
| B-box zinc finger | PF00643 | 6 |
| TUB Family | PF01167 | 6 |
| Helix-loop-helix DNA-binding domain – Myc amino-terminal region | PF00010 | 5 |
| KNOX2 domain | PF03791 | 3 |
| LIM domain family – PET Domain | PF00412 | 5 |
| Dof domain, zinc finger | PF02701 | 4 |
| GATA zinc finger | PF00320 | 3 |
| TCP family transcription factor | PF03634 | 2 |
| CCAAT-HAP2 Family CCAAT-binding transcription factor (CBF-B/NF-YA) subunit B | PF02045 | 2 |
| SBP (Sqamosa-promoter binding protein) floral development | PF03110 | 1 |
| HSF Family (Heat shock protein promoter binding) | PF00447 | 1 |
| EIL Family ethylene insensitive 3 | PF04873 | 1 |
| B3 DNA binding domain | PF02362 | 1 |
| ARID/BRIGHT DNA binding domain – ELM2 domain | PF01388 | 1 |
Figure 6Number of spruce consensus sequences (identified by HMM searches against PFAM) relative to the size of the gene families in . Each point represents a protein family detected by the HMM searches with p-score < 1e-10. Point coordinates are the number of genes found in the analysed Angiosperm genome (x axis) and the number of contigs found in the spruce database (y axis), after a log transformation. The red, blue and green lines represent the ratios 1:1, 1:2, and 1:4, respectively. Red points represent sequences found 4 times more in white spruce than in Arabidopsis: 1. AWPM-19-like family [PF05512], 2. Chalcone and stilbene synthases, C-terminal domain [PF02797], 3. Phosphoenolpyruvate carboxykinase [PF01293]. Blue points represent sequences found 4 times more in spruce than in rice : 4. Ribosomal protein S28e [PF01200], 5. Cyclin-dependent kinase regulatory subunit [PF01111], 6. TIR domain [PF01582], 7. Splicing factor 3B subunit 10 [PF07189], 8. Ribosomal Proteins L2, C-terminal domain [PF03947]. Green points represent sequences found 4 times more in spruce compared to both Arabidopsis and rice: 9. Translationally controlled tumour protein [PF00838], 10. S-adenosyl-L-homocysteine hydrolase [PF05221], 11. S-adenosylmethionine synthetase, C-terminal domain [PF02773].
Pairwise comparison of white spruce consensus sequences related to the translationally controlled tumour proteins (TCTP). Nucleic acid identities were determined using the Smith-Waterman algorithm (water) available in the EMBOSS suite [71] in a 138 bp region of the 5' UTR immediately upstream of the first codon (ATG), (above the diagonal); and, along the complete sequence of the consensus sequences (under the diagonal). The diagonal shows the contig length.
| Sequence10076 | Sequence10707 | Sequence9531 | Sequence7749 | Sequence1882 | |
| Sequence10076 | 805 | 88/162 (54.3%) | 54/84 (64.3%) | 70/159 (44.0%) | 83/144 (57.6%) |
| Sequence10707 | 761/890 (85.5%) | 977 | 111/157 (70.7%) | 71/147 (48.3%) | 99/154 (64.3%) |
| Sequence9531 | 759/889 (85.4%) | 925/1034 (89.5%) | 1124 | 65/133 (48.9%) | 101/159 (63.5%) |
| Sequence7749 | 515/659 (78.1%) | 548/736 (74.5%) | 596/938 (63.5%) | 945 | 73/147 (49.7%) |
| Sequence1882 | 719/815 (88.2%) | 742/823 (90.2%) | 750/906 (82.8%) | 523/687 (76.1%) | 796 |
Pairwise comparison of white spruce consensus sequences related to the S-adenosylmethionine synthetase (SAMS). Nucleic acid identities were determined using the Smith-Waterman algorithm (water) available in the EMBOSS suite [71] in a 99 bp region of the 3' UTR immediately downstream the stop codon (above the diagonal) and along the complete sequence of the consensus sequences (under the diagonal). The diagonal shows the contig length.
| Sequence 10446 | Sequence 10482 | Sequence 10630 | Sequence 10683 | Sequence 10828 | Sequence 8600 | Sequence 9676 | |
| Sequence10446 | 1677 | 46/97 (47.4%) | 48/113 (42.5%) | 51/117 (43.6%) | 45/85 (52.9%) | 85/106 (80.2%) | 44/98 (44.9%) |
| Sequence10482 | 1096/1607 (68.2%) | 1467 | 54/78 (69.2%) | 65/92 (70.7%) | 50/84 (59.5%) | 49/114 (43%) | 45/96 (46.9%) |
| Sequence10630 | 1126/1641 (68.6%) | 1343/1557 (86.3%) | 1540 | 69/113 (61.1%) | 48/78 (61.5%) | 47/95 (49.5%) | 55/111 (49.5%) |
| Sequence10683 | 1143/1711 (66.8%) | 1342/1521 (88.2%) | 1357/1582 (85.8%) | 1531 | 49/103 (47.6%) | 58/116 (50%) | 46/117 (39.3%) |
| Sequence10828 | 1202/1814 (66.3%) | 1262/1534 (82.3%) | 1343/1714 (78.4%) | 1306/1604 (81.4%) | 1679 | 49/95 (51.6%) | 49/109 (45%) |
| Sequence8600 | 1349/1691 (79.8%) | 1058/1536 (68.9%) | 1089/1532 (71.1%) | 1092/1583 (69%) | 1120/1656 (67.6%) | 1476 | 41/71 (57.7%) |
| Sequence9676 | 1025/1418 (72.3%) | 1314/1459 (90.1%) | 1276/1397 (91.3%) | 1261/1381 (91.3%) | 1179/1369 (86.1%) | 1026/1462 (70.2%) | 1356 |
Figure 7SpruceDB core tables and data sources. Data from flat files on ESTs, Assemblies and blast hits is loaded into the core tables Read, Contig, Contig_Element and Blast_Hsp. Additional information on taxonomy identifiers and Uniref100 peptides is obtained from shared databases.
Figure 8Examples of the interface of the SpruceDB database. A) Use of Query 1 to search for contigs matching "cinnamoyl alcohol dehydrogenase" among the blastx results loaded in the database. B) Display of the results indicating alignment parameters (alignment length, similarity and identity level). C) BioDATA page linked to by clicking on MNC5693153 in Query 1 results. The upper figure illustrates the alignment of the members of the contigs in a color coded manner. Read names written in blue and white color refer to 5'and 3'reads, respectively. D) Query 8 allowing to retrieve sequence aliases and library names for specified MN_Ids. E) Query 8 results showing libraries GQ004 and GQ006.