| Literature DB >> 17892551 |
Lawrence Hene1, Vattipally B Sreenu, Mai T Vuong, S Hussain I Abidi, Julian K Sutton, Sarah L Rowland-Jones, Simon J Davis, Edward J Evans.
Abstract
BACKGROUND: Deep transcriptome analysis will underpin a large fraction of post-genomic biology. 'Closed' technologies, such as microarray analysis, only detect the set of transcripts chosen for analysis, whereas 'open' e.g. tag-based technologies are capable of identifying all possible transcripts, including those that were previously uncharacterized. Although new technologies are now emerging, at present the major resources for open-type analysis are the many publicly available SAGE (serial analysis of gene expression) and MPSS (massively parallel signature sequencing) libraries. These technologies have never been compared for their utility in the context of deep transcriptome mining.Entities:
Mesh:
Year: 2007 PMID: 17892551 PMCID: PMC2104538 DOI: 10.1186/1471-2164-8-333
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Effect of total number of tags sequenced on number of distinct tag sequences identified. LongSAGE () and MPSS () libraries produced from an activated CD4+ T-cell clone were sampled at various sizes to examine the effect of library size on the number of distinct tag sequences identified. If the library is large enough to sample all available tags, then increasing the library size will not increase the number of sequences detected. Closed diamonds represent all tags in the library. Open circles represent only those tags that exactly match either the genome or the transcriptome (i.e. excluding possible sequencing errors but also polymorphisms and some tags crossing splice junctions).
Figure 2Number of transcripts in the UTBS dataset identified by LongSAGE and MPSS. The UTBS dataset consists of transcripts containing both NlaIII and DpnII restriction sites and for which all extracted tags are unique in both the transcriptome and the genome. The LongSAGE () and MPSS () libraries were sampled at various sizes and the numbers of transcripts from the UTBS dataset for which tags were identified were calculated.
Figure 3Comparison of tag sequences in three MPSS libraries produced from the same RNA sample. A. The three libraries were sampled to various sizes in a step-wise fashion to examine the effect of library size on the number of distinct tag sequences identified (as done for single SAGE and MPSS libraries in Fig. 1). Closed diamonds represent random sampling of tags from all three libraries combined. Open diamonds represent sampling of each library in turn. Clearly, although the number of distinct species identified by each library (with the possible exception of the third) appears to approach saturation, each library is sampling a different subset of sequences from the initial RNA pool. B. Venn diagrams showing the distribution of tag sequences between the three MPSS libraries. The library represented by the blue circle is the one used in most of the analyses presented in this study. Diagram (i) represents all the different tag sequences in the libraries. Diagram (ii) represents only those tags that match the genome; this reduces the influence of sequencing errors. In both comparisons, the majority of distinct sequences are found in only one library. Diagram (iii) represents known transcripts in the UTBS dataset found expressed in the sense direction. Here the pattern is less marked, but still only half the transcripts were observed in all three libraries (1,312/2,646). The improvement in the correlation of the libraries for known transcripts (i.e. those in the UTBS) was expected because more highly expressed transcripts are more likely to have been previously identified, and therefore known transcripts tend to be more abundant and have a greater chance of being sampled.
Effect of tag length on MPSS library complexity
| Tag length sequenced (bp) | Length of tags analysed | Number of unique tags | Tags matching genome sequence |
| 20 | 20 | 14,894 | 11,489 (77%) |
| 20 | 17 | 13,576 | 11,934 (88%) |
| 20 | 14 | 12,509 | 12,372 (99%) |
| 17 | 17 | 18,084 | 14,307 (79%) |
| 17 | 14 | 15,190 | 14,944 (98%) |
| 14 | 14 | 19,931 | 19,402 (97%) |
MPSS tags can be extracted from the same initial dataset to produce tags of different lengths; in this case 14, 17 and 20 bp tags were extracted. After the extractions, tag lengths can be computationally shortened to see if there is a difference in complexity between the different tag extractions. Decreasing the tag length sequenced was, unexpectedly, found to increase the complexity of the library. For example, 14,894 different 20 base tags were produced, which contained 13,576 different 17 base sequences if the last 4 bases were ignored. However, if the tags were initially extracted at 17 bases (i.e. ignoring the last annealing step in sequencing) then a library of 18,084 different tag sequences was produced; 4,508 distinct species are therefore lost in this last sequencing step. The last column shows how many of the distinct tag species have perfect matches in the human genome, and this is also expressed as the proportion of the species identified (in brackets).