| Literature DB >> 19531739 |
Nicolas Philippe1, Anthony Boureux, Laurent Bréhélin, Jorma Tarhio, Thérèse Commes, Eric Rivals.
Abstract
Ultra high-throughput sequencing is used to analyse the transcriptome or interactome at unprecedented depth on a genome-wide scale. These techniques yield short sequence reads that are then mapped on a genome sequence to predict putatively transcribed or protein-interacting regions. We argue that factors such as background distribution, sequence errors, and read length impact on the prediction capacity of sequence census experiments. Here we suggest a computational approach to measure these factors and analyse their influence on both transcriptomic and epigenomic assays. This investigation provides new clues on both methodological and biological issues. For instance, by analysing chromatin immunoprecipitation read sets, we estimate that 4.6% of reads are affected by SNPs. We show that, although the nucleotide error probability is low, it significantly increases with the position in the sequence. Choosing a read length above 19 bp practically eliminates the risk of finding irrelevant positions, while above 20 bp the number of uniquely mapped reads decreases. With our procedure, we obtain 0.6% false positives among genomic locations. Hence, even rare signatures should identify biologically relevant regions, if they are mapped on the genome. This indicates that digital transcriptomics may help to characterize the wealth of yet undiscovered, low-abundance transcripts.Entities:
Mesh:
Year: 2009 PMID: 19531739 PMCID: PMC2731892 DOI: 10.1093/nar/gkp492
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Background distribution and influence of length on the prediction capacity. (A) Theoretical probabilities of a transcriptomic SAGE tag being located (1 − A′(t)), and being located once (B′(t)) in a random Bernoulli sequence. The former starts at approximately 1 at 14 bp and gets to lesser than 0.01 at 19 bp, while the latter reaches its maximum at 18 bp and decreases towards low values. (B) Influence of the tag length onthe prediction capacity shown with Precision–Recall-like curves. The recall (percentage of located tags over all tags) is plotted versus the precision (percentage of uniquely mapped tags over all mapped tags) for each tag length. The blue curve of a human Solexa tags set departs from those of random tags, either theoretically computed (yellow) or empirically estimated (brown curve). The overlapping yellow and brown curves show the validity of thetheoretical model. The blue curve remains linear until 19 bp, then bends down.
Figure 2.Influence of the selection of tags with # occ. above a threshold on the percentage of analysed tags (grey), analysed occurrences (pink), erroneous tags (black), erroneous occurences (magenta) and located tags (blue). A point at abcissa x is the corresponding value when only tags with # occ. > x are kept. Analysed tags represent 25% of the original tag set, but still 85% of the occurrences. The percentages of erroneous tags and erroneous occurrences become low, respectively very low with # occ. > 1. The percentage of located tags stabilizes after # occ. > 10; the blue curve serves as a graphical method to set the threshold to select biologically valid tags in our estimation procedure.
Percentages of erroneous occurrences [𝒮(t)] and the probability of an erroneous nucleotide (P) for SAGE and Chip-seq assays at different tag length (t)
| SAGE-Sanger | SAGE-Solexa | ChIP-Seq-Solexa | ||||
|---|---|---|---|---|---|---|
| (6 527 650 occ) | (2 222 344 occ) | (1 339 671 occ) | ||||
| 𝒮( | 𝒮( | 𝒮( | ||||
| 14 | 6.02 ± 1.64 | 0.44 | 4.22 ± 2.77 | 0.31 | − | − |
| 15 | 6.25 ± 0.88 | 0.43 | 5.31 ± 1.26 | 0.36 | − | − |
| 16 | 6.10 ± 0.67 | 0.39 | 4.85 ± 0.96 | 0.31 | 6.89 ± 1.59 | 0.44 |
| 17 | 7.37 ± 0.46 | 0.45 | 5.24 ± 0.71 | 0.32 | − | − |
| 18 | 8.32 ± 0.38 | 0.48 | 6.65 ± 0.65 | 0.38 | 7.53 ± 0.99 | |
| 0.53 | 0.44 | − | − | |||
| 10.79 ± 0.33 | ||||||
| 21 | 12.49 ± 0.32 | 0.63 | 10.57 ± 0.60 | 0.53 | − | − |
| 22 | − | − | − | − | 10.39 ± 0.09 | 0.50 |
| 24 | − | − | − | − | 11.99 ± 0.09 | 0.53 |
| 26 | − | − | − | − | 13.51 ± 0.09 | 0.56 |
| 28 | − | − | − | − | 15.22 ± 0.09 | 0.59 |
| 30 | − | − | − | − | 16.83 ± 0.09 | 0.61 |
The tag length t ranges from 14 to 21 for SAGE-{Sanger,Solexa}, and from 16 to 30 bp for Chip-Seq-Solexa, ± α(t) is the standard error of 𝒮(t). The percentage of erroneous occurrences logically increases with length. However, the percentage of an erroneous nucleotide increases with its position in the sequence (until the 30th bp), showing that, even with the Solexa technique, errors occur more frequently at the 3′ end (in bold: values cited in text).
Errors and localization for the Sage-Solexa library without and with filtration
| Sage-Solexa | # | # | ||||||
|---|---|---|---|---|---|---|---|---|
| private library | (440 445 tags) | (114 721 tags) | ||||||
| ℛ( | ℳ( | 𝒳( | 𝒮( | 𝒱( | 𝒳( | 𝒮( | 𝒱( | |
| 14 | 0.23 | 97.36 | 1.01 | 32.46 ± 2.80 | 31.92 | 0.59 | 14.77 ± 2.82 | 14.47 |
| 15 | 0.96 | 88.42 | 4.16 | 30.18 ± 1.74 | 27.84 | 2.25 | 12.22 ± 1.75 | 11.06 |
| 16 | 2.34 | 63.35 | 10.86 | 24.84 ± 1.31 | 17.65 | 5.86 | 10.27 ± 1.32 | 6.91 |
| 17 | 6.84 | 33.65 | 22.18 | 25.77 ± 0.77 | 11.14 | 13.50 | 11.19 ± 0.78 | 4.35 |
| 18 | 10.27 | 10.83 | 34.06 | 30.15 ± 0.77 | 4.95 | 20.09 | 12.45 ± 0.77 | 1.69 |
| 19 | 12.38 | 6.49 | 41.04 | 35.33 ± 0.74 | 3.89 | 23.61 | 13.85 ± 0.74 | 1.18 |
| 13.89 | 3.09 | 45.61 | 38.20 ± 0.79 | 25.52 | 14.01 ± 0.79 | |||
| 21 | 15.56 | 2.42 | 50.00 | 41.98 ± 0.99 | 2.04 | 28.23 | 15.48 ± 0.99 | 0.57 |
For each length t, one reads the percentages over all tags of valid mapped tags [ℛ(t)], erroneous mapped tags [ℳ(t)], unmapped tags [𝒳(t)], erroneous tags [𝒮(t)] with its standard error [α(t)] and false positive locations [i.e locations mapped by erroneous tags, 𝒱(t)]. The three last statistics are given for the unfiltered (# occ. > 0) and filtered (# occ. > 1) tag set (in bold: values cited in text).
Figure 3.Variation of the prediction capacity when mapping four transcriptomic or ChIP-Seq tag setson the human genome. For each assay, the histogram gives for each tag length, the percentages over all tags of located tags (light blue bar) uniquely mapped tags (dark blue bar), erroneous tags with the standard errors (black curve) and the background distribution (brown curve). Histograms obtained whenmapping (A) the SAGE-Sanger set, (B) the CAGE-Sanger set, (C) the SAGE-Solexa set and (D) the ChIPSeq-Solexa set on the genome sequence (hg18). For concision, only even lengths are plotted in ChIP-Seq histogram (D) tags. The ratio of tags located once on the genome reaches its maximum at 19 bp when the background probability of being located is already low, while the ratio of erroneous tags keeps on increasing after that.
Classification of TARs according to Ensembl annotations
| Exonic | Inxonic | Intronic | Intergenic | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Result | Total | S (1) | AS (4) | S (2) | AS (5) | S (3) | AS (6) | EST (7) | Other (8) | |
| % | 100 | 34.7 | 7.8 | 1.0 | 0.4 | 15.1 | 9.2 | 5.5 | 26.3 | |
| 16 328 | 5659 | 1279 | 156 | 73 | 2467 | 1501 | 898 | 4295 | ||
| % | 100 | 38.5 | 8.8 | 1.2 | 0.3 | 15.6 | 6.6 | 5.5 | 23.5 | |
| 56 006 | 21 600 | 4947 | 691 | 192 | 8760 | 3694 | 3054 | 13 068 | ||
| % | 100 | 38.5 | 8.8 | 1.2 | 0.3 | 15.6 | 6.6 | 5.5 | 23.5 | |
| 56 441 | 21 706 | 4970 | 687 | 192 | 8808 | 3743 | 3100 | 13 235 | ||
| Tiling | % | 100 | 35.6 | − | − | − | 34.9 | − | 10.8 | 18.7 |
The number and percentage of all TARs using digital transcriptomic tags at length 16, 20 or 21 bp, and using a tiling-array [ENCODE project (11)] are shown for each annotation category (cf. ‘Classification of transcriptomic tags’ section). If inside a gene, a TAR can be located in an exon, in an intron or overlap one of each, we term it as ‘inxonic’. These three categories are further subdivided into sense (S) and antisense (AS), depending on which strand the tag was located on compared with the gene. Category (7) concerns ESTs that do not overlap any annotated exon.
| 𝒮( | : | the probability that a sequence of length |
| 𝒳( | : | the prior probability that a sequence of length |
| ℳ( | : | the probability that an erroneous sequence of length |
| ℛ( | : | the probability that a non-erroneous sequence of length |