| Literature DB >> 17359537 |
Jeppe Emmersen1, Anna M Heidenblut, Annabeth Laursen Høgh, Stephan A Hahn, Karen G Welinder, Kåre L Nielsen.
Abstract
BACKGROUND: During gene expression analysis by Serial Analysis of Gene Expression (SAGE), duplicate ditags are routinely removed from the data analysis, because they are suspected to stem from artifacts during SAGE library construction. As a consequence, naturally occurring duplicate ditags are also removed from the analysis leading to an error of measurement.Entities:
Mesh:
Year: 2007 PMID: 17359537 PMCID: PMC1839111 DOI: 10.1186/1471-2105-8-92
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Estimating occurrence of duplicate ditags in SAGE based on an even distribution of compatible overlapping tags. (A) The number of expected duplicate ditags (equation 1). (B) The accuracy of tag count when duplicate ditags were removed from the analysis. SAGE (hatched line), and LongSAGE (solid line).
Abundant LongSAGE tags observed in pancreatic acinar cells.
| Duplicate ditags | |||||
| Includeda | Removedb | ||||
| Tag sequence | Tag count | Tag count | Fold changec | THCd | Gene name |
| CATGTCAGGGTGATTCTGGTG | 3315 | 1086 | 0.33 | 2531342 | Trypsin I |
| CATGGCGTGACCAGCTTTGTT | 2609 | 1161 | 0.44 | 2498325 | Elastase IIIB |
| CATGAATTGAAGAAACTGACC | 2359 | 713 | 0.30 | 2510696 | Unknown |
| CATGGAGCACACCCTGAATCA | 1145 | 657 | 0.57 | 2613307 | Carboxypeptidase A1 |
| CATGGAACACAAAAAAAAAAA | 1094 | 535 | 0.49 | Unknown | |
| CATGTGCGAGACCACCCCTAT | 891 | 461 | 0.52 | 2683646 | Carboxypeptidase A2 |
| CATGTCCTCAAAACAAAAAAA | 753 | 377 | 0.50 | Unknown | |
| CATGAGCCTTGGTATCAAGAG | 645 | 353 | 0.55 | 2462969 | Cholesterol esterase |
| CATGTTCATACACCTATCCCC | 531 | 177 | 0.33 | 2398611 | NADH dehydrogenase |
| CATGCTGAATCTAAATTATAA | 526 | 257 | 0.49 | 2590573 | Alpha-amylase 2B |
| CATGTCCTCAAAACAATAAAA | 465 | 252 | 0.54 | Unknown | |
| CATGTCCTCAAAAAAAAAAAA | 431 | 211 | 0.49 | Unknown | |
aTotal number of tags = 44,276
bTotal number of tags = 31,868
ctag count including duplicate ditags/tag count excluding duplicate ditags.
d Tentative Human Contig number from The Institute for Genomic Research
The relationship of tag abundance and degree of change introduced by removal of duplicate ditags.
| Tag count | # of unique tags | Observed change upon removal of duplicate ditags | |||
| >2 folda | 1.5–2 folda | 1–1.5 folda | unchangeda | ||
| >200 | 19 | 7 (37) | 8 (42) | 4 (21) | 0 (0) |
| >100–199 | 13 | 3 (23) | 4 (31) | 6 (46) | 0 (0) |
| >50–99 | 42 | 2 (5) | 13 (31) | 27 (64) | 0 (0) |
| >20–49 | 91 | 6 (7) | 11 (12) | 72 (79) | 2 (2) |
| >10–19 | 179 | 3 (2) | 16 (9) | 104 (58) | 56 (31) |
| >5–9 | 383 | 1 (0.3) | 19 (5) | 139 (36) | 224 (58) |
| >2–5 | 2157 | 8 (0.4) | 240 (11) | 120 (6) | 1789 (83) |
a Fold change is calculated by dividing the total tag count with the tag count obtained after removal of duplicate ditags. Percentage of total number of different tags in the indicated intervals is given in parentheses.
Summary of ditag statistics.
| Pancreatic acinar cells | RefSeq v.16c | |
| Ditag lengtha | Number | |
| 40 | 3339 | N.A. |
| 41 | 11329 | N.A. |
| 42 | 7325 | N.A. |
| 43 | 240 | N.A. |
| 44 | 61 | N.A. |
| Overlap class | Number | Number |
| ATb | 572 | 5172 |
| CGb | 17 | 1371 |
| GCb | 344 | 3635 |
| TAb | 161 | 5076 |
| AA or TT | 3173 | 18853 |
| AC or GT | 1784 | 8120 |
| AG or CT | 813 | 11069 |
| CA or TG | 437 | 10467 |
| CC or GG | 2868 | 8561 |
| GA or TC | 495 | 9315 |
aDitags shorter than 40 nucleotides were not extracted from sequences
bPalindromic sequences. The number of sequences compares to half the number in non-palindromic sequences.
cLongSAGEtags generated in-silico form RefSeq using 17+CATG nt tags only.
Figure 2Overview of the LongSAGE_bias.pl PERL script used for the data analyses. The quality threshold of sequence files can be set at any level desired. A high quality threshold may lead to the under representation of difficult to sequence tags. If set to zero all tags are included and the number of tags observed once or twice increases.
Figure 3Predicted versus observed duplicate ditags in a LongSAGE study of pancreatic acinar cells [4]. Predicted ditag counts, two for each observed ditag, were calculated according to equation 3 (see methods for details). (A) All observed ditags are included. (C) Outliers, according to table 4 were removed. Standardized residuals were calculated according to equation 4. The confidence interval at three standard deviations is shown as lines. (B) All observed ditags are included. (D) Outliers, according to table 4 were removed. The recalculated confidence interval at three standard deviations is shown as lines.
Comparison of outlier ditags with the matched database sequences.
| 34 | 970 | CATGGAGCACACCCTGAATCACACCAGAATCACCCTGACATG | ||
| 2401106 | Carboxypeptidase | |||
| 2434341 | Trypsin I | |||
| 17 | 289 | CATGGTGTGTGCTGGAGGGTACACCAGAATCACCCTGACATG | ||
| 2431718 | Elastase IIIA | |||
| 2434341 | Trypsin I | |||
| 14 | 913 | CATGTCAGGGTGATTCTGGTGAGGAAGCCCACACAGAACATG | ||
| 2434341 | Trypsin I | |||
| 2434342 | Trypsin I | |||
| 13 | 641 | CATGACGCTGGACGCTCCAAGCACCAGAATCACCCTGACATG | ||
| 2407612 | Colipase | |||
| 2434341 | Trypsin I | |||
| 9 | 101 | CATGTCAGGGTGATTCTGGTGTGATTGCCGAGCCAGAGCATG | ||
| 2434341 | Trypsin I | |||
| 2237360 | Phospholipase A2b | |||
| 4 | 127 | CATGTCAGGGTGATTCTGGTGCTGGCGCTTCTGACCATCATG | ||
| 2434341 | Trypsin I | |||
| 2401106 | Carboxypeptidase c | |||
| 4 | 85 | CATGACGCTGGACGCTCCAAGTGATTCAGGGTGTGCTCCATG | ||
| 2407612 | Colipase | |||
| 2401106 | Carboxypeptidase | |||
| 2 | 267 | CATGGAGCACACCCTGAATCAAACAAAGCTGGTCACGCCATG | ||
| 2401106 | Carboxypeptidase | |||
| 2254617 | Elastase IIIB | |||
| 86 | 8 | CATGACAGTAAGAGAATTATGCAGTGCTGCCATAACCATG | ||
| β-lactamase | ||||
| Inv. β-lactamase |
aTentative Human Contig number from The Institute for Genomic Research (TIGR).
bThe match to Phospholipase A2 is not perfect (GTGATTGCCGAGCCAGAGCACG)
cThe tag matches an inverted sequence from carboxypeptidase.
Additional transcript changes detected between pancreatic acinar and ductal cells by including duplicate ditags.
| CATGGGCGACTCTGGCGGCCC | 40 | 0 | 6.72E-13 | THC2268952 | Chymotrypsinogen B |
| CATGGAGCACACCCTGAATCC | 39 | 0 | 1.4E-12 | unknown | |
| CATGCCTGTAATCCCAGCTAC | 20 | 95 | 1.23E-11 | W85818 | |
| CATGCCTAGCTGGATTGCAGA | 26 | 104 | 5.89E-11 | BU542624 | |
| CATGAAAGTCTAGAAATAAAA | 3 | 41 | 3.62E-09 | THC2400275 | full-length cDNA clone CS0DC017YH08 |
| CATGCACAAACGGTAGTTTTG | 187 | 110 | 3.78E-09 | AV744668 | |
| CATGTGTGCTAAATGTGTTCG | 69 | 22 | 5.56E-09 | BF089871 | |
| CATGTTCTGTGTGGGCTTCCC | 27 | 0 | 8.83E-09 | unknown | |
| CATGTGCATCTGGTGTAGGAA | 33 | 103 | 1.49E-08 | BU626127 | |
| CATGGGGTTGGCTTGAAACCA | 2 | 35 | 1.72E-08 | BG756271 | |
| CATGCACCTCCCACCGGCCGT | 26 | 0 | 1.82E-08 | THC2457279 | Elastase 2B |
| CATGCTAAGACTTCACCAGTC | 58 | 145 | 2.1E-08 | BU674671 | |
| CATGGTAAGTGTACTGGAAAG | 33 | 4 | 3.3E-08 | THC2400569 | Human mitochondrial genes |
| CATGAATCCTTGCCTCCCTCA | 25 | 1 | 3.74E-08 | BI791939 | |
| CATGGGAACAAACAGATCGAA | 6 | 44 | 6.7E-08 | NP922813 | CD24 protein |
| CATGGTAATTTAAACAATGAA | 0 | 29 | 7.31E-08 | THC2336784 | Integrin beta-6 precursor |
| CATGTCCCCGTGGCTGTGGGG | 1 | 29 | 7.31E-08 | AV700058 | |
| CATGTGCCCTCAGGAAAAAAA | 0 | 29 | 7.31E-08 | THC2244374 | Neutrophil gelatinase-associated lipocalin |
| CATGGAACACAAAAAAAAAGA | 24 | 0 | 7.68E-08 | unknown | |
| CATGTGGCTTCAAGCCACCAG | 28 | 89 | 8.5E-08 | BF987687 | |
| CATGCCAAACGTGTAACAATT | 7 | 46 | 8.61E-08 | CV350470 | |
| CATGACAGTAAGAGAATTATG | 87 | 39 | 1.11E-07 | unknown | |
| CATGCTGTACAGACACCACCA | 0 | 28 | 1.33E-07 | BG151226 | |
| CATGGTAAATTTAAAAAAAAA | 1 | 28 | 1.33E-07 | unknown | |
| CATGAGTTGAAGAAACTGACC | 23 | 0 | 1.57E-07 | unknown | |
| CATGGTTATGGCAGCACTGCA | 86 | 39 | 1.64E-07 | unknown | |
| CATGGGTGGTGTCTGAGAGGC | 0 | 27 | 2.41E-07 | THC2256155 | gastrointestinal glutathione peroxidase 2 |
| CATGTTCATTATAATCTCAAA | 8 | 46 | 2.72E-07 | BG025220 | |
| CATGCATCTTCACCAGCAGCT | 4 | 36 | 2.75E-07 | CD240368 | |
| CATGCTGCTTGGTGAACAATC | 4 | 36 | 2.75E-07 | THC2247807 | Neutral and basic amino acid transport protein |
| CATGTATGACTTAATAAATCC | 2 | 30 | 3.01E-07 | AA506911 | |
| CATGCTTGTGAACTGCACAAC | 0 | 26 | 4.38E-07 | AA343639 | |
| CATGGAAATTTAAAGCAGGTT | 2 | 29 | 5.31E-07 | THC2272041 | |
| CATGCCAGAACAGACTGGTGA | 19 | 67 | 5.89E-07 | CD240292 | |
| CATGCCAGGGTGATTCTGGTG | 21 | 0 | 6.58E-07 | THC2434375 | Trypsin II |
| CATGGTGTGCGCTGGGGGCGT | 21 | 0 | 6.58E-07 | unknown | |
| CATGGATTGAAGAAACTGACC | 21 | 0 | 6.58E-07 | unknown | |
| CATGTGTCCACCATCTCTCTG | 21 | 0 | 6.58E-07 | THC2434352 | Trypsinogen C |
| CATGGCGTGACCAGCTTTGTG | 21 | 1 | 6.58E-07 | unknown | |
| CATGAGCCACTGCGCCCAGCC | 26 | 3 | 6.96E-07 | H75720 | |
| CATGCTTCTGATCTCAGCAGT | 0 | 25 | 7.92E-07 | THC2315603 | Heparan sulfate 3-O-sulfotransferase-1 |
| CATGCACAGGCAAAATGTATT | 1 | 25 | 7.92E-07 | CA314838 | |
| CATGTGAAGTTATACTGTGGC | 2 | 28 | 9.33E-07 | AW970111 | |
| CATGGGATATGTGGTGTATAT | 7 | 41 | 9.67E-07 | AV656761 | |
| CATGCATATCATTAAACAAAT | 5 | 36 | 0.00000106 | NP924865 | Insulin-like growth factor binding protein 7 |
| CATGTATTTTCCAGCTGCCTC | 20 | 1 | 0.00000134 | AA514440 | |
| CATGTCAGGGTGGTTCTGGTG | 20 | 1 | 0.00000134 | unknown | |
| CATGTCAGGGTGATCCTGGTG | 20 | 0 | 0.00000134 | unknown | |
| CATGTCAGGGCGATTCTGGTG | 20 | 0 | 0.00000134 | unknown | |
| CATGAAAAGCAGAAATCGGTT | 0 | 24 | 0.00000143 | THC2244965 | Krueppel-like factor 5 |
| CATGTTTGCACCTTTCTAGTT | 0 | 24 | 0.00000143 | NP119453 | Connective tissue growth factor |
| CATGATACTTTAATCAGAAGC | 1 | 24 | 0.00000143 | NP1194136 | full-length cDNA clone CS0DF028YA19 |
| CATGGCGAAACCCTGTCTCTA | 3 | 30 | 0.00000155 | W03579 | |
| CATGCTTATGGTTGATCAGTT | 2 | 27 | 0.00000164 | CB243786 | |
| CATGCACCTAATTGGAAGCGC | 56 | 22 | 0.00000216 | CF129138 | |
| CATGGGAATGTACGTTATTTC | 13 | 52 | 0.00000218 | NP1215187 | mitochondrial ATP synthase |
aThe percentage of tags idenfied as differentially expressed that cannot be matched to database sequences are similar including (27%) or excluding (30%) ditags as well as in this list (24%).