| Literature DB >> 17064411 |
Coral del Val1, Vladimir Yurjevich Kuryshev, Karl-Heinz Glatting, Peter Ernst, Agnes Hotz-Wagenblatt, Annemarie Poustka, Sandor Suhai, Stefan Wiemann.
Abstract
BACKGROUND: The German cDNA Consortium has been cloning full length cDNAs and continued with their exploitation in protein localization experiments and cellular assays. However, the efficient use of large cDNA resources requires the development of strategies that are capable of a speedy selection of truly useful cDNAs from biological and experimental noise. To this end we have developed a new high-throughput analysis tool, CAFTAN, which simplifies these efforts and thus fills the gap between large-scale cDNA collections and their systematic annotation and application in functional genomics.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17064411 PMCID: PMC1636072 DOI: 10.1186/1471-2105-7-473
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Caftan pipeline, data and program flow.
cDNA features flow, generation of new features and type of test performed by CAFTAN
| CDNA coverage | |||||
| 3'utr mapping | |||||
| 5' utr mapping | Filtered BLAT | ||||
| Internal mapping | Exons | Delta5' region | |||
| Total mapping | Exons length | Delta int | |||
| Single exon cDNA | BLAT Output | Filtered BLAT | |||
| Repeats number | Genome assembly query | Simple R | |||
| Number of Ss | Genome assembly query | Canonical Ss "GT-AG" | |||
| cDNA Ss type | Canonical Ss "GT-AG" | ||||
| Non_canonical Ss "GC-AG" | |||||
| Genome assembly query | U12 Ss "AT-AC" | ||||
| Antisense Ss "CT-GC" | |||||
| Ss-score | |||||
| Unknown Ss (others) | |||||
| cDNA signal type | Polyasignal Output | ||||
| Contamination | Polyasignal Output | Genome assembly query | Genomic polyA tail | ||
Decision rules for cDNA sequence quality evaluation in CAFTAN.
| False | |||||||||
| True | Unmapped | 1 | |||||||
| True | Unmapped | > 1 | |||||||
| True | Partial mapped | 1 | |||||||
| True | Partial mapped | > 1 | < 100 | ||||||
| True | Partial mapped | > 1 | 100 | No contamination | |||||
| True | Partial mapped | > 1 | 100 | Any contamination | |||||
| True | Mapped | 1 | No contamination | + | + | ||||
| True | Mapped | 1 | No contamination | - | - | ||||
| True | Mapped | 1 | Genomic polyA | + | + | Perfect | |||
| True | Mapped | 1 | Genomic polyA | + | + | No Perfect | |||
| True | Mapped | 1 | Genomic polyA | + | - | ||||
| True | Mapped | 1 | No contamination | + | - | Perfect | |||
| True | Mapped | 1 | No contamination | + | - | No Perfect | |||
| True | Mapped | 1 | Repeats | - | - | ||||
| True | Mapped | 1 | No contamination | - | + | ||||
| True | Mapped | 1 | Repeats | + | - | Perfect | |||
| True | Mapped | 1 | Repeats | + | - | No Perfect | |||
| True | Mapped | 1 | Genomic polyA | - | - | ||||
| True | Mapped | > 1 | Ssc <= 60 | ||||||
| True | Mapped | > 1 | 60 < Ssc < = 100 | Good SS* | Mixed contamination | ||||
| True | Mapped | > 1 | 60 < Ssc < 80 | Good SS* | Genomic polyA | ||||
| True | Mapped | > 1 | Ssc = > 90 | Bad SS + | Mixed | ||||
| True | Mapped | > 1 | 60 < Ssc < 90 | Bad SS + | |||||
| True | Mapped | > 1 | 60 < Ssc <= 100 | Good SS* | No contamination | ||||
| True | Mapped | > 1 | 60 < Ssc <= 100 | Good SS* | No contamination | + | + | ||
| True | Mapped | > 1 | 60 < Ssc <= 100 | Good SS* | No contamination | + | - | ||
| True | Mapped | > 1 | Ssc > = 80 | Good SS* | No contamination | - | - | ||
| True | Mapped | > 1 | Ssc > = 80 | Good SS* | No contamination | - | + | ||
| True | Mapped | >1 | Ssc > = 80 | Good SS* | Repeats | ||||
| True | Mapped | > 1 | Ssc > = 90 | Bad SS + | No contamination | ||||
| True | Mapped | > 1 | Ssc > = 90 | Bad SS + | Repeats | ||||
| True | Mapped | > 1 | 60 < Ssc < 80 | Good SS* | No contamination | - | - | ||
| True | Mapped | > 1 | 60 < Ssc < 80 | Good SS* | No contamination | - | + | ||
| True | Mapped | > 1 | Ssc > = 80 | Good SS* | Genomic polyA | ||||
| True | Mapped | > 1 | 60 < Ssc < 80 | Good SS* | Repeats | ||||
| True | Mapped | > 1 | Ssc > = 90 | Bad SS + | Genomic PolyA |
The definition of splice site types (Ss type) is described in Table 1. * Good SS are Canonical Splice sites, non-canonical, and u12. Bad SS + are unknown and antisense combinations of donor and acceptors (Table 1)
CAFTAN results for the annotated VEGA cDNAs and for the 3000 randomly generated cDNAs.
| 154 | 2.58 % | 46 | 1.53 % | |
| 4911 | 82.53 % | 1 | 0.03 % | |
| 0 | 0.00 % | 0 | 0 % | |
| 57 | 0.96 % | 0 | 0 % | |
| 37 | 0.62 % | 115 | 3.83 % | |
| 791 | 13.29 % | 2666 | 88.86 % | |
| 0 | 0.00 % | 172 | 5.73 % | |
| 5950 | 3000 | |||
Selection of cDNAs classified as "bad_multiple_exon_cdna" by CAFTAN.
| 1 | No signal | perfect | TNFRSF14-001 | The sequence ends with the poly A signal AAUAUA | |
| 2 | No signal | perfect | SLC35E2-001 | Partial Sequence in the 3' UTR end. The sequence is longer and for that reason is not possible to find the polyA signal | |
| 3 | No signal | perfect | PARK7-005 | Partial Sequence in the 3' UTR end. The sequence is longer and for that reason is not possible to find the polyA | |
| 4 | No signal | perfect | PARK7-006 | RPL20 Gen, eventually (A)ATGAA(A) signal | |
| 5 | No signal | perfect | C1orf86-001 | Not a perfect cDNA, the last exon fails and for that reason it is not possible to find a polyA signal | |
| 6 | No signal | perfect | PHF13-001 | False 3' end (artefact), the real 3' end is upstream from this point and is supported by many ESTs and a canonical polyA signal | |
| 7 | No signal | perfect | CTNNBIP1-002 | Good cDNA, this cDNA does not have canonical or typical non-canocial polyA signals checked in CAFTAN. Putative polyA signal: ATGTAAATAT | |
| 8 | No signal | perfect | CTNNBIP1-003 | The real 3' UTR is a little bit longer and contains a canonical polyA signal 20 bp upstream from the polyA tail. | |
| 9 | No signal | perfect | CTNNBIP1-004 | Partial Sequence in the 3' UTR end, fail the terminal bases, which make a perfect canonical polyA signal (AAUAAA) 15 bp upstream from the polyA tail | |
| 10 | No signal | perfect | UBE4B-002 | Partial Sequence in the 3' UTR end, fail many terminal bases, which make a perfect canonical polyA signal (AAUAAA) | |
| 11 | No signal | perfect | SDF4-002 | Partial Sequence in the 3' UTR end, fail almost all the 3 'UTR | |
| 12 | No signal | perfect | KIF1B-002 | Partial Sequence in the 3' UTR end, fail almost all the 3 'UTR | |
| 13 | No signal | perfect | KIF1B-004 | Partial Sequence in the 3' UTR end, fail almost all the 3 'UTR | |
| 14 | No signal | perfect | UBE2J2-001 | Partial Sequence in the 3' UTR end, fail almost all the 3 'UTR |
All of them lack the presence of a polyA signal but have a perfect match to VEGA transcripts, as it should be expected. The manual inspection by an expert curator of these VEGA CDNAs showed that all of them but 2 (No. 4 and No. 8) where not complete failing part of the 3'UTR region or having artifact 3' end like No. 6.
Selection of cDNAs classified as "bad_single_exon_cdna" by CAFTAN.
| histone 2, H2ab | HIST2H2AB-001 | Histone mRNAs are not polyadenylated. | ||
| Small proline-rich protein 1A | SPRR1A-001 | Fails tail of the 3' end of the cDNA | ||
| histone 1, H3c | HIST1H3C-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H2bb | HIST1H2BB-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H3f | HIST1H3F-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H2ad | HIST1H2AD-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H2bf | HIST1H2BF-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H2bg | HIST1H2BG-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H2bh | HIST1H2BH-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H2bi | HIST1H2BI-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H4h | HIST1H4H-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H2 | HIST1H2BJ-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H2aj | HIST1H2AJ-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H2al | HIST1H2AL-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H2am | HIST1H2AM-001 | Histone mRNAs are not polyadenylated. | ||
| RP11-295F4.3-001 | Fails tail of the 3' end of the cDNA | |||
| TAAR8-001 | Signal 70 pb upstream. EST evidence from a shorter gene | |||
| histone 1, H1c | HIST1H1C-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H3i | HIST1H3I-001 | Histone mRNAs are not polyadenylated. | ||
| histone 1, H3j | HIST1H3J-001 | Histone mRNAs are not polyadenylated. | ||
| histone H1 | HIST1H1A-002 | Histone mRNAs are not polyadenylated. | ||
| cysteinyl leukotriene receptor 2 | CYSLTR2-001 | Partial 3'UTR Sequence | ||
| G protein-coupled receptor 10 | GPR10-001 | Partial 3'UTR Sequence | ||
| Interferon, alpha 4 | IFNA4-001 | Internal primed | ||
| Interferon, alpha 6 | IFNA6-001 | Internal primed | ||
| T-cell acute lymphocytic leukemia 2 | TAL2 | Internal primed | ||
| G protein-coupled receptor 174 | GPR174 | Partial 3'UTR Sequence | ||
| Insulin receptor substrate 4 | IRS4 | Ends in the signal, | ||
| Potassium voltage-gated channel | KCNA10-001 | Partial 3'UTR Sequence | ||
| Taste receptor, type 2, member 4 | TAS2R4 | Partial 3'UTR Sequence | ||
| A disintegrin | ADAM21-001 | Artifact, much longer than the real gene | ||
| MAS1 oncogene-like | MAS1L-001 | Partial 3'UTR Sequence | ||
| histone 2, H2ac | HIST2H2AC-001 | Histone mRNAs are not polyadenylated. | ||
| HsG1428-001 | Partial 3'UTR Sequence | |||
| HsG647-001 | Partial 3'UTR Sequence | |||
| HsG684-001 | Presence of a repeat in the 3' UTR | |||
| HsG2001-001 | Partial 3'UTR Sequence |
All of them lack the presence of a polyA signal but have a perfect match to VEGA transcripts, as it should be expected. The manual inspection by an expert curator of these VEGA CDNAs showed that 19 all of them are Histone coding cDNAs which do not go under polyadenylation. The rest either did not have a complete 3'UTR region or was an artifact 3'.