| Literature DB >> 17997864 |
Yi-An Chen1, Chang-Chun Lin, Chin-Di Wang, Huan-Bin Wu, Pei-Ing Hwang.
Abstract
BACKGROUND: The enormous amount of sequence data available in the public domain database has been a gold mine for researchers exploring various themes in life sciences, and hence the quality of such data is of serious concern to researchers. Removal of vector contamination is one of the most significant operations to obtain accurate sequence data containing only a cDNA insert from the basecalls output by an automatic DNA sequencer. Popular bioinformatics programs to accomplish vector trimming include LUCY, cross_match and SeqClean.Entities:
Mesh:
Year: 2007 PMID: 17997864 PMCID: PMC2194723 DOI: 10.1186/1471-2164-8-416
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Illustration of some trimming details. The shaded area highlights the range covering the 30% from either end of the EST. According to the original SeqClean design, the vector contaminant is recognized only if some or all of the similar vector sequence is identified within this range. The boxes in blue indicate the vector-derived sequence. The yellow open boxes represent cDNA inserts and the green bars show the low quality regions. The small stars indicate where the number 1 base is located by CVS coordinates. The boxes in red specify the product of SeqClean trimming. Comments for each of the three listed trimming situations are denoted to their right. Condition A indicates those ESTs which were mistakenly trashed. Condition B shows incomplete trimming and condition C is an example of correct trimming. Example ESTs corresponding to each of the three conditions are shown in the table below, where the position numbering followed the coordinates of the untrimmed EST sequences.
Figure 2Re-linearization of vector pGEM-T at its cloning site. A. Simplified map of vector pGEM-T. The insert DNA of interest was cloned into position between bases 60 and 61. The primers were introduced with the DNA insert during cDNA preparation carried out in a wet lab. B. Vector sequence of pGEM-T before and after re-linearization Bases 1–198 and 2899–3015 were expressed. The omitted nucleotides are expressed as dotted lines. Additional nucleotides TA (colored in blue) were appended to the vector at position 60 during a wet-lab experimental procedure. The letters in pink boxes (bases 1 ~ 60 plus the appended T) were moved electronically to the end of the sequence for vector re-linearization.
Effects of SeqClean on phred-treated EST against CVS or RVS form of the vector sequence
| SeqClean | ||
| CVS | RVS | |
| Number of ESTs to trim | 6035 | 6035 |
| Number trashed | 239 | 143 |
| Number of incompletely trimmed ESTs | 1118 | 2 |
| % incomplete trimminga | 18.5% | 0.0% |
a % of incomplete trimming was obtained by dividing the number of incompletely trimmed ESTs by the number of ESTs to trim. For example, the percentage of incompletely trimmed ESTs in the CVS column was calculated as: 1118/6035 × 100%.
Cloning vectors, primers and adaptors used for ESTs in this study
| Vector (cutting site) | Primer/adaptor | No. seq | Source of EST |
| PGEM-Teasy (60–61) | primer1, primer2R, adapter1, adapter2R | 2882 | ABRC laboratories |
| pTriplEx2 (571–581) | ToTom | 2318 | ABRC laboratories |
| pCR21TOPO (294–295) | primer1, primer2R, adapter1, adapter2R | 5 | ABRC laboratories |
| pZL1 (239–253) | 700 | ABRC laboratories | |
| pT7Blue (95–96)a | 130 | ABRC laboratories | |
| pCMV-SPORT6.1 (774–1222) | 3000 | trace archive at NCBI | |
| pT7T3-PAC (200–240) | 3000 | trace archive at NCBI | |
| pDNR-LIB (77–298) | Sfi-(dT)30; AAGCAGTGGTATCAACG CAGAGTGGCC | 3000 | trace archive at NCBI |
a Only 14 of the 130 ESTs cloned into pT7Blue in this test used the indicated adapter while the remaining 116 used no adapters.
Effect of vector forms in different trimming programs
| CVS | # resulting ESTsa | 14729 | 14870 | 14817 | 14808 |
| # removed ESTs | 306 | 165 | 218 | 227 | |
| # incompletely trimmed ESTs | 1128 | 586 | 155 | 156 | |
| % contamination | 7.50% | 3.90% | 1.03% | 1.04% | |
| average length of trimmed ESTs | 551.4 ± 163.4 | 560.7 ± 175.4 | 546.3 ± 176.7 | 539.7 ± 175.6 | |
| RVS | # resulting ESTsa | 14825 | 14870 | 14793 | 14785 |
| # removed ESTs | 210 | 165 | 242 | 250 | |
| # incompletely trimmed ESTs | 12 | 637 | 190 | 193 | |
| % contamination | 0.08% | 4.14% | 1.26% | 1.28% | |
| average length of trimmed ESTs | 543.0 ± 172.1 | 563.5 ± 171.8 | 546.9 ± 176.7 | 540.1 ± 175.6 | |
a number of ESTs whose length is equal to or greater than 100 bases after being processed by a vector trimming program as indicated
Effect of vectors on vector trimming performance by three programs
| Vector | #EST tested | ||||||||
| CVS | RVS | CVS | RVS | CVS | RVS | CVS | RVS | ||
| PCMV. SPORT-6.1 | 3000 | 0.03% | 0.03% | 3.17% | 3.17% | 0.30% | 0.27% | 0.17% | 0.17% |
| PT7T3PAC | 3000 | 0.00% | 0.00% | 0.23% | 0.77% | 0.10% | 0.07% | 0.03% | 0.03% |
| PDNR-LIB | 3000 | 0.30% | 0.30% | 1.17% | 1.17% | 0.00% | 0.00% | 0.03% | 0.03% |
| pGEM-Teasy | 2882 | 38.38% | 0.03% | 6.45% | 7.15% | 0.52% | 0.52% | 0.31% | 0.31% |
| pTriplEx2 | 2318 | 0.04% | 0.04% | 9.66% | 9.66% | 4.66% | 4.87% | 5.31% | 5.52% |
| pCR21TOPO | 5 | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
| pZL1 | 700 | 1.43% | 0.00% | 3.86% | 3.86% | 2.71% | 2.71% | 2.29% | 2.29% |
| pT7Blue | 130 | 0.77% | 0.00% | 9.23% | 9.23% | 0.77% | 25.38% | 0.77% | 25.38% |
| Total | 15035 | 7.50% | 0.08% | 3.90% | 4.14% | 1.03% | 1.26% | 1.04% | 1.28% |
Contamination rates of ESTs resulted from vector trimming under indicated conditions were grouped by the cloning vectors. The ratio was expressed in percentage showing the proportion of the ESTs using a specific cloning vector remained incompletely cleaned.
Vector contamination in ESTs sampled from dbEST
| Vector | a# Contaminated ESTs | b# ESTs | cRatio |
| pT7T3D-PacI | 20 | 3883 | 0.52% |
| pSPORT1 | 99 | 3784 | 2.62% |
| pCMVSPORT6.0 | 17 | 4784 | 0.36% |
| pME18S-FL | 13 | 3382 | 0.38% |
| pUC18 | 17 | 1618 | 1.05% |
| pCS107 | 3 | 1198 | 0.25% |
| PBRcDNASfiIAB | 5 | 1177 | 0.42% |
| pBluescriptSK(-) | 20 | 2494 | 0.80% |
| pDNR-LIB | 24 | 1817 | 1.32% |
| pOTB7 | 6 | 891 | 0.67% |
| pBluescriptIISK(+) | 27 | 2516 | 1.07% |
| pBK-CMV | 11 | 783 | 1.40% |
| pBluescriptSK(+) | 11 | 713 | 1.54% |
| pExpress-1 | 19 | 1507 | 1.26% |
| pBluescriptIIKS(+) | 7 | 689 | 1.02% |
| pcDNA3.1(+/-) | 8 | 469 | 1.71% |
| pCMV-SPORT6.1 | 186 | 1676 | 11.10% |
| pTriplEx2 | 60 | 545 | 11.01% |
| pDONR222 | 20 | 516 | 3.88% |
| pCS108 | 1 | 362 | 0.28% |
| pT7T3D | 1 | 559 | 0.18% |
| Total | 575 | 35363 | 1.63%d |
aNumber of ESTs which were cloned into the indicated vector remained to be contaminated with vector.
bNumber of ESTs using the indicated vector were subjected to BLAST analysis for identification of vector contamination
cpercentage of vector contamination for the ESTs using the same cloning vector.
dpercetage obtained by 575/35363 = 1.63%