Literature DB >> 26062606

Vector sequence contamination of the Plasmodium vivax sequence database in PlasmoDB and In silico correction of 26 parasite sequences.

Zhi-Yong Tao1, Xu Sui2,3,4, Cao Jun5,6,7, Richard Culleton8, Qiang Fang9, Hui Xia10, Qi Gao11,12,13.   

Abstract

We found a 47 aa protein sequence that occurs 17 times in the Plasmodium vivax nucleotide database published on PlasmoDB. Coding sequence analysis showed multiple restriction enzyme sites within the 141 bp nucleotide sequence, and a His6 tag attached to the 3' end, suggesting cloning vector origins. Sequences with vector contamination were submitted to NCBI, and BLASTN was used to cross-examine whole-genome shotgun contigs (WGS) from four recently deposited P. vivax whole genome sequencing projects. There are at least 26 genes listed in the PlasmoDB database that incorporate this cloning vector sequence into their predicted provisional protein products.

Entities:  

Mesh:

Year:  2015        PMID: 26062606      PMCID: PMC4464627          DOI: 10.1186/s13071-015-0927-x

Source DB:  PubMed          Journal:  Parasit Vectors        ISSN: 1756-3305            Impact factor:   3.876


Findings

Genome databases are of great value for biomedical research, and have significantly advanced our understanding of the biology of multiple parasite species, including Plasmodium falciparum and Plasmodium vivax, the two most common malaria parasites [1, 2]. The latter genome sequence was produced by shotgun sequencing by Carlton et al. at TIGR in 2008 at five fold coverage, and is deposited at GenBank and PlasmoDB [3]. Assembly errors are inevitable when constructing genomes, and, in the case of intracellular parasites, contamination with host DNA sequence also poses a problem. Indeed, recent research has shown that many published genomes, including mammalian, contain contaminating sequence from a variety of microorganisms [4]. Considering gene prediction errors and malaria parasites specifically, Lu et al. reported that about 20 % of genes are incorrectly predicted in the P. falciparum genome database, although these are mostly due to errors arising from the gene prediction software used [5]. During a search for repetitive protein fragments in the P. vivax genome conducted on the nucleotide sequences deposited in PlasmoDB [6] we found a 47 amino acid (aa) sequence (KGQDNSADIQHSGGRSSLEGPRFEGKPIPNPLLGLDSTRTGHHHHHH) repeated a total of 17 times in several annotated contigs. A His6 tag (Fig 1A) was attached to the 3’ end, and multiple restriction enzyme sites (Fig 1B) were present within the 141 bp nucleotide sequence (AAG GGT CAA GAC AAT TCT GCA GAT ATC CAG CAC AGT GGC GGC CGC TCG AGT CTA GAG GGC CCG CGG TTC GAA GGT AAG CCT ATC CCT AAC CCT CTC CTC GGT CTC GAT TCT ACG CGT ACC GGT CAT CAT CAC CAT CAC CAT). This sequence, when run through a VecScreen search (NCBI, http://www.ncbi.nlm.nih.gov/tools/vecscreen/) shows significant similarity to the promoter probe vector pMQ354 (Fig 1C). These features suggest cloning vector sequence contamination. We performed BLASTN searches of these 17 coding sequences against whole-genome shotgun contigs (WGS) of four whole genome sequences (India VII [GenBank: AFMK01000000], North Korean [GenBank: AFBK01000000], Brazil I [GenBank: AFNI01000000], Mauritania I [GenBank: AFNJ01000000]) [7]. All hits were aligned with the reference sequence, and the results showed missing or substituted base pairs at the 3′ end of the query sequences, resulting in the absence of the correct stop codon of the parasite gene, and the incorporation of the vector sequence into the predicted parasite gene protein product, which then terminated at the vector stop codon. Considering that there may be a possibility of frame shifting, we translated the coding sequence in all three frames (Fig 1A), and frames two and three protein were used as query sequences against the PlasmoDB protein database. This resulted in five and four sequence hits respectively, and these nine sequences were subjected to alignment and correction as described before. In total, we discovered 26 sequences in PlasmoDB contaminated by the vector sequence (Table 1).
Fig. 1

Cloning vector source sequence contamination in PlasmoDB. a: A 141 bp vector source sequence with a his6 tag repeatedly occurred in the Plasmodium vivax nucleotide database. b: Dozens of restriction enzyme sites are present in the sequence. c: VecScreen search showed the contaminating sequence strongly match to pMQ354. d: Typical errors in Sal-1 strain sequencing results due to the contaminating sequence. The missing ends are marked in yellow, and contaminating vector sequences are underlined

Table 1

Correction of 26 genes affected by a contaminated cloning vector sequence in PlasmoDB

IDPlasmoDB IDGenBank accession numberLength (bp)
Before correctionAfter correction
1PVX_253300XM_0016123281,086945
2PVX_250300XM_0016123231,047906
3PVX_211290a XM_001612311945807
4PVX_226290a XM_001612298792741
5PVX_214290a XM_001612318861792
6PVX_215290a XM_001612317861793
7PVX_220290XM_001612333654513
8PVX_252300XM_0016123321,1491,008
9PVX_222290b XM_0016123491,2331,098
10PVX_196290b XM_0016123371,1731,101
11PVX_195290XM_0016123731,8931,752
12PVX_231290XM_001612334639498
13PVX_213290XM_001612274513441
14PVX_249300XM_0016123311,113972
15PVX_227290XM_0016123701,9021,761
16PVX_240290c XM_001612308942801
17PVX_235290c XM_001612320717576
18PVX_254300XM_0016123271,062921
19PVX_200290d XM_001612305921876
20PVX_201290d XM_001612303828780
21PVX_206290d XM_001612319924876
22PVX_208290d XM_0016123291,017876
23PVX_216290e XM_001612279711570
24PVX_218290e XM_001612281711570
25PVX_237290e XM_001612314711570
26PVX_217290e XM_001612282621570

a, b, c, d, e:Represent duplicated sequences respectively

Cloning vector source sequence contamination in PlasmoDB. a: A 141 bp vector source sequence with a his6 tag repeatedly occurred in the Plasmodium vivax nucleotide database. b: Dozens of restriction enzyme sites are present in the sequence. c: VecScreen search showed the contaminating sequence strongly match to pMQ354. d: Typical errors in Sal-1 strain sequencing results due to the contaminating sequence. The missing ends are marked in yellow, and contaminating vector sequences are underlined Correction of 26 genes affected by a contaminated cloning vector sequence in PlasmoDB a, b, c, d, e:Represent duplicated sequences respectively Generally, cloning vector source sequences are relatively easily recognized by a variety of tools, such as VecScreen. The P. vivax database has been updated more than ten times [8], and yet this vector sequence contamination persists, suggesting that it may have special characteristics that render it difficult to identify automatically. Attempted PCR amplification of Sal-1 genomic DNA using primers specific for the potential contaminating sequence would provide definitive proof of whether these sequences really are present in the genome, a scenario we believe to be highly unlikely. The publication of four geographical reference strain whole genome sequences now provides an opportunity for the correction of the genome sequence of the Sal-I reference genome. Given our findings, it is possible that further interrogation of the P. vivax genome deposited in PlasmoDB may reveal further contamination. It is also possible that any previous work that made use of these sequences may require reappraisal.
  8 in total

1.  [Plasmodium vivax specific peptides prediction and screening based on repetitive protein sequences and linear B cell epitope].

Authors:  Zhi-yong Tao; Sui Xu; Yuan-yuan Wang; Qiang Fang; Hui Xia; Qi Gao
Journal:  Zhongguo Xue Xi Chong Bing Fang Zhi Za Zhi       Date:  2014-06

Review 2.  The Plasmodium vivax genome sequencing project.

Authors:  Jane Carlton
Journal:  Trends Parasitol       Date:  2003-05

3.  Comparative genomics of the neglected human malaria parasite Plasmodium vivax.

Authors:  Jane M Carlton; John H Adams; Joana C Silva; Shelby L Bidwell; Hernan Lorenzi; Elisabet Caler; Jonathan Crabtree; Samuel V Angiuoli; Emilio F Merino; Paolo Amedeo; Qin Cheng; Richard M R Coulson; Brendan S Crabb; Hernando A Del Portillo; Kobby Essien; Tamara V Feldblyum; Carmen Fernandez-Becerra; Paul R Gilson; Amy H Gueye; Xiang Guo; Simon Kang'a; Taco W A Kooij; Michael Korsinczky; Esmeralda V-S Meyer; Vish Nene; Ian Paulsen; Owen White; Stuart A Ralph; Qinghu Ren; Tobias J Sargeant; Steven L Salzberg; Christian J Stoeckert; Steven A Sullivan; Marcio M Yamamoto; Stephen L Hoffman; Jennifer R Wortman; Malcolm J Gardner; Mary R Galinski; John W Barnwell; Claire M Fraser-Liggett
Journal:  Nature       Date:  2008-10-09       Impact factor: 49.962

4.  PlasmoDB: the Plasmodium genome resource. A database integrating experimental and computational data.

Authors:  Amit Bahl; Brian Brunk; Jonathan Crabtree; Martin J Fraunholz; Bindu Gajria; Gregory R Grant; Hagai Ginsburg; Dinesh Gupta; Jessica C Kissinger; Philip Labo; Li Li; Matthew D Mailman; Arthur J Milgram; David S Pearson; David S Roos; Jonathan Schug; Christian J Stoeckert; Patricia Whetzel
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

5.  Genome sequence of the human malaria parasite Plasmodium falciparum.

Authors:  Malcolm J Gardner; Neil Hall; Eula Fung; Owen White; Matthew Berriman; Richard W Hyman; Jane M Carlton; Arnab Pain; Karen E Nelson; Sharen Bowman; Ian T Paulsen; Keith James; Jonathan A Eisen; Kim Rutherford; Steven L Salzberg; Alister Craig; Sue Kyes; Man-Suen Chan; Vishvanath Nene; Shamira J Shallom; Bernard Suh; Jeremy Peterson; Sam Angiuoli; Mihaela Pertea; Jonathan Allen; Jeremy Selengut; Daniel Haft; Michael W Mather; Akhil B Vaidya; David M A Martin; Alan H Fairlamb; Martin J Fraunholz; David S Roos; Stuart A Ralph; Geoffrey I McFadden; Leda M Cummings; G Mani Subramanian; Chris Mungall; J Craig Venter; Daniel J Carucci; Stephen L Hoffman; Chris Newbold; Ronald W Davis; Claire M Fraser; Bart Barrell
Journal:  Nature       Date:  2002-10-03       Impact factor: 49.962

6.  Unexpected cross-species contamination in genome sequencing projects.

Authors:  Samier Merchant; Derrick E Wood; Steven L Salzberg
Journal:  PeerJ       Date:  2014-11-20       Impact factor: 2.984

7.  The malaria parasite Plasmodium vivax exhibits greater genetic diversity than Plasmodium falciparum.

Authors:  Daniel E Neafsey; Kevin Galinsky; Rays H Y Jiang; Lauren Young; Sean M Sykes; Sakina Saif; Sharvari Gujja; Jonathan M Goldberg; Sarah Young; Qiandong Zeng; Sinéad B Chapman; Aditya P Dash; Anupkumar R Anvikar; Patrick L Sutton; Bruce W Birren; Ananias A Escalante; John W Barnwell; Jane M Carlton
Journal:  Nat Genet       Date:  2012-08-05       Impact factor: 38.330

8.  cDNA sequences reveal considerable gene prediction inaccuracy in the Plasmodium falciparum genome.

Authors:  Fangli Lu; Hongying Jiang; Jinhui Ding; Jianbing Mu; Jesus G Valenzuela; José M C Ribeiro; Xin-zhuan Su
Journal:  BMC Genomics       Date:  2007-07-27       Impact factor: 3.969

  8 in total
  1 in total

1.  Parasite infection of public databases: a data mining approach to identify apicomplexan contaminations in animal genome and transcriptome assemblies.

Authors:  Janus Borner; Thorsten Burmester
Journal:  BMC Genomics       Date:  2017-01-19       Impact factor: 3.969

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.