Literature DB >> 8628656

Cleaning the GenBank Arabidopsis thaliana data set.

P G Korning1, S M Hebsgaard, P Rouze, S Brunak.   

Abstract

Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks. However, the possibilities are drastically impaired if the stored data is unreliable. During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank. A number of simple 'sanity' checks, based on the nature of the data, revealed an alarmingly high error rate. More than 15% of the most important entries extracted did contain erroneous information. In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing. In a few cases the errors are due to mere typographical misprints, which may be corrected by comparison to the original papers, but errors caused by wrong assignments of splice sites from experimental data are the most common. It is proposed that the level of error correction should be increased and that gene structure sanity checks should be incorporated--also at the submitter level--to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.

Entities:  

Mesh:

Substances:

Year:  1996        PMID: 8628656      PMCID: PMC145627          DOI: 10.1093/nar/24.2.316

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


  9 in total

1.  Database of homology-derived protein structures and the structural meaning of sequence alignment.

Authors:  C Sander; R Schneider
Journal:  Proteins       Date:  1991

2.  The minimum functional length of pre-mRNA introns in monocots and dicots.

Authors:  G J Goodall; W Filipowicz
Journal:  Plant Mol Biol       Date:  1990-05       Impact factor: 4.076

3.  Selection of representative protein data sets.

Authors:  U Hobohm; M Scharf; R Schneider; C Sander
Journal:  Protein Sci       Date:  1992-03       Impact factor: 6.725

4.  Neural network detects errors in the assignment of mRNA splice sites.

Authors:  S Brunak; J Engelbrecht; S Knudsen
Journal:  Nucleic Acids Res       Date:  1990-08-25       Impact factor: 16.971

5.  Cleaning up gene databases.

Authors:  S Brunak; J Engelbrecht; S Knudsen
Journal:  Nature       Date:  1990-01-11       Impact factor: 49.962

6.  The unusual 5' splicing border GC is used in myrosinase genes of the Brassicaceae.

Authors:  J Xue; L Rask
Journal:  Plant Mol Biol       Date:  1995-10       Impact factor: 4.076

7.  Nucleotide and protein sequences of a cytoplasmic ribosomal protein S15a gene from Arabidopsis thaliana.

Authors:  P C Bonham-Smith; M M Moloney
Journal:  Plant Physiol       Date:  1994-09       Impact factor: 8.340

8.  Prediction of human mRNA donor and acceptor sites from the DNA sequence.

Authors:  S Brunak; J Engelbrecht; S Knudsen
Journal:  J Mol Biol       Date:  1991-07-05       Impact factor: 5.469

9.  Arabidopsis phosphoribosylanthranilate isomerase: molecular genetic analysis of triplicate tryptophan pathway genes.

Authors:  J Li; J Zhao; A B Rose; R Schmidt; R L Last
Journal:  Plant Cell       Date:  1995-04       Impact factor: 11.277

  9 in total
  18 in total

1.  A splice site mutant of maize activates cryptic splice sites, elicits intron inclusion and exon exclusion, and permits branch point elucidation.

Authors:  S Lal; J H Choi; J R Shaw; L C Hannah
Journal:  Plant Physiol       Date:  1999-10       Impact factor: 8.340

2.  A branch point consensus from Arabidopsis found by non-circular analysis allows for better prediction of acceptor sites.

Authors:  N Tolstrup; P Rouzé; S Brunak
Journal:  Nucleic Acids Res       Date:  1997-08-01       Impact factor: 16.971

3.  The maize genome contains a helitron insertion.

Authors:  Shailesh K Lal; Michael J Giroux; Volker Brendel; C Eduardo Vallejos; L Curtis Hannah
Journal:  Plant Cell       Date:  2003-02       Impact factor: 11.277

4.  EUGENE'HOM: A generic similarity-based gene finder using multiple homologous sequences.

Authors:  Sylvain Foissac; Philippe Bardou; Annick Moisan; Marie-Josée Cros; Thomas Schiex
Journal:  Nucleic Acids Res       Date:  2003-07-01       Impact factor: 16.971

5.  DNA barcoding insect-host plant associations.

Authors:  José A Jurado-Rivera; Alfried P Vogler; Chris A M Reid; Eduard Petitpierre; Jesús Gómez-Zurita
Journal:  Proc Biol Sci       Date:  2009-02-22       Impact factor: 5.349

6.  Identification of protein-coding regions in Arabidopsis thaliana genome based on quadratic discriminant analysis.

Authors:  M Q Zhang
Journal:  Plant Mol Biol       Date:  1998-07       Impact factor: 4.076

Review 7.  Splicing of precursors to mRNA in higher plants: mechanism, regulation and sub-nuclear organisation of the spliceosomal machinery.

Authors:  G G Simpson; W Filipowicz
Journal:  Plant Mol Biol       Date:  1996-10       Impact factor: 4.076

8.  Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences.

Authors:  J Kleffe; K Hermann; W Vahrson; B Wittig; V Brendel
Journal:  Nucleic Acids Res       Date:  1996-12-01       Impact factor: 16.971

9.  Sequence analysis of an 81 kb contig from Arabidopsis thaliana chromosome III.

Authors:  F Quigley; P Dao; A Cottet; R Mache
Journal:  Nucleic Acids Res       Date:  1996-11-01       Impact factor: 16.971

10.  Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information.

Authors:  S M Hebsgaard; P G Korning; N Tolstrup; J Engelbrecht; P Rouzé; S Brunak
Journal:  Nucleic Acids Res       Date:  1996-09-01       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.