| Literature DB >> 23160412 |
Abstract
FlyBase is the model organism database for Drosophila genetic and genomic information. Over the last 20 years, FlyBase has had to adapt and change to keep abreast of advances in biology and database design. We are continually looking for ways to improve curation efficiency and efficacy. Genetic literature curation focuses on the extraction of genetic entities (e.g. genes, mutant alleles, transgenic constructs) and their associated phenotypes and Gene Ontology terms from the published literature. Over 2000 Drosophila research articles are now published every year. These articles are becoming ever more data-rich and there is a growing need for text mining to shoulder some of the burden of paper triage and data extraction. In this article, we describe our curation workflow, along with some of the problems and bottlenecks therein, and highlight the opportunities for text mining. We do so in the hope of encouraging the BioCreative community to help us to develop effective methods to mine this torrent of information. DATABASE URL: http://flybase.orgEntities:
Mesh:
Year: 2012 PMID: 23160412 PMCID: PMC3500518 DOI: 10.1093/database/bas039
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1Literature curation pipeline. A weekly PubMed database search identifies new Drosophila-related publications. The citation details for these articles are added to our bibliography, and where possible, the associated PDF is downloaded. The correspondence email for each new article is extracted from the PDF and used to invite the author to use our FTYP tool. Through use of this tool, or a FlyBase curator ‘skim’ curating each article, gene-to-publication links are generated. These are published to our FlyBase website at the first opportunity. Data types found in the article, flagged either by the authors or curators, are then used to generate a priority list for ‘full curation’, where we extract detailed genetic and molecular information.
Data-type flags used in literature curation (taken from an article by Bunt et al., 2012; see Ref. 2)
| Data-type flags | Data presented in an article |
|---|---|
| Drosophila reagents | |
| New allele or aberration | Generation of a new classical allele or chromosomal aberration in a Drosophilid genome |
| New transgene | Generation of a new transgenic construct |
| Gene characterization | |
| Initial characterization | Initial characterization of a Drosophilid gene |
| Merge of gene reports | Evidence suggesting the merge of two or more FlyBase gene reports |
| Gene rename | New gene symbol or name for an existing gene in FlyBase |
| Expression | |
| Expression in a wild-type background | New temporal or spatial expression data of any |
| Expression in a mutant background | Expression data of any |
| Phenotypes and interactions | |
| Phenotypic analysis | Novel phenotypic data |
| Physical interaction | Physical interactions involving |
| Genome annotation data | |
| Changes to | New experimental data relevant to |
| Changes to non- | New experimental data relevant to the gene model structure of non- |
| Mapping of features to genome | |
| | Experimental definition of |
The main CVs used in literature curation
| Ontology | Example search term (CV ID) |
|---|---|
| fly_anatomy | dMP2 neuron (FBbt:00001602) |
| fly_development | Pupal stage P6 (FBdv:00005353) |
| Term qualifier | Nutrition conditional (FBcv:0000714) |
| Phenotypic class | Smell perception defective (FBcv:0000404) |
| Sequence ontology | Engineered_foreign_gene (SO:0000281) |
| Origin of mutation | P-element activity (FBcv:0000486) |
| Allele class | Amorphic allele (FBcv:0000688) |
| Cellular component | Germ cell nucleus (GO:0043073) |
| Molecular function | Satellite DNA binding (GO:0003696) |
| Biological process | mRNA processing (GO:0006397) |
Figure 2Literature curation into proformae. Text files composed of various proformae are used to capture data from the literature. (A) The proformae are ordered such that each curation record has to start with a publication proforma, so all objects mentioned subsequently can be attributed to the relevant publication. Allele proformae are added underneath the parent gene proforma, so all allele information can be related back to the parent gene. (B) Proformae are split into four different types of fields. The fields start with an exclamation mark (for processing) and each field has a field code, e.g. GA1a is the allele symbol field (all fields in the allele proforma are coded GAx).
Figure 3Phenotype curation. Example data entries for a section of text [taken from an article by Baines (2003), see Ref. 5]. First, we identify the object we are ascribing the phenotype to, then we concisely curate the phenotype as free text, relating it to the object (which is placed between ‘at sign’ symbols as these symbols are hyperlinked). We then annotate the phenotype to CV terms, in this case, to terms from our ‘phenotypic class’ and ‘fly_anatomy’ ontologies.