| Literature DB >> 31960022 |
Giulia Antonazzo1, Jose M Urbano1, Steven J Marygold1, Gillian H Millburn1, Nicholas H Brown1.
Abstract
Brief summaries describing the function of each gene's product(s) are of great value to the research community, especially when interpreting genome-wide studies that reveal changes to hundreds of genes. However, manually writing such summaries, even for a single species, is a daunting task; for example, the Drosophila melanogaster genome contains almost 14 000 protein-coding genes. One solution is to use computational methods to generate summaries, but this often fails to capture the key functions or express them eloquently. Here, we describe how we solicited help from the research community to generate manually written summaries of D. melanogaster gene function. Based on the data within the FlyBase database, we developed a computational pipeline to identify researchers who have worked extensively on each gene. We e-mailed these researchers to ask them to draft a brief summary of the main function(s) of the gene's product, which we edited for consistency to produce a 'gene snapshot'. This approach yielded 1800 gene snapshot submissions within a 3-month period. We discuss the general utility of this strategy for other databases that capture data from the research literature. Database URL: https://flybase.org/.Entities:
Mesh:
Year: 2020 PMID: 31960022 PMCID: PMC6971343 DOI: 10.1093/database/baz152
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Detailed descriptions of the GO scores and Evidence score variables from the totalScore formula
|
|
|
|---|---|
|
| Sum of the scores given to every GO term associated to a gene, adding 1 point per GO term, with ROOT terms giving 0.5 points and no term giving 0 points. |
|
| 200, 100 and 50 points if there is at least one biological process, molecular function and cellular component GO term, respectively. |
Detailed descriptions of the variables used in the scoring function for ranking authors based on their contribution to the knowledge about a gene
|
|
|
|---|---|
|
| Number of papers published by the author on the selected gene in the last 10 years |
|
| Number of papers published by the author describing new genetic reagents (allele or transgenic construct) for the selected gene |
|
| Number of papers published by the author on the selected gene, where the paper is linked to less than 500 genes in total (to exclude high-throughput analyses) |
|
| Number of papers published by the author on the selected gene, where the paper describes phenotypes of classical allele(s) of the selected gene |
|
| Number of papers published by the author on the selected gene including experimental evidence GO terms (Biological Process and Molecular Function aspects) |
|
| Number of papers published by the author on the selected gene including expression data and/or antibody information |
|
| Number of papers published by the author on the selected gene including ‘gene source’ information (i.e. papers first characterizing a gene) |
An example illustrating how, for a given gene, various values quantify how much each author has contributed to the information stored in FlyBase on a given gene
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|
| Author1 | author1@... | 24 | 8 | 12 | 1 | 0 | 5 | 0 |
|
| Author2 | author2@... | 9 | 5 | 7 | 3 | 0 | 5 | 0 |
|
| Author3 | author3@... | 12 | 4 | 7 | 3 | 0 | 5 | 0 |
|
| Author4 | author4@... | 28 | 6 | 9 | 0 | 0 | 7 | 0 |
|
| Author5 | author5@... | 14 | 0 | 10 | 1 | 0 | 7 | 0 |
|
| Author6 | author6@... | 17 | 2 | 6 | 2 | 0 | 5 | 0 |
|
| Author7 | author7@... | 15 | 5 | 7 | 3 | 0 | 1 | 0 |
|
| Author8 | author8@... | 28 | 10 | 4 | 0 | 0 | 4 | 0 |
|
| Author9 | author9@... | 12 | 7 | 3 | 1 | 0 | 3 | 0 |
|
| Author10 | author10@... | 14 | 1 | 2 | 0 | 3 | 3 | 0 |
|
A final score is calculated as a function of these values, and the authors can be ranked on the basis of this score. See Materials and methods for descriptions of the column headings. LTP papers: low-throughput papers; 10y Papers: papers in the last 10 years; Clas. Al. w Pheno: classical alleles with phenotypes; GO exp: experimental evidence GO terms; Exp + AB: expression plus antibodies; Mut + Nat. Les.: mutagen plus nature of lesion; Gene Src.: gene source.
The effect of reminders on the fraction of genes for which snapshots were obtained
|
|
|
|
|
|
|---|---|---|---|---|
| Second cycle (1690 genes) | 496/1690 (29%) | 355/1104 (32%) | 76/672 (11%) | 205/693 (30%) |
| Third cycle (692 genes) | 160/692 (23%) | 128/532 (24%) | NA | NA |
Note: responses are shown only for authors assigned up to five genes. Total number of genes are followed by percentages in parentheses, and all percentages are relative to the number of genes falling into the given category
Figure 1Author response rates in the different cycles. Pilot cycle: semi-manually selected authors. Second cycle: predicted authors, no gene categorization. Third cycle: predicted authors, with gene categorization.
The rate of snapshot submission in the second e-mailing cycle, with genes stratified by the type of annotation data in FlyBase
|
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|
| GO data? | Alleles? | Orthologs? | |||||
|
| Y | Y | 3214 | 2578 | 937 |
|
|
|
| Y | N | 813 | 635 | 260 |
|
|
|
| N | Y | 1573 | 491 | 144 |
|
|
|
| N | N | 601 | 213 | 114 |
|
|
| Y (non-exp) | Y | Y | 758 | 264 | 37 | 14% | 5% |
| Y (non-exp) | Y | N | 196 | 61 | 11 | 18% | 6% |
| Y (non-exp) | N | Y | 2544 | 271 | 43 | 16% | 2% |
| Y (non-exp) | N | N | 923 | 110 | 25 | 23% | 3% |
| no GO | Y | Y | 155 | 42 | 1 | 2% | 1% |
| no GO | Y | N | 272 | 48 | 4 | 8% | 1% |
| no GO | N | Y | 517 | 22 | 1 | 5% | 0.2% |
| no GO | N | N | 2341 | 40 | 3 | 8% | 0.1% |
| Total | 13 907 | 4775 | 1580 | 33% | 11% | ||
Abbreviations: exp = experimental GO data; non-exp = non-experimental GO data; no GO = no GO data available; Alleles = Alleles with phenotypes
Figure 2The relationship between the number of genes for which a given author was asked to provide snapshots and the average fraction of genes for which snapshots were returned (not including authors sent a spreadsheet with many genes). The numbers above the data points indicate the number of authors in each category.
Figure 3Screenshot of the top of a gene report page showing the Gene Snapshot section.
Figure 4Overview of pipeline to produce Gene Snapshots.