Literature DB >> 27504008

How much does curation cost?

Peter D Karp1.   

Abstract

NIH administrators have recently expressed concerns about the cost of curation for biological databases. However, they did not articulate the exact costs of curation. Here we calculate the cost of biocuration of articles for the EcoCyc database as $219 per article over a 5-year period. That cost is 6-15% of the cost of open-access publication fees for publishing biomedical articles, and we estimate that cost is 0.088% of the cost of the overall research project that generated the experimental results. Thus, curation costs are small in an absolute sense, and represent a miniscule fraction of the cost of the research.
© The Author(s) 2016. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2016        PMID: 27504008      PMCID: PMC4976296          DOI: 10.1093/database/baw110

Source DB:  PubMed          Journal:  Database (Oxford)        ISSN: 1758-0463            Impact factor:   3.451


Perspective

In a recent article, Bourne et al. (1) argue that the recent flat research budgets for biomedical research imply that biomedical databases must chart a new course to explore new business models and methodologies. They are very concerned about the costs of databases in general, and of curation in particular. But to put these issues into a proper perspective, it is important to understand how much curation actually costs. Although many of us might expect that curation is quite expensive, we will show that its costs are quite modest on a per-article basis, and as a fraction of the cost of the original research. We estimate the cost of curation for the EcoCyc database (2) using the following methodology. We want to be clear about this methodology to encourage other groups to perform similar estimates, and because this calculation involves a number of considerations. Our overall approach is to divide the cost of curation work over a given time period by the number of publications curated during that time period. What do we mean by the cost of curation work? Our analysis considers curation only; we omit the costs of database and website operations, quality assurance, software development, outreach, preparation of publications, and bioinformatics research that are performed by EcoCyc and by other database projects. The reason we have excluded these other tasks is because they do not in fact involve curation per se. Furthermore, if we want to understand the benefits of replacing professional curation with say crowd-sourced curation or automated text mining, such replacement would not obviate other database costs such as outreach and website operations, so it is important to understand the costs of curation itself. We do include the costs of managing curators in our estimate of curation costs. Because in the EcoCyc project some of the preceding non-curation tasks are performed by curators, we have had to estimate what fraction of their time curators actually spend doing curation (that estimate ranges from 100% for some curators on the project to 80% for other curators). Note that costs will vary significantly across institutions because of variations in indirect costs (two EcoCyc curators work in Mexico and Australia, and NIH pays only an 8% indirect cost rate to foreign institutions) and in the cost of living (labor costs are quite low in Mexico). What do we mean by number of publications? We have queried past versions of EcoCyc to determine the number of publications cited by the EcoCyc database in each version, using a program that interrogates every field in EcoCyc that could include a citation. These statistics are subject to possible over-counting and under-counting. Under counting could result if a curator neglected to cite within EcoCyc a curated publication, which we think would be extremely rare. Over-counting could occur if a curator cited an article that they thought relevant to say a gene that they were curating, when they had not actually curated the article. This situation does occur occasionally, but we think its incidence is <5%. During the 5 years from May 2011 to 2016, the number of publications cited by EcoCyc increased by 9606 (from 21  448 to 31  054). Our curation costs during this period were $2.1M, yielding a curation-cost per publication of $219. Thus, curation costs per publication were from 6 to 15% the cost of an open-access publication fee for publishing a biomedical article (open-access fees typically range from $1500 to $3500). Furthermore, let us calculate the cost of curation as a fraction of the cost of performing the research. Let us postulate that an average NIH research grant (R01) has a budget of $250 000 per year, and produces one publication per year. The $219 curation cost is thus 0.088% of the cost of the overall research project—a minuscule price to pay for accurately curated and computable biological knowledge. We can also compare the curation cost of the research project to the cost of coffee breaks for the project. Imagine that a scientist who works 10 h per day on average takes one five-minute coffee break each day. They spend 0.83% of their time on such breaks. Thus, the curation cost is slightly more than one-tenth the cost of coffee breaks, a cost that is considered negligible. We should not expect curation costs to be identical for every database because many factors will influence curation costs. Some databases may accept higher error rates than others [the error rate for EcoCyc curation has been estimated at 1.40% (3)]. Database curation procedures vary significantly, and we believe EcoCyc curation is likely to be relatively high on the scale of complexity because EcoCyc curators author long mini-reviews for genes and pathways, they extract a large number of database fields (350) for many different datatypes ranging from metabolic pathways to gene essentiality, and they capture molecular interactions at a high level of detail that enables generation of metabolic models from EcoCyc and capture of mechanisms of gene regulation for multiple types of regulation. On the other hand, EcoCyc curation costs are lowered by the preceding factors related to the non-U.S. groups who participate in EcoCyc. We estimate that if the non-US curation had been performed at a U.S. university, that the cost per publication would rise to approximately $320 per publication. This 50% increase, although significant, would not undermine the conclusions of this article: that curation costs are minute when compared to the cost of the research. In a future perspective we will examine whether the curation process is inefficient, and whether the other approaches suggested by Bourne et al., such as direct curation by authors of a publication, are workable. In summary, the $219 per publication cost of curation is a minuscule fraction (0.088%) of the cost of research—approximately one-tenth the cost of the coffee breaks for the researchers who performed the research. Open-access publication fees, which the scientific community apparently considers to be a reasonable tax on the research project, cost ∼1% of the budget of a research project—significantly more than the cost of curation.

Funding

This work was supported by SRI International.

Conflict of interest

None declared.
  3 in total

1.  Perspective: Sustaining the big-data ecosystem.

Authors:  Philip E Bourne; Jon R Lorsch; Eric D Green
Journal:  Nature       Date:  2015-11-05       Impact factor: 49.962

2.  EcoCyc: fusing model organism databases with systems biology.

Authors:  Ingrid M Keseler; Amanda Mackie; Martin Peralta-Gil; Alberto Santos-Zavaleta; Socorro Gama-Castro; César Bonavides-Martínez; Carol Fulcher; Araceli M Huerta; Anamika Kothari; Markus Krummenacker; Mario Latendresse; Luis Muñiz-Rascado; Quang Ong; Suzanne Paley; Imke Schröder; Alexander G Shearer; Pallavi Subhraveti; Mike Travers; Deepika Weerasinghe; Verena Weiss; Julio Collado-Vides; Robert P Gunsalus; Ian Paulsen; Peter D Karp
Journal:  Nucleic Acids Res       Date:  2012-11-09       Impact factor: 16.971

3.  Curation accuracy of model organism databases.

Authors:  Ingrid M Keseler; Marek Skrzypek; Deepika Weerasinghe; Albert Y Chen; Carol Fulcher; Gene-Wei Li; Kimberly C Lemmer; Katherine M Mladinich; Edmond D Chow; Gavin Sherlock; Peter D Karp
Journal:  Database (Oxford)       Date:  2014-06-12       Impact factor: 3.451

  3 in total
  12 in total

1.  The BioCyc collection of microbial genomes and metabolic pathways.

Authors:  Peter D Karp; Richard Billington; Ron Caspi; Carol A Fulcher; Mario Latendresse; Anamika Kothari; Ingrid M Keseler; Markus Krummenacker; Peter E Midford; Quang Ong; Wai Kit Ong; Suzanne M Paley; Pallavi Subhraveti
Journal:  Brief Bioinform       Date:  2019-07-19       Impact factor: 11.622

2.  New reasons for biologists to write with a formal language.

Authors:  Raul Rodriguez-Esteban
Journal:  Database (Oxford)       Date:  2022-06-03       Impact factor: 4.462

Review 3.  Can we replace curation with information extraction software?

Authors:  Peter D Karp
Journal:  Database (Oxford)       Date:  2016-12-26       Impact factor: 3.451

4.  On expert curation and scalability: UniProtKB/Swiss-Prot as a case study.

Authors:  Sylvain Poux; Cecilia N Arighi; Michele Magrane; Alex Bateman; Chih-Hsuan Wei; Zhiyong Lu; Emmanuel Boutet; Hema Bye-A-Jee; Maria Livia Famiglietti; Bernd Roechert; The UniProt Consortium
Journal:  Bioinformatics       Date:  2017-11-01       Impact factor: 6.937

5.  AgBioData consortium recommendations for sustainable genomics and genetics databases for agriculture.

Authors:  Lisa Harper; Jacqueline Campbell; Ethalinda K S Cannon; Sook Jung; Monica Poelchau; Ramona Walls; Carson Andorf; Elizabeth Arnaud; Tanya Z Berardini; Clayton Birkett; Steve Cannon; James Carson; Bradford Condon; Laurel Cooper; Nathan Dunn; Christine G Elsik; Andrew Farmer; Stephen P Ficklin; David Grant; Emily Grau; Nic Herndon; Zhi-Liang Hu; Jodi Humann; Pankaj Jaiswal; Clement Jonquet; Marie-Angélique Laporte; Pierre Larmande; Gerard Lazo; Fiona McCarthy; Naama Menda; Christopher J Mungall; Monica C Munoz-Torres; Sushma Naithani; Rex Nelson; Daureen Nesdill; Carissa Park; James Reecy; Leonore Reiser; Lacey-Anne Sanderson; Taner Z Sen; Margaret Staton; Sabarinath Subramaniam; Marcela Karey Tello-Ruiz; Victor Unda; Deepak Unni; Liya Wang; Doreen Ware; Jill Wegrzyn; Jason Williams; Margaret Woodhouse; Jing Yu; Doreen Main
Journal:  Database (Oxford)       Date:  2018-01-01       Impact factor: 3.451

6.  A behind-the-scenes tour of the IEDB curation process: an optimized process empirically integrating automation and human curation efforts.

Authors:  Nima Salimi; Lindy Edwards; Gabriele Foos; Jason A Greenbaum; Sheridan Martini; Brian Reardon; Deborah Shackelford; Randi Vita; Leora Zalman; Bjoern Peters; Alessandro Sette
Journal:  Immunology       Date:  2020-07-26       Impact factor: 7.397

7.  Funding knowledgebases: Towards a sustainable funding model for the UniProt use case.

Authors:  Chiara Gabella; Christine Durinx; Ron Appel
Journal:  F1000Res       Date:  2017-11-27

8.  Involving community in genes and pathway curation.

Authors:  Sushma Naithani; Parul Gupta; Justin Preece; Priyanka Garg; Valerie Fraser; Lillian K Padgitt-Cobb; Matthew Martin; Kelly Vining; Pankaj Jaiswal
Journal:  Database (Oxford)       Date:  2019-01-01       Impact factor: 3.451

9.  Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase.

Authors:  Valerio Arnaboldi; Daniela Raciti; Kimberly Van Auken; Juancarlos N Chan; Hans-Michael Müller; Paul W Sternberg
Journal:  Database (Oxford)       Date:  2020-01-01       Impact factor: 3.451

10.  Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases.

Authors:  Qingyu Chen; Ramona Britto; Ivan Erill; Constance J Jeffery; Arthur Liberzon; Michele Magrane; Jun-Ichi Onami; Marc Robinson-Rechavi; Jana Sponarova; Justin Zobel; Karin Verspoor
Journal:  Genomics Proteomics Bioinformatics       Date:  2020-07-09       Impact factor: 7.691

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.