BACKGROUND: QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. RESULTS: We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. CONCLUSIONS: Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community.
BACKGROUND: QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. RESULTS: We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. CONCLUSIONS: Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community.
Authors: A Brazma; P Hingamp; J Quackenbush; G Sherlock; P Spellman; C Stoeckert; J Aach; W Ansorge; C A Ball; H C Causton; T Gaasterland; P Glenisson; F C Holstege; I F Kim; V Markowitz; J C Matese; H Parkinson; A Robinson; U Sarkans; S Schulze-Kremer; J Stewart; R Taylor; J Vilo; M Vingron Journal: Nat Genet Date: 2001-12 Impact factor: 38.330
Authors: Xiao Dong; Kevin E Gilbert; Rajarshi Guha; Randy Heiland; Jungkee Kim; Marlon E Pierce; Geoffrey C Fox; David J Wild Journal: J Chem Inf Model Date: 2007-06-29 Impact factor: 4.956
Authors: Ola Spjuth; Jonathan Alvarsson; Arvid Berg; Martin Eklund; Stefan Kuhn; Carl Mäsak; Gilleain Torrance; Johannes Wagener; Egon L Willighagen; Christoph Steinbeck; Jarl E S Wikberg Journal: BMC Bioinformatics Date: 2009-12-03 Impact factor: 3.169
Authors: Rajarshi Guha; Michael T Howard; Geoffrey R Hutchison; Peter Murray-Rust; Henry Rzepa; Christoph Steinbeck; Jörg Wegner; Egon L Willighagen Journal: J Chem Inf Model Date: 2006 May-Jun Impact factor: 4.956
Authors: Paul T Spellman; Michael Miller; Jason Stewart; Charles Troup; Ugis Sarkans; Steve Chervitz; Derek Bernhart; Gavin Sherlock; Catherine Ball; Marc Lepage; Marcin Swiatek; W L Marks; Jason Goncalves; Scott Markel; Daniel Iordan; Mohammadreza Shojatalab; Angel Pizarro; Joe White; Robert Hubley; Eric Deutsch; Martin Senger; Bruce J Aronow; Alan Robinson; Doug Bassett; Christian J Stoeckert; Alvis Brazma Journal: Genome Biol Date: 2002-08-23 Impact factor: 13.583
Authors: Sean Ekins; Alex M Clark; S Joshua Swamidass; Nadia Litterman; Antony J Williams Journal: J Comput Aided Mol Des Date: 2014-06-19 Impact factor: 3.686
Authors: Janna Hastings; Leonid Chepelev; Egon Willighagen; Nico Adams; Christoph Steinbeck; Michel Dumontier Journal: PLoS One Date: 2011-10-03 Impact factor: 3.240
Authors: Ola Spjuth; Valentin Georgiev; Lars Carlsson; Jonathan Alvarsson; Arvid Berg; Egon Willighagen; Jarl E S Wikberg; Martin Eklund Journal: Bioinformatics Date: 2012-11-23 Impact factor: 6.937