| Literature DB >> 28437440 |
Lisa M Gandy1, Jordan Gumm1, Benjamin Fertig2, Anne Thessen2, Michael J Kennish3, Sameer Chavan4, Luigi Marchionni5, Xiaoxin Xia5, Shambhavi Shankrit5, Elana J Fertig5.
Abstract
Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85-100%). We further implement Synthesize in an open source web application, Synthesizer (https://github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases.Entities:
Mesh:
Year: 2017 PMID: 28437440 PMCID: PMC5402950 DOI: 10.1371/journal.pone.0175860
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Number of datasets available before and after curation.
Fig 2System figure for the synthesizer application.
Example spreadsheets with data types.
| Spreadsheet | Column header | Column Datatype | Column values |
|---|---|---|---|
| A | gender | discrete | male, female |
| B | sex | discrete | male, female |
| B | grade | discrete | moderate, poor, well |
COCA collocates for the gender column.
| Term | Collocates |
|---|---|
| gender | difference, age, ethnic, effect, issue, significant, race, role, between, class |
| female | both, participate, voice, bodily, figure, student, male, than, athlete, first |
| male | young, black, white |
Fig 3A) Venn diagram showing the common collocates between the column “gender” in spreadsheet A () and column “sex” in spreadsheet B () of Table 1, respectively. B) Eq 2 applied to running example to compute the C) matrix of cosine similarity measures between spreadsheet A and B (cs_matrix).
Fig 4Synthesizer upload files.
Fig 5Synthesizer user interface.
Fig 6Global find and replace.
Fig 7Synthesize cosine similarity threshold selection on two training HNSCC datasets (GSE6791 and GSE3292).
A) Cosine similarity between labels and columns of each pair of columns in the two datasets. B) Accuracy of automatic merging as a function of the cosine similarity merging threshold. C) Accuracy of resulting merging for each set of columns using a threshold cosine similarity value of 0.5.
Fig 8Synthesize accuracy with regards to cancer genomics datasets.
Fig 9Synthesize accuracy with regards to ecology datasets.