| Literature DB >> 27296526 |
Jai Ram Rideout1, John H Chase1, Evan Bolyen1, Gail Ackermann2, Antonio González2, Rob Knight2,3, J Gregory Caporaso4,5.
Abstract
BACKGROUND: Bioinformatics software often requires human-generated tabular text files as input and has specific requirements for how those data are formatted. Users frequently manage these data in spreadsheet programs, which is convenient for researchers who are compiling the requisite information because the spreadsheet programs can easily be used on different platforms including laptops and tablets, and because they provide a familiar interface. It is increasingly common for many different researchers to be involved in compiling these data, including study coordinators, clinicians, lab technicians and bioinformaticians. As a result, many research groups are shifting toward using cloud-based spreadsheet programs, such as Google Sheets, which support the concurrent editing of a single spreadsheet by different users working on different platforms. Most of the researchers who enter data are not familiar with the formatting requirements of the bioinformatics programs that will be used, so validating and correcting file formats is often a bottleneck prior to beginning bioinformatics analysis. MAIN TEXT: We present Keemei, a Google Sheets Add-on, for validating tabular files used in bioinformatics analyses. Keemei is available free of charge from Google's Chrome Web Store. Keemei can be installed and run on any web browser supported by Google Sheets. Keemei currently supports the validation of two widely used tabular bioinformatics formats, the Quantitative Insights into Microbial Ecology (QIIME) sample metadata mapping file format and the Spatially Referenced Genetic Data (SRGD) format, but is designed to easily support the addition of others.Entities:
Keywords: Cloud; Data validation; Metadata; Plugin; QIIME; Spreadsheet; Tabular file format
Mesh:
Year: 2016 PMID: 27296526 PMCID: PMC4906574 DOI: 10.1186/s13742-016-0133-6
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Fig. 1Keemei screenshots of a user validating a QIIME sample metadata mapping file. From the Google Sheets menu bar, the user selects Add-ons > Keemei > Validate QIIME mapping file to validate their spreadsheet against the QIIME (quantitative insights into microbial ecology) sample metadata mapping file format specification. Validation results are displayed directly in the spreadsheet and in a sidebar interface on the right side of the spreadsheet. Invalid cells are highlighted in the spreadsheet, where a red background color indicates the cell has one or more errors or warnings associated with it, and a yellow background color indicates the cell has one or more warnings associated with it. Hovering the mouse over a cell will display the reason(s) why the cell is invalid. The sidebar contains a summary of the validation (e.g. file format validated against, number of invalid cells, etc.) and lists invalid cells. The user can click on a cell in the sidebar to view the reason(s) why the cell is invalid (similar to hovering over a cell in the spreadsheet). In this screenshot, the user has clicked on invalid cell A3; we see that cells A3 and A5 contain duplicate sample identifiers, which are disallowed in QIIME mapping files. The data in this screenshot are derived from [13]
Fig. 2Keemei screenshots of a user focusing on an invalid cell in a QIIME mapping file. Keemei’s sidebar provides a way to focus on an invalid cell in order to correct it. This feature is especially useful when working with large sheets that would require scrolling to find and correct invalid cells. By clicking on the magnifying glass next to invalid cell O46 in the sidebar (red border added for clarity), cell O46 is made active and the user is scrolled to the cell’s location in the spreadsheet to make any necessary corrections. The data in this screenshot are derived from [14]. QIIME, quantitative insights into microbial ecology
Fig. 3Validation performance as dataset size and error rate increases. Keemei’s validation runtime (walltime measured in seconds) is plotted against an increasing number of spreadsheet rows (dataset size) with a fixed number of columns (24). Each dataset size contains simulated QIIME (quantitative insights into microbial ecology) mapping file data with a varying percentage of invalid cells. A description of the simulated data is provided in the main text and the simulated data is available in Additional file 1
Fig. 4Validation performance as the number of validation rules increase. Keemei’s validation runtime (walltime measured in seconds) is plotted against an increasing number of validation rules applied to simulated data. The dataset size and error rate are held constant. A description of the simulated data is provided in the main text and the simulated data is available in Additional file 1