Literature DB >> 35811804

Recommendations for the formatting of Variant Call Format (VCF) files to make plant genotyping data FAIR.

Sebastian Beier1,2, Anne Fiebig1, Cyril Pommier3, Isuru Liyanage4, Matthias Lange1, Paul J Kersey5, Stephan Weise1, Richard Finkers6,7, Baron Koylass4, Timothee Cezard4, Mélanie Courtot4,8, Bruno Contreras-Moreira9, Guy Naamati4, Sarah Dyer4, Uwe Scholz1.   

Abstract

In this opinion article, we discuss the formatting of files from (plant) genotyping studies, in particular the formatting of metadata in Variant Call Format (VCF) files. The flexibility of the VCF format specification facilitates its use as a generic interchange format across domains but can lead to inconsistency between files in the presentation of metadata. To enable fully autonomous machine actionable data flow, generic elements need to be further specified. We strongly support the merits of the FAIR principles and see the need to facilitate them also through technical implementation specifications. They form a basis for the proposed VCF extensions here. We have learned from the existing application of VCF that the definition of relevant metadata using controlled standards, vocabulary and the consistent use of cross-references via resolvable identifiers (machine-readable) are particularly necessary and propose their encoding. VCF is an established standard for the exchange and publication of genotyping data. Other data formats are also used to capture variant data (for example, the HapMap and the gVCF formats), but none currently have the reach of VCF. For the sake of simplicity, we will only discuss VCF and our recommendations for its use, but these recommendations could also be applied to gVCF. However, the part of the VCF standard relating to metadata (as opposed to the actual variant calls) defines a syntactic format but no vocabulary, unique identifier or recommended content. In practice, often only sparse descriptive metadata is included. When descriptive metadata is provided, proprietary metadata fields are frequently added that have not been agreed upon within the community which may limit long-term and comprehensive interoperability. To address this, we propose recommendations for supplying and encoding metadata, focusing on use cases from plant sciences. We expect there to be overlap, but also divergence, with the needs of other domains. Copyright:
© 2022 Beier S et al.

Entities:  

Keywords:  ELIXIR; FAIR; data management; genotyping; phenotyping; plant; snp; vcf

Mesh:

Year:  2022        PMID: 35811804      PMCID: PMC9218589          DOI: 10.12688/f1000research.109080.2

Source DB:  PubMed          Journal:  F1000Res        ISSN: 2046-1402


  28 in total

1.  Practical Computational Reproducibility in the Life Sciences.

Authors:  Björn Grüning; John Chilton; Johannes Köster; Ryan Dale; Nicola Soranzo; Marius van den Beek; Jeremy Goecks; Rolf Backofen; Anton Nekrutenko; James Taylor
Journal:  Cell Syst       Date:  2018-06-27       Impact factor: 10.304

2.  1,500 scientists lift the lid on reproducibility.

Authors:  Monya Baker
Journal:  Nature       Date:  2016-05-26       Impact factor: 49.962

3.  The European Nucleotide Archive.

Authors:  Rasko Leinonen; Ruth Akhtar; Ewan Birney; Lawrence Bower; Ana Cerdeno-Tárraga; Ying Cheng; Iain Cleland; Nadeem Faruque; Neil Goodgame; Richard Gibson; Gemma Hoad; Mikyung Jang; Nima Pakseresht; Sheila Plaister; Rajesh Radhakrishnan; Kethi Reddy; Siamak Sobhany; Petra Ten Hoopen; Robert Vaughan; Vadim Zalunin; Guy Cochrane
Journal:  Nucleic Acids Res       Date:  2010-10-23       Impact factor: 16.971

4.  Identifiers.org and MIRIAM Registry: community resources to provide persistent identification.

Authors:  Nick Juty; Nicolas Le Novère; Camille Laibe
Journal:  Nucleic Acids Res       Date:  2011-12-02       Impact factor: 16.971

5.  BrAPI-an application programming interface for plant breeding applications.

Authors:  Peter Selby; Rafael Abbeloos; Jan Erik Backlund; Martin Basterrechea Salido; Guillaume Bauchet; Omar E Benites-Alfaro; Clay Birkett; Viana C Calaminos; Pierre Carceller; Guillaume Cornut; Bruno Vasques Costa; Jeremy D Edwards; Richard Finkers; Star Yanxin Gao; Mehmood Ghaffar; Philip Glaser; Valentin Guignon; Puthick Hok; Andrzej Kilian; Patrick König; Jack Elendil B Lagare; Matthias Lange; Marie-Angélique Laporte; Pierre Larmande; David S LeBauer; David A Lyon; David S Marshall; Dave Matthews; Iain Milne; Naymesh Mistry; Nicolas Morales; Lukas A Mueller; Pascal Neveu; Evangelia Papoutsoglou; Brian Pearce; Ivan Perez-Masias; Cyril Pommier; Ricardo H Ramírez-González; Abhishek Rathore; Angel Manica Raquel; Sebastian Raubach; Trevor Rife; Kelly Robbins; Mathieu Rouard; Chaitanya Sarma; Uwe Scholz; Guilhem Sempéré; Paul D Shaw; Reinhard Simon; Nahuel Soldevilla; Gordon Stephen; Qi Sun; Clarysabel Tovar; Grzegorz Uszynski; Maikel Verouden
Journal:  Bioinformatics       Date:  2019-10-15       Impact factor: 6.937

6.  Ensembl Genomes 2020-enabling non-vertebrate genomic research.

Authors:  Kevin L Howe; Bruno Contreras-Moreira; Nishadi De Silva; Gareth Maslen; Wasiu Akanni; James Allen; Jorge Alvarez-Jarreta; Matthieu Barba; Dan M Bolser; Lahcen Cambell; Manuel Carbajo; Marc Chakiachvili; Mikkel Christensen; Carla Cummins; Alayne Cuzick; Paul Davis; Silvie Fexova; Astrid Gall; Nancy George; Laurent Gil; Parul Gupta; Kim E Hammond-Kosack; Erin Haskell; Sarah E Hunt; Pankaj Jaiswal; Sophie H Janacek; Paul J Kersey; Nick Langridge; Uma Maheswari; Thomas Maurel; Mark D McDowall; Ben Moore; Matthieu Muffato; Guy Naamati; Sushma Naithani; Andrew Olson; Irene Papatheodorou; Mateus Patricio; Michael Paulini; Helder Pedro; Emily Perry; Justin Preece; Marc Rosello; Matthew Russell; Vasily Sitnik; Daniel M Staines; Joshua Stein; Marcela K Tello-Ruiz; Stephen J Trevanion; Martin Urban; Sharon Wei; Doreen Ware; Gary Williams; Andrew D Yates; Paul Flicek
Journal:  Nucleic Acids Res       Date:  2020-01-08       Impact factor: 16.971

7.  The European Variation Archive: a FAIR resource of genomic variation for all species.

Authors:  Timothe Cezard; Fiona Cunningham; Sarah E Hunt; Baron Koylass; Nitin Kumar; Gary Saunders; April Shen; Andres F Silva; Kirill Tsukanov; Sundararaman Venkataraman; Paul Flicek; Helen Parkinson; Thomas M Keane
Journal:  Nucleic Acids Res       Date:  2022-01-07       Impact factor: 16.971

8.  BioSamples database: FAIRer samples metadata to accelerate research data management.

Authors:  Mélanie Courtot; Dipayan Gupta; Isuru Liyanage; Fuqi Xu; Tony Burdett
Journal:  Nucleic Acids Res       Date:  2022-01-07       Impact factor: 16.971

9.  DbVar and DGVa: public archives for genomic structural variation.

Authors:  Ilkka Lappalainen; John Lopez; Lisa Skipper; Timothy Hefferon; J Dylan Spalding; John Garner; Chao Chen; Michael Maguire; Matt Corbett; George Zhou; Justin Paschall; Victor Ananiev; Paul Flicek; Deanna M Church
Journal:  Nucleic Acids Res       Date:  2012-11-27       Impact factor: 16.971

10.  ELIXIR: Providing a Sustainable Infrastructure for Life Science Data at European Scale.

Authors:  Jennifer Harrow; Rachel Drysdale; Andrew Smith; Susanna Repo; Jerry Lanfear; Niklas Blomberg
Journal:  Bioinformatics       Date:  2021-06-27       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.