Literature DB >> 35572792

Genomic data of two Greek Vitis varieties.

George Tsiolas1, Sofia Michailidou1, Antiopi Tsoureki1, Anagnostis Argiriou1,2.   

Abstract

The genetic material of Vitis varieties is crucial for the wine sector. In addition, genomic technologies applied in vitis germplasm characterization are important for the conservation of indigenous genetic reservoirs. Until recently the most common method to genetically identify vitis varieties was the use of Simple Sequence Repeats (SSR) along with SNP chips. Yet, with the progress in Next Generation Sequencing (NGS) technologies and the reduced sequencing cost per base, a twist in plant species genetic identification methods has occurred. Among them, the low coverage Whole-Genome Sequencing (lcWGS) method with downstream bioinformatic analysis for variant discovery and phylogenetic characterization is gaining scientific attention. In this dataset, shotgun sequencing data of two different Greek Vitis varieties, 'Razaki' and 'Vlachiko' are presented. Vitis cultivars were collected from the Aristotle University of Thessaloniki's (AUTH) ampelographic collection and have been previously phenotypically and genetically characterized. WGS libraries were sequenced on an IlluminaⓇ NovaSeq 6000 platform with the IlluminaⓇ NovaSeq 6000 S2 Reagent Kit (300 cycles). Raw sequence data used for analysis are available in NCBI under the Sequence Read Archive (SRA), with BioProject ID PRJNA805368. Reads were aligned to the reference genome of Vitis vinifera available from the EnsemblPlants database and formal analysis was conducted with the Genome Analysis Toolkit 4 (GATK4) pipeline. Data can be used to enrich our knowledge related to the genetic background of vitis cultivars and can also serve as a threshold in the scientific community towards the construction of a genomic database of vitis cultivars.
© 2022 The Authors. Published by Elsevier Inc.

Entities:  

Keywords:  SNPs; Variant analysis; Vitis cultivars; Whole-genome sequencing

Year:  2022        PMID: 35572792      PMCID: PMC9092844          DOI: 10.1016/j.dib.2022.108216

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table

Value of the Data

Data add new knowledge on Vitis genetic variation at the level of variety. Data provides information on the genomic background of two Greek vitis varieties that can be used for future identification of unknown grapevine varieties. Viticulturists will benefit from results related to the functional characteristics of each variety through genomic selection. The data produced contribute to the preservation and the adoption of these vitis varieties in plant breeding schemes.

Data Description

Genomic sequencing data were generated with IlluminaⓇ NovaSeq 6000Ⓡ platform using two paired-end libraries with insert size of approx. 300 bp. In total, 25.62 Gbases were generated with >Q30 of 98%; 12.91 Gbases for ‘Razaki’ variety and 12.71 Gbases for ‘Vlachiko’ variety (Table 1). Total coverage for each variety's genome was greater than 25x, which is sufficient to identify single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) across their genomes [1]. Raw sequencing data are available under the BioProject accession PRJNA805368 at NCBI's sequence read archive.
Table 1

Generated genomic data of Vitis varieties ‘Razaki’ and ‘Vlachiko’.

BioSampleSRA AccessionVarietyRaw ReadsGbasesDepth of coverage
SAMN25855225SRR17982063Razaki47,292,25812,917,903,04826.57
SAMN25855226SRR17982062Vlachiko47,404,59212,714,283,85826.15
Generated genomic data of Vitis varieties ‘Razaki’ and ‘Vlachiko’. The majority of the discovered variants were SNPs with approximately 5.8 × 106 events in each variety and insertions/deletions with over 900 × 103 events per variety (Table 2). These numbers include SNPs and InDels found in unique sequences of the reference genome as well as in the repetitive genome fractions. The SNPs were primarily found in intergenic regions in contrast to the indels that were mainly found in intragenic regions causing frameshifts. Frameshift variants due to the indels are 6,847 in ‘Razaki’ and 7,024 in ‘Vlachiko’. Missense variants due to the SNPs are 119,532 in ‘Razaki’ and 126,314 in ‘Vlachiko’. SNPs and indels responsible for the gain of stop codons are 2,882 in ‘Razaki’ and 2,910 in ‘Vlachiko’. The number and the type of variants of the affected Sequence Ontologies (SO) are presented in detail in Table 3 and Fig. 1.
Table 2

Type and number of variants per variety.

Razaki
Vlachiko
Summary Variant StatisticsInDelsSNPsInDelsSNPs
Total number of loci889,8745,687,476927,9395,931,063
Number of variants (before filtering)926,0095,735,996968,3095,984,608
Number of variants processed (after filtering)915,8815,704,855957,1375,950,310
Number of multi-allelic variants (more than two alleles)36,13548,52040,37053,545
Number of effects1,700,7329,494,7691,769,2279,906,520
Reference genome total length486,265,422486,265,422486,265,422486,265,422
Reference genome effective length486,265,422486,265,422486,265,422486,265,422
Variant rate1 every 530 bases1 every 85 bases1 every 508 bases1 every 81 bases
Table 3

The number of variants and the affected Sequence Ontologies (SO).

Sequence Ontologies (SO) affectedRazaki
Vlachiko
InDelsSNPsInDelsSNPs
3_prime_UTR_truncation344,68220
3_prime_UTR_variant10,404011,37548,289
5_prime_UTR_premature_start_codon_gain_variant03,77603,889
5_prime_UTR_truncation4020
5_prime_UTR_variant4,57123,5234,79224,673
bidirectional_gene_fusion1000
conservative_inframe_deletion78607940
conservative_inframe_insertion80709070
disruptive_inframe_deletion1,33501,3910
disruptive_inframe_insertion95201,0120
downstream_gene_variant371,0341,811,783386,9221,906,153
exon_loss_variant130130
frameshift_variant6,84707,0240
gene_fusion1030
initiator_codon_variant051050
intergenic_region686,8514,261,933709,6624,401,118
intragenic_variant5000
intron_variant203,9181,154,7631,241,9721,241,972
missense_variant119,5320126,3140
non_coding_transcript_exon_variant9681221,216133
non_coding_transcript_variant03070331
splice_acceptor_variant545356554340
splice_donor_variant522428570422
splice_region_variant14,6913,22016,1253,346
start_lost405181412170
start_retained_variant015016
stop_gained2,5912912,628282
stop_lost517133524129
stop_retained_variant2312525335
synonymous_variant95,9020103,1310
upstream_gene_variant1,973,306412,6062,045,032423,890
Fig. 1

Heatmap of variants for ‘Razaki’ and ‘Vlachiko’ varieties. Rows depict the affected Sequence Ontologies and columns the SNPs and InDels for each variety. Color scale refers to log10[(variants)+1].

Type and number of variants per variety. The number of variants and the affected Sequence Ontologies (SO). Heatmap of variants for ‘Razaki’ and ‘Vlachiko’ varieties. Rows depict the affected Sequence Ontologies and columns the SNPs and InDels for each variety. Color scale refers to log10[(variants)+1].

Experimental Design, Materials and Methods

Sampling and library construction

Leaf tissues were obtained from two grapevine varieties ‘Razaki’ and ‘Vlachiko’, which are a part of the Ampelographic Collection of the Aristotle University of Thessaloniki. Leaves were ground to a fine powder in the presence of liquid nitrogen and subsequently, DNA extraction was conducted using the NucleoSpin Plant II kit (MACHEREY-NAGEL, Düren, Germany), according to the manufacturer's instructions. The quality of extracted DNA was assessed on a 0.8% agarose gel stained with 0.5 µg/ml ethidium bromide. DNA concentration was estimated by a fluorometric method on a Qubit 4.0 Fluorimeter using the QubitⓇ dsDNA BR assay kit (Invitrogen, Carlsbad, CA, USA). Libraries were prepared with the Nextera DNA Flex library preparation kit following the manufacturer's instructions for an average insert size of 300 bp. Initially, libraries were quantified with the Qubit dsDNA BR kit and their average size was estimated by capillary fragment electrophoresis on a 5400 Fragment Analyzer system (Agilent Technologies, Santa Clara, CA, USA) using the DNF-477-0500 kit. Finally, library quantification was performed by qPCR using the KAPA Library Quantification kit for IlluminaⓇ sequencing platforms (Kapa Biosystems; Roche Diagnostics Corporation, Indianapolis, IN, USA) on a Rotor‐Gene Q thermocycler (Qiagen, Hilden, Germany), and normalized in relation to their size. Libraries were sequenced on an IlluminaⓇ NovaSeq 6000Ⓡ platform using the NovaSeq 6000 S2 Reagent Kit (300 cycles).

Bioinformatics and data analysis

The quality of the reads was evaluated with the FastQC [2]. Raw reads were aligned to the reference genome of Vitis vinifera (12x) from EnsemblPlants (http://ftp.ensemblgenomes.org/pub/plants/release-52/fasta/vitis_vinifera/dna/) with MiniMap2 [3] and the command line options -x sr -a -R ‘@R\tID:\tLB:\tPL:ILLUMINA\tPM:NOVASEQ\tSM:’ without removing duplicate reads in this step. For variant discovery, Genome Analysis Toolkit 4 (GATK4) [4] pipeline was used. In detail, the duplicates duplicate reads were marked with the MarkDuplicatesSpark and the variants were recalibrated with the BaseRecalibrator using the filtered variants. The first round of variant discovery performed with HaplotypeCaller. Identified variants were filtered with VariantFiltration in order to filter out the variants with values of QD<2.0, FS>60.0, MQ<40.0, SOR>4.0, MQRankSum<-12.5 and ReadPosRankSum<-8.0. Final variants were obtained after the filtration of technical variants with the BaseRecalibrator and ApplyBSQR tools. The annotation of the final variants was performed with SnpEff [5].

Ethics Statements

The authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article.

CRediT Author Statement

George Tsiolas: Methodology, Analysis, Writing and Editing Sofia Michailidou: Methodology, Writing, Review and Editing. Antiopi Tsoureki: Analysis. Anagnostis Argiriou: Conceptualization, Review, Funding Acquisition, Project Administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.
SubjectBiological sciences: Omics: Genomics
Specific subject areaLow coverage whole genome sequencing of two Greek vitis cultivars for cultivar identification and variant discovery
Type of dataTables and Figures
How the data were acquiredWGS libraries were constructed using Illumina's Nextera DNA Flex library preparation kit. Sequencing was performed on an Illumina NovaSeq 6000 platform using the Illumina NovaSeq 6000 S2 Reagent Kit (300 cycles). The variant discovery was conducted using the Genome Analysis Toolkit 4 pipeline.
Data formatRaw and Analyzed
Description of data collectionLeaves from two grapevine varieties, ‘Razaki’ (white grape variety) and ‘Vlachiko’ (red grape variety), were obtained from the Ampelographic Collection of the Aristotle University of Thessaloniki.
Data source locationInstitution: Institute of Applied Biosciences – Centre for Research and Technology HellasCity: ThessalonikiCountry: GreeceLatitude and longitude for analyzed data: 40.56806, 22.99713
Data accessibilityRepository name: NCBI SRAData identification number: PRJNA805368Direct URL to data:https://www.ncbi.nlm.nih.gov/bioproject/PRJNA805368, https://www.ncbi.nlm.nih.gov/sra/?term=SRR17982062, https://www.ncbi.nlm.nih.gov/sra/?term=SRR17982063
  4 in total

1.  A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.

Authors:  Pablo Cingolani; Adrian Platts; Le Lily Wang; Melissa Coon; Tung Nguyen; Luan Wang; Susan J Land; Xiangyi Lu; Douglas M Ruden
Journal:  Fly (Austin)       Date:  2012 Apr-Jun       Impact factor: 2.160

2.  Minimap2: pairwise alignment for nucleotide sequences.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2018-09-15       Impact factor: 6.937

3.  From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.

Authors:  Geraldine A Van der Auwera; Mauricio O Carneiro; Christopher Hartl; Ryan Poplin; Guillermo Del Angel; Ami Levy-Moonshine; Tadeusz Jordan; Khalid Shakir; David Roazen; Joel Thibault; Eric Banks; Kiran V Garimella; David Altshuler; Stacey Gabriel; Mark A DePristo
Journal:  Curr Protoc Bioinformatics       Date:  2013

4.  Coverage recommendation for genotyping analysis of highly heterologous species using next-generation sequencing technology.

Authors:  Kai Song; Li Li; Guofan Zhang
Journal:  Sci Rep       Date:  2016-10-20       Impact factor: 4.379

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.