Literature DB >> 29214201

Genotyping-by-sequencing data of 272 crested wheatgrass (Agropyron cristatum) genotypes.

Pingchuan Li1, Bill Biligetu1, Bruce E Coulman1, Michael Schellenberg2, Yong-Bi Fu3.   

Abstract

Crested wheatgrass [Agropyron cristatum L. (Gaertn.)] is an important cool-season forage grass widely used for early spring grazing. However, the genomic resources for this non-model plant are still lacking. Our goal was to generate the first set of next generation sequencing data using the genotyping-by-sequencing technique. A total of 272 crested wheatgrass plants representing seven breeding lines, five cultivars and five geographically diverse accessions were sequenced with an Illumina MiSeq instrument. These sequence datasets were processed using different bioinformatics tools to generate contigs for diploid and tetraploid plants and SNPs for diploid plants. Together, these genomic resources form a fundamental basis for genomic studies of crested wheatgrass and other wheatgrass species. The raw reads were deposited into Sequence Read Archive (SRA) database under NCBI accession SRP115373 (https://www.ncbi.nlm.nih.gov/sra?term=SRP115373) and the supplementary datasets are accessible in Figshare (10.6084/m9.figshare.5345092).

Entities:  

Keywords:  Crested wheatgrass; Diploid; Genotyping-by-sequencing; Raw sequence data; Tetraploid

Year:  2017        PMID: 29214201      PMCID: PMC5712052          DOI: 10.1016/j.dib.2017.09.030

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table Value of the data The first large set of next generation sequencing datasets obtained from genomic DNAs for 272 diploid and tetraploid crested wheatgrass plants. These plants represent seven breeding lines, five cultivars and five geographically diverse accessions. These datasets can be utilized to enhance genetic and genomic studies, genetic diversity assessments, and marker-assistedbreeding of crested wheatgrass. The SNP datasets can be directly explored for the development of useful molecular markers to investigate genetic variability of crested wheatgrass and to facilitate the molecular breeding of this plant.

Data

Our sequencing efforts in crested wheatgrass generated two new sets of genomic data. The first set consists of 608 FASTQ files generated for 272 diploid and tetraploid crested wheatgrass plants representing seven breeding lines, five cultivars and five geographically diverse accessions (see Table 1). These sequences were obtained from genomic DNA through genotyping-by-sequencing (GBS) technique using an Illumina MiSeq instrument for seven runs with paired-ends of 250 bp in length. Plants from two accessions were sequenced twice as a control for quality assessment. Raw reads for all 17 accessions were deposited into NCBI's SRA database with accession number SRP115373 (https://www.ncbi.nlm.nih.gov/sra?term=SRP115373). The second set contains several meta data files generated from bioinformatics analysis of the sequence reads, including contigs for diploid and tetraploid plants and SNPs for diploid plants. These supplementary datasets have been deposited in Figshare (10.6084/m9.figshare.5345092).
Table 1

List of 17 crested wheatgrass accessions used in the study.

AccessionCN numberaAlternative identificationaOriginPloidy
AC ParklandFOR483Canada2x
FairwayCN32968FOR533Canada2x
PGR 16452CN43215Kazakhstan2x
PGR 16454CN43217Iran2x
S9542S9542Canada2x
AC GoliathCN108673Canada4x
AC NewkirkFOR552Canada4x
Karabalykskij 202CN31068Kazakhstan4x
KirkCN108662Canada4x
PGR 16830CN43478Kazakhstan4x
S8959EFOR917Canada4x
S9491S9491Canada4x
S9514S9514Canada4x
S9516S9516Canada4x
S9544S9544Canada4x
S9556S9556Canada4x
Vysokij 9CN30995Siberia4x

CN number is the accession identification in Plant Gene Resources of Canada, Agriculture and Agri-Food Canada (AAFC), while the alternative accession labels including FOR or S are from the joint forage breeding program of the University of Saskatchewan and AAFC.

List of 17 crested wheatgrass accessions used in the study. CN number is the accession identification in Plant Gene Resources of Canada, Agriculture and Agri-Food Canada (AAFC), while the alternative accession labels including FOR or S are from the joint forage breeding program of the University of Saskatchewan and AAFC.

Experimental design, materials and methods

Plant materials and DNA extraction

This study selected 17 crested wheatgrass accessions consisted of seven breeding lines, five cultivars and five geographically diverse accessions (Table 1). Five accessions are diploid and 12 accessions are tetraploid. These accessions were acquired from USDA plant germplasm system, Plant Gene Resources of Canada, and the joint forage breeding forage program of the University of Saskatchewan and Agriculture and Agri-Food Canada. Seeds were randomly selected from each accession and grown for six weeks in a greenhouse at the Saskatoon Research and Development Centre, Agriculture and Agri-Food Canada, under a 12 h photoperiod at 25 °C during the day time. Young leaf tissues were collected from 16 randomly selected plants of each of the 17 accessions, and stored at −80 °C prior to DNA extraction. For each of the 272 samples, DNA was extracted from 0.1 g ground tissue by following the protocols of NucleoSpin® Plant II Kit (Macherey-Nagel, Bethlehem, PA, USA) and eluted in a 1.5-ml Eppendorf tube with Elution Buffer. The DNA quality was measured with NanoDrop 8000 (Thermo Scientific) by comparing the 260 and 280-nm absorption. DNA samples were further quantified through the Quant-iTTM PicoGreen® dsDNA assay kit (Invitrogen) and subsequently diluted to 60 ng/μl with 1×TE buffer prior to sequencing analysis.

GBS library preparation and sequencing

The complexity reduced and multiplexed GBS libraries were prepared following the published gd-GBS protocol [1]. In brief, each library preparation started with 200 ng of purified genomic DNA by restriction enzyme digestion of PstI + MspI. Ligations between specifically customized 5′/3′ adapters and inserts by T4 ligase were carried out using standard product protocol. Ligated fragments were purified by AMPure XP kit and subsequently amplified and indexed with Illumina TruSeq HT multiplexing primers. Six library pools were made and each consisted of 48 indexed samples (3 accessions×16 individual plants). One extra library pool was included using 32 samples from two randomly selected accessions as a control. Prior to pooling of samples into a library, amplicon fragments from TruSeq HT kits were pre-selected by Pippin instrument for an insert size ranging between 250 and 450 bp, but the actual fragment size varied between 400 and 600 bp. Each pooled library was diluted to 6pM and denatured with 5% of sequencing-ready Illumina PhiX Library Control that serves for calibration for sequencing confidence. Sequencing was performed at the Saskatoon Research and Development Centre using an Illumina MiSeq instrument with paired-ends of 250 bp in length. Seven MiSeq runs generated 608 FASTQ sequence files for 272 plants from 17 accessions. Note that 32 plants from two accessions were sequenced twice as control for quality assessment.

Contig assembly and sequence similarity analysis

Contig sequence was assembled by using protein associated SNP prediction and genotyping pipeline paSNPg [2], which was specifically developed for non-model species and requires two inputs including the raw MiSeq sequence reads and relevant plant Ensembl PEP package [3]. Pep_database.tar.bz2 tarball was prepared by following the protocols of paSNPg to merge 44 plants species’ PEP data and was placed together with paired-end sequence reads (or FASTQ files) as a combined input for paSNPg. The default settings were adopted for k-mer size (100 bp) and minimal percentage of sample size (MPSS, 80%). Contigs were generated mainly through the Minia routine [4] implemented in the paSNPg pipeline. The analysis generated 6674 contigs for diploid crested wheatgrass plants with the default setting of parameters: 75% of identical match and 99% of alignment length. Among those contigs, 768 (11.5%) were associated with exons of coding genes. A total of 7792 contigs were assembled for tetraploid crested wheatgrass plants, while 809 (10.3%) assembled contigs were associated with exons of coding genes. Efforts were also made to assemble contigs for separate diploid or tetraploid accessions, following the same parameter setting as for the combined analysis of diploid or tetraploid plants. The outcomes are summarized in Table 2. A sequence similarity analysis of these contigs was also made between diploid and tetraploid plants using Blastn search among diploid- and tetraploid-based contigs under the cut-off of 1e-100 for E-value [5]. The parsed results revealed 4477 (67%) diploid-based contigs sequences matched with 4461 (57%) tetraploid-based contigs.
Table 2

The MiSeq sequence data profile and empirical genomic coverage (EgC) for 17 crested wheatgrass accessions.

AccessionaRaw readsTrimmed readsWith 5′ Hinfl residuesbContig×length (bp)cTotal contig length (bp)EgC (%)dBiosample accession
AC Parkland6,823,8995,968,4135,948,98012,749×2393,057,2320.044SAMN07502767 – SAMN07502782
Fairway8,838,3117,023,4005,312,45818,204×2404,376,2320.063SAMN07502815 – SAMN07502846
PGR164528,415,7936,919,5685,252,77634,236×2418,283,1600.120SAMN07502847 – SAMN07502878
PGR164547,750,9626,400,2076,070,26527,981×2416,750,2780.098SAMN07502879 – SAMN07502894
S95427,905,2216,248,3646,048,36818,528×2354,371,0620.063SAMN07502991 – SAMN07503006
AC Goliath7,084,0595,981,0985,961,13012,227×2412,947,9300.022SAMN07502735 – SAMN07502750
AC Newkirk6,832,7145,621,6055,605,8809471×2392,265,7600.017SAMN07502751 – SAMN07502766
Karabalykskij 2026,826,8075,412,8745,134,9778754×2382,089,2660.015SAMN07502799 – SAMN07502814
Kirk6,103,7715,172,0525,153,9878580×2412,069,3050.015SAMN07502911 – SAMN07502926
PGR168307,208,4105,978,6835,668,84110,722×2382,560,3840.019SAMN07502895 – SAMN07502910
S8959E6,312,3245,373,3545,353,2179546×2412,300,7500.017SAMN07502927 – SAMN07502942
S94918,412,2886,951,6986,910,09914,807×2403,563,4500.026SAMN07502943 – SAMN07502958
S95148,194,7196,883,2236,836,3979328×2402,238,8100.017SAMN07502959 – SAMN07502974
S95168,010,4546,687,3736,635,12111,730×2402,815,4570.021SAMN07502975 – SAMN07502990
S95448,496,6787,132,5076,896,08216,800×2363,968,7540.029SAMN07503007 – SAMN07503022
S95567,440,6846,235,1346,026,82417,085×2374,050,7050.030SAMN07503023 – SAMN07503038
Vysokij 96,874,7035,872,3515,855,73812,528×2392,999,8400.022SAMN07502783 – SAMN07502798

Only the forward (R1) sequence reads from all individual plants of each accession were used for contig assembly.

The number of sticky ends of PstI 'TGCA' found in 5′ end.

The length representing the average contig length in base pairs.

Based on the average genome size of crested wheatgrass estimated through flow cytometry with those of Triticum durum and Triticum aestivum.

The MiSeq sequence data profile and empirical genomic coverage (EgC) for 17 crested wheatgrass accessions. Only the forward (R1) sequence reads from all individual plants of each accession were used for contig assembly. The number of sticky ends of PstI 'TGCA' found in 5′ end. The length representing the average contig length in base pairs. Based on the average genome size of crested wheatgrass estimated through flow cytometry with those of Triticum durum and Triticum aestivum. Additional analysis was made to assess the gd-GBS empirical genome coverage (EgC) for each accession. The genome sizes of crested wheatgrass plants were estimated through flow cytometry based on the genome sizes of Triticum durum and Triticum aestivum. The average genome size of 6898 Mbp and 13,527 Mbp were obtained for diploid and tetraploid plants, respectively. The EgC values ranged from 0.044% to 0.120% with a mean of 0.078% for diploid plants, and ranged from 0.015% to 0.030% with an average of 0.021% for tetraploid plants (Table 2).

Protein-associated SNP identification

Efforts were made to do a SNP call only for diploid plants using the paSNPg pipeline, as it was developed specifically for diploid non-model species. Currently, there is no effective pipeline available for SNP calls from genomic sequences of tetraploid plants. A total number of 11,854 nuclear SNPs were successfully discovered from diploid plants, of which 1738 (14.7%) SNPs were associated with exons of coding genes. However, the number of total nuclear SNP and exon-associated SNPs without missing values across diploid plants were smaller with 1158 and 308, respectively.
Subject areaAgricultural and biological sciences
More specific subject areaPlant genomics
OrganismCrested wheatgrass [Agropyron cristatum L. (Gaertn.)]
Type of dataGenotyping-by-sequencing data
How data was acquiredIllumina MiSeq on GBS libraries
Data formatRaw reads in FASTQ format and meta data in plain text
Experimental factorsCrested wheatgrass plants grown in greenhouse
Experimental featuresgd-GBS protocol
Data source locationAgriculture and Agri-Food Canada and University of Saskatchewan
Data accessibilityRaw sequence data are available from NCBI SRA with accession SRP115373 and meta data are accessible from Figshare
  3 in total

1.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

2.  PEP: Predictions for Entire Proteomes.

Authors:  Phil Carter; Jinfeng Liu; Burkhard Rost
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

3.  Using cascading Bloom filters to improve the memory usage for de Brujin graphs.

Authors:  Kamil Salikhov; Gustavo Sacomoto; Gregory Kucherov
Journal:  Algorithms Mol Biol       Date:  2014-02-24       Impact factor: 1.405

  3 in total
  1 in total

1.  Genotyping-by-Sequencing Enhances Genetic Diversity Analysis of Crested Wheatgrass [Agropyron cristatum (L.) Gaertn.].

Authors:  Kiran Baral; Bruce Coulman; Bill Biligetu; Yong-Bi Fu
Journal:  Int J Mol Sci       Date:  2018-08-31       Impact factor: 5.923

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.