Literature DB >> 31673575

Draft genome sequence data of maqui (Aristotelia chilensis) and identification of SSR markers.

Adriana Bastías1, Francisco Correa2, Pamela Rojas2, Constanza Martin2, Jorge Pérez-Diaz3, Cristian Yáñez3, Mara Cuevas3, Ricardo Verdugo3, Boris Sagredo2.   

Abstract

Maqui (Aristotelia chilensis [Molina] Stunz) is a small dioecious tree, belonging to the Elaeocarpaceae family. Maqui fruit has high levels of antioxidant activity, which are due to elevated anthocyanin and polyphenol content. Here we describe a draft genome sequence data of maqui (A. chilensis). The genomic sequence datasets were obtained using Illumina NextSeq platform. Nucleotide sequences of raw reads and the assembled draft genome are available at NCBI's Sequence Read Archive as BioProject PRJNA544858. Also, a total of 210067 microsatellite or simple sequence repeat (SSR) markers were identified.
© 2019 The Authors.

Entities:  

Keywords:  Aristotelia chilensis; Draft genome; Illumina NextSeq platform; Maqui; Microsatellite; SSR markers; Sequencing

Year:  2019        PMID: 31673575      PMCID: PMC6817651          DOI: 10.1016/j.dib.2019.104545

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications table Data of raw sequence reads and assembled draft genome of maqui (Aristotelia chilensis) contribute to establish a genomic platform for this plant species. Draft genome data can facilitate the identification of molecular mechanisms that underlie properties of maqui products, thereafter contribute to improve them by classical and/or biotechnological approaches. The draft genome data will accelerate functional genomics research in this species. The newly developed SSR markers dataset of maqui should be useful tools to assesses its genetic diversity and understand its genetic structure, facilitating the implementation of effective conservation system of its natural populations.

Data

Here we described data of raw sequence-reads, an assembled draft genome and SSR analysis from genomic DNA of maquí (A. chilensis). Both raw data and assembled draft genome are available at NCBI's Sequence Read Archive as BioProject PRJNA544858P (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA544858). The genomic DNA was obtained from fresh leaves of maqui. Using a library with 300 bp insert size and paired-end–tag DNA sequencing using illumina NextSeq 550 platform around 187 million 2 × 151 bp reads were generated. After a process of quality trimming and filtering of data using FastQC v0.11.5, which allow to remove reads containing more than 5% unknown nucleotides, low-quality reads (reads containing more than 50% bases with Q-value ≤ 20), all unpaired reads and short reads (<35 bp), a 95.87% from the total reads were suitable for genome assembling (Table 1). A draft genome of maqui was obtained through de novo assembling using MaSuRCA software [1] (see Table 2).
Table 1

Dataset of maqui (A. chilensis) reads obtained by Illumina NextSeq 550 sequencing before and after filtering.

SpeciesBefore filteringAfter filtering
Total reads (×2)GC (%)Total reads (×2)GC (%)% total reads
A. chilensis187,132,04036179,407,34535.1395,87
Table 2

Data on contig measurements that were assembled by MaSuRCA software with high-quality reads.

ItemNumberDescription
Total number of sequences58,451Counts
N5013,213A + T + C + G + N (bp)
Max contig113,184(A + T + C + G) not include Ns
Min contig500(A + T + C + G) not include Ns
Total length of sequences326,414,674A + T + C + G + N (bp)
Total valid length of sequences326,169,547A + T + C + G (bp)
Unknown bases (Ns) in sequences245,127bp
Percentage of unknown bases0.08Percentage (%)
GC content35.13(G + C)/(A + T + C + G) not include Ns (%)
Dataset of maqui (A. chilensis) reads obtained by Illumina NextSeq 550 sequencing before and after filtering. Data on contig measurements that were assembled by MaSuRCA software with high-quality reads. The final genome assembly had a total length of 326 Mb, comprising in 58,451 scaffolds and 140X of mean coverage were obtained. The scaffold N50s of this assembly were 13.2 kb, and unclosed gap regions represented 0.08% of the assembly. In addition, the G + C content of the genome assembly excluding gaps was estimated to be 35.13%. The assembled draft genome was constructed using 343,326,678 (95.68%) of the raw sequence reads. To check the draft genome generated, the raw sequence reads for transcriptomic data from maqui were downloaded from NCBI database (BioProject PRJNA255387) and mapped to the draft genome using HiSAT2 map alignment program [2] with 93.61% of filtered RNA sequences were mapped. The assembled A. chilensis draft genome was analyzed with BUSCO tools [3] using the embryophyta database (Fig. 1). We found 1244 complete orthologs genes (C: 90.4%), 1220 orthologs complete genes and single-copy (S: 88.7%), 24 orthologs complete genes and duplicated (D: 1.7%), 84 orthologs fragmented genes (F: 6.1%) and 47 missing genes BUSCO's (M: 3.5%).
Fig. 1

Percentage of 1375 single-copy orthologs genes from 60 plants by BUSCO analysis.

Percentage of 1375 single-copy orthologs genes from 60 plants by BUSCO analysis. The assembled draft genome of maqui was used to identify microsatellite sequences or simple sequence repeat (SSR) (Table 3). Dinucleotide to hexanucleotide repeat microsatellite sequences, with repeat motif size ranging from 2 to 6 bp and a length ≥12 bp were considered. This includes data of dinucleotide repeats ≥6, trinucleotide repeats ≥4, and tetra-, penta- and hexa-, repeats ≥3. A total of 210.067 maqui perfect SSR markers were identified (Table 3). Among the identified SSRs, dinucleotide motifs (54.87%) were the most common, followed by tetranucleotide (17.73%) and trinucleotide motifs (15.7%) (Table 4). We also examined the distribution of maqui microsatellites with regard to motif length and type and the number of repeats (Fig. 2). A total of 111,531 primer pairs were designed from flanking sequences of di-to hexanucleotide microsatellites of maqui (A. chilensis) and are available in Table S1.
Table 3

Dataset of microsatellite (SSRs) searches of maqui (A. chilensis) using PERF software.

ItemNumberDescription
Total number of perfect SSRs210,067Counts
Total length of perfect SSRs3,153,200bp
The average length of SSRs15.02total ssr length/total ssr counts (bp)
SSRs per sequence4total SSR counts/sequence counts
% of sequence occupied by SSRs0.97ssr total length/total sequence size (%)
Relative abundance644.04total SSRs/total valid length (loci/Mb)
Relative density9667.36total SSR length/total valid length (bp/Mb)
Table 4

Distribution to microsatellites di-to hexanucleotide motifs in the assembled genomic DNA of maqui (A. chilensis).

TypeCountsLength (bp)Percent (%)Relative Abundance (loci/Mb)Relative Density (bp/Mb)
Di115,2541,765,32454.87353.365412.29
Tri32,972480,60015.7101.091473.47
Tetra37,247481,29617.73114.21475.6
Penta15,190242,4407.2346.57743.29
Hexa9,404183,5404.4828.83562.71
Fig. 2

Distribution of SSR from maqui (A. chilensis) with Di-to Hexa-nucleotides by repeat numbers. The graph is based on a total of 210,067 SSRs detected in non-redundant genomic maqui DNA. Di, tri, tetra, penta and hexa, refer to dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides, respectively.

Dataset of microsatellite (SSRs) searches of maqui (A. chilensis) using PERF software. Distribution to microsatellites di-to hexanucleotide motifs in the assembled genomic DNA of maqui (A. chilensis). Distribution of SSR from maqui (A. chilensis) with Di-to Hexa-nucleotides by repeat numbers. The graph is based on a total of 210,067 SSRs detected in non-redundant genomic maqui DNA. Di, tri, tetra, penta and hexa, refer to dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides, respectively.

Experimental design, materials, and methods

Plant material

Young maqui (A. chilensis) leaves were collected at INIA-Rayentue, Rengo, O'Higgins Region, Chile, (Latitude 34°19′16.1″S and longitude 70°50′03.6″W). Samples were frozen in liquid nitrogen and stored at −80 °C until DNA extraction and subsequent analysis.

Genomic DNA extraction

Genomic DNA of maqui (A. chilensis) was extracted as was described by Bastias et al., 2016 [4] using DNeasy Plant Mini Kit (Qiagen) following the manufacturer's instructions.

DNA sequencing

Paired-end–tag DNA de novo sequencing using Illumina NextSeq 550 platform was used. Approximately 187 million 2 × 151 bp reads were generated from library with 300 bp insert size. Sequence quality of raw genomic data was assessed using FastQC v0.11.5 software (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). Quality trimming and filtering of data was performed using fastqp (https://github.com/OpenGene/fastp) [5], reads containing more than 5% unknown nucleotides, and low-quality reads (reads containing more than 50% bases with Q-value ≤ 20) and all unpaired reads were discarded. Short reads (<35 bp) were removed from the filtered data.

Genome assembly

Then de novo assembly of the clean reads was performed to generate contigs and scaffolds. For de novo assembly we used MaSuRCA (http://www.genome.umd.edu/masurca.html) [1] with optimized k-mer length of 85, calculated by KmerGenie software [6]. Assembly statistics were obtained with QUAST (quality assessment tool for genome assemblies) software [7].

Assessing genome assembly completeness with benchmarking universal single-copy orthologs (BUSCO)

The assembled A. chilensis genome data was searched for BUSCO analysis [3] against the embryophyta database, consisting of 1375 orthologs constructed from 60 species.

Identification of Putative SSRs and primer design

We analyzed perfect SSRs. The contig sequences obtained in FASTA files were screened with a repeat motif size range of 2–6 bp and a length of >12 bp. This includes dinucleotide repeats ≥6, trinucleotide repeats ≥4, and tetra-, penta- and hexa repeats ≥3, using PERF software [8]. The program allows for direct primer design using PRIMER 3 [9] by searching for microsatellite repeats and primer annealing sites in the flanking regions.

Specifications table

Subject areaGenomics
More specific subject areaPlant Genomics
Type of dataTables and figures
How data was acquiredPaired-end–tag DNA sequencing was realized using illumina NexSeq 550 platform.
Data formatRaw and analyzed data of draft genome assembly; SSR table
Experimental factorsLeaves of maqui, DNA extraction and do novo sequencing.
Experimental featuresGenomic DNA was extracted from leaves of maqui (Aristotelia chilensis) with the DNeasy Plant Mini Kit (QIAGEN, USA). The paired-end library was sequenced using Illumina NexSeq 550 plataform. De novo assembling was done with MaSuRCA software. SSR identification analysis was assessed with the MIcroSAtellite software.
Data source locationRengo, Chile, INIA-Rayentue (Avda. Salamanca s/n, Km 105 ruta 5 sur, sector Los Choapinos). Latitude 34°19′16.1″S and longitude 70°50′03.6″W.
Data accessibilityThe nucleotide sequences of raw reads and assembled draft genome are available at NCBI's Sequence Read Archive as BioProject PRJNA544858 (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA544858)
Related research articleBastías, A., Correa, F., Rojas, P., Almada, R., Muñoz, C., Sagredo, B., 2016. Identification and Characterization of Microsatellite Loci in Maqui (Aristotelia chilensis [Molina] Stunz) Using Next-Generation Sequencing (NGS). PLoS ONE 11(7): e0159825. https://doi.org/10.1371/journal.pone.0159825
Value of the data

Data of raw sequence reads and assembled draft genome of maqui (Aristotelia chilensis) contribute to establish a genomic platform for this plant species.

Draft genome data can facilitate the identification of molecular mechanisms that underlie properties of maqui products, thereafter contribute to improve them by classical and/or biotechnological approaches.

The draft genome data will accelerate functional genomics research in this species.

The newly developed SSR markers dataset of maqui should be useful tools to assesses its genetic diversity and understand its genetic structure, facilitating the implementation of effective conservation system of its natural populations.

  9 in total

1.  Primer3 on the WWW for general users and for biologist programmers.

Authors:  S Rozen; H Skaletsky
Journal:  Methods Mol Biol       Date:  2000

2.  Informed and automated k-mer size selection for genome assembly.

Authors:  Rayan Chikhi; Paul Medvedev
Journal:  Bioinformatics       Date:  2013-06-03       Impact factor: 6.937

3.  The MaSuRCA genome assembler.

Authors:  Aleksey V Zimin; Guillaume Marçais; Daniela Puiu; Michael Roberts; Steven L Salzberg; James A Yorke
Journal:  Bioinformatics       Date:  2013-08-29       Impact factor: 6.937

4.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors:  Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal:  Bioinformatics       Date:  2015-06-09       Impact factor: 6.937

5.  QUAST: quality assessment tool for genome assemblies.

Authors:  Alexey Gurevich; Vladislav Saveliev; Nikolay Vyahhi; Glenn Tesler
Journal:  Bioinformatics       Date:  2013-02-19       Impact factor: 6.937

6.  PERF: an exhaustive algorithm for ultra-fast and efficient identification of microsatellites from large DNA sequences.

Authors:  Akshay Kumar Avvaru; Divya Tej Sowpati; Rakesh Kumar Mishra
Journal:  Bioinformatics       Date:  2018-03-15       Impact factor: 6.937

7.  Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.

Authors:  Mihaela Pertea; Daehwan Kim; Geo M Pertea; Jeffrey T Leek; Steven L Salzberg
Journal:  Nat Protoc       Date:  2016-08-11       Impact factor: 13.491

8.  fastp: an ultra-fast all-in-one FASTQ preprocessor.

Authors:  Shifu Chen; Yanqing Zhou; Yaru Chen; Jia Gu
Journal:  Bioinformatics       Date:  2018-09-01       Impact factor: 6.937

9.  Identification and Characterization of Microsatellite Loci in Maqui (Aristotelia chilensis [Molina] Stunz) Using Next-Generation Sequencing (NGS).

Authors:  Adriana Bastías; Francisco Correa; Pamela Rojas; Rubén Almada; Carlos Muñoz; Boris Sagredo
Journal:  PLoS One       Date:  2016-07-26       Impact factor: 3.240

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.