Literature DB >> 25774672

Complete chloroplast genome of the multifunctional crop globe artichoke and comparison with other Asteraceae.

Pasquale L Curci¹, Domenico De Paola¹, Donatella Danzi¹, Giovanni G Vendramin², Gabriella Sonnante¹.

Abstract

With over 20,000 species, Asteraceae is the second largest plant family. High-throughput sequencing of nuclear and chloroplast genomes has allowed for a better understanding of the evolutionary relationships within large plant families. Here, the globe artichoke chloroplast (cp) genome was obtained by a combination of whole-genome and BAC clone high-throughput sequencing. The artichoke cp genome is 152,529 bp in length, consisting of two single-copy regions separated by a pair of inverted repeats (IRs) of 25,155 bp, representing the longest IRs found in the Asteraceae family so far. The large (LSC) and the small (SSC) single-copy regions span 83,578 bp and 18,641 bp, respectively. The artichoke cp sequence was compared to the other eight Asteraceae complete cp genomes available, revealing an IR expansion at the SSC/IR boundary. This expansion consists of 17 bp of the ndhF gene generating an overlap between the ndhF and ycf1 genes. A total of 127 cp simple sequence repeats (cpSSRs) were identified in the artichoke cp genome, potentially suitable for future population studies in the Cynara genus. Parsimony-informative regions were evaluated and allowed to place a Cynara species within the Asteraceae family tree. The eight most informative coding regions were also considered and tested for "specific barcode" purpose in the Asteraceae family. Our results highlight the usefulness of cp genome sequencing in exploring plant genome diversity and retrieving reliable molecular resources for phylogenetic and evolutionary studies, as well as for specific barcodes in plants.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2015 PMID： 25774672 PMCID： PMC4361619 DOI： 10.1371/journal.pone.0120589

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Cynara cardunculus L. is a complex species belonging to the second largest family of plants, Asteraceae, with over 20,000 species [1]. It includes two crops, the globe artichoke [C. cardunculus L. var. scolymus (L.) Fiori] and the cultivated or leafy cardoon (C. cardunculus var. altilis DC), and the wild cardoon (C. cardunculus var. sylvestris Lam.). The wild perennial cardoon has been recognized as the ancestor of both cultigens [2,3]. The globe artichoke is a diploid outcrossing crop (2n = 2x = 34) originating in the Mediterranean region. It fulfills an important role in human nutrition in this area, where it is mainly consumed as a vegetable for its large and edible immature flower heads. The globe artichoke is also well known for its beneficial properties, due to a high content in polyphenols such as flavonoids, caffeic acid, chlorogenic acid and cynarin [4,5,6], and to a particular abundance of inulin in roots [7]. Due to the high level of heterozygosity in its genome, for centuries the artichoke has been mainly asexually propagated in order to ensure commercial uniformity [8]. Recently, an increasing number of seed-propagated varieties have been released [9]. Artichoke cultivation is mainly located in Europe (where Italy is the leading producer) and in North Africa. More recently it has spread in California, Peru and Argentina (http://faostat.fao.org, 2012). Despite much interest in the phylogenetics of Asteraceae, several relationships still need to be clarified. The Cynareae, Dicomeae, and Tarchonantheae formed a well-supported trichotomy in the cp metatree by Panero and Funk [10], which did not include any Cynara type species, but the relationships within the Cynareae (synonymous Cardueae) and between the Cynareae and the other tribes of the Carduoideae subfamily are not as well resolved [10,11]. Chloroplasts, which originate from ancient eubacterial invasions [12], are multifunctional organelles possessing their own genetic material. In most higher plants, including Angiosperms, the chloroplast (cp) genome forms a double stranded, circular molecule ranging from 120 to 160 kb that is highly conserved in size, structure and gene content [13,14]. The plant cp genomes typically harbor a quadripartite structure consisting of two inverted repeats (IRs) separated by two regions of unique DNA, the large (LSC) and small (SSC) single-copy regions [15]. Substitution rates in plant cp genomes are much lower than those in their nuclear genomes [16] and the very low level of recombination and primarily uniparental inheritance makes cp genomes a valuable source of genetic markers for phylogenetic analyses [17,18]. For these reasons, cp genomes are also useful tools for DNA barcoding. While universal DNA barcodes are not available for plants [19], Li et al. [20] recently proposed the use of taxon-specific barcodes for species identification using dedicated DNA cp-regions that have a sufficiently high mutation rate. Since the publication of the first cp genome [21], the number of complete cp genomes available (http://www.ncbi.nlm.nih.gov/genome/) has increased rapidly thanks to the development of high-throughput technologies [20]. However, only a few representatives from the Asteraceae family have been completely sequenced and analyzed. Here we present the complete cp genome sequence of the globe artichoke, obtained by a combination of data retrieved from genome and BAC clone sequencing. This is the first published cp genome belonging to the subfamily Carduoideae and thus represents a solid resource for phylogenetic studies and comparative genomics of the Asteraceae. In this manuscript, we also searched for the most valuable regions for barcoding with potential applications across the large family of Asteraceae.

Materials and Methods

Chloroplast sequencing and analyses

Genomic DNA was extracted from young leaves of globe artichoke, variety “Brindisino” according to Sonnante et al. [22]. Whole genomic DNA was sent to IGA Technology Services (Udine, Italy) in order to perform Illumina sequencing, using the GAIIx platform (200–350 bp library insert size, 75 bp paired-end reads). Short reads were deposited in the NCBI Short Read Archive under the accession number SRP049578. A BAC library of the globe artichoke was used to search clones harboring the cp genome. A total of 57,600 clones from the same genotype, representing approximately five haploid genome equivalents, were screened by a multidimensional pooling strategy, using cp specific primer pairs (S1 Table). The identified BAC clone was isolated by plasmid DNA extraction and purification with Plasmid Midi Kit (Qiagen, Milan, Italy) following the manufacturer instructions, and finally sent to IGA Technology Services (Udine, Italy) for 250 bp paired-end reads MiSeq (Illumina) sequencing. Short reads were deposited in the NCBI Short Read Archive under the accession number SRR1648410. Sequence data were analyzed with the CLC Genomics Workbench 6.0.1 (CLC Bio, Aarhus, Denmark), using de novo and reference-guided assembling methods, alternatively, with the following parameters: mismatch cost = 2, insertion cost = 3, deletion cost = 3, length fraction = 0.8, similarity fraction = 0.9. The four junctions between IRs and SSC/LSC were checked by standard PCR amplification with specific primers (S1 Table) and Sanger sequencing. Gene annotation was carried out with DOGMA (Dual Organellar GenoMe Annotator) [23] to identify coding sequences (cds), rRNAs, and tRNAs using the Plant plastid genetic code and BLAST homology searches. To verify the exact gene and exon boundaries, we compared artichoke annotations with those of lettuce (DQ383816). All tRNA genes were further confirmed with the online tRNAscan-SE 1.21 search server [24]. Codon usage was calculated for all exons of protein-coding genes (pseudogenes were not calculated) with Acua 1.0 [25]. Base composition was evaluated with DNA/RNA Base Composition Calculator (http://www.currentprotocols.com/WileyCDA/CurPro3Tool/toolId-7.html). Repeated elements in the artichoke cp genome were investigated by means of two web-based programs. The first one, Tandem Repeat Finder [26], was set at 2, 7 and 7 for match, mismatch and indel, respectively. The minimum alignment score and maximum period size were set as 50 and 500, respectively. The second program, REPuter [27], was set with the identity parameter no less than 90% (hamming distance equal to 3). Among dispersed repeats, the minimum repeat size investigated was 30 bp for direct and 20 bp for palindromic, respectively. Overlapping repeats were merged into one repeat motif whenever possible. Microsatellites (SSRs) were predicted using MISA (MIcroSAtellite, http://pgrc.ipk-gatersleben.de/misa/) and the software tool IMEx (Imperfect Microsatellite Extractor) [28]. We identified SSRs as mononucleotide repeats ≥8 bases, dinucleotides ≥10 bases (5 repeats), trinucleotides and tetranucleotides ≥12 bases (4 and 3 repeats, respectively), pentanucleotides ≥15 bp (3 repeats), and exanucleotides ≥18 bp (3 repeats).

Comparative and phylogenetic analyses of Asteraceae cp genomes and development of barcode markers

Full alignments of nine Asteraceae cp genomes were performed using mVISTA program [29] in Shuffle-LAGAN mode. Genomes retrieved from NCBI were: Helianthus annuus (NC007977), Lactuca sativa (DQ383816), Parthenium argentatum (GU120098), Guizotia abyssinica (EU549769), Jacobaea vulgaris (HQ234669), Ageratina adenophora (NC_015621), Artemisia frigida (NC_020607), Chrisanthemum x morifolium (JQ362483), and C. cardunculus var. scolymus annotation was used as a reference. Regions promisingly valuable as phylogenetic markers across the Asteraceae family were investigated with Mega 6 [30] using default parameters (gap opening penalty: 15; gap extension penalty: 6.66; DNA weight matrix: IUB; transition weight: 0.5; negative matrix: off; delay divergent cutoff: 30%). Each alignment was imported in PAUP* 4.0b10 [31] for a phylogenetic analysis using the parsimony criterion. The robustness of every tree was confirmed with 1,000 bootstrap replicates, and the consistency (CI) and retention (RI) indexes were calculated. For barcoding applications, eight coding regions of the genes ccsA, matK, ndhA, rbcL, accD, clpP, rps16 and ycf1 were chosen for primer design with Primer 3 software [32] to obtain amplicons of 800 bp on average. PCR reactions were performed using a 9700 thermal cycler (Applied Biosystems, Foster City, CA) in 10 μl reaction mixtures containing 50 ng template DNA, 0.02 μM forward and reverse primer, 0.2 mM of each dNTP, 1x buffer, 0.4 U Taq DNA polymerase (Life Technologies, Foster City, CA) and 1.5 mM MgCl2. Thermal profile for the amplification was 3 min of initial denaturation at 94°C, 35 cycles of 30 sec at 94°C, 30 sec at optimal primer temperature (56°C for all genes, except for rps16, at 58°C) and 1 min extension at 72°C, followed by a final 7 min incubation at 72°C. The amplified fragments were checked on 1.5% agarose gel with a 100 bp molecular size standard (Fermentas, Vilnius, Lithuania). Six cp protein-coding genes (matk, ndhD, ndhF, ndhI, rbcL, rpoB) and the first exon of rpoC1 were extracted from 69 accessions: 60 from Panero and Funk [10], eight from the NCBI database, and one corresponding to the globe artichoke here described. These species belong to seven Asteraceae subfamilies: Asteroideae, Corymbioideae, Cichorioideae, Gymnarrhenoideae, Pertyoideae, Carduoideae, and Hecastocleidoideae. Extracted sequences were then concatenated through copy and paste and aligned with Fast Statistical Alignment (FSA) [33] web-server, setting gap factor as 1 and model type as Tamura-Nei [34]. All positions containing gaps or missing data were eliminated. Maximum parsimony (MP) analyses were performed with PAUP*4.0b10. Heuristic tree searches were conducted with 10 random-taxon-addition replicates, tree bisection-reconnection (TBR) branch swapping, and with “multrees” option in effect. Non-parametric bootstrap analysis was carried out under 1,000 replicates with TBR branch swapping. Maximum likelihood (ML) analysis was performed with RaxML Blackbox [35] using the Gamma model of rate heterogeneity.

Results and Discussion

Chloroplast genome assembly and annotation

Reads from an Illumina partial sequencing of the “Brindisino” globe artichoke nuclear genome were used to assemble the cp genome. To this end, the total reads (33 million) were filtered by aligning them on the cp genome from L. sativa (DQ383816), chosen for its phyletic proximity to Cynara genus. We thus obtained 1,308,860 mapped reads (coverage 643x), covering about 90% of the entire cp genome. In order to complete the cp sequence, we screened a BAC library obtained from the same genotype used for nuclear genome sequencing. By means of specific primer pairs, a clone harboring the artichoke cp genome was successfully isolated. Illumina BAC sequencing produced longer reads that helped complete the entire cp sequence. The total amount of reads obtained by the two approaches was merged and assembled using de novo and reference-guided methods, separately. The two assemblies produced an almost identical cp sequence, except for six insertion/deletions (indel) events, four insertions and two deletions in the reference-guided assembly compared to the de novo one. Subsequent PCR amplifications and Sanger sequencing revealed that the de novo assembly was correct five out of six times compared to the reference-guided assembly. Artichoke cp complete sequence was hence adjusted according to these findings. Finally, the four junctions between the IRs and SSC/LSC were confirmed by PCR amplifications and Sanger sequencing.

Genome organization and gene content

The artichoke cp genome is 152,529 bp in length. The canonical quadripartite structure consists of one LSC of 83,578 bp, one SSC of 18,641 bp and a pair of IRs of 25,155 bp each (Fig. 1). This genome contains 114 unique genes, including 30 tRNA, 4 rRNA, and 80 predicted protein-coding genes (Table 1). The tRNA-coding genes represent all the 20 amino acids and are distributed throughout the genome, one in the SSC region, 22 in the LSC region and seven in the IR region. Seven genes coding for tRNA, four rRNA genes and six protein-coding genes (rpl2, rpl23, ycf2, ndhB, rps7, ycf15) are completely duplicated in the IR regions. Therefore, the total number of genes present in the artichoke cp genome is 131 (Fig. 1).

Fig 1

Artichoke cp genome map.

Table 1

Genes present in the globe artichoke cp genome.

Category	Gene name
Photosystem I	psaA, B, C, I, J, ycf3 ^a , ycf4
Photosystem II	psbA, B, C, D, E, F, H, I, J, K, L, M, N, T, Z/lhbA
Cytochrome b6/f	petA, B ^b , D ^b , G, L, N
ATP synthase	atpA, B, E, F ^b , H, I
Rubisco	rbcL
NADH oxidoreductase	ndhA ^b , B ^b ^, ^c , C, D, E, F, G, H, I, J, K
Large subunit ribosomal proteins	rpl2 ^b ^, ^c , 14, 16, 20, 22, 23 ^c , 32, 33, 36
Small subunit ribosomal proteins	rps2, 3, 4, 7 ^c , 8, 11, 12 ^b ^, ^c ^, ^d , 14, 15, 16 ^b , 18, 19 ^c ^, ^e
RNAP	rpoA, rpoB, C1 ^a , C2
Other proteins	accD, ccsA, cemA, clpP ^a , matK, infA
Proteins of unknown function	ycf1 ^c ^, ^e , ycf2 ^c , ycf15 ^c , ycf68 ^c ^, ^e
Ribosomal RNAs	rrn23 ^c , 16 ^c , 5 ^c , 4.5 ^c
Transfer RNAs	trnA(UGC) ^b ^, ^c , trnC(GCA), D(GUC), E(UUC), F(GAA), G(UCC),G(GCC), H(GUG), I(CAU) ^c , I(GAU) ^b ^, ^c , K(UUU) ^b , L(UAA) ^b , L(UAG), L(CAA) ^c , fM(CAU), M(CAU), N(GUU) ^c , P(UGG), Q(UUG), R(ACG) ^c , R(UCU), S(GCU), S(GGA), S(UGA), T(CGU), T(UGU), V(UAC) ^b , V(GAC) ^c , W(CCA), Y(GUA)

aGene containing two introns

bGene containing a single intron

cTwo gene copies in the IRs

dGene divided into two independent transcription units

ePseudogenes

Artichoke cp genome map.

Genes shown on the outside of the large circle are transcribed clockwise; genes on the inside are transcribed counterclockwise. Thick lines of the small circle indicate IRs. Pseudogenes are marked with '*'. aGene containing two introns bGene containing a single intron cTwo gene copies in the IRs dGene divided into two independent transcription units ePseudogenes Protein-coding genes make up 49.9% (76,085 nt for 86 genes), tRNAs (2,796 nt for 37 genes) and rRNAs (8,842 nt for eight genes) represent 1.8% and 5.8% of the genome, respectively. The remaining 42.5% are non-coding introns, intergenic spacers, or pseudogenes. We have identified three pseudogenes: ycf68, in the IR, contains a premature stop codon in its coding sequence; the remaining two pseudogenes, ycf1 and rps19, are located in the boundary regions between IRb/SSC and IRa/LSC, respectively. The lack of their protein-coding ability is due to a partial gene duplication. The average AT content of the artichoke cp genome is 62.3%. The AT content of the LSC and SSC regions is 64.2% and 68.6%, respectively, whereas that of the IR regions is 56.9%; these data are congruent to what has been found in other cp genomes, e.g. Sesamum and Camellia genera [36,37]. The low AT content in the IR regions is due to the reduced presence of AT nucleotides in the four rRNA genes: rrn16, rrn23, rrn4.5, and rrn5. The IR regions may help stabilize the cp genome, as evidenced in a group of legumes that have lost a copy of the IR and are subject to more rearrangements compared to those that have not [38]. In the artichoke cp genome there are 18 intron-containing genes (Table 2). Among them, 16 genes (eight protein-coding and six tRNA genes) have a single intron and two genes (ycf3, clpP) have two introns. Out of the 18 genes with introns, 13 (nine protein-coding and four tRNA genes) are located in the LSC, one protein coding in the SSC and four (two protein coding and two tRNAs) in the IR region. The trnK-UUU intron is the largest one (2,530 bp) and includes the matK gene. The rps12 gene is a trans-spliced gene: its 5’ end exon is located in the LSC region and the two remaining exons are located in the IR regions. In the ndhD and psbL genes, we observed that ACG is used as an alternative start codon instead of the canonical ATG. The ACG start codon has been shown to convert to an AUG initiation site as reported in N. tabacum [39]. One GUG start codon was found in rps19. GUG codons have been reported to be more efficient than ACG in initiating translation and have a relative strength varying from 15 to 30% of AUG activity [40].

Table 2

Intron containing genes in the globe artichoke cp genome; exon and intron size.

Gene	Region	Exon I (bp)	Intron I (bp)	Exon II (bp)	Intron II (bp)	Exon III (bp)
rpl16	LSC	9	1004	399	-	-
rps16	LSC	40	854	215	-	-
rpoC1	LSC	432	732	1638	-	-
atpF	LSC	145	707	410	-	-
ycf3	LSC	124	742	230	698	153
clpP	LSC	71	626	291	803	229
petB	LSC	6	765	642	-	-
rps12*	LSC	243	-	114	-	-
petD	LSC	8	705	475	-	-
rpl2	IR	391	665	434	-	-
ndhB	IR	782	670	751	-	-
ndhA	SSC	553	1060	539	-	-
trnK-UUU	LSC	37	2530	35	-	-
trnA-UGC	IR	38	821	35	-	-
trnL-UAA	LSC	37	440	50	-	-
trnG-UCC	LSC	47	707	23	-	-
trnI-GAU	IR	43	510	35	-	-
trnV-UAC	LSC	38	574	38	-	-

* rps12 gene is subjected to trans-splicing.

* rps12 gene is subjected to trans-splicing. A total of 78,891 nt and 26,297 codons represent the coding capacity of 86 protein-coding genes in the artichoke cp genome (S2 Table). Leucine (2,792 codons meaning 10.6% of the total) and cysteine (293 corresponding to 1.1%) are the most and the least abundant amino acids, respectively. The codon usage is biased towards a high representation of A and T at the third codon position, which is similar to the majority of angiosperm cp genomes [41,42]. The whole artichoke cp sequence along with gene annotations was submitted to GenBank (accession number: KM035764).

Repeat structure and sequence analysis

Repeat regions are considered to play an important role in genome recombination and rearrangement [43,44]. We divided these regions in two categories: direct (D) and palindromic (P) repeats. With a 100% match criterion in repeat copies, Tandem Repeat Finder (TRF) identified ten sets of repeats longer than 10 bp. With a >90% criterion, TRF detected 12 other sets of repeats giving 22 total sets, nine in cds regions, two in intronic regions, and 11 in intergenic regions (S3 Table). REPuter allowed us to identify 21 repeats. Six repeats had a 0 hamming distance, that is, a complete identity with each other. We compared the redundant output of REPuter with TRF and checked the tandem repeats; dispersed repeats (direct and palindromic) were analyzed separately. Fifteen palindromic repeats and six direct repeats were identified. Therefore the total number of repeats was 43 and their copy number ranged between 2 and 4 (S3 Table). We analyzed the length of these repeats: 26 were 10–20 bp, 10 were 21–30 bp, five were 31–40 bp and two were 41–50 bp. Among all the repeats, 50% were in intergenic space regions, 13% in the intronic regions, 34% in the coding regions and 3% in the regions spanning from spacers to gene. The longest repeat was organized in tandem. It measured 45 bp in length and was located in the ycf1 gene. Among the coding regions, the richest in repeats was the ycf1 gene, which contained six repeats: five direct and one palindromic. As reported for other genomes [45,46], the ycf2 gene was also rich in repeats (four) carrying two direct and two palindromic repeats. It has already been demonstrated that these two coding and divergent regions are often associated with many repeat events [47].

SSR analysis

Chloroplast SSRs (cpSSRs) are generally short mononucleotide tandem repeats that, when located in the non-coding regions of the cp genome, commonly show intraspecific variation in repeat number [17]. CpSSRs can exhibit high variation within the same species and thus are considered valuable markers for population genetics [48,49] and phylogenetic analyses [50]. We analyzed SSRs with two programs, IMEx and MISA and obtained comparable results except for 12 SSRs (including four mononucleotides, four dinucleotides and four trinucleotides), which were identified only by IMEX. The total output consisted of 127 repeats: 61% (77 SSRs) in the LSC region, 25% (32 SSRs) in the SSC region, and 14% (18 SSRs) in the IR regions. Furthermore, 46% were in spacer regions, 42% in coding regions, 10% in intronic regions and 2% in pseudogene regions. We found a total of 109 homopolymers corresponding to 86% of the total SSRs, five dinucleotide (4%), six trinucleotide (5%), and seven tetranucleotide (5%) repeats (Fig. 2). Among the 109 mononucleotide repeats, only two belonged to the C/G type while all the others were A/T type; 61.5% of the mononucleotide repeats were in non-coding regions. This higher proportion of poly(A)/(T) relative to poly(G)/(C) has already been reported in Asteraceae [51,52] and other plant families [36,50,53]. The coding cp-regions with the highest number of repeats were ycf1 with 16 SSRs, followed by ycf2 with eight SSRs in the two IRs (S4 Table). These results are consistent with those from other species, e.g. Vigna radiata, P. argentatum and G. abyssinica [41,51,54] emphasizing that the highly variable ycf1 coding region can represent, also in Cynara, an interesting region suitable for phylogenetic studies or DNA barcoding possibly also at low taxonomic levels [55].

Fig 2

Total repeat and SSR distribution in C. cardunculus var. scolymus chloroplast genome.

(a) Repeat distribution among four different regions: coding sequence, intronic sequence, intergenic space region and overlapping region. (b) SSR distribution according to type: mononucleotide, dinucleotide, trinucleotide, and tetranucleotide repeats. SSR number and percentages (in brackets) are provided. (c) SSR type distribution between coding and non-coding regions.

Total repeat and SSR distribution in C. cardunculus var. scolymus chloroplast genome.

Comparison with other Asteraceae cp genomes, barcode markers and phylogenetic analyses

Structural differences: a continuing expansion of IRs in Cynara

The artichoke cp genome is the third in size among the nine Asteraceae complete cp genomes (Table 3), smaller than those of P. argentatum (152,803 bp) and L. sativa (152,772 bp), and features the largest IR region (25,155 bp). It is important to note that the two L. sativa genomes available in GenBank (NC_007578 and DQ383816) differ between each other by 6 bp and in the relative orientation of their SSC region. This incongruence can be due to polymorphisms between the strains investigated, to differences in the assembly methods, and/or to sequencing errors. The possible existence of an inverted SSC in Asteraceae genomes is still to be confirmed but cannot be excluded given the nature of the flip-flop mechanism of the inverted repeats [56]. For Ar. frigida, Liu et al. [57] claimed to have observed a totally inverted SSC in their assembly. However, the specific primers they used to validate the presumed inversion event would amplify the SSC no matter its orientation.

Table 3

Size comparison among nine cp genomes completely sequenced in the Asteraceae family.

Species	Accession Number	Genome size (bp)	LSC (bp)	SSC (bp)	IR (bp)
Parthenium argentatum	NC_013553	152803	84593	18842	24684
Latuca sativa cv. salinas	DQ383816	152772	84105	18599	25034
Cynara cardunculus var. scolymus	KM035764	152529	83578	18641	25155
Guizotia abyssinica	EU549769	151762	83535	18227	24999
Helianthus annuus	NC007977	151104	83530	18308	24633
Artemisia frigida	NC_020607	151076	82740	18394	24971
Chrysanthemum x morifolium	JQ362483	151033	82780	18347	24953
Ageratina adenophora	NC_015621	150698	84829	18359	23755
Jacobaea vulgaris	HQ234669	150689	82855	18276	24779

Species are ordered by genome size.

LSC: Large Single-Copy

SSC: Small Single-Copy

IR: Inverted Repeat

Species are ordered by genome size. LSC: Large Single-Copy SSC: Small Single-Copy IR: Inverted Repeat A multiple sequence alignment (MSA) was performed among all nine Asteraceae cp genomes sequenced to date, and served as a basis for investigating similarity levels (Fig. 3). In accordance with other angiosperms, the IRs and the coding regions are more conserved than the single-copy and non-coding regions, respectively. The IR regions of cp genome are much conserved in land plants compared to the single copy regions, mainly due to the presence of the rRNA gene group [47]. They only differ in length due to their contraction and expansion at the junction of LSC and SSC. This represents the main cause for size variation in the cp genomes [58,59].

Fig 3

Visualization of alignment of nine Asteraceae cp genome sequences.

Visualization of alignment of nine Asteraceae cp genome sequences.

VISTA-based identity plot showing sequence identity among eight cp genomes already published (see Materials and Methods for accession numbers) and the artichoke cp genome, set as a reference. Sequence identity is shown as a percentage between 50–100% on y-axis. On x-axis, artichoke genes are indicated on top lines, and arrows represent their orientation. Genome regions are distinguished by colors. CNS: conserved non-coding sequences. The IR-LSC/SSC borders with full annotations for the adjacent genes were compared across the nine sequenced Asteraceae cp genomes (Fig. 4). In this comparison, it was necessary to adjust sequence annotations for J. vulgaris, Ag. adenophora, H. annuus, G. abyssinica and P. argentatum, so that all sequences started from the first nucleotide after IRa. At the LSC/IRb border, the IRb expanded by 60 bp towards the rps19 gene in C. cardunculus, L. sativa and Ar. frigida, by 41 bp in J. vulgaris and by 101 bp in H. annuus. The same IR expanded by 567 bp in the ycf1 gene at the IRb/SSC border, both in C. cardunculus and J. vulgaris. At this position, the smallest and biggest expansions occur in Ag. adenophora (468 bp) and H. annuus (576 bp), respectively. In seven out of nine species, the complete ycf1 gene spans across IRb and SSC and appears as a pseudogene in the IRa region. The contrary happens in Ar. frigida and its inverted SSC [57]; in P. argentatum, the ycf1 gene is entirely located in the SSC region.

Fig 4

Comparison of the border positions of LSC, SSC, and IR regions among nine sequenced Asteraceae chloroplast genomes.

Genes are indicated in boxes and their extensions in the corresponding regions are displayed above boxes.

Comparison of the border positions of LSC, SSC, and IR regions among nine sequenced Asteraceae chloroplast genomes.

Genes are indicated in boxes and their extensions in the corresponding regions are displayed above boxes. The ndhF gene in C. cardunculus overlaps the SSC/IRa border by 17 bp, revealing an expansion of the IR compared to the other Asteraceae cp genomes sequenced so far. In this way, 17 bp at the 3’ end of ndhF gene are overlapping with ycf1 gene at the IRb/SSC border and with ycf1 pseudogene at the SSC/IRa border. In all other eight species, the same ndhF gene is entirely located in the SSC region varying only in distance from the SSC/IRa border. This distance is only 4 bp and 5 bp in L. sativa and J. vulgaris, respectively, whereas in P. argentatum it is 1,018 bp. In Ar. frigida the same gene is 76 bp distant from the IRb/SSC border, because of its inverted SSC region.

Highly informative regions and barcoding perspectives

Based on MSA of the nine Asteraceae cp genomes (Fig. 3), we focused on coding regions and retrieved the most promising sequences suitable for the development of reliable molecular markers in the Asteraceae family. The intergenic sequences may not be appropriate for phylogenetic analyses at the family level due to their high variation and a lack of high quality alignments [60] and thus should rather be used at a lower taxonomic rank. After aligning separately the selected coding regions and investigating the parsimony-informative ratios, we analyzed the most divergent regions (Table 4). Ycf1 and rps16 displayed the highest percentage of parsimony-informative characters (8.6% and 6.1% respectively), while the other coding regions analyzed (ccsA, rbcL, ndhA, ndhF, matK, clpP, accD, petD, petB and rpoC1) showed an interesting parsimony-informative ratio ranging from 3.9% to 5.4%. With this analysis, we confirmed the informative values for well-known regions previously adopted for the Asteraceae family (i.e. rbcL, rps16, ndhF, matK), or recently observed, such as rpoC1, ycf1 and clpP genes [60,52]. Moreover, the genes accD, ccsA, ndhA, petB and petD were identified in this work as highly parsimony-informative regions and thus can be considered in future phylogenetic studies in this family. Ycf1, clpP and accD are essential genes for cell survival and plant development in some taxa, but not in others [61,62,63]. Rps16, the gene coding for the ribosomal protein S16, appears non-functional or lost in several plant lineages, e.g. Medicago truncatula, Phaseolus vulgaris, V. radiata and the Populus genus [64]. Due to their pivotal role, these genes can be substituted by nuclear-encoded versions when the cp forms are not functional or lacking [65].

Table 4

Coding regions and their parsimony-informative rate.

No.	Region	Length	Aligned length	Conserved sites	No. Pars. uninf.	Pars. inf.	Pars.inf. %	C.I.	R.I.
1	accD	1530	1610	1414	130	66	4.10	0.91	0.79
2	ccsA	969	975	859	63	53	5.44	0.88	0.79
3	cemA	690	690	632	40	18	2.61	0.94	0.85
4	clpP	2020	2127	1894	144	89	4.18	0.9	0.81
5	matK	1521	1527	1342	119	66	4.32	0.94	0.88
6	ndhA	2152	2307	2015	190	102	4.42	0.90	0.77
7	ndhI	501	501	470	18	13	2.59	0.92	0.87
8	ndhK	678	678	592	75	11	1.62	0.95	0.67
9	petB	1413	1444	1285	103	56	3.88	0.88	0.74
10	petD	1188	1269	1160	58	51	4.02	0.92	0.88
11	rbcL	1434	1458	1340	52	66	4.53	0.78	0.67
12	rpoA	1008	1014	896	93	25	2.47	0.92	0.72
13	rpoC1	2802	2918	2030	778	110	3.77	0.94	0.66
14	rpoC2	4158	4176	3763	288	125	2.99	0.92	0.79
15	rps16	1109	1238	1034	129	75	6.06	0.92	0.82
16	ycf1	5304	5568	3585	1505	478	8.58	0.89	0.60
17	ndhD	1503	1539	1420	79	40	2.60	0.85	0.64
18	ndhF	2250	2260	1983	173	104	4.60	0.89	0.78
19	rpoB	3183	3183	2965	140	76	2.39	0.91	0.82

Length: refers to sequence length in Cynara cardunculus var. scolymus

Aligned length: refers to the alignment of nine Asteraceae considered in the comparative analysis (see Materials and Methods)

Pars.: parsimony

Uninf. uninformative

Inf.: informative

C.I.: consistency index

R.I.: retention index

Length: refers to sequence length in Cynara cardunculus var. scolymus Aligned length: refers to the alignment of nine Asteraceae considered in the comparative analysis (see Materials and Methods) Pars.: parsimony Uninf. uninformative Inf.: informative C.I.: consistency index R.I.: retention index The presence of intronic sequences in both ndhA and rps16 genes contributes to the divergence at these two loci. MatK gene has been shown to have a high evolutionary rate and a suitable length for barcoding applications. RbcL is a good candidate for DNA barcoding in plants at the family and genus level too, since it can be easily amplified and sequenced in most land plants. Nevertheless, it shows a slow evolutionary rate and a lower divergence compared to the other plastid genes in flowering plants [66]. Thanks to their complementary features, matK and rbcL have been recommended by the Consortium for the Barcode of Life (CBOL) Plant Working Group in combination as multi-locus DNA barcodes in plants [20]. In order to propose possible barcode regions for the Asteraceae family, we focused on eight of the genes described above, which displayed a rate of informativity above 4%: ycf1, rps16, ccsA, rbcL, ndhA, matK, clpP and accD (Table 4). Based on MSA among the nine species completely sequenced, we designed “universal” primer pairs (S5 Table) which can be used in the whole Asteraceae family. In order to test their efficiency, we amplified a group of species (L. serriola, Matricaria chamomilla, Gerbera hybrida, Cr. x morifolium, H. annuus and C. cardunculus). These species are representatives of the four major Asteraceae subfamilies (Asteroideae, Cichorioideae, Carduoideae and Multisioideae) which are estimated to include 99% of the Asteraceae species [1]. We obtained 100% successful amplifications (S1 Fig.) with specific products of the expected sizes, suggesting that these primer pairs can be useful for species barcoding within the Asteraceae family.

Phylogenetic relationships within Asteraceae

Asteraceae is one of the largest families in the plant kingdom. Several studies have analyzed the phylogenetic relationships in this family based on cp sequences. One of the most comprehensive analyses included 108 taxa and was based on ten cp regions, seven of which were coding genes, and the remaining ones noncoding sequences [10]. However, this study did not involve the genus Cynara and most of the species for which cp genome has been completely sequenced. Therefore, in order to place new species in the Asteraceae metatree, we selected 60 taxa from that work, belonging to the main Asteraceae subfamilies, and added the nine completely sequenced cp genomes, including C. cardunculus. For this purpose, we retrieved six intronless genes (matk, ndhD, ndhF, ndhI, rbcL, rpoB) and the first exon of rpoC1. Gene sequences for each single taxon were concatenated and then aligned. Total alignment was 13,875 bp in length, comprising 10,491 constant characters, 1,573 singleton characters and 1,811 parsimony-informative characters. Maximum Parsimony and ML analyses were performed using Acicarpha spatulata as outgroup. The two trees obtained displayed comparable topologies and only slightly better bootstrap values were obtained with ML method. With MP analysis, a phylogenetic tree of 7,380 total length was obtained (Fig. 5), whereas ML delivered a tree with a sum of branch lengths of SBL = 0.7888 (S2 Fig.). The consensus MP tree, displaying bootstrap values higher than 70% in almost all nodes, was highly comparable with the MP tree obtained by Panero and Funk [10], even though in our analysis we did not include the non-coding cp regions trnL-trnF, 23S-trnA, and trnK partial intron. Moreover, we added the species with newly sequenced cp genomes, placing them in the Asteraceae phylogenetic tree.

Fig 5

Phylogenetic tree based on maximum parsimony of 69 accessions belonging to the Asteraceae family.

Phylogenetic tree based on maximum parsimony of 69 accessions belonging to the Asteraceae family.

Seven coding regions were used: matk, ndhD, ndhF, ndhI, rbcL, rpoB and the first exon of rpoC1, for a total of 1,811 parsimony-informative characters. Sequences from C. cardunculus were obtained from this work. Bootstrap values for each node were set greater than 50%. Species for which the complete cp genome is available are shaded. Within the Asteroideae subfamily, our MP tree showed Cr. x morifolium grouping with Ar. frigida within Anthemideae tribe. As expected, Jacobea vulgaris clustered in the Senecioneae tribe. However, the relationship between this tribe and other tribes of Asteroideae was not solved. This result is in agreement with those obtained by Panero and Funk [10] who found the position of Senecioneae equivocal. Helianthus annuus, P. argentatum, G. abyssinica, and Ag. adenophora grouped in the Heliantheae alliance. The Asteroideae subfamily is sister to Cichorioideae subfamily, including L. sativa in the Cichorieae tribe. Both Asteroideae and Cichorioideae are related to the Carduoideae subfamily; here the previous tricotomy among Tarchonantheae, Dicomeae, and Cynareae [10] was solved by grouping Dicomeae and Tarchonantheae tribes, although with low bootstrap support (58% and 64% in MP and ML trees, respectively). Taxa within the Cynareae tribe form two groups. The first one is composed of Carthamus tinctorius and Centaurea melitensis, and, to a higher level, C. cardunculus. The second group includes Atractylis cancellata and Echinops ritro. The phylogenetic results on Cynareae tribe are consistent with those observed in more detailed studies specific to Cardueae (Cynareae) based on morphological [67] and molecular evidence [11], although a lower number of species and genera was considered in our analysis.

Conclusions

The C. cardunculus chloroplast genome represents the first complete sequence from the large Carduoideae subfamily, within the widespread family of Asteraceae. The comparison with the eight other Asteraceae complete genomes sequenced so far demonstrated that the artichoke cp genome is well conserved in gene content and order but that it also features a relevant number of simple sequence repeats, which could be further explored for population studies within Cynara genus. The most parsimony-informative regions identified in this study are of potential interest for future phylogenetic studies of the Asteraceae and may serve as a solid resource for barcoding applications.

Test results for the eight barcoding-candidate primer pairs designed for selected parsimony-informative regions.

1: Cr. x morifolium; 2: M. chamomilla; 3: Gerbera hybrida; 4: L. serriola; 5: H. annuus; 6: C. cardunculus. Left lane in each gel: 100 bp DNA ladder. (TIF) Click here for additional data file.

Phylogenetic tree based on maximum likelihood of 69 accessions belonging to the Asteraceae family.

Seven coding regions were used: matk, ndhD, ndhF, ndhI, rbcL, rpoB and the first exon of rpoC1. The analysis was performed using RaxML Blackbox with the Gamma model of rate heterogeneity. Bootstrap support values were set greater than 50%. (TIF) Click here for additional data file.

Primer pairs used for BAC identification and junction validation.

The name of primers refer to the cp regions amplified. (DOCX) Click here for additional data file.

Codon-anticodon recognition pattern and codon usage for Cynara cardunculus var. scolymus chloroplast genome.

(DOCX) Click here for additional data file.

Direct and palindromic repeats in C. cardunculus var. scolymus cp genome.

(DOCX) Click here for additional data file.

SSRs found in artichoke cp genome.

(DOCX) Click here for additional data file.

Barcoding primer pair candidates.

Sequences were designed on the most parsimony-informative coding regions revealed by the multialignment of the nine Asteraceae complete cp genomes. (DOCX) Click here for additional data file.

57 in total

1. A comparative analysis of the Lactuca and Helianthus (Asteraceae) plastid genomes: identification of divergent regions and categorization of shared repeats.

Authors: Ruth E Timme; Jennifer V Kuehl; Jeffrey L Boore; Robert K Jansen
Journal: Am J Bot Date: 2007-03 Impact factor: 3.844

2. IMEx: Imperfect Microsatellite Extractor.

Authors: Suresh B Mudunuri; Hampapathalu A Nagarajaram
Journal: Bioinformatics Date: 2007-03-22 Impact factor: 6.937

3. A rapid bootstrap algorithm for the RAxML Web servers.

Authors: Alexandros Stamatakis; Paul Hoover; Jacques Rougemont
Journal: Syst Biol Date: 2008-10 Impact factor: 15.683

4. The value of sampling anomalous taxa in phylogenetic studies: major clades of the Asteraceae revealed.

Authors: Jose L Panero; V A Funk
Journal: Mol Phylogenet Evol Date: 2008-03-28 Impact factor: 4.286

5. Antioxidative and protective properties of extracts from leaves of the artichoke (Cynara scolymus L.) against hydroperoxide-induced oxidative stress in cultured rat hepatocytes.

Authors: R Gebhardt
Journal: Toxicol Appl Pharmacol Date: 1997-06 Impact factor: 4.219

6. VISTA: computational tools for comparative genomics.

Authors: Kelly A Frazer; Lior Pachter; Alexander Poliakov; Edward M Rubin; Inna Dubchak
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

7. Primer3--new capabilities and interfaces.

Authors: Andreas Untergasser; Ioana Cutcutache; Triinu Koressaar; Jian Ye; Brant C Faircloth; Maido Remm; Steven G Rozen
Journal: Nucleic Acids Res Date: 2012-06-22 Impact factor: 16.971

8. Phylogenetic analyses of Vitis (Vitaceae) based on complete chloroplast genome sequences: effects of taxon sampling and phylogenetic methods on resolving relationships among rosids.

Authors: Robert K Jansen; Charalambos Kaittanis; Christopher Saski; Seung-Bum Lee; Jeffrey Tomkins; Andrew J Alverson; Henry Daniell
Journal: BMC Evol Biol Date: 2006-04-09 Impact factor: 3.260

9. ACUA: a software tool for automated codon usage analysis.

Authors: Umashankar Vetrivel; Vijayakumar Arunkumar; Sudarsanam Dorairaj
Journal: Bioinformation Date: 2007-10-06

10. Comparative chloroplast genomes of camellia species.

Authors: Jun-Bo Yang; Shi-Xiong Yang; Hong-Tao Li; Jing Yang; De-Zhu Li
Journal: PLoS One Date: 2013-08-23 Impact factor: 3.240

49 in total

1. The newly developed single nucleotide polymorphism (SNP) markers for a potentially medicinal plant, Crepidiastrum denticulatum (Asteraceae), inferred from complete chloroplast genome data.

Authors: Hoang Dang Khoa Do; Joonhyung Jung; JongYoung Hyun; Seok Jeong Yoon; Chaejin Lim; Keedon Park; Joo-Hwan Kim
Journal: Mol Biol Rep Date: 2019-04-12 Impact factor: 2.316

2. Sequence Characteristics and Phylogenetic Analysis of the Artemisia argyi Chloroplast Genome.

Authors: Changjie Chen; Yuhuan Miao; Dandan Luo; Jinxin Li; Zixin Wang; Ming Luo; Tingting Zhao; Dahui Liu
Journal: Front Plant Sci Date: 2022-06-20 Impact factor: 6.627

3. Comparative analyses of chloroplast genomes from 13 Lagerstroemia (Lythraceae) species: identification of highly divergent regions and inference of phylogenetic relationships.

Authors: Gang Zheng; Lingling Wei; Li Ma; Zhiqiang Wu; Cuihua Gu; Kai Chen
Journal: Plant Mol Biol Date: 2020-01-29 Impact factor: 4.076

4. Genetic diversity and population structure of Cynara cardunculus L. in southern Portugal.

Authors: Maria Miguel Castro; Daniela Rosa; Ana M Ferro; Ana Faustino; Ana Paulino; Teresa Brás; Eliana Machado; Carla Pinto Cruz; Anabela D F Belo; Paula Nozes; João Portugal; Sofia Ramôa; Diogo Mendonça; Fernanda Simões; Maria F Duarte; Liliana Marum
Journal: PLoS One Date: 2021-06-09 Impact factor: 3.240

5. The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny.

Authors: Davide Scaglione; Sebastian Reyes-Chin-Wo; Alberto Acquadro; Lutz Froenicke; Ezio Portis; Christopher Beitel; Matteo Tirone; Rosario Mauro; Antonino Lo Monaco; Giovanni Mauromicale; Primetta Faccioli; Luigi Cattivelli; Loren Rieseberg; Richard Michelmore; Sergio Lanteri
Journal: Sci Rep Date: 2016-01-20 Impact factor: 4.379

6. Complete Chloroplast Genome Sequence of Aquilaria sinensis (Lour.) Gilg and Evolution Analysis within the Malvales Order.

Authors: Ying Wang; Di-Feng Zhan; Xian Jia; Wen-Li Mei; Hao-Fu Dai; Xiong-Ting Chen; Shi-Qing Peng
Journal: Front Plant Sci Date: 2016-03-08 Impact factor: 5.753

7. The Complete Chloroplast Genome of Ye-Xing-Ba (Scrophularia dentata; Scrophulariaceae), an Alpine Tibetan Herb.

Authors: Lianghong Ni; Zhili Zhao; Gaawe Dorje; Mi Ma
Journal: PLoS One Date: 2016-07-08 Impact factor: 3.240

8. The Chloroplast Genome of Utricularia reniformis Sheds Light on the Evolution of the ndh Gene Complex of Terrestrial Carnivorous Plants from the Lentibulariaceae Family.

Authors: Saura R Silva; Yani C A Diaz; Helen Alves Penha; Daniel G Pinheiro; Camila C Fernandes; Vitor F O Miranda; Todd P Michael; Alessandro M Varani
Journal: PLoS One Date: 2016-10-20 Impact factor: 3.240

9. Chloroplast Genome Sequence of Pigeonpea (Cajanus cajan (L.) Millspaugh) and Cajanus scarabaeoides (L.) Thouars: Genome Organization and Comparison with Other Legumes.

Authors: Tanvi Kaila; Pavan K Chaduvla; Swati Saxena; Kaushlendra Bahadur; Santosh J Gahukar; Ashok Chaudhury; T R Sharma; N K Singh; Kishor Gaikwad
Journal: Front Plant Sci Date: 2016-12-09 Impact factor: 5.753

10. The Complete Plastid Genome of Lagerstroemia fauriei and Loss of rpl2 Intron from Lagerstroemia (Lythraceae).

Authors: Cuihua Gu; Luke R Tembrock; Nels G Johnson; Mark P Simmons; Zhiqiang Wu
Journal: PLoS One Date: 2016-03-07 Impact factor: 3.240