Literature DB >> 25798340

SNP discovery in complex allotetraploid genomes (Gossypium spp., Malvaceae) using genotyping by sequencing.

Carla Jo Logan-Young¹, John Z Yu², Surender K Verma¹, Richard G Percy², Alan E Pepper¹.

Abstract

PREMISE OF THE STUDY: Single-nucleotide polymorphism (SNP) marker discovery in plants with complex allotetraploid genomes is often confounded by the presence of homeologous loci (along with paralogous and orthologous loci). Here we present a strategy to filter for SNPs representing orthologous loci. METHODS AND
RESULTS: Using Illumina next-generation sequencing, 54 million reads were collected from restriction enzyme-digested DNA libraries of a diversity of Gossypium taxa. Loci with one to three SNPs were discovered using the Stacks software package, yielding 25,529 new cotton SNP combinations, including those that are polymorphic at both interspecific and intraspecific levels. Frequencies of predicted dual-homozygous (aa/bb) marker polymorphisms ranged from 6.7-11.6% of total shared fragments in intraspecific comparisons and from 15.0-16.4% in interspecific comparisons.
CONCLUSIONS: This resource provides dual-homozygous (aa/bb) marker polymorphisms. Both in silico and experimental validation efforts demonstrated that these markers are enriched for single orthologous loci that are homozygous for alternative alleles.

Entities: Chemical Disease Gene Mutation Species

Keywords: Gossypium; genotyping by sequencing; interspecific; intraspecific; next-generation sequencing; polyploid; single-nucleotide polymorphisms

Year: 2015 PMID： 25798340 PMCID： PMC4356317 DOI： 10.3732/apps.1400077

Source DB: PubMed Journal: Appl Plant Sci ISSN： 2168-0450 Impact factor: 1.936

Cottons (Gossypium L. spp.) provide the leading natural fiber for textiles, as well as an important seed product for feed, food, and oil (Campbell et al., 2010). The most widely grown species are the allotetraploids G. hirsutum L. (upland cotton) and G. barbadense L. (Sea Island cotton). These species are both descended from an allopolyploidization event involving an A-genome diploid species, related to modern G. herbaceum L. and G. arboreum L., and D-genome diploid species, related to modern G. raimondii Ulbr. (Percival and Kohel, 1990). Recent developments in next-generation sequencing (NGS) technology have lowered the cost of sequencing per base, and enabled the genotyping by sequencing (GBS) approach for developing informative single-nucleotide polymorphism (SNP) markers in species with large, complex genomes (Elshire et al., 2011), including species without a reference genome sequence (Glaubitz et al., 2014). In this study, we employed a simple and cost-effective GBS approach to identify intraspecific and interspecific SNPs within and between allotetraploid cottons G. hirsutum and G. barbadense. A major difficulty in the characterization and utilization of SNPs in polyploid species is determining whether a polymorphism detected by short-read NGS is the result of alternative alleles at a single locus or the presence of multiple homeologous loci. To identify markers that were likely to represent alternative alleles at a single orthologous locus, we used the Stacks bioinformatics pipeline (Catchen et al., 2011) as a filter to enrich for codominant markers composed of pairs of alleles that were homozygous in the respective taxa used for comparison (Fig. 1). Given high enough sequence coverage to accurately identify all relevant alleles and loci, the sstacks algorithm implemented by Stacks assigns an aa/bb marker type in this situation (Scenario 1). In contrast, detection of polymorphisms between homeologous loci will likely give rise to an ab/ab marker type prediction (Scenario 2). Detection of a polymorphism between paralogs within a subgenome will give rise to an aa/ab marker type (Scenario 3). Markers that are homozygous in one parent and heterozygous in the other parent will also give rise to an aa/ab marker type prediction (Scenario 4). A host of other combinations will give rise to either ab/ab or aa/bb patterns (not shown).

Fig. 1.

Predicted marker type categories from the sstacks algorithm for four common genetic scenarios (out of many) that give rise to apparent GBS polymorphisms between two allotetraploids. Red lines indicate sequences that can be clearly assigned to the AT subgenome, and blue lines indicate those that can be assigned to the DT subgenome. Gray lines indicate regions of high sequence similarity between homeologs or paralogs (e.g., no differences outside of the SNP of interest). Marker type predictions are based on the assumption that there is adequate sequence coverage to accurately score all alleles at all relevant loci. Given the array of possible scenarios that can give rise to candidate SNPs using GBS, we focused our efforts on markers with a simple aa/bb biallelic pattern. We considered these to be the best candidates for single loci with codominant polymorphisms between cotton accessions that would be useful for downstream applications such as genetic diversity studies, linkage and QTL mapping, genome-wide association studies (GWAS), and marker-assisted selection.

METHODS AND RESULTS

GBS was performed by a method similar to the general strategy outlined previously (Elshire et al., 2011), with major differences. Genomic DNAs from cotton taxa (Table 1) were extracted from liquid N2 flash-frozen seedlings using a protocol described previously (Pepper and Norwood, 2001) with the addition of 1/10th volume of Plant RNA Isolation Aid (Ambion, Austin, Texas, USA) to the initial extraction buffer. Genomic DNA (250 ng) was digested with 10 units of restriction endonucleases HinP1I or BsrGI for 2 h at 37°C. HinP1I is a CpG methylation-sensitive four-base cutter (G|CGC), while BsrGI is a methylation-insensitive six-base cutter (T|GTACA). Adapter ligations were carried out in 50-μL reactions in the presence of 40 pmol of the restriction-enzyme appropriate combination of annealed, 6-bp bar-coded, Illumina-compatible P5 and P7 adapters (Appendix 1) and 1600 cohesive-end ligation units of T4 DNA ligase (New England Biolabs, Ipswich, Massachusetts, USA) for 1 h at 37°C. Because active restriction enzymes were still present in these reactions, a final incubation for 30 min at 37°C was performed to cleave any chimeric ligation products between genomic DNA fragments (the adapters were designed to abolish the HinP1I or BsrGI recognition site upon ligation). All samples were pooled and then purified using MinElute columns (QIAGEN, Valencia, California, USA). Single-copy regions of the plastid genome can be present in >1500 copies per cell in leaf tissue (Zoschke et al., 2007) and may thus represent a major contaminant of GBS libraries. In silico restriction digestion of the G. hirsutum plastid genome (Lee et al., 2006) showed that no fragments were present between 186 bp and 218 bp in HinP1I digests and between 144 bp and 300 bp in BsrGI digests. To exclude fragments originating from the plastid genome, and to achieve complexity reduction, size selection (190–210 bp for HinP1I and 160–290 bp for BsrGI) was performed using 2.5% agarose gels stained with Gel Green (Biotium, Hayward, California, USA). Size-selected fractions were treated with NEBNext Fill-in and ssDNA Isolation Module (New England Biolabs) to obtain single-stranded biotinylated fragments to use as template for PCR amplification with Illumina-compatible primers PCR 1.01 and PCR 2.01 (Appendix 1). PCR cycling conditions consisted of 98°C for 30 s followed by 20–30 cycles of 98°C for 12 s, 65°C for 30 s, 72°C for 30 s, with a final extension time of 1 min at 72°C using Phusion polymerase (New England Biolabs). Because of the very narrow range of fragment sizes that were extracted in the size selection step, 20 cycles of PCR were required for amplification of BsrGI libraries and 30 cycles were required for HinP1I libraries. The samples were purified using Agencourt AMPure XP beads (Beckman Coulter, Brea, California, USA), then quantified using the AccuBlue High Sensitivity Quantitation Kit (Biotium) on the VICTOR X3 Multilabel Plate Reader (PerkinElmer, Akron, Ohio, USA). Samples were diluted to 10 nM, then provided to the Texas A&M AgriLife Genomics & Bioinformatics Services for sequencing on the Illumina GAII and/or HiSeq 2000 instrument (Illumina, San Diego, California, USA).

Table 1.

Seed sources, taxonomy, and preliminary GBS statistics for a set of diploid (A1-27, D5-1) and allotetraploid cottons.

Year	Scientific name	Name or designation	PI no.	Origin	BsrGI sorted sequences	BsrGI unique stacks	HinP1I sorted sequences	HinP1I unique stacks
2003	G. herbaceum L.	A₁-27	PI 408785	Peru	1,678,012	12,638	1,743,678	43,883
1989	G. raimondii Ulbr.	D₅-1	PI 530898	Ecuador	949,301	4215	15,323,076	8932
1984	G. barbadense L.	K-56	PI 274514	Sinchao Chico, Piura, Peru,	3,995,793	16,668	3,440,581	17,332
2005	G. hirsutum L.	TM-1	PI 607172	College Station, Texas, USA	2,752,301	22,862	2,609,330	10,792
2002	G. barbadense L.	Pima 3-79		Sacaton, Arizona, USA	2,756,143	23,077	1,816,093	9059
2005	G. hirsutum L.	TX-231	PI 163725	Zacapa, Zacapa, Guatemala	318,042	7895	592,566	1739

Note: PI no. = Plant Introduction number, National Plant Germplasm System.

Seed sources, taxonomy, and preliminary GBS statistics for a set of diploid (A1-27, D5-1) and allotetraploid cottons. Note: PI no. = Plant Introduction number, National Plant Germplasm System. A total of ca. 54 million raw 76-bp paired-end reads were imported into the Geneious bioinformatics package (Drummond et al., 2012) to trim for quality (P = 0.05). The length of all fragments was trimmed to 66 bp for analysis using Stacks ver. 0.998 (Catchen et al., 2011). Sequences were preprocessed using the “process_radtags” script, in which the 6-bp barcodes were sorted and removed, yielding 60-bp fragments. The process radtag options included: -c (clean data) and -q (discard low quality reads). Barcode-sorted FASTQ files were processed in pairwise combinations using the “denovo_map.pl” script. The de novo map parameters included: -n 3 (mismatches allowed between loci during catalog building), -m 3 (minimum number of identical, raw reads required to create a stack), -M 2 (number of mismatches allowed between loci when processing a single individual), and -t (remove, or break up, highly repetitive RAD-Tags during ustacks). Output data files were loaded to MySQL tables (Oracle Corporation, Redwood Shores, California, USA), and SNPs between taxa were annotated in a pairwise manner. We obtained an average of ca. 2 million raw sequences and ca. 15,000 unique ‘stacks’ (loci) for each sample (Table 1). To examine the partitioning of GBS markers into the gene-rich component of the genome, we used BLASTN (Altschul et al., 1990) to search the 22,862 unique BsrGI stacks and 10,792 unique HinP1I stacks from G. hirsutum TM-1 against predicted coding DNA sequences (CDSs) and transcripts from the diploid G. raimondii (Paterson et al., 2012; Wang et al., 2012) using a significance threshold of <1e-6. A very large proportion of HinP1I fragments (∼36–50%) showed some degree of sequence similarity to transcribed regions of the G. raimondii genome (Table 2). This proportion was far lower for BsrGI fragments (∼3–9%). Because HinP1I is methylation sensitive, the HinP1I libraries may be enriched in transcribed regions, which are hypo-methylated in plant genomes (Feng et al., 2010). We also examined our GBS libraries for the presence of repetitive DNA using BLAST searches of all BsrGI and HinP1I against the TIGR Brassicaceae Repeat Database ver2_0_0 (Ouyang and Buell, 2004). We considered this database to be an appropriate proxy for cotton repetitive sequences because cotton (Malvaceae) and the Brassicaceae are sister taxa (Soltis et al., 2000). The presence of fragments with similarity to known repetitive elements was extremely low (<2%) in both BsrGI and HinP1I libraries. Organellar DNA contamination was examined using BLAST searches to the G. hirsutum plastid and mitochondrial genomes (Lee et al., 2006; Liu et al., 2013). The presence of fragments with similarity to the plastid and mitochondrial genomes was low in both libraries (Table 2).

Table 2.

BLASTN results (significance value <1e-6) using total stacks from Gossypium hirsutum cv. TM-1 or selected aa/bb markers across all taxa as subject.

	BsrGI		HinP1I
Database	TM-1	aa/bb	TM-1	aa/bb
JGI CDS	1059	109	3822	343
	(4.60%)	(3.10%)	(35.40%)	(38.20%)
JGI transcript	2018	245	4943	448
	(8.80%)	(7.10%)	(45.80%)	(49.90%)
BGI CDS	1941	341	3767	325
	(8.50%)	(9.80%)	(34.90%)	(36.20%)
Brassicaceae repeats	102	0	170	5
	(0.45%)	(0.00%)	(1.50%)	(0.56%)
Plastid	6	1	171	6
	(0.03%)	(0.03%)	(1.50%)	(0.67%)
Mitochondrial	17	8	843	4
	(0.07%)	(0.23%)	(7.81%)	(0.45%)
Total	22,862	3474	10,792	897

Note: BGI = Beijing Genomics Institute; CDS = coding DNA sequences; JGI = Joint Genome Institute.

Results are searched against the databases indicated: JGI CDS and JGI transcript from G. raimondii (Paterson et al., 2012; Wang et al., 2012); Brassicaceae repeats from TIGR Brassicaceae Repeat Database ver2_0_0 (Ouyang and Buell, 2004); plastid from G. hirsutum plastid genome (Lee et al., 2006); and mitochondrial from G. hirsutum mitochondrial genome (Liu et al., 2013).

BLASTN results (significance value <1e-6) using total stacks from Gossypium hirsutum cv. TM-1 or selected aa/bb markers across all taxa as subject. Note: BGI = Beijing Genomics Institute; CDS = coding DNA sequences; JGI = Joint Genome Institute. Results are searched against the databases indicated: JGI CDS and JGI transcript from G. raimondii (Paterson et al., 2012; Wang et al., 2012); Brassicaceae repeats from TIGR Brassicaceae Repeat Database ver2_0_0 (Ouyang and Buell, 2004); plastid from G. hirsutum plastid genome (Lee et al., 2006); and mitochondrial from G. hirsutum mitochondrial genome (Liu et al., 2013). For loci that were shared between any two taxa, polymorphisms were identified and categorized based on marker type predictions from Stacks. Across all taxa, totals of 18,073 and 5014 aa/bb pairwise SNP combinations were identified from the BsrGI and HinP1I libraries, respectively (Tables 3 and 4). Data for these marker sets have been submitted to the CottonGen SNP database (http://www.cottongen.org). To determine the extent of overlap between this SNP collection and those already available on the CottonGen database, a subset of 1475 BsrGI fragments and 551 HinP1I fragments from G. hirsutum cultivar TM-1 were searched against all 183,035 CottonGen SNPs using the batch BLAST tool (http://www.cottongen.org/tools/batch_blast; searched 3 January 2015). For BsrGI, only two fragments (0.13%) had 100% matches in the CottonGen SNP database, and only 19 fragments (1.3%) had matches to similar sequences in the database (with up to three mismatches). For HinP1I, only nine fragments (1.6%) had a 100% match in the CottonGen SNP database, and 27 fragments (4.9%) were similar up to three mismatches. Thus, the overlap between this and other cotton SNP collections was very low.

Table 3.

Numbers of BsrGI shared stacks (loci) and dual homozygous (aa/bb) marker loci across a set of intraspecific and interspecific combinations of Gossypium taxa.

Pairwise combination	Shared stacks	aa/bb Markers
A₁/D₅	413	216
Pima 3-79/A₁	3329	1538
Pima 3-79/D₅	2123	1057
Pima 3-79/K-56	10,623	859
Pima 3-79/TM-1	12,408	2040
Pima 3-79/TX-231	5322	1183
TM-1/A₁	3172	1550
TM-1/D₅	2003	1041
TM-1/K-56	8910	2171
TM-1/TX-231	5492	575
TX-231/A₁	2031	959
TX-231/D₅	1328	690
TX-231/K-56	4836	1512
K-56/A₁	3421	1616
K-56/D₅	2072	1066

Table 4.

Numbers of HinP1I shared stacks (loci) and dual homozygous (aa/bb) marker loci across a set of intraspecific and interspecific combinations of Gossypium taxa.

Pairwise combination	Shared stacks	aa/bb Markers
A₁/D₅	1201	502
Pima 3-79/A₁	2987	862
Pima 3-79/D₅	1387	351
Pima 3-79/K-56	931	62
Pima 3-79/TM-1	4921	740
Pima 3-79/TX-231	856	167
TM-1/A₁	3198	899
TM-1/D₅	1528	323
TM-1/K-56	906	182
TM-1/TX-231	961	111
TX-231/A₁	740	256
TX-231/D₅	444	119
TX-231/K-56	342	79
K-56/A₁	770	241
K-56/D₅	429	120

Numbers of BsrGI shared stacks (loci) and dual homozygous (aa/bb) marker loci across a set of intraspecific and interspecific combinations of Gossypium taxa. Numbers of HinP1I shared stacks (loci) and dual homozygous (aa/bb) marker loci across a set of intraspecific and interspecific combinations of Gossypium taxa. The proportion of aa/bb polymorphic loci to total (shared) loci was similar between BsrGI and HinP1I across all combinations of taxa (Fig. 2). In an intraspecific comparison within G. barbadense between cultivated variety Pima 3-79 and landrace K-56 (Peru), 6.7–8.1% of all markers showed aa/bb polymorphisms. In an intraspecific comparison within G. hirsutum between cultivated variety TM-1 and landrace TX-231 (Guatemala), 10.5–11.6% of markers showed aa/bb polymorphisms. An interspecific comparison between G. barbadense Pima 3-79 and G. hirsutum TM-1 showed the highest level of polymorphism (15–16.4%). These values correspond to approximate SNP frequencies of >0.0012–0.002 substitutions per base pair for intraspecific comparisons and >0.0028 substitutions per base pair for interspecific comparisons.

Fig. 2.

BsrG1 and HinP1I GBS polymorphism in tetraploid Gossypium spp. The proportion of highly informative (aa/bb) markers relative to total shared loci (stacks) in intraspecific and interspecific pairwise comparisons is shown. 3-79 = G. barbadense cv. Pima 3-79; K-56 = G. barbadense accession K-56; TM-1 = G. hirsutum cv. TM-1; TX-231 = G. hirsutum accession TX-231. To test the efficacy of selecting aa/bb markers to enrich for orthologous loci SNPs, we employed both in silico and experimental validation. The in silico validation made use of the available A and D diploid genome sequences by determining whether the predicted aa/bb dual homozygous markers had the expected evolutionary pattern for sequences from Scenario 1 (Fig. 1). To perform this analysis, a subset of G. hirsutum TM-1 vs. G. barbadense Pima 3-79 aa/bb markers was searched against the complete genome sequences of G. raimondii (D-genome diploid) (Paterson et al., 2012; Wang et al., 2012) and G. arboreum (A-genome diploid) (Li et al., 2014) using BLASTN at a significance threshold of <1e-6. For each aa/bb marker, the tetraploid sequences and positive hits from diploid genomes were aligned using the map-to-reference tool implemented by the Geneious bioinformatics package (Drummond et al., 2012). Alignments were constructed for 549 HinP1I and 1413 BsrGI markers. Minimum spanning distance (smallest number of mutational changes to transition from one sequence to another) was used to classify the alignments into one of five categories (Fig. 3). Dual-homozygous markers that were polymorphic between the cotton species at a single locus (Fig. 1, Scenario 1) were expected to give rise to the alignment pattern designated Category I. In contrast, polymorphisms between homeologs present in the AT and DT subgenomes (Fig. 1, Scenario 2) were expected to give an alignment similar to Category II (Fig. 3), in which the TM-1 and Pima 3-79 alleles were each most similar to fragments in one of the two diploid species. If a putative marker is actually a polymorphism between paralogs within a given subgenome, the expected alignment pattern would be exemplified by that shown for Category III (Fig. 3). For some markers, a likely subgenome identity could be discerned, but the marker showed unresolvable similarities to several paralogs within that subgenome (Category IV). This category may still manifest as aa/bb interspecific polymorphic markers, depending on the presence or absence of HinP1I or BsrGI flanking restriction sites. Finally, for some markers, the likely subgenome of the locus could not be determined (Category V) because of unresolvable similarities to fragments in both the AT or DT subgenomes. Again, these markers may still represent aa/bb markers, depending on flanking restriction sites. For the alignments of HinP1I and BsrGI markers (Table 5), positive BLAST hits were identified in one or both diploid genomes for 99.5% of BsrGI fragments and 98.5% of HinP1I fragments. Alignments for 83.5% of BsrGI markers and 69.4% of HinP1I markers (77.7% of markers overall) had the Category I pattern that was expected of aa/bb dual-homozygous single-locus markers, while only 1.6% of BsrGI markers and 10.7% of HinP1I markers (4.2% overall) had patterns indicating that they represented polymorphisms between homeologous loci from the AT or DT subgenomes (Category II) and only 1.5% of markers had alignment patterns suggesting that the polymorphism was between two paralogs in a given subgenome (Category III). Given that markers in both Categories IV and V can also, in principle, give rise to the aa/bb marker type, depending on flanking restriction sites, our in silico validation rate may actually be as high as 96.2% for BsrGI and 89.3% for HinP1I (94.2% overall).

Fig. 3.

Table 5.

Categorization of marker alignments of aa/bb markers polymorphic between Gossypium hirsutum TM-1 and G. barbadense Pima 3-79. Alignments included TM-1 and 3-97 alleles, along with any BLAST hits (1e-6) to sequenced A- and D-genome diploid species (Paterson et al., 2012; Wang et al., 2012; Liu et al., 2013). The five categories are described in the text and illustrated in Fig. 3.

Category	BsrGI	HinP1I
Total fragments aligned	1413	549
Category I	1183	381
Category II	24	59
Category III	30	0
Category IV	99	16
Category V	77	93
Category V without BLAST hits to diploids	7	8
Fragments assigned to a subgenome	1312	397
Fragments assigned to A_T	841	234
Fragments assigned to D_T	471	163

Representative examples of the five categories of sequence alignments observed in TM-1 vs. Pima 3-79 polymorphic markers with aa/bb marker type assignment from Stacks. Nucleotides on a black background indicate the site of the key Pima 3-79 polymorphism relative to the TM-1 reference sequence. Nucleotides on a gray background indicate additional mismatches relative to the TM-1 reference sequence. The top two lines in each category indicate the TM-1 and Pima 3-79 fragment sequences, respectively. The prefix B indicates BsrG1 markers, and H indicates HinP1I. Additional lines in the alignment represent fragments from diploid genomes along with chromosomal assignments. BGI_A = Gossypium arboreum (Li et al., 2014); JGI_D = G. raimondii (Paterson et al., 2012); BGI_D = G. raimondii (Wang et al., 2012); scaf = scaffold. Categorization of marker alignments of aa/bb markers polymorphic between Gossypium hirsutum TM-1 and G. barbadense Pima 3-79. Alignments included TM-1 and 3-97 alleles, along with any BLAST hits (1e-6) to sequenced A- and D-genome diploid species (Paterson et al., 2012; Wang et al., 2012; Liu et al., 2013). The five categories are described in the text and illustrated in Fig. 3. Experimental validation was performed using the PCR-based cleaved amplified polymorphic sequence (CAPS) marker method (Konieczny and Ausubel, 1993). Only a small proportion of markers were suitable for experimental validation based on the following criteria: (1) SNP had to be near the middle of the 60-bp sequence to allow for design of flanking primers, (2) flanking sequences had to have suitable G+C content for primer design (30–60%) and lack simple sequence repeats, (3) specific primers had to be designed (using the alignments) to avoid amplification of known paralogs and homeologs, and (4) the SNP had to occur within the recognition site of a commercially available restriction enzyme. Only 22 TM-1 vs. Pima 3-79 markers (three HinP1I and 19 BsrGI) met all of these criteria; all of these fell into Category I (above) when examined in evolutionary alignments. Primer pairs shown in Table 6 were used in PCR amplification with the KAPA3G Plant PCR kit (Kapa Biosystems, Wilmington, Massachusetts, USA), as per the manufacturer’s recommended protocol. Amplification products were examined using 4% agarose gel electrophoresis (E-Gel EX, Life Technologies, Grand Island, New York, USA) before and after restriction digestion. Of the 22 CAPS markers, one marker (Bsr1616) yielded multiple PCR amplicons, none of which were of the predicted size. One marker (Bsr18072) showed unexpected partial digestion in both TM-1 and Pima 3-79 accessions. Of those markers that could be definitively scored, 20/21 (96%) showed the predicted pattern of restriction digestion for polymorphic markers that were homozygous within each of the two taxa examined.

Table 6.

Cleaved amplified polymorphic sequence validation of 22 aa/bb markers that are polymorphic between Gossypium hirsutum TM-1 and G. barbadense Pima 3-79.

Locus	Primer sequences (5′–3′)	Enzyme	Predicted cut^a	Cut TM-1	Cut 3-79
Bsr1195	F: CGTACACAAAGTATTTAGAGAATATAA	MluCI	Pima 3-79		X
	R: CAAAAAGGTACGTTCCATGAAAAG
Bsr1616^b	F: CGTACACATGGTGAACACTTAGTAC	BfaI	TM-1	(Multiple amplicons)
	R: GTAGACAAGAGAGCTACGAGATAAAC
Bsr3721	F: CACGTCCTAGGACACGGGCTAT	NlaIII	Pima 3-79		X
	R: GTGTGACCGTGTGTGGCACACTA
Bsr5368	F: CGTACAATTAGGTGTTTCGCTCTTAG	NlaIII	TM-1	X
	R: AGCTCTAGTATCATAACTACAGTTAGC
Bsr7080	F: CGTACATGGAACTTTTTAAGGAGGC	AluI	TM-1	X
	R: ACATTTAATGCAAGTGCATGTAT
Bsr7402	F: CGTACAAGACTCACCCACAAGT	TaqI	TM-1	X
	R: GGCTTGATGCTGGGATTATATACAC
Bsr9628	F: CGTACAATAGAGTTACAATAAACTCG	TaqI	Pima 3-79		X
	R: GTTTTTGCCGAACTTTATTCATAACA
Bsr12910	F: CGTACAGTCAACCGCCTTAAAAATTTA	MseI	TM-1	X
	R: CTTTTACGGTGTTTTTGTTTTGACATC
Bsr13288	F: CATCAGCATAAGGAACACGTGGCAC	HpyCH2IV	Pima 3-79		X
	R: TTGACGGAATAACCAGACAAGAACA
Bsr14160	F: CGTACATGAGTACTAAAGAGATTGG	NlaIII	TM-1	X
	R: GATATCTTTAATAGGGGGTGCAAC
Bsr17257	F: CAAAGACCTCCCCCACCTACTTC	HpaII	TM-1	X
	R: TCAGCACCCTGTGGTACCTCAAG
Bsr17701	F: CAACAACCTGCCTCACCTGCTTC	MluCI	TM-1	X
	R: TTAGCACCTTATGGCATCTCAGGA
Bsr18072	F: CGTACAAGAACCTCCCCCACC	HpaII	TM-1	X	X
	R: CAGCACCCTGTGGCATCTCTG
Bsr18083	F: CGTACAAACCTGAGATTTCAGGTC	HpyCH2IV	Pima 3-79		X
	R: CCCTGATATGTATTGGTCGGGC
Bsr18484	F: CGTACATTAACCCGGTTCAGGTG	NlaIII	Pima 3-79		X
	R: ACTGGATCCATTAGTTAGAATCGGG
Bsr18818	F: CGTACAGTTATAAGAGAAATTCCAC	BfaI	TM-1	X
	R: CTCTTCAACCCCTTGTTTTGTGATC
Bsr20063	F: CGTACATGATAAGGACAAGAGTATT	MseI	Pima 3-79		X
	R: CAGTTTTGTCCGGTACGGTCTGGCA
Bsr20113	F: CGTACAACAATCATACAAGGAAT	NlaIII	TM-1	X
	R: GTCTCTAGACCCGTTCCTTCATG
Bsr20829	F: CGTACAACTCAAGTGTACCACT	TaqI	Pima 3-79		X
	R: TTCCTGTTGAATTTATCTGAAATATC
Hin2726	F: CGCATGCATGTTAGCAAGCAGTG	HpyCH4V	Pima 3-79		X
	R: CGTGATTCGACGAAAACCAATC
Hin3799	F: CCAGTTCTATCATGGCAAGATTCC	HpaII	TM-1	X
	R: GGAAGTTTCAACGAGAGAGTTGAAAG
Hin9147	F: CAGCCCACCACTTTTCCTTACC	BfaI	TM-1	X
	R: TGTGCAGAATTGAGGGTTGCCT

Predicted cut site is based on an alignment of GBS fragment sequences.

Bsr1616 yielded multiple PCR amplicons, none of which matched the expected size.

Cleaved amplified polymorphic sequence validation of 22 aa/bb markers that are polymorphic between Gossypium hirsutum TM-1 and G. barbadense Pima 3-79. Predicted cut site is based on an alignment of GBS fragment sequences. Bsr1616 yielded multiple PCR amplicons, none of which matched the expected size.

CONCLUSIONS

Cultivated cottons have complex allotetraploid genomes with high levels of repetitive DNA and a small proportion of gene-encoding DNA (Li et al., 2014). These characteristics greatly complicate efforts to apply GBS approaches. Foremost among these difficulties is the presence of homeologous gene copies (homeologs) inherited from the diploid ancestors. Furthermore, all plant genomes have paralogous regions arising from gene duplication processes other than allotetraploidization (such as tandem duplication). To filter out the confounding polymorphisms between homeologs and between paralogs, we selected for markers with a dual-homozygous aa/bb marker prediction from the Stacks algorithm. The likelihood of such a pattern arising from more than one orthologous locus by a mutational process or by gene conversion was considered to be small compared to the straightforward interpretation of alternative alleles at a single locus. The resulting filtered marker set was highly enriched for markers with evolutionary patterns that were consistent with alternative, codominant alleles at a single locus within a particular subgenome. The application of this filter also reduced the number of markers with sequence similarity to repetitive elements and organellar genomes (Table 5). BLASTN searches against G. raimondii transcript and CDS databases indicated that markers derived from the methylation-sensitive restriction enzyme HinP1I were highly enriched in gene-related sequences (Table 4). Thus we consider this marker set to be highly informative for mapping traits in gene-encoding regions of the genome. The total set of 18,073 BsrGI-derived and 5014 HinP1I-derived polymorphic markers selected by this strategy can be used in a variety of applications across a range of taxa. For example, 921 SNPs between the photoperiodic G. barbadense landrace K-56 (Peru) and the photoperiod-independent cultivar Pima 3-79 could be used for mapping the photoperiodism trait. They could also be employed as a resource for marker-assisted conversion of photoperiodic germplasm to photoperiod independence (Percy, 2009). Similarly, our collection included 686 SNPs between the photoperiodic G. hirsutum landrace TX-231 (Guatemala) and the photoperiod-independent cultivar TM-1. Finally, our collection included ca. 2000 SNP markers that can be applied to the TM-1 × Pima 3-79 interspecific recombinant inbred line (RIL) population (Yu et al., 2012) by providing a linkage-based framework for ongoing genome sequencing and chromosome assembly efforts in the allotetraploid cottons G. hirsutum and G. barbadense. These new markers add to existing collections of cotton SNPs developed from: (1) comparative transcriptome sequencing, (2) shallow depth genome sequencing, (3) genome reduction based on restriction site conservation (GR-RSC), detected by (4) Roche 454 pyrosequencing, and (5) genotyping by sequencing (Van Deynze et al., 2009; Byers et al., 2012; Lacape et al., 2012; Rai et al., 2013; Zhu et al., 2014). Because of the small fragment size (50 bp) and intrinsic similarities between orthologs and paralogs, the SNP loci described here may be of limited use for non-sequence-based genotyping approaches (e.g., KASPR, Illumina GoldenGate) (Hyten et al., 2008; Byers et al., 2012). However, this unique GBS approach can facilitate the discovery of large sets of informative markers that can be employed to genotype extensive collections of biological samples and experimental populations (RILs, F2, backcross) using barcoding and multiplex sequencing strategies (Elshire et al., 2011). Currently, a single lane on the Illumina HiSeq 2500 instrument can be used to genotype 96 samples at an average coverage of ca. 2 million reads per sample (the depth of coverage used in this work). For some experimental purposes, the most flexible and cost-effective approach may be to use a “white list” of marker sequence and polymorphism data, such as that provided here, to design a targeted set of oligonucleotides that capture and enrich selected genomic fragments for resequencing using available NGS technologies. It is important to note that the overall strategy for marker discovery and annotation that we have provided in this study can be extended to any species, including those that are allopolyploid.

21 in total

1. Sampling nucleotide diversity in cotton.

Authors: Allen Van Deynze; Kevin Stoffel; Mike Lee; Thea A Wilkins; Alexander Kozik; Roy G Cantrell; John Z Yu; Russel J Kohel; David M Stelly
Journal: BMC Plant Biol Date: 2009-10-20 Impact factor: 4.215

2. The draft genome of a diploid cotton Gossypium raimondii.

Authors: Kunbo Wang; Zhiwen Wang; Fuguang Li; Wuwei Ye; Junyi Wang; Guoli Song; Zhen Yue; Lin Cong; Haihong Shang; Shilin Zhu; Changsong Zou; Qin Li; Youlu Yuan; Cairui Lu; Hengling Wei; Caiyun Gou; Zequn Zheng; Ye Yin; Xueyan Zhang; Kun Liu; Bo Wang; Chi Song; Nan Shi; Russell J Kohel; Richard G Percy; John Z Yu; Yu-Xian Zhu; Jun Wang; Shuxun Yu
Journal: Nat Genet Date: 2012-08-26 Impact factor: 38.330

3. Genome sequence of the cultivated cotton Gossypium arboreum.

Authors: Fuguang Li; Guangyi Fan; Kunbo Wang; Fengming Sun; Youlu Yuan; Guoli Song; Qin Li; Zhiying Ma; Cairui Lu; Changsong Zou; Wenbin Chen; Xinming Liang; Haihong Shang; Weiqing Liu; Chengcheng Shi; Guanghui Xiao; Caiyun Gou; Wuwei Ye; Xun Xu; Xueyan Zhang; Hengling Wei; Zhifang Li; Guiyin Zhang; Junyi Wang; Kun Liu; Russell J Kohel; Richard G Percy; John Z Yu; Yu-Xian Zhu; Jun Wang; Shuxun Yu
Journal: Nat Genet Date: 2014-05-18 Impact factor: 38.330

Review 4. Epigenetic reprogramming in plant and animal development.

Authors: Suhua Feng; Steven E Jacobsen; Wolf Reik
Journal: Science Date: 2010-10-29 Impact factor: 47.728

5. Deep sequencing reveals differences in the transcriptional landscapes of fibers from two cultivated species of cotton.

Authors: Jean-Marc Lacape; Michel Claverie; Ramon O Vidal; Marcelo F Carazzolle; Gonçalo A Guimarães Pereira; Manuel Ruiz; Martial Pré; Danny Llewellyn; Yves Al-Ghazi; John Jacobs; Alexis Dereeper; Stéphanie Huguet; Marc Giband; Claire Lanaud
Journal: PLoS One Date: 2012-11-15 Impact factor: 3.240

6. A high-density simple sequence repeat and single nucleotide polymorphism genetic map of the tetraploid cotton genome.

Authors: John Z Yu; Russell J Kohel; David D Fang; Jaemin Cho; Allen Van Deynze; Mauricio Ulloa; Steven M Hoffman; Alan E Pepper; David M Stelly; Johnie N Jenkins; Sukumar Saha; Siva P Kumpatla; Manali R Shah; William V Hugie; Richard G Percy
Journal: G3 (Bethesda) Date: 2012-01-01 Impact factor: 3.154

7. Development and mapping of SNP assays in allotetraploid cotton.

Authors: Robert L Byers; David B Harker; Scott M Yourstone; Peter J Maughan; Joshua A Udall
Journal: Theor Appl Genet Date: 2012-01-18 Impact factor: 5.699

8. The complete chloroplast genome sequence of Gossypium hirsutum: organization and phylogenetic relationships to other angiosperms.

Authors: Seung-Bum Lee; Charalambos Kaittanis; Robert K Jansen; Jessica B Hostetler; Luke J Tallon; Christopher D Town; Henry Daniell
Journal: BMC Genomics Date: 2006-03-23 Impact factor: 3.969

9. The complete mitochondrial genome of Gossypium hirsutum and evolutionary analysis of higher plant mitochondrial genomes.

Authors: Guozheng Liu; Dandan Cao; Shuangshuang Li; Aiguo Su; Jianing Geng; Corrinne E Grover; Songnian Hu; Jinping Hua
Journal: PLoS One Date: 2013-08-05 Impact factor: 3.240

10. TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline.

Authors: Jeffrey C Glaubitz; Terry M Casstevens; Fei Lu; James Harriman; Robert J Elshire; Qi Sun; Edward S Buckler
Journal: PLoS One Date: 2014-02-28 Impact factor: 3.240

6 in total

1. Genetic diversity analysis of Gossypium arboreum germplasm accessions using genotyping-by-sequencing.

Authors: Ruijuan Li; John E Erpelding
Journal: Genetica Date: 2016-09-07 Impact factor: 1.082

2. Development, genetic mapping and QTL association of cotton PHYA, PHYB, and HY5-specific CAPS and dCAPS markers.

Authors: Fakhriddin N Kushanov; Alan E Pepper; John Z Yu; Zabardast T Buriev; Shukhrat E Shermatov; Sukumar Saha; Mauricio Ulloa; Johnie N Jenkins; Abdusattor Abdukarimov; Ibrokhim Y Abdurakhmonov
Journal: BMC Genet Date: 2016-10-24 Impact factor: 2.797

3. Diversity analysis of cotton (Gossypium hirsutum L.) germplasm using the CottonSNP63K Array.

Authors: Lori L Hinze; Amanda M Hulse-Kemp; Iain W Wilson; Qian-Hao Zhu; Danny J Llewellyn; Jen M Taylor; Andrew Spriggs; David D Fang; Mauricio Ulloa; John J Burke; Marc Giband; Jean-Marc Lacape; Allen Van Deynze; Joshua A Udall; Jodi A Scheffler; Steve Hague; Jonathan F Wendel; Alan E Pepper; James Frelichowski; Cindy T Lawley; Don C Jones; Richard G Percy; David M Stelly
Journal: BMC Plant Biol Date: 2017-02-03 Impact factor: 4.215

4. High-density 80 K SNP array is a powerful tool for genotyping G. hirsutum accessions and genome analysis.

Authors: Caiping Cai; Guozhong Zhu; Tianzhen Zhang; Wangzhen Guo
Journal: BMC Genomics Date: 2017-08-23 Impact factor: 3.969

5. Identification of genes related to salt stress tolerance using intron-length polymorphic markers, association mapping and virus-induced gene silencing in cotton.

Authors: Caiping Cai; Shuang Wu; Erli Niu; Chaoze Cheng; Wangzhen Guo
Journal: Sci Rep Date: 2017-04-03 Impact factor: 4.379

6. A first linkage map and downy mildew resistance QTL discovery for sweet basil (Ocimum basilicum) facilitated by double digestion restriction site associated DNA sequencing (ddRADseq).

Authors: Robert Pyne; Josh Honig; Jennifer Vaiciunas; Adolfina Koroch; Christian Wyenandt; Stacy Bonos; James Simon
Journal: PLoS One Date: 2017-09-18 Impact factor: 3.240

6 in total