Literature DB >> 26722221

Genome Wide Sampling Sequencing for SNP Genotyping: Methods, Challenges and Future Development.

Zhihua Jiang¹, Hongyang Wang², Jennifer J Michal¹, Xiang Zhou¹, Bang Liu³, Leah C Solberg Woods⁴, Rita A Fuchs⁵.

Abstract

Genetic polymorphisms, particularly single nucleotide polymorphisms (SNPs), have been widely used to advance quantitative, functional and evolutionary genomics. Ideally, all genetic variants among individuals should be discovered when next generation sequencing (NGS) technologies and platforms are used for whole genome sequencing or resequencing. In order to improve the cost-effectiveness of the process, however, the research community has mainly focused on developing genome-wide sampling sequencing (GWSS) methods, a collection of reduced genome complexity sequencing, reduced genome representation sequencing and selective genome target sequencing. Here we review the major steps involved in library preparation, the types of adapters used for ligation and the primers designed for amplification of ligated products for sequencing. Unfortunately, currently available GWSS methods have their drawbacks, such as inconsistency in the number of reads per sample library, the number of sites/targets per individual, and the number of reads per site/target, all of which result in missing data. Suggestions are proposed here to improve library construction, genotype calling accuracy, genome-wide marker density and read mapping rate. In brief, optimized GWSS library preparation should generate a unique set of target sites with dense distribution along chromosomes and even coverage per site across all individuals.

Entities: Chemical Disease Gene Species

Keywords: dominant and co-dominant markers, genome wide association study, human, animals and plants; restriction site associated DNA sequencing, genotyping by sequencing, reduced representation library sequencing, complexity reduction of polymorphism sequencing

Mesh：

Year: 2016 PMID： 26722221 PMCID： PMC4679402 DOI： 10.7150/ijbs.13498

Source DB: PubMed Journal: Int J Biol Sci ISSN： 1449-2288 Impact factor: 6.580

Introduction

Single nucleotide polymorphisms (SNPs) are the predominant forms of sequence variations in both plant and animal genomes 1. For example, approximately 150 million SNPs have been discovered in the human genome (see dbSNP database at the National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/SNP/). The urgent need for simultaneous genotyping of many thousands of SNPs on large numbers of individual samples has accelerated the innovation of novel SNP genotyping platforms, such as oligonucleotide arrays (Affymetrix, Inc., Santa Clara, CA, USA) 2,3 and BeadArray microarrays (Illumina, Inc., San Diego, CA, USA) 4. Currently, Affymetrix offers genotyping arrays for livestock and aquaculture species (buffalo, cattle, chicken, pig, salmon and trout), crops (cotton, maize, soybean, strawberry and wheat) and biomedical and model organisms (human, dog, mouse and Arabidopsis thaliana) (http://www.affymetrix.com/estore/), while Illumina markets whole genome genotyping BeadArrays for human and non-human species (cattle, dog, maize, pig and sheep) (http://www.illumina.com/index-d.html). These SNP arrays or chips have been widely used in 1) long range linkage disequilibrium analyses for determination of population structures/individual ancestries, 2) genome wide association studies (GWAS) for understanding genetic complexity of complex phenotypes that affect human health or economically important traits in animals and plants, 3) detection of genome instability for evaluation of somatic copy number lesions and germline copy number variation and 4) even transcriptome profiling for investigating allele-specific expression patterns 5-7. However, these commercially available SNP arrays or chips cannot be easily modified to suit individual experimental designs. In addition, relevant research cannot be conducted for species that do not have commercially available SNP arrays/chips. It is true that customized SNP arrays/chips can be manufactured, but they are cost-prohibitive to many researchers. Next generation sequencing (NGS) technologies, on the other hand, generate millions of reads in a relatively short amount of time, making them powerful tools for genome research 8,9. For example, one full NGS run can produce 1 million to 5,000 million reads in 8 hours to 11 days, depending on the platform 10. Such a high throughput capability of genome sequencing or re-sequencing projects should efficiently and accurately discover and genotype many thousands of genetic polymorphisms, mainly SNPs, which can be used to investigate quantitative, functional, and evolutionary genomics in human, animals, and plants. To date, high (20x to 30x, for example) or low (approximately 5x, for example) whole genome sequencing or resequencing coverage has been proposed as a useful strategy to discover these genetic variants 11. It is obvious that a high coverage strategy can lead to a high cost project, in particular when it deals with large eukaryotic genomes (such as mammalian and many plant species) and large sample sizes. However, a low coverage strategy also has drawbacks because SNP calling is difficult and will result in low genotyping accuracy and high genotyping errors 11. Furthermore, whole genome sequencing/resequencing for SNP genotyping is technically unnecessary and is cost-prohibitive, because linkage disequilibrium relationships (rvalues) can be as high as 95% - 100% among genetic markers within a gene or a genomic region 12,13. In practice, only one of the highly linked markers is usually selected as a tagged SNP in data analysis. In order to address the limitations associated with SNP arrays/chips and the high/low coverage whole genome sequencing/resequencing platforms described above, the genome research community has been developing alternative strategies to discover and genotype genetic variants in a cost effective manner. Basically, these alternative methods/techniques are NGS-based, but different laboratory procedures can result in different data outcomes in terms of reduced genome complexities, reduced genome representations or selected genome targets. In contrast to whole genome sequencing/re-sequencing, a basic feature associated with these methods/techniques is to have a subset of a genome sampled and sequenced. As such, we collectively term these techniques as genome-wide sampling sequencing (GWSS) methods. Recently, Davey and co-workers 14 indicated that these methods are applicable to both model organisms with known genomes and non-model organisms with unknown genomes. For example, GWSS applications were used for SNP discovery in common bean 15, eggplant 16, flax 17 and pigs 18. Traditionally, GWSS methods involve three common steps: 1) DNA digestion with restriction enzymes, 2) ligation with adapters that fit into the relevant sequencing platforms and 3) PCR amplification to increase the yield of library products for sequencing. The genome sampling process depends on the use of rare-cutter enzymes, size selection of products, or selective amplification of products. Because exome sequencing samples all exons within a genome, we will include it in the present review. Nevertheless, these methods have their drawbacks, such as inconsistency in not only the number of reads per sample library, but also the number of sites/target per individual and the number of reads per site/target. Consequently, missing data and uneven distribution of markers along each chromosome is unavoidable. Therefore, GWSS methods must be improved to address these issues.

GWSS method review

Currently available GWSS methods have mainly evolved from reduced representation (library) sequencing (RRS or RRLS) 19,20, complexity reduction of polymorphism sequencing (CRoPSTM) 21, restriction site associated DNA sequencing (RAD-seq) 22 and genotyping by sequencing (GBS) 23 methods. However, cross-interpretations of different assays with the same terminologies or cross-labeling of the same assays with different terminologies have occurred frequently in the literature. After careful review of the detailed laboratory procedures of each method, the currently available methods can be classified into four categories: 1) GWSS without size selection, 2) GWSS with semi-size selection, 3) GWSS with size selection and 4) GWSS with selective amplification. Generally speaking, GWSS libraries can be constructed by digesting DNA with one or two restriction enzymes and analyzing the subsequent library with Illumina or Ion Torrent sequencing platforms.

GWSS without size selection

Examples of methods designed to prepare libraries without size selection include GBS 23, GBS with two restriction enzymes 24 and genome reducing and sequencing (GGRS) 25 (Supplemental Table 1). The main steps in these methods include digestion of DNA with restriction enzymes, simultaneous ligation of adapters, and selective amplification of products with PCR using primers complementary to ligated adapter sequences (Figure 1). Of the three methods above, only GGRS 25 employs selection of 300- to 500-bp PCR products for sequencing. Size selection at this stage of library preparation probably does not significantly affect the sequencing outcome in terms of genome complexity when compared to GBS and two-enzyme GBS methods. Rather, these three methods differ in the design of their adapters. Sample preparation with the GBS method uses a barcode adapter and a common adapter, which allow for multiplexing of 48 or 96 samples 23. In comparison, when libraries are prepared with two-enzyme GBS, a conventional adapter is designed for ligation to one cohesive end of the fragmented DNA, while a Y-type adapter is designed for ligation to the other end of the DNA fragment. This design ensures specific amplification of only fragments produced by digestion with the two restriction enzymes 24. The GGRS approach requires usage of only one Y-type adapter for library preparation 25. In brief, these methods rely on usage of five- or 6-cutter enzymes to prepare sequencing libraries with reduced genome complexity. In theory, DNA fragments too long to be amplified by polymerase would be excluded from the library. Overall, libraries prepared with GWSS without size-selection methods are relatively simple, robust, time and cost-efficient approaches 23. Obviously, these methods require specific sets of customized barcoded adapters for different enzymes or enzyme combinations.

Figure 1

Classification of GWSS methods. 1) GWSS without size selection after DNA digestion and ligation. Size selection may be employed after PCR amplification. 2) GWSS with semi-size selection. These methods perform size selection after ligation with adapters. Some methods may combine random shearing with size selection and then ligate with the second adapter. 3) GWSS with size selection immediately after DNA digestion. 4) GWSS with selective amplification. These methods are dependent on primer design for selective amplification of products for sequencing. Theoretically, the former two groups of methods can lead to reduction of genome complexity, while the latter two groups of methods may result in reduction of genome representation.

GWSS with semi-size selection

Some GWSS that utilize semi-size selection during library preparation include restriction-site associated DNA sequencing (RAD-seq) 22, paired-end reduced representation libraries (paired-end RPLs) 10, double-digest RAD-seq (DDRAD-seq) 26 and flexible and scalable GBS 11 (Supplemental Table 1). Although these authors used different terminologies to describe their methods, the library construction processes are actually quite similar. In brief, none of these methods include a DNA fragment size selection step immediately after restriction digestion (Figure 1). Instead, size selection is performed in ensuing steps of the library preparation process. Fragment size selection is performed twice when libraries are prepared with RAD-seq and paired-end RPLs methods. After the first adapter (the P1 adapter designed for the Illumina sequencing platform, for example) is ligated to DNA fragments, the ligated products are randomly sheared. Next, a size selection step is performed, followed by end repair, dA tailing and ligation with the P2 primer (Y adapter design) 10,22. After PCR amplification, size selection is repeated and products are subsequently sequenced. In comparison to the GWSS methods without size selection described above, random shearing of ligated products in these two methods may improve the possibility that large fragments are sequenced. However, random shearing may affect the position of the reads, which might complicate read mapping, particularly when the paired-end sequencing strategy is used. When libraries are prepared with the DDRAD-seq and flexible and scalable GBS procedures, size selection is conducted after ligation with both 5' and 3' adapters (Supplemental Table 1). The flexible and scalable GBS 11 technique uses restriction enzymes to produce DNA fragments with blunt ends. Obviously, blunt ends do not require end-repair, but dA-tailing is used for ligation to universal Illumina adapters. If the ligation step is not optimized, size selection of ligated products may lead to inefficient genome representation reduction. As such, the reduced genome complexity of libraries prepared with GWSS methods that do not employ size selection or use only semi-size selection may not be significantly different if DNA is digested with the same enzyme(s).

GWSS with size selection

Seven examples of GWSS library preparation with size selection are presented in this review, including reduced representation shotgun sequencing (RRS) 19, reduced representation libraries (RRLs) 20, paired-end reduced representation libraries (paired-end RPLs) 27, GBS with one enzyme digest 28, GBS with two enzyme digest 29, improved RRLs (iRRL) 30 and type IIB endonucleases restriction-site associated DNA (2b-RAD) 31 (Supplemental Table 1). Terminologically, these methods seem quite different, but their library preparation procedures are almost identical: DNA digestion, size selection, adapter ligation, PCR amplification and sequencing (Figure 1). Only the original RRS method 19 used the traditional cloning technique. When libraries were prepared for 66 cattle DNA samples using the RRLs method, Van Tassell and colleagues 20 discovered over 62,000 SNPs based on approximately 50 million reads. Kerstens and co-workers 27 used the same method to investigate structure variation in chickens. They made four DNA pools: a white egg layer line, brown egg layer line, and two different broiler lines, each with 25 individuals. The AluI digested DNA was separated on a 10% precast polyacrylamide gel (BioRad). DNA fragments of 100-200 bp in length were sliced out of the gel, followed by elution and precipitation for library preparation using the Illumina sequencing technology platform. Greminger et al. 30 reported an iRRL protocol for discovery of genetic markers using samples derived from two orangutan populations. They selected HaeIII for their method after analyzing in silico digestion with 23 blunt-end cutters (4-6 bp recognition sites) in multiple combinations. During the library preparation, Greminger and colleagues 30 paid special attention to size selection: the digested DNA fragments were separated on high-resolution gels and only 104 bp -123 bp fragments were excised from the gels and recovered by electro elution. The remaining steps in their method followed the SOLiD ChiP-Seq protocol 30. The 2b-RAD method 31 uses one enzyme, such as BsaXI or AlfI for digestion, but it can produce tags of uniform size for amplification and sequencing.

GWSS with selective amplification

Two examples of GWSS with selective amplification are listed in Supplemental Table 1. The CRoPSTM method, which is based on the AFLP® (Amplified fragment length polymorphisms) technique 21,32,33 belongs to this group of GWSS methods. This technology involves digestion of genomic DNA with a pair of enzymes, followed by ligation with adapters and PCR amplification (pre-amplification and selective-amplification). The amplified products are separated on a gel for traditional AFLP analysis 32 or sequenced using a Genome Sequencer (GS) 20/GS FLX next-generation sequencing platform 21. Although it is called scalable GBS, the method developed by Sonah and colleagues 34 is also a good example of GWSS with selective amplification. The authors used longer 3' primers that contained three parts: the entire common adapter, the 3' restriction site, and an extension of 1 or 2 bases into the insert fragments. In brief, these methods do not involve size selection, but employ selective amplification of products for sequencing (Figure 1). Although it is a totally different approach, here we briefly discuss the whole exome sequencing (WES) method 35. The WES technique usually starts with construction of a whole genome shotgun library. Fragments that range in size from 150 bp to 250 bp are collected after gel electrophoresis, and exome capture arrays are designed based on Refseq mRNA sequences. The shotgun library is then hybridized to the exome arrays and exon region related products are captured. The hybridized products are recovered from the slides and amplified for sequencing 36. The WES method selectively sequences 2 - 3% of the genome; however, this represents more than 98% of the exon regions 35.

GWSS technical challenges

Based on the methods we have reviewed, it appears that size selection and selective amplification are key steps in determination of the final composition of a GWSS library. Almost every digested DNA fragment prepared by GWSS without size selection and selective amplification will be sequenced unless it is too long to be amplified by PCR. This challenge can be overcome by GWSS methods that include random shearing of ligated products followed by size selection. Libraries prepared with GWSS without size selection and GWSS with semi-size selection methods have reduced genome complexity since partial (for most products) or full sequence (relative few products when they are short, particularly when sequenced with paired-end reads) can be collected for each digested product. Size selection immediately after restriction digestion can certainly reduce genome representation as not all digested products will be sequenced. However, quality control of size selection can be practically challenging. When techniques that utilize selective amplification are employed, the reduction of genome representation depends on primer design, but mismatch amplification can jeopardize the process. Therefore, the genome research community must address these problems to improve preparation of GWSS libraries.

Inconsistency in the number of reads per sample

Although the same set of primers is used to amplify a pool of libraries/samples under the same PCR conditions, the number of reads per library or per sample varies. Using RAD-seq, Baird and colleagues 22 tested 96 F2 fish and found variable numbers of reads sequenced per sample: 68% of individuals had between 50 K and 150 K RAD tags sequenced. Using GBS, Elshire and coworkers 23 reported a high CV (coefficient of variation) in the number of reads containing the appropriate barcode. Further, the cut sites with reads collected ranged from 23% among barley samples to 43% among maize samples when ApeKI was used to digest the genomic DNA. When 47 cattle DNA samples were digested with PstI, De Donato and colleagues 37 observed that read number per individual varied by 39%. Byrne and coworkers 38 investigated genetic diversity among eight perennial ryegrass varieties. After separate digestion with ApeKI and PstI, the CV for the number of reads per sample ranged from 27% to 62%. Moreover, the barcode (F(31, 223) = 20.93, p = 0.001) and barcode length (F(4, 250) = 29.89, p = 0.001) had significant effects on read numbers per sample 38.

Inconsistency in the number of sites sequenced per sample

In addition to variation in the number of reads per library, Chen et al. 25 observed disparities in the number of sites with reads mapped per library, ranging from 171,472 to 447,051 sites per sample and site number among individuals had a CV of about 21%. When eight restriction enzymes (MlyI, AluI, RsaI, DraI, EcoRV, StuI, HaeIII and HincII) were used to create blunt-end GBS libraries, Heffelfinger and coworkers 11 found that total percentage of predicted sites covered with reads ranged from 5.04% (digested with HincII) to 26.84% (digested with HaeIII) in maize and from 8.67% (digested with StuI) to 51.32% (digested with AluI) in rice.

Inconsistency in the number of reads per site

Beissinger and colleagues 28 collected over 118 million GBS sequence reads to determine the number of reads per site using Zea mays inbred B73 as a reference and enzyme ApeKI for digestion. During the library preparation, they selected ApeKI fragments of 70 to 318 bp in length for sequencing. This strategy should have yielded a predicted total of 1,406,269 sites as the number of reads per site is expected to follow a Poisson distribution with a mean equal to the average coverage (~40 reads per site). However, they found 0 reads for 1,021,382 sites, while the remaining sites had 1 - 95,014 read(s) per site.

Missing data

Inconsistencies in the number of reads per sample library, the number of sites/target per individual, and the number of reads per site/target can easily result in missing data. For example, Beissinger and colleagues 28 revealed that highly overrepresented sites (>500 reads per site), which comprised ~0.5% of the sequenced ApeKI sites in maize, accounted for 41.7% of the total reads. The authors predicted that the missing data would be 30%, 20%, or 10% of sites when sequencing produced 23, 41, or 80 million reads, respectively. Working on perennial ryegrass with ApeKI digestion, Byrne and coworkers 38 were able to discover 643,798 SNPs with a minor allele frequency of ≥5%. However, only 80,902 (~12.5%) SNPs were successfully genotyped in 75% of individual samples at 5X coverage. Although over 270,000 SNPs were identified using GBS in an F1 population of apples (Malus x domestica) segregating for skin color, Gardner et al. 29 found that a high proportion of data for SNPs was missing across samples and therefore could not be used for genetic mapping. Out of the 85 million DNA sequence reads produced in that study, only 6% could be used to generate a linkage map with 3,967 SNPs in the species. In addition, markers were not evenly distributed along each chromosome. For example, the physical distance between “adjacent” markers exceeded 15 Mb in some cases 29. Using outbred male Sprague-Dawley (SD) rats, GBS analysis only revealed a total of 2,256 high confidence SNPs with a minor allele frequency of larger than 0.2 39. Such a number of SNPs seems too few to pursue any GWAS analysis.

Suggestions for future GWSS development

Improvement of library construction process

Preparation of a GWSS library is relatively simple, but the process needs to be further optimized in order to radically address the challenges discussed above. First, we may need to prepare a library for each individual rather than a library for pooled samples. Such a “one sample-one library” approach will enable quantification of each library before sequencing so that equal amounts of libraries are sequenced and yield a similar number of total reads, thus minimizing inconsistency in number of reads per sample. Second, we need to compare efficiencies of primers designed with different binding positions that control the number of sites amplified. Currently, there are three types of primers used in final amplification of libraries for sequencing. These primers bind to 1) attachment sites (flowcell binding regions) alone (e.g., RAD-seq 22) (Figure 2), 2) attachment sites + sequencing read sites (e.g., GBS 23) (Figure 2), and 3) attachment sites + sequencing read sites + 3' restriction sites + extension of 1 or 2 bases into the insert fragments 34. Theoretically, this last set of primers is more specific and should amplify only the targeted sites; however, in reality this may not be true and further investigation is warranted. Third, the Y-adapter design and the subsequent primers for amplification should be well matched. As illustrated in Figure 2, Chen and colleagues 25 used a universal adapter as well as a reverse primer to initiate synthesis. Notably, both forward and reverse primers can initiate the first round of amplification, as they share 14 identical nucleotides at their 3' ends. Lastly, size selection of products for sequencing is also important. Gel separation and subsequent retrieval of products may contribute to variation in the number of sites involved in reduced genome complexity sequencing. Gel-free size selection with beads should reduce site variation. On the other hand, reducing uneven coverage of reads per site remains one of the most challenging issues in the construction of reduced genome complexity libraries 28.

Figure 2

Examples of GWSS NGS library construction methods used to discover genome-wide representative variants, mainly SNPs, among individuals. RE indicates the use of a single restriction enzyme in the assay, while RE1 and RE2 indicate the use of two enzymes for digestion of genomic DNA. The buffer zone contains additional nucleotides to tighten the restriction site ligation. The first four examples are designed for assessment with Illumina analyzers, while the last example is a library built to meet Ion Torrent platform requirements.

Improvement of genotype calling accuracy

As discussed above, GWSS methods rely on restriction enzyme digestions. Table 1 lists 23 restriction enzymes that have been frequently used. Use of these enzymes can potentially produce two sources of markers: 1) genetic (SNPs, insertions/deletions and short tandem repeats, for example) and epigenetic markers inside the recognition sites and 2) genetic markers outside of the restriction enzymes recognition sites. Since the restriction enzymes AluI, BsaXI, DraI, HaeIII, HindIII, MluCI, MlyI, MseI, MspI, NlaIII, PstI, SbfI and SphI are resistant to Dam (methylated by the dam methylase), Dcm (methylated by the dcm methylase) and CpG (methylated by the CpG methylase) blocks, digestion with these enzymes will not produce epigenetic polymorphisms. However, the remaining ten enzymes shown in Table 1 may produce genetic, epigenetic or both types of polymorphisms depending on nucleotide mutations and the methylation status of recognition sequences. Nevertheless, genetic and epigenetic markers associated with recognition sites can jeopardize genotype calling accuracy in individuals because they are dominant markers, rather than the co-dominant markers. As such, only homozygotes and heterozygotes with one allele that contain an enzyme cut site or that lack methylation blocks can be digested, ligated, amplified, sequenced, and scored. Therefore, GWSS methods need to use enzymes with recognition sequences that are rarely mutated in the genome and that are insensitive/resistant to Dam, Dcm, and CpG blocks, thus producing millions of NGS reads that can be used to maximize the discovery of co-dominant markers for linkage mapping or genome wide association studies. In order to be most effective, these enzymes must also produce an even distribution of fragment sizes and cannot generate any concentrated bands after DNA digestion.

Table 1

Restriction enzymes frequently used in the construction of GWSS libraries.

Enzyme	Recognition Sequence	Dam Blocked?	Dcm Blocked?	CpG blocked?
AluI [11,27]	AG/CT	No	No	No
ApeKI [23,28,34,37,38]	G/CWGC	No	No	Yes
AvaII [25]	G/GWCC	No	Yes	Yes
BsaXI [31]	(9/12)ACNNNNNCTCC(10/7)	No	No	No
DraI [11,20]	TTT/AAA	No	No	No
EcoRI [22,26,33]	G/AATTC	No	No	Yes
EcoRV [11]	GAT/ATC	No	No	Yes
EcoT22I [37]	ATGCA/T	?	?	?
HaeIII [11,20,30]	GG/CC	No	No	No
HincII [11]	GTY/RAC	No	No	Yes
HindIII [29]	A/AGCTT	No	No	No
HpaII [21]	C/CGG	No	No	Yes
MluCI [26]	/AATT	No	No	No
MlyI [11]	GAGTC(N)5/	No	No	No
MseI [21,33]	T/TAA	No	No	No
MspI [24,26,29,40]	C/CGG	No	No	No
NlaIII [26]	CATG/	No	No	No
PstI [10,24,37,38,40]	CTGCA/G	No	No	No
RsaI [11]	GT/AC	No	No	Yes
SbfI [10,12,26]	CCTGCA/GG	No	No	No
SphI [26]	GCATG/C	No	No	No
StuI [11]	AGG/CCT	No	Yes	No
TaqI [33]	T/CGA	Yes	No	No

Source: Reprinted from www.neb.com (2015) with permission from New England Biolabs.

Improvement of marker density

As listed in Table 1, four-, five- or six-cutter enzymes have been mainly used in library preparation. Yield estimates determined by in silico digestion of the soybean genome (1.1 - 1.5 Gb in length) were up to 9.5 million fragments for MseI (four-cutter), 800 K fragments for ApeKI (five-cutter) and 100 K fragments for PstI (six-cutter), respectively 34. These numbers clearly show that five-cutter enzymes provide a reasonable number of cut sites (i.e., roughly one site per kb of sequence). When five- or six-cutter enzymes are used; however, the average fragment size is still too large to fit into NGS platforms. Therefore, the first adapter-ligated products are often randomly sheared, size-selected and then ligated with the second adapter to yield constructs that contain flowcell binding regions, sequencing primer binding sites, and barcodes for library separation. Obviously, these steps add complexity to library preparation. Six-cutter enzymes are often combined with four-cutter enzymes. When only two-enzyme digested products are targeted, the six-four cutter enzyme combination probably doesn't add many more sites to the library pool as compared to six-cutter digestion enzyme alone. Nevertheless, missing sites (over 70% sites with zero reads, see Beissinger et al. 28) and missing genotypes (marker gaps up to 15 Mb in length, see Gardner et al. 29) indicate that marker density needs to be improved. The use of four-cutter enzymes in library preparation may be the answer to this problem. In this case, size selection will play an important role in the reduction of genome representation. In particular, in silico digestion should be performed, followed by selection of products of appropriate sizes that evenly distributed along each chromosome with reasonable distances between any adjunct sites.

Improvement of read mapping rate

Ligation is one of the critical steps involved in the preparation of GWSS libraries. Figure 2 illustrates the ideal ligation of products. In reality, many other types of ligated products are fabricated in the reaction, which eventually lead to artifacts among the sequenced reads. Our recent experience shows that even a 120 bp read can contain three digested fragments derived from different chromosomes. For example, after ligated artifacts were separated, the number of reads of ≥36 bp in length increased from 106 million reads to 132 million reads. As a consequence, the unique mapping rate also improved from ~45% to ~65%. Therefore, these artifacts cannot be ignored. In the AFLP technique 32, modified adapters are used to carry out both enzyme digestion and ligation in one step. Once ligated, these adapters are no longer recognized by the enzyme. Whether or not this strategy can be integrated in the construction of GWSS libraries needs to be tested. In brief, we continue to face challenges in making NGS-based GWSS libraries, even though the procedure is relatively simple. The library construction process can certainly be improved if we redesign adapters and primers, reconsider selection of restriction enzymes and optimize PCR conditions. The ultimate goal is to amplify a unique set of targets with dense distribution and even coverage across all individual samples. Supplementary Tables. Click here for additional data file.

39 in total

1. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays.

Authors: Hajime Matsuzaki; Shoulian Dong; Halina Loi; Xiaojun Di; Guoying Liu; Earl Hubbell; Jane Law; Tam Berntsen; Monica Chadha; Henry Hui; Geoffrey Yang; Giulia C Kennedy; Teresa A Webster; Simon Cawley; P Sean Walsh; Keith W Jones; Stephen P A Fodor; Rui Mei
Journal: Nat Methods Date: 2004-11 Impact factor: 28.547

2. AFLP: a new technique for DNA fingerprinting.

Authors: P Vos; R Hogers; M Bleeker; M Reijans; T van de Lee; M Hornes; A Frijters; J Pot; J Peleman; M Kuiper
Journal: Nucleic Acids Res Date: 1995-11-11 Impact factor: 16.971

Review 3. Genome-wide genetic marker discovery and genotyping using next-generation sequencing.

Authors: John W Davey; Paul A Hohenlohe; Paul D Etter; Jason Q Boone; Julian M Catchen; Mark L Blaxter
Journal: Nat Rev Genet Date: 2011-06-17 Impact factor: 53.242

4. Structural variation in the chicken genome identified by paired-end next-generation DNA sequencing of reduced representation libraries.

Authors: Hindrik Hd Kerstens; Richard Pma Crooijmans; Bert W Dibbits; Addie Vereijken; Ron Okimoto; Martien Am Groenen
Journal: BMC Genomics Date: 2011-02-03 Impact factor: 3.969

5. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species.

Authors: Robert J Elshire; Jeffrey C Glaubitz; Qi Sun; Jesse A Poland; Ken Kawamoto; Edward S Buckler; Sharon E Mitchell
Journal: PLoS One Date: 2011-05-04 Impact factor: 3.240

6. Identification of SNP and SSR markers in eggplant using RAD tag sequencing.

Authors: Lorenzo Barchi; Sergio Lanteri; Ezio Portis; Alberto Acquadro; Giampiero Valè; Laura Toppino; Giuseppe Leonardo Rotino
Journal: BMC Genomics Date: 2011-06-10 Impact factor: 3.969

7. Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms.

Authors: Maja P Greminger; Kai N Stölting; Alexander Nater; Benoit Goossens; Natasha Arora; Rémy Bruggmann; Andrea Patrignani; Beatrice Nussberger; Reeta Sharma; Robert H S Kraus; Laurentius N Ambu; Ian Singleton; Lounes Chikhi; Carel P van Schaik; Michael Krützen
Journal: BMC Genomics Date: 2014-01-10 Impact factor: 3.969

8. Quantitative genomics of 30 complex phenotypes in Wagyu x Angus F₁ progeny.

Authors: Lifan Zhang; Jennifer J Michal; James V O'Fallon; Zengxiang Pan; Charles T Gaskins; Jerry J Reeves; Jan R Busboom; Xiang Zhou; Bo Ding; Michael V Dodson; Zhihua Jiang
Journal: Int J Biol Sci Date: 2012-06-12 Impact factor: 6.580

9. Application of genotyping-by-sequencing on semiconductor sequencing platforms: a comparison of genetic and reference-based marker ordering in barley.

Authors: Martin Mascher; Shuangye Wu; Paul St Amand; Nils Stein; Jesse Poland
Journal: PLoS One Date: 2013-10-03 Impact factor: 3.240

10. Genotyping-by-Sequencing in Plants.

Authors: Stéphane Deschamps; Victor Llaca; Gregory D May
Journal: Biology (Basel) Date: 2012-09-25

15 in total

1. Genotyping by sequencing-based linkage map construction and identification of quantitative trait loci for yield-related traits and oil content in Jatropha (Jatropha curcas L.).

Authors: Vijay Yepuri; Saakshi Jalali; Vishwnadharaju Mudunuri; Sai Pothakani; Nagesh Kancharla; S Arockiasamy
Journal: Mol Biol Rep Date: 2022-03-03 Impact factor: 2.742

2. Optimized double-digest genotyping by sequencing (ddGBS) method with high-density SNP markers and high genotyping accuracy for chickens.

Authors: Yuzhe Wang; Xuemin Cao; Yiqiang Zhao; Jing Fei; Xiaoxiang Hu; Ning Li
Journal: PLoS One Date: 2017-06-09 Impact factor: 3.240

3. GBS-based single dosage markers for linkage and QTL mapping allow gene mining for yield-related traits in sugarcane.

Authors: Thiago Willian Almeida Balsalobre; Guilherme da Silva Pereira; Gabriel Rodrigues Alves Margarido; Rodrigo Gazaffi; Fernanda Zatti Barreto; Carina Oliveira Anoni; Cláudio Benício Cardoso-Silva; Estela Araújo Costa; Melina Cristina Mancini; Hermann Paulo Hoffmann; Anete Pereira de Souza; Antonio Augusto Franco Garcia; Monalisa Sampaio Carneiro
Journal: BMC Genomics Date: 2017-01-11 Impact factor: 3.969

4. A SNP Based Linkage Map of the Arctic Charr (Salvelinus alpinus) Genome Provides Insights into the Diploidization Process After Whole Genome Duplication.

Authors: Cameron M Nugent; Anne A Easton; Joseph D Norman; Moira M Ferguson; Roy G Danzmann
Journal: G3 (Bethesda) Date: 2017-02-09 Impact factor: 3.154

5. Construction of a highly saturated linkage map in Japanese plum (Prunus salicina L.) using GBS for SNP marker calling.

Authors: Basilio Carrasco; Máximo González; Marlene Gebauer; Rolando García-González; Jonathan Maldonado; Herman Silva
Journal: PLoS One Date: 2018-12-03 Impact factor: 3.240

6. MTHFR C677T polymorphism analysis: A simple, effective restriction enzyme-based method improving previous protocols.

Authors: Francesca Antonaros; Giulia Olivucci; Elena Cicchini; Giuseppe Ramacieri; Maria Chiara Pelleri; Lorenza Vitale; Pierluigi Strippoli; Chiara Locatelli; Guido Cocchi; Allison Piovesan; Maria Caracausi
Journal: Mol Genet Genomic Med Date: 2019-03-13 Impact factor: 2.183

7. A robust sequencing assay of a thousand amplicons for the high-throughput population monitoring of Alpine ibex immunogenetics.

Authors: Camille Kessler; Alice Brambilla; Dominique Waldvogel; Glauco Camenisch; Iris Biebach; Deborah M Leigh; Christine Grossen; Daniel Croll
Journal: Mol Ecol Resour Date: 2021-07-07 Impact factor: 8.678

Review 8. Integrating Genomic Data Sets for Knowledge Discovery: An Informed Approach to Management of Captive Endangered Species.

Authors: Kristopher J L Irizarry; Doug Bryant; Jordan Kalish; Curtis Eng; Peggy L Schmidt; Gini Barrett; Margaret C Barr
Journal: Int J Genomics Date: 2016-06-08 Impact factor: 2.326

9. The Pattern and Distribution of Induced Mutations in J. curcas Using Reduced Representation Sequencing.

Authors: Fatemeh Maghuly; Stephan Pabinger; Julie Krainer; Margit Laimer
Journal: Front Plant Sci Date: 2018-04-23 Impact factor: 5.753

Review 10. Altered expression of long non-coding RNA GAS5 in digestive tumors.

Authors: Shounan Lu; Zhilei Su; Wen Fu; Zhankun Cui; Xingming Jiang; Sheng Tai
Journal: Biosci Rep Date: 2019-01-18 Impact factor: 3.840