| Literature DB >> 29551885 |
Cheng-Hong Yang1,2, Kuo-Chuan Wu1,3, Li-Yeh Chuang4, Hsueh-Wei Chang5,6,7.
Abstract
DNA barcode sequences are accumulating in large data sets. A barcode is generally a sequence larger than 1000 base pairs and generates a computational burden. Although the DNA barcode was originally envisioned as straightforward species tags, the identification usage of barcode sequences is rarely emphasized currently. Single-nucleotide polymorphism (SNP) association studies provide us an idea that the SNPs may be the ideal target of feature selection to discriminate between different species. We hypothesize that SNP-based barcodes may be more effective than the full length of DNA barcode sequences for species discrimination. To address this issue, we tested a ribulose diphosphate carboxylase (rbcL) SNP barcoding (RSB) strategy using a decision tree algorithm. After alignment and trimming, 31 SNPs were discovered in the rbcL sequences from 38 Brassicaceae plant species. In the decision tree construction, these SNPs were computed to set up the decision rule to assign the sequences into 2 groups level by level. After algorithm processing, 37 nodes and 31 loci were required for discriminating 38 species. Finally, the sequence tags consisting of 31 rbcL SNP barcodes were identified for discriminating 38 Brassicaceae species based on the decision tree-selected SNP pattern using RSB method. Taken together, this study provides the rational that the SNP aspect of DNA barcode for rbcL gene is a useful and effective sequence for tagging 38 Brassicaceae species.Entities:
Keywords: Decision tree; SNP; barcoding; rbcL gene; species tag
Year: 2018 PMID: 29551885 PMCID: PMC5846911 DOI: 10.1177/1176934318760856
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
A total of 38 rbcL sequences of the plant family Brassicaceae from GenBank.
| Species name | Length, bp | Accession no. | Position[ |
|---|---|---|---|
|
| 1347 | AY167983.1 | 114–1167 |
|
| 1332 | FN594834.1 | 109–1162 |
|
| 1292 | AY174633.1 | 82–1135 |
|
| 1154 | DQ310542.1 | 30–1083 |
|
| 1348 | KM360667.1 | 120–1173 |
|
| 1366 | KF602144.1 | 138–1191 |
|
| 1381 | JX848436.1 | 140–1193 |
|
| 1408 | KM360682.1 | 120–1173 |
|
| 1347 | AY167981.1 | 114–1167 |
|
| 1300 | HE616642.1 | 99–1152 |
|
| 1384 | KM360691.1 | 96–1149 |
|
| 1380 | FN594833.1 | 128–1181 |
|
| 1366 | FN594827.1 | 113–1166 |
|
| 1152 | JN847840.1 | 29–1082 |
|
| 1324 | FN594843.1 | 91–1144 |
|
| 1380 | JX848439.1 | 139–1192 |
|
| 1202 | HE616652.1 | 1–1054[ |
|
| 1345 | KM360756.1 | 120–1173 |
|
| 1347 | AY167980.1 | 114–1167 |
|
| 1374 | FN594846.1 | 114–1167 |
|
| 1399 | AM234933.1 | 111–1164 |
|
| 1401 | KM360815.1 | 113–1166 |
|
| 1408 | KM360830.1 | 120–1173 |
|
| 1392 | FN594830.1 | 138–1191 |
|
| 1324 | KT626727.1 | 114–1167 |
|
| 1341 | JQ933388.1 | 111–1164 |
|
| 1345 | KM360861.1 | 120–1173 |
|
| 1383 | JQ933404.1 | 111–1164 |
|
| 1380 | FN594826.1 | 128–1181 |
|
| 1318 | KT626750.1 | 114–1167 |
|
| 1084 | DQ310543.1 | 30–1083 |
|
| 1340 | FN594852.1 | 111–1164 |
|
| 1336 | JQ933435.1 | 111–1164 |
|
| 1379 | JX848443.1 | 138–1191 |
|
| 1324 | KT626842.1 | 114–1167 |
|
| 1347 | AY167982.1 | 114–1167 |
|
| 1336 | JQ933355.1 | 111–1164 |
|
| 1345 | KM361012.1 | 120–1173 |
Abbreviation: bp, base pair.
The position is listed in the reference of its own accession number.
The reference sequence for single-nucleotide polymorphism barcoding and position.
Figure 1.Flowchart of rbcL SNP barcoding. Three steps are processed to perform the DTSB method: Step 1—data processing, step 2—decision tree construction, and step 3—barcode creation. For step 1, 7 rbcL sequences from different species (S1-S7) were retrieved from GenBank. After alignment, the protruding sequences were trimmed to generate the same length for pretested rbcL sequences. For step 2, the aligned sequences were fed for decision tree processing, such as creation of a node root of tree, decision rule creating, and separation rule for sparing into 2 categories (left leaf [LL] node and right leaf [RL] node). For step 3, the tree was constructed and the SNP barcodes of rbcL gene for species tagging were generated. Here, hypothesized sequences and SNP barcodes provide an example for species tags. SNP indicates single-nucleotide polymorphism.
Figure 2.Decision tree making of 38 aligned rbcL sequences. The number within each circle is the nucleotide position of the trimmed alignment sequence of rbcL. The decision making starts from top to down sequentially level by level. In each level, the letters (nucleotides) in the left and right sides indicate the left leaf (LL) and right leaf (RL) nodes as shown in Figure 1. After collecting the nucleotides from top to down levels, the SNP barcode of rbcL can present the tag sequence for different species as shown in Figures 3 and 4. SNP indicates single-nucleotide polymorphism.
Figure 3.Sorting of SNP barcode of rbcL sequences from 38 Brassicaceae species. The nucleotides chosen for a decision tree construction depend on the rule of decision tree rather than the order of nucleotide position. Only after sorting, the nucleotides appear in the order of position. Yellow background indicates the minimal SNPs for generating species-specific SNP patterns. The position index is based on the trimmed rbcL sequence which is the same as the sequence from accession no. HE616652.1 (Table 1). SNP indicates single-nucleotide polymorphism.
Figure 4.The SNP barcodes for 38 species of Brassicaceae are sorted and the SNP barcode (see Figure 3) is converted to a barcode pattern. SNP indicates single-nucleotide polymorphism.