| Literature DB >> 30089465 |
Renato Renison Moreira Oliveira1,2,3, Gisele Lopes Nunes4, Talvâne Glauber Lopes de Lima4, Guilherme Oliveira4,5, Ronnie Alves6,7,8.
Abstract
BACKGROUND: Taxonomic identification of plants and insects is a hard process that demands expert taxonomists and time, and it's often difficult to distinguish on morphology only. DNA barcodes allow a rapid species discovery and identification and have been widely used for taxonomic identification by targeting known gene regions that permit to discriminate these species. DNA barcode sequence analysis is usually carried out with processes and tools that still demand a high interaction with the user or researcher. To reduce at most such interaction, we proposed PIPEBAR, a pipeline for DNA chromatograms analysis of Sanger platform sequencing, ensuring high quality consensus sequences along with efficient running time. We also proposed a paired-end reads assembly tool, OverlapPER, which is used in sequence or independently of PIPEBAR.Entities:
Keywords: DNA barcode; DNA sequencing; Paired-end assembly; Sanger
Mesh:
Substances:
Year: 2018 PMID: 30089465 PMCID: PMC6083499 DOI: 10.1186/s12859-018-2307-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1PIPEBAR workflow. a, b Conversion of chromatograms files to FASTQ files. c, ) Trimming and filtering of the FASTQ files for low quality bases based on Phred quality. e Overlapping paired-end reads considering both substitution and indel errors by OverlapPER. f report file of barcode sequences produced given a PHRED quality value as parameter. g Fasta files for the merged reads and FASTQ files for the not-merged reads. h In case of analyzing a coding region, PIPEBAR applies stop codon and frameshift correction
Fig. 2Stop-codon and frameshift corrections. PIPEBAR translates the sequence in 3 forward and 3 reverse frames, selects the frame where the impact of the found stop codons is minimum. Identifying the best translation frame, the stop codons located in the extremities of the sequence are trimmed, generating at the end of the process a sequence that is ready to be submitted to NCBI and BOLD databases
Comparison of PIPEBAR to SeqTrace and Geneious regarding to the total of barcodes produced at the end of the pipeline execution for the 3 datasets, mean similarity percentage of all the resulting barcodes to its respective Bold reference sequence, the time spent for each pipeline counting from sequences trimming to the final results, total sum of mismatches and gap openings by applying Blastn [44] against the FASTA of ab1 files retrieved from Bold
| Dataset 1 (841 plant marker genes) | Dataset 2 (558 animal marker genes) | Dataset 3 (490 fungi marker gene) | |||||||
|---|---|---|---|---|---|---|---|---|---|
| PIPEBAR | SeqTrace | Geneious | PIPEBAR | SeqTrace | Geneious | PIPEBAR | SeqTrace | Geneious | |
| Resulting barcodes | 830 | 841 | 829 | 557 | 558 | 555 | 448 | 487 | 438 |
| Mean % identity | 99.88 ± 0.17 | 99.68 ± 0.41 | 99.92 ± 0. 12 | 99.88 ± 0.16 | 99.56 ± 0.44 | 99.91 ± 0.11 | 99.67 ± 0.52 | 98.79 ± 1.73 | 99.73 ± 0.43 |
| Mean % length | 557.51 ± 49.2 | 575 ± 161. 6 | 549.81 ± 42.5 | 637.82 ± 28.98 | 638.54 ± 29.36 | 638.45 ± 28.85 | 618.50 ± 48.01 | 585.52 ± 81.53 | 619 ± 45.3 |
| Run time (s) |
| 367 | 197 |
| 296 | 98 |
| 231 | 160 |
| Mismatches | 372 | 941 | 294 | 140 | 383 | 96 | 267 | 930 | 224 |
| Gap openings | 91 | 266 | 41 | 17 | 115 | 12 | 82 | 341 | 72 |
Fig. 3Merging process for paired-end reads. a OverlapPER script first finds a seed (a short sequence in one of the reads represented in bold) (b) The reads are positioned according to the seed found and the total overlap is determined. c The total overlap is analyzed. If there is a hit in the alignment, the identity score is incremented. If a base is aligned to a gap, the identity score is incremented. In case of a mismatch in the alignment, if the next 5 bases (tolerance) are identical, the mismatch score is incremented, otherwise a gap insertion is repeated 4 times until the next 5 bases are identical. Nucleotides in bold represent a hit in the alignment
Fig. 4Example of an erroneous sequence generated by SeqTrace. Sequences PCCMN351-FWD and PCCMN351-REV are the trimmed sequences that should overlap, originated from the Dataset 1. SeqTrace erroneously insert gaps (from base 100 to 500) that connect the two sequences without having any insert distance information. The figure was generated using Geneious by aligning the PCCMN351-FWD and PCCMN351-REV sequences to its respective barcode generated by SeqTrace. The qualities of the forward and reverse sequences are demonstrated as a histogram. As SeqTrace does not generate a FASTQ file, we could not evaluate the quality of the barcode generated
Overlap similarities and length of sequence pairs that were not assembled by PIPEBAR nor Geneious in the Dataset 1
| Sequence ID | Overlap similarity (%) | Overlap length (bp) |
|---|---|---|
| BBYUK2200-ITS | – | 0 |
| MKTRT2524-rbcL | – | 0 |
| PCCMN290-ITS | – | 0 |
| PCCMN303-ITS | – | 0 |
| PCUBC495-ITS | – | 0 |
| PCUBC568-ITS | 100% | 20 |
| PCUBC799-ITS | 91% | 12 |
| VASCB012-ITS | 27.3% | 189 |
| VASCB062-ITS | 40.9% | 104 |
The similarities were calculated by aligning the overlapping regions from each sequence pair using MAFFT [45]
Analysis of PIPEBAR’s barcodes with respect to the mean similarity percentage of all the generated sequences to its respective Geneious’ sequences, mean length of the alignment, total sum of mismatches and gap openings by applying Blastn against the Geneious’ reference sequences
| Mean % identity | Mean length (bp) | Mismatches | Gap openings | |
|---|---|---|---|---|
| Dataset 1 | 99.9 ± 0.08 | 545.74 ± 105.95 | 191 | 45 |
| Dataset 2 | 99.97 ± 0.03 | 669.43 ± 19. 26 | 59 | 21 |
| Dataset 3 | 99.93 ± 0. 12 | 596.72 ± 93.72 | 57 | 42 |
Results obtained by OverlapPER, PEAR, FLASH and COPE
| Tool | Total merged pairs | % merged pairs | Mean length of merged sequences | Mean % identity | Mean mismatch | Mean gap opening | Run time (s) |
|---|---|---|---|---|---|---|---|
| OverlapPER | 999,706 | 99.97% | 391.69 ± 18.69 | 97.52% ± 0.70% | 7.10 ± 2.66 | 2.62 ± 0.9 | 511. 26 ± 6.87 |
| PEAR | 995,648 | 99.56% | 391. 19 ± 20.87 | 96.22% ± 1.31% | 11.67 ± 4.87 | 3.04 ± 1.18 | 1363.37 ± 3.22 |
| FLASH | 326,686 | 32.67% | 391.90 ± 19.54 | 97.38% ± 0.73% | 7.55 ± 2.87 | 2.71 ± 0.97 | 49.93 ± 2.74 |
| COPE | 292,303 | 29. 23% | 392.34 ± 19.46 | 97.45 ± 0.71 | 7.30 ± 2.77 | 2.70 ± 0.97 | 468.45 ± 0.81 |
| BBMerge | 201,842 | 20.18% | 392.66 ± 19.25 | 97.49 ± 0.70 | 7.03 ± 2.69 | 2.83 ± 0.95 | 25. 23 ± 0.79 |
Parameters: minimum overlap of 10 bp and minimum identity of 90%. Mean identity, mismatch and gap openings are shown in comparison to the reference genome
A total of 1,000,000 simulated reads were used as input for the evaluated tools. The results are shown regarding the absolute number of total merged pairs of sequences, the percentage of merged pairs, the mean length of the resulting merged sequences, the mean percentage identity when aligning the resulting sequences to the reference genome, the mean total of mismatch and gap openings resulted from the alignment and finally the mean run time took for each tool