| Literature DB >> 30583062 |
Wanfei Liu1, Shuangyang Wu2, Qiang Lin3, Shenghan Gao4, Feng Ding5, Xiaowei Zhang4, Hasan Awad Aljohi6, Jun Yu7, Songnian Hu8.
Abstract
The rapid development of high-throughput sequencing technologies has led to a dramatic decrease in the money and time required for de novo genome sequencing or genome resequencing projects, with new genome sequences constantly released every week. Among such projects, the plethora of updated genome assemblies induces the requirement of version-dependent annotation files and other compatible public dataset for downstream analysis. To handle these tasks in an efficient manner, we developed the reference-based genome assembly and annotation tool (RGAAT), a flexible toolkit for resequencing-based consensus building and annotation update. RGAAT can detect sequence variants with comparable precision, specificity, and sensitivity to GATK and with higher precision and specificity than Freebayes and SAMtools on four DNA-seq datasets tested in this study. RGAAT can also identify sequence variants based on cross-cultivar or cross-version genomic alignments. Unlike GATK and SAMtools/BCFtools, RGAAT builds the consensus sequence by taking into account the true allele frequency. Finally, RGAAT generates a coordinate conversion file between the reference and query genomes using sequence variants and supports annotation file transfer. Compared to the rapid annotation transfer tool (RATT), RGAAT displays better performance characteristics for annotation transfer between different genome assemblies, strains, and species. In addition, RGAAT can be used for genome modification, genome comparison, and coordinate conversion. RGAAT is available at https://sourceforge.net/projects/rgaat/ and https://github.com/wushyer/RGAAT_v2 at no cost.Entities:
Keywords: Genome annotation; Genome assembly; Genome comparison; Variant identification
Mesh:
Year: 2018 PMID: 30583062 PMCID: PMC6364042 DOI: 10.1016/j.gpb.2018.03.006
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Figure 1Workflow of RGAAT
RGAAT includes three main modules: variant identification, coordinate conversion, and genome assembly/annotation. Variant identification is based on the sequence alignment or genome comparison result; annotation transfer is based on the processing coordinate conversion with variant calling result. The output of RGAAT includes new genome assembly and annotation.
Figure 2Workflow of variant identification based on sequence alignment (SAM/BAM) in RGAAT
Variant identification based on sequence alignment includes 3 stages: read processing, variant discovery, and variant filtering. During the variant discovery stage, RGAAT applies a combination of criteria listed in the box with dashed borders in blue to increase the sensitivity of identification. During the variant filtering stage, RGAAT applies a combination of criteria listed in the box with dashed borders in orange to ensure the accuracy of identification. Finally, all candidate variant related attributes are recoded to reduce the false positive rate.
The four HiSeq datasets analyzed in this study
| Data type | 100-bp pair-end exome | 101-bp pair-end exome | 101-bp pair-end genome | 101-bp pair-end genome |
| Data source | GCAT | PRJCA001139 | PRJCA000261 | PRJCA000261 |
| Coverage | 30× | 77× | 5167× | 1788× |
| Mapping software | Bowtie2(2.2.4) | BWA(0.7.10-r789) | Bowtie2(2.2.4) | Bowtie2(2.2.4) |
| No. of mapped reads | 17,884,489 | 60,821,606 | 9,402,591 | 9,363,153 |
| No. (percentage) of mapped clean reads | 17,167,791 (95.99%) | 59,862,730 (98.42%) | 9,338,188 (99.32%) | 9,198,701 (98.24%) |
| No. of raw variants | 789,170 | 1,182,110 | 157,076 | 488,918 |
| No. of variants after the first filtering | 189,106 | 273,382 | 223 | 742 |
| No. of final filtered variants | 123,660 | 201,100 | 221 | 742 |
Note: dbSNP for human samples and the manually-curated variants for chloroplast and mitochondria sequences were used for evaluating the performance of variant calling in RGAAT. During read filtering step, unmapped reads, multi-mapped reads, reads generated from PCR duplicate, reads with low quality, high mismatch, chromosome difference, or large distance for paired-end were removed. At the first filtering step, variants with low read average base quality, low uniquely-mapped allele frequency, high reference frequency, low read depth, or low variant read number were removed. Variants of low read depth, low allele frequency, or low average read quality were filtered to obtain the final variants. GCAT, Genome Comparison and Analytic Testing platform.
Figure 3Workflow of variant identification based on genome comparison in RGAAT
First, the genome comparison is performed by BLAT and the redundant reads are filtered out by the combination of processes indicated in the box with dashed borders in black. Second, variants including SNPs and INDELs are identified as indicated in the box with dashed borders in blue. Finally, variants were ready for downstream analysis.
Figure 4Workflow of annotation transfer
The annotation transfer pipeline integrates two sets of information: the transfer progress based on variant identification (blue), and the transfer progress based on genome comparison (yellow). The integrated annotation file is classified into three categories (dashed borders in black): no locus transferred, two loci transferred, and one locus transferred. Annotation file in the first category is discarded, while file in the second one is transferred directly and file in the third one needs two extra steps (dashed borders in blue) to infer untransformed locus by extension to both upstream and downstream regions.
Comparison of performance in variant identification using different tools
| Human | RGAAT | 20,032 | 1432 | 46,467,105 | 3471 | 93.33% | 85.23% | 100.00% |
| GATK | 19,861 | 1121 | 46,467,416 | 3321 | 94.66% | 85.67% | 100.00% | |
| SAMtools | 20,537 | 3227 | 46,465,310 | 2248 | 86.42% | 90.13% | 99.99% | |
| Freebayes_Q40 | 18,925 | 1854 | 46,466,683 | 4376 | 91.08% | 81.22% | 100.00% | |
| Human | RGAAT | 14,487 | 468 | 59,286 | 2045 | 96.87% | 87.63% | 99.22% |
| GATK | 14,180 | 732 | 59,311 | 2063 | 95.09% | 87.30% | 98.78% | |
| SAMtools | 14,862 | 802 | 59,230 | 1392 | 94.88% | 91.44% | 98.66% | |
| Freebayes_Q40 | 13,688 | 555 | 59,321 | 2722 | 96.10% | 83.41% | 99.07% | |
| Human | RGAAT | 38,527 | 785 | 48,223,655 | 2,926,351 | 98.00% | 1.30% | 100.00% |
| GATK | 42,773 | 1553 | 48,222,887 | 2,922,105 | 96.50% | 1.44% | 100.00% | |
| SAMtools | 41,963 | 1882 | 48,222,558 | 2,922,915 | 95.71% | 1.42% | 100.00% | |
| Chloroplastd | RGAAT | 198 | 23 | 154,300 | 14 | 89.59% | 93.40% | 99.99% |
| GATK | 163 | 123 | 154,200 | 49 | 56.99% | 76.89% | 99.92% | |
| SAMtools | 181 | 83 | 154,240 | 31 | 68.56% | 85.38% | 99.95% | |
| Mitochondriae | RGAAT | 624 | 118 | 677,331 | 60 | 84.10% | 91.23% | 99.98% |
| GATK | 560 | 238 | 677,211 | 124 | 70.18% | 81.87% | 99.96% | |
| SAMtools | 581 | 220 | 677,229 | 103 | 72.53% | 84.94% | 99.97% | |
Note: The variants of NIST Genome in a Bottle (GIAB, a) and the variants of Illumina OMNI SNP Array (b) were obtained from GCAT; the dbSNP variants for human exome (c) were deposited in GSA (accession No. PRJCA00113); the manually curated variants for chloroplast (d) and mitochondria (e) were deposited in GSA (accession No. PRJCA00261). TP, true positive; FP, false positive; TN, true negative; FN, false negative.
Comparison of performance in annotation transfer using RGAAT and RATT
| Between different genome assemblies | Chloroplast | 336 | 336 | RGAAT | 336 | 0 | 336 | 0 | 0 | 100.00% | 100.00% |
| RATT | 305 | 3 | 303 | 2 | 33 | 99.34% | 90.18% | ||||
| Between different strains | 8540 | 7996 | RGAAT | 8417 | 29 | 7796 | 318 | 198 | 96.08% | 97.50% | |
| RATT | 8223 | 41 | 7648 | 314 | 348 | 96.06% | 95.65% | ||||
| 2430 | - | RGAAT | 2430 | 5 | - | - | - | - | - | ||
| RATT | 2430 | 22 | - | - | - | - | - | ||||
| Between different species | 686 | 661 | RGAAT | 652 | 186 | 636 | 16 | 25 | 97.55% | 96.22% | |
| RATT | 647 | 174 | 633 | 14 | 28 | 97.83% | 95.76% | ||||
| 2430 | 1106 | RGAAT | 1401 | 105 | 670 | 671 | 436 | 49.96% | 60.58% | ||
| RATT | 993 | 13 | 661 | 321 | 445 | 67.31% | 59.76% | ||||
Note: The number of reference features is the number of annotations from the source genome; the number of target features is the number of reference annotations already available on the targeted genome; the number of transferred features is the number of annotations transferred by software based on the annotation of the source genome and the comparison of the two genome sequences; and the number of problematic features is the number of annotations partially transferred due to the interruption by the presence of stop codons. Parameters including TP, FP, FN, precision and sensitivity are calculated based” on the number of transferred features and the number of target features. “-” indicates that there is no this kind of feature in query genome and the number of FP is overestimated due to the inclusion of pseudo genes. TP, true positive; FP, false positive; FN, false negative.