| Literature DB >> 31346518 |
Jinhwa Kong1, Sun Huh2, Jung-Im Won3, Jeehee Yoon4, Baeksop Kim4, Kiyong Kim5.
Abstract
Genomic analysis begins with de novo assembly of short-read fragments in order to reconstruct full-length base sequences without exploiting a reference genome sequence. Then, in the annotation step, gene locations are identified within the base sequences, and the structures and functions of these genes are determined. Recently, a wide range of powerful tools have been developed and published for whole-genome analysis, enabling even individual researchers in small laboratories to perform whole-genome analyses on their objects of interest. However, these analytical tools are generally complex and use diverse algorithms, parameter setting methods, and input formats; thus, it remains difficult for individual researchers to select, utilize, and combine these tools to obtain their final results. To resolve these issues, we have developed a genome analysis pipeline (GAAP) for semiautomated, iterative, and high-throughput analysis of whole-genome data. This pipeline is designed to perform read correction, de novo genome (transcriptome) assembly, gene prediction, and functional annotation using a range of proven tools and databases. We aim to assist non-IT researchers by describing each stage of analysis in detail and discussing current approaches. We also provide practical advice on how to access and use the bioinformatics tools and databases and how to implement the provided suggestions. Whole-genome analysis of Toxocara canis is used as case study to show intermediate results at each stage, demonstrating the practicality of the proposed method.Entities:
Year: 2019 PMID: 31346518 PMCID: PMC6617929 DOI: 10.1155/2019/4767354
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Overview of the genome analysis process of the GAAP pipeline system. The overall workflow of the system is shown, and all software tools and annotation databases are summarized.
SOAPec commands.
| KmerFreq_AR -k 17 -t 10 -p prefix readlist.txt |
| Corrector_AR -k 17 -l 3 -r 50 -t 10 prefix.freq.cz prefix.freq.cz.len readlist.txt |
FastqToSam command.
| java -jar picard.jar FastqToSam F1=forward_reads.fq F2=reverse_reads.fq |
| O=unaligned_read_pairs.sam SM=sample001 |
Markduplicates command.
| java -jar picard.jar Markduplicates I=unaligned_read_pairs.sam O=output_duplicate.sam |
| M=output_duplicate_report.txt REMOVE_DUPLICATES=true |
SOAPdenovo configuration file.
| [LIB] |
| max_rd_len=101 |
| avg_ins=170 |
| reverse_seq=0 |
| asm_flags=3 |
| rank=1 |
| q1=DNAread170_1.fastq |
| q2=DNAread170_2.fastq |
| [LIB] |
| max_rd_len=101 |
| avg_ins=2900 |
| reverse_seq=1 |
| asm_flags=2 |
| rank=2 |
| q1=DNAread2900_1.fastq |
| q2=DNAread2900_2.fastq |
SOAPdenovo command.
| Soapdenovo all -s config.file -K kmerlength -o outprefix |
Preparing ALLPATHS input files.
| PrepareAllPathsInputs.pl DATA_DIR=/data PLOIDY=1 |
| IN_GROUPS_CSV=in_groups.csv IN_LIBS_CSV=in_libs.csv |
Running ALLPATHS command.
| RunAllPathsLG PRE=<pre> REFERENCE_NAME=test.genome |
| DATA_SUBDIR=data RUN=run SUBDIR=test |
Abyss command.
| abyss-pe k=kmerlength name=outprefix lib='pea peb' mp='mp1' |
| pea='DNAread170_1.fastq DNAread170_2.fastq' |
| peb='DNAread400_1.fastq DNAread400_2.fastq' |
| mp1='DNAread2900_1.fastq DNAread2900_2.fastq' |
Velveth command.
| velveth output_directory/ hash_length -fastq |
| -shortPaired DNAread170_1.fastq DNAread170_2.fastq |
| -shortPaired2 DNAread400_1.fastq DNAread400_2.fastq |
| -longPaired DNAread2900_1.fastq DNAread2900_2.fastq |
Velvetg command.
| velvetg output_directory/ -cov_cutoff auto -exp_cov auto -ins_length 170 -ins_length2 400 |
| -ins_length_long 2900 -scaffolding yes |
GapCloser command.
| GapCloser -a scaffold.scafSeq -b config.file -o genome.fasta -l readlength |
Trinity command.
| Trinity - |
| - |
RepeatMasker command.
| RepeatMasker - |
Augustus command.
| augustus - |
An example of maker_opt.ctl in MAKER.
| #- |
| genome=genome.fasta |
| #- |
| est=trinity.fasta |
| #- |
| protein=#protein sequence file in fasta format |
| #- |
| snaphmm=similar_snap.hmm |
| augustus_species=similar_species |
MAKER command.
| maker |
Figure 2Workflow of running MAKER in GAAP. First, MAKER is run using scaffolds, transcripts, and proteins from similar species, and the results are used to train SNAP and Augustus. Next, the trained results are reinput into MAKER, along with assembled scaffolds and transcripts, to obtain the final annotation results.
EVM weights file.
| ABINITIO_PREDICTION augustus 1 |
| ABINITIO_PREDICTION maker 1 |
| PROTEIN genewise_protein_alignments 5 |
| TRANSCRIPT PASA_transcript_assemblies 10 |
EVM partitioning command.
| EvmUtils/partition_EVM_inputs.pl - |
| - |
| - |
| - |
EVM execution command.
| EvmUtils/write_EVM_commands.pl - |
| - |
| - |
| - |
| EvmUtils/execute_EVM_commands.pl commands.list |
EVM combining command.
| EvmUtils/recombine_EVM_partial_outputs.pl - |
EVM conversion command.
| EvmUtils/convert_EVM_outputs_to_GFF3.pl - |
| - |
PASA database generation command.
| scripts/Load_Current_Gene_Annotations.dbi -c config.file -g genome.fasta -P evm.out.gff3 |
PASA configuration file.
| # MySQL settings |
| MYSQLDB=myPasaDB |
PASA execution command.
| scripts/Launch_PASA_pipeline.pl -c config.file -A -g genome.fasta -t trinity.fasta |