| Literature DB >> 19545381 |
James C Estill1, Jeffrey L Bennetzen.
Abstract
BACKGROUND: High quality annotation of the genes and transposable elements in complex genomes requires a human-curated integration of multiple sources of computational evidence. These evidences include results from a diversity of ab initio prediction programs as well as homology-based searches. Most of these programs operate on a single contiguous sequence at a time, and the results are generated in a diverse array of readable formats that must be translated to a standardized file format. These translated results must then be concatenated into a single source, and then presented in an integrated form for human curation.Entities:
Year: 2009 PMID: 19545381 PMCID: PMC2705364 DOI: 10.1186/1746-4811-5-8
Source DB: PubMed Journal: Plant Methods ISSN: 1746-4811 Impact factor: 4.993
Figure 1An overview of the workflow supported by the current version of the DAWGPAWS suite of programs.
DAWGPAWS annotation scripts for generating computational annotation results in batch mode.
| EuGène [ | Gene | batch_eugene.pl |
| GeneID [ | Gene | batch_geneid.pl |
| GeneMark.hmm [ | Gene | batch_genemark.pl |
| Genscan [ | Gene | batch_genescan.pl |
| Find_LTR [ | TE | batch_findltr.pl* |
| LTR_STRUC [ | TE | batch_ltrstruc.vbs |
| LTR_FINDER [ | TE | batch_ltrfinder.pl* |
| LTR_seq [ | TE | batch_ltrseq.pl* |
| FINDMITE [ | TE | batch_findmite.pl* |
| Tandem Repeats Finder [ | Repeat | batch_trf.pl |
| HMMER [ | TE homology | batch_hmmer.pl* |
| NCBI-BLAST [ | TE and gene homology | batch_blast.pl* |
| RepeatMasker [ | TE homology | batch_repmask.pl* |
| TEnest [ | TE homology | batch_tenest.pl |
These scripts operate on a directory of FASTA files, and generate the native results of the annotation program as well as the GFF file format. The exception is the batch_ltrstruc.vbs visual basic script that must be used in conjunction with cnv_ltrstruc2gff.pl to generate results in GFF.
* Indicates programs that make use of a configuration file. The nature and format of the configuration file for these programs is described in the individual help file for those programs.
DAWGPAWS scripts for conversion of annotation results from native program output to GFF.
| FGENESH [ | Gene | cnv_fgenesh2gff.pl |
| GeneMark.hmm [ | Gene | cnv_genemark2gff.pl |
| Find_LTR [ | TE | cnv_findltr2gff.pl |
| LTR_FINDER [ | TE | cnv_ltrfinder2gff.pl |
| LTR_seq [ | TE | cnv_ltrseq2gff.pl |
| LTR_STRUC [ | TE | cnv_ltrstruc2gff.pl |
| RepSeek [ | TE | cnv_repseek2gff.pl |
| NCBI-BLAST [ | TE and gene homology | cnv_blast2gff.pl |
| RepeatMasker [ | TE homology | cnv_repmask2gff.pl |
| TEnest [ | TE homology | cnv_tenest2gff.pl |
Additional helper scripts included in the DAWGPAWS package.
| cnv_gff2game.pl | Converts GFF files to the game.xml format. |
| cnv_game2gff3.pl | Converts game.xml files to the GFF3 format. |
| batch_hardmask.pl | Given a directory of lowercase masked sequence files, this will replace lowercase residues with an N or X to indicate masking. |
| dir_merge.pl | Given annotation results scattered across multiple directories, this program can merge the results into subdirectories in a single parent directory. |
| vennseq.pl | Given GFF annotation results from multiple methods, this program generates a Euler Diagram of these features using the VennMaster program [ |
| batch_findgaps.pl | This program will annotate gaps in the query sequences in the input directory. |
| clust_write_shell.pl | This program writes shell scripts to run DAWGPAWS in a cluster environment running the Platform LSF queuing system. |
| cnv_seq2dir.pl | Given a FASTA file with multiple sequence files, this program generates a separate FASTA file for each sequence record. The sequence files produced are named using the sequence ID in the FASTA header in the input file. |
| fasta_merge.pl | This program merges all FASTA files in a directory into a single FASTA file. |
| fasta_shorten.pl | This program shortens the FASTA header by limiting the header length, or splitting the header by a delimiting character. Some annotation programs are limited by the length of the FASTA header that is accepted, and this programs allows input files to meet this limitation. |
| fetch_tenest.pl | Fetches multiple results from the Plant GDB TEnest server and converts the results to GFF. |
| gff_seg.pl | Given a GFF file that contains point or segment data, this will extract segments with score values that exceed a threshold value. |
| ltrstruc_prep.pl | Because the LTR_STRUC program only runs under the windows environment, this program converts FASTA sequences in UNIX to DOS line endings and generates the files name and flist file required for LTR_STRUC. |
| seq_oligiocount.pl | This program allows for the generation of a GFF file that counts the number of times an oligomer in the genomic contig occurs in a reference shotgun sequence database. |
Common command line options used throughout the DAWGPAWS suite of programs.
| --indir | For batch scripts, this indicates the input directory containing the FASTA files to annotate. For conversion scripts, this indicates the input file to convert from the native format to the GFF format. |
| --outdir | For batch scripts, this indicates the output directory containing the annotation results for the program and the GFF results. |
| --config | For programs that make use of a configuration file, this indicates the path to the configuration file to use. |
| --seqname | For conversion scripts, this indicates the sequence id to use in the GFF output file. |
| --param | For conversion scripts, this indicates the name of that parameter set used with the annotation program. This option allows the user to distinguish among multiple parameter sets for the same annotation program, and this parameter name is appended to the source column of the GFF output file. |
| --program | For conversion scripts, this indicates the name of the program used to generate the annotation result. |
| --version | Print the current version of the script. |
| --usage | Print a short program usage message. |
| --help | Print a short help message including the common usage and all program options available at the command line. |
| --man | Print the full program manual. |
| --verbose | This will run the program with maximum verbosity. This option will generate status updates while the program is running, and will maximize the error reporting functions of the script. All verbose statements are written to the standard error output stream. |
Figure 2Screen capture image of gene and TE annotation results visualized in the Apollo genome annotation program. This example shown is for a wheat BAC that has been annotated and curated with the assistance of DAWGPAWS.
Figure 3Screen capture image of the TE annotation results and oligomer counts visualized in the GBrowse genome annotation visualization program. The example shown is for a 15 kb segment of a BAC with a wheat DNA insert.