| Literature DB >> 27916797 |
Jin Xiao1,2, Manoj Kumar Sekhwal3,4, Pingchuan Li5, Raja Ragupathy6, Sylvie Cloutier7, Xiue Wang8, Frank M You9.
Abstract
Pseudogenes are paralogs generated from ancestral functional genes (parents) during genome evolution, which contain critical defects in their sequences, such as lacking a promoter, having a premature stop codon or frameshift mutations. Generally, pseudogenes are functionless, but recent evidence demonstrates that some of them have potential roles in regulation. The majority of pseudogenes are generated from functional progenitor genes either by gene duplication (duplicated pseudogenes) or retro-transposition (processed pseudogenes). Pseudogenes are primarily identified by comparison to their parent genes. Bioinformatics tools for pseudogene prediction have been developed, among which PseudoPipe, PSF and Shiu's pipeline are publicly available. We compared these three tools using the well-annotated Arabidopsis thaliana genome and its known 924 pseudogenes as a test data set. PseudoPipe and Shiu's pipeline identified ~80% of A. thaliana pseudogenes, of which 94% were shared, while PSF failed to generate adequate results. A need for improvement of the bioinformatics tools for pseudogene prediction accuracy in plant genomes was thus identified, with the ultimate goal of improving the quality of genome annotation in plants.Entities:
Keywords: bioinformatics tools; duplicated; genome-wide; plants; processed; pseudogenes
Mesh:
Year: 2016 PMID: 27916797 PMCID: PMC5187791 DOI: 10.3390/ijms17121991
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Illustration of pseudogenes formation. Note: 1: Repeats associated with retro-transposition; 2: The hashtag symbol (#) indicates deleterious mutations; 3: The dashed box indicates truncation.
Figure 2The emergence of processed or duplicated pseudogenes. Functional pseudogenes are indicated in the green framework. Solid arrows represent reference supports for the paths while dotted arrows show predicted paths. HBBP1: Hemoglobin Subunit Beta Pseudogene 1.
Comparison between processed and duplicated pseudogenes.
| Processed Pseudogenes | Duplicated Pseudogenes | |
|---|---|---|
| 1 | Arise from mRNA that was reverse-transcribed and re-integrated into the genome | Arise from gene duplication |
| 2 | Lack of non-coding intervening sequences: introns and promoters | Possess promoters, exon-intron structure and other upstream regulatory sequences |
| 3 | Possess a poly-A tail at 3′ end | No 3′ poly-A tail |
| 4 | Possess flanking direct repeats associated with TE insertion sites | No flanking direct repeats |
| 5 | Mostly present at different loci from its parent genes | Some are present as a cluster with their parent gene as a consequence of tandem segmental duplication |
| 6 | Have 3′ or 5′ truncations | Have 3′ truncations |
| 7 | Generally shorter | Comparatively longer |
Bioinformatics pipelines or approaches for predicting pseudogenes.
| Pipeline/Method | Input Data | Brief Description | Availability | * Ref. |
|---|---|---|---|---|
| Harrison’s Approach | Protein and genome sequence, annotation information | Using protein sequences to find pseudogenes in intergenic regions by FASTA alignment; refinement of alignments for validation and classification | Method, not a pipeline tool | [ |
| Sakai’s Approach | cDNA and genome sequence | Using cDNA to search and extract corresponding regions from genome sequence by BLASTn; realignment of cDNAs to extract sequence; using est2genome for classification | Method, not a pipeline tool | [ |
| PPFINDER (Processed Pseudogene Finder) | Gene model and cDNA database | Using cDNA as evidence to determine parent genes in gene models; using parent genes to detect locus missing introns by BLASTN search; removing false candidates | [ | |
| PseudoFinder | Functional genes and genome sequence | Finding homologues of functional genes in a genome; classification into either pseudo or functional categories using Support Vector Machines (SVMs) based on a combination of features by BLASTz analysis | Not available online | [ |
| RetroFinder | GenBank mRNA and genome sequence | Alignment of mRNAs from GenBank to genome sequence by BLASTz; detection of biological features; heuristic weighting for known PPGs | Not available online | [ |
| GIS-PET (Gene identification signature-paired end tag) method | mRNA and genome sequence | Using 5′ and 3′ paired-end-tag (PET) of mRNAs to select candidates based on homology; using the shortest candidate to search the genome by BLAT | Method, not a pipeline tool | [ |
| PseudoPipe | Genome sequence (repeat marked), parent proteins and their exon coordinates | Using protein sequence to find pseudogenes in repeat-masked intergenic regions by tBLASTn; realignment of candidates to corresponding parent(s) by FASTA to validate and classify pseudogenes | [ | |
| Shiu’s pipeline | Parent proteins and genome sequence (repeat-masked and intergenic) | Using protein sequence to find pseudogenes in repeat-masked intergenic regions by tBLASTn; realignment of candidates to corresponding parent(s) by FASTA to validate pseudogenes. Similar to PseudoPipe | [ | |
| PSF (Pseudogene Finder) | Same as Shiu’s pipeline | Using protein sequence to find pseudogenes in repeat-masked intergenic regions directly by Pro-map to detect disruption events and classify pseudogenes | [ |
* Ref.: Reference.
Figure 3A flow chart of genome-wide pseudogene prediction methods.
Comparison of thee bioinformatics tools employed for pseudogene prediction using the Arabidopsis thaliana genome sequence with its 924 known pseudogenes.
| Tool | No. of Total Pseudogenes Identified | No. of Parents Associated | No. of Known Pseudogenes Identified | Known Pseudogenes Identified (%) |
|---|---|---|---|---|
| PseudoPipe | 4108 | 2550 | 751 | 81.3 |
| Shiu’s pipeline | 3531 | 2317 | 729 | 78.9 |
| PSF | 801 | 604 | 55 | 6.0 |
Figure 4Comparisons of three bioinformatics tools for pseudogene prediction using the A. thaliana genome with its 924 known pseudogenes. (A) Identified known pseudogenes; (B) all identified pseudogenes.