| Literature DB >> 32899740 |
Tanguy Lallemand1, Martin Leduc1, Claudine Landès1, Carène Rizzon2, Emmanuelle Lerat3.
Abstract
Gene duplication is an important evolutionary mechanism allowing to provide new genetic material and thus opportunities to acquire new gene functions for an organism, with major implications such as speciation events. Various processes are known to allow a gene to be duplicated and different models explain how duplicated genes can be maintained in genomes. Due to their particular importance, the identification of duplicated genes is essential when studying genome evolution but it can still be a challenge due to the various fates duplicated genes can encounter. In this review, we first describe the evolutionary processes allowing the formation of duplicated genes but also describe the various bioinformatic approaches that can be used to identify them in genome sequences. Indeed, these bioinformatic approaches differ according to the underlying duplication mechanism. Hence, understanding the specificity of the duplicated genes of interest is a great asset for tool selection and should be taken into account when exploring a biological question.Entities:
Keywords: bioinformatic tools; gene duplication; genome evolution; paralogous genes; synteny
Mesh:
Year: 2020 PMID: 32899740 PMCID: PMC7565063 DOI: 10.3390/genes11091046
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1The different types of duplications. (A) Whole genome duplication which implies complete chromosome duplication. (B) Tandem duplications which produce identical adjacent sequences. (C) Retroduplication, which produces a retrocopy of a gene devoid of introns and with a polyA tail. (D) Transduplication in which a DNA transposon acquires fragments of genes. (E) Segmental duplications which correspond to long stretches of duplicated sequences with high identity.
Estimation of the amount of duplicated genes in different species.
| Species | No. of Considered Genes | No. of Estimated Duplicated Genes | % Estimated Duplicated Genes | Methodology | Duplicated Gene Types | References |
|---|---|---|---|---|---|---|
|
| 25,557 | 11,937 | 46.7 | All-against-all nucleotide sequence similarity searches using | Not specified, all paralogous pairs were searched | [ |
| 27,558 | 12,761 | 46.3 * | All-against-all protein sequence similarity search using | Not specified, genes families were obtained | [ | |
| 25,972 | 10,483–17,406 | 40.4–67 | All-against-all protein sequence similarity search using | Not specified, genes families were all obtained (gene families) | [ | |
| 22,810 | 21,622 | 94.8 * | All-against-all protein sequence similarity search using | WGD, tandem, proximal, DNA based transposed, retrotransposed, and dispersed duplications | [ | |
| 33,869–>19,727 | 12,981 | 65.8 | All-against-all protein sequence similarity search using | Gene families (tandem duplications searched among families) | [ | |
| 13,298 | 11,386 | 85–97 | All-against-all protein sequence similarity search using | Not specified, distant duplicates | [ | |
| 31,126 | 14,473 | 46.5 * | Ensembl family database and genes >300 nt. Tandem duplications were then searched for among families. | Gene families (tandem duplications searched for among families) | [ | |
| 20,415 | 15,569 | 76.3 | Pooling of different datasets from [ | WGD and SSD | [ | |
| 22,447 | 11,740 | 52.3 * | Ensembl version 77, >50% sequence identity, and high confidence for paralogy. | WGD and SSD | [ | |
|
| 21,305 | 14,043 | 65.9 | All-against-all protein sequence similarity search using | Gene families (tandem duplications searched for among families) | [ |
| 27,736 | 16,091 | 58.01 | Ensembl family database and genes >300 nt. Tandem duplications were then searched for among families. | Gene families (tandem duplications were searched for among families) | [ | |
|
| 18,468 | 12,466 | 67.5 | All-against-all protein sequence similarity search using | Gene families (tandem duplications searched for among families) | [ |
| 27,194 | 16,446 | 60.48 * | Ensembl family database and genes >300 nt. Tandem duplications were then searched for among families. | Gene families (tandem duplications searched for among families) | [ | |
| 18,562 | 9149 | 49.3 | All-against-all nucleotide sequence similarity searches using | Not specified, all paralogous pairs were searched | [ | |
| 42,534 | 8244–19,322 | 19.4–45.4 | All-against-all protein sequence similarity search using | Not specified, genes families were all obtained (gene families) | [ | |
| 27,910 | 21,461 | 76.9 * | All-against-all protein sequence similarity search using | WGD, tandem, proximal, DNA based transposed, retrotransposed, and dispersed duplications | [ |
* These values have been calculated according to the information provided in the corresponding reference article.
Summary of the characteristics of different existing tools for identifying syntenic blocks.
| Name | Input | Output Text | Output Plots | Main Algorithm | Specificities | Other Information | Documentation | Programming Language | Interface | References | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gene Orientation | Genome Number | ||||||||||
|
| Tabulated text | Graphical visualization | Custom Greedy Graph | Typical implementation of the collinearity strategy | Yes | N | Complete | C++ | Command line interface | [ | |
|
| Tabulated text | Graphical visualization | Able to detect transposed gene duplications, detection of the type of duplicates | No | N | Incomplete and with errors | C++ | Command line interface | [ | ||
|
| Species gene list and gene tree | Tabulated text | Graphical visualization | Uses gene trees to define gene homologies. Takes into account gene orientations, and tandem duplication blocks | Yes | 2 | Complete | Python | Command line interface | [ | |
|
| List of protein-coding genes and their associated amino-acid sequences | Text files containing homology relationships (RBH and non-RBH) and syntenic blocks description | Chromosomal painting representation, genome-wide dotplot | Computes Reciprocal Best-Hits (RBH) to reconstruct the backbones of the synteny blocks and complete with non-RBH syntenic homologs | Only one parameter: the synteny block stringency. Use | only in visualizations | N | Complete | Python, bash | Command line interface | [ |
|
| Nucleic sequences | Tabulated text | Multiple interactive plots | Cross-correlation, implemented as a fast Fourier transform | Based on a search strategy at a global level and cross-correlation at the local level | Yes | 2 | Short | C++, on linux | Command line interface | [ |
|
| Homologous genes and associated E-value | Tabulated text | Dot plot | Identification of chains of ordered gene pairs by searching paths in directed acyclic graph | Use of dynamic programming making it fast and highly reliable. Many softwares are based on this algorithm | No | 2 | Short | C++, Perl | Command line interface, Graphical user interface | [ |
|
| Any type of genetic markers (physical or genetic distance between markers, gene numbers) | Tabulated text with syntenic blocks and associated p-value | None | Dynamic programming algorithm based on the Smith-Waterman algorithm | Statistical inference, high computational efficiency, and flexibility of input data types | No | 2 | Not available | C++, Perl | Command line interface | [ |
|
| Sequences or alignments and an annotation file | Text file gathering alignments | None | Profile-profile alignment setting, which is an extension of the Waterman-Eggert algorithm | Implementing a phylogenetic scoring function | - | N | Complete | C++ | Command line interface | [ |
|
| List of the linear order and orientation of features on each contig andlist of the pairwise homologies between features | Text file results | Dot Plot | Dynamic programming algorithm based on the Smith-Waterman algorithm | Modeling of the probability of observing segmental homologies assumed by chance and taking this model into account to parameterize the algorithm and the statistical evaluation of its output | Yes | 2 | Not available | C++ | Command line interface | [ |
|
| Set of anchors (e.g., local alignments or pairs of similar genes) | Text file where each genome is represented as a shuffled sequence of the syntenic blocks | Dot Plot | Construction of A-Bruijn graph | Graph-based algorithm allowing to identify non-overlapping syntenic blocks | No | N | Not available | C# | Command line interface | [ |
|
| Two text files containing gene names and/or coordinates | Dot Plot | Homology matrix based algorithm | Typical implementation of the colinearity strategy. Identifies large-scale syntenic blocks despite high levels of background noise | No | 2 | Short | Perl, and requires the BioPerl and GD.pm modules | Command line interface | [ | |
|
| Genomic locations of anchor or | Genomic locations of chains and orthologous segments | Dot Plot and a synteny map | Machine Learning and Markov Chains | Use Markov chain models and machine learning techniques. Automatically optimizes the parameters used in the Markov chain models. Scoring scheme based on stochastic models | Yes | N | Complete | C++ | Command line interface | [ |
|
| Genome sequences in FASTA format and associated GFF files | Homologous genes, diagonals, and identified syntenic blocks. | Visualization available and interactive |
| Interactive visualizations. Calculates synonymous and nonsynonymous mutation rates for syntenic gene pairs using | No | N | Complete | No requirements | Web user interface | [ |
|
| Information about markers and the homologous groups. | Tabulated text | Three interactive visualizations Whole Genome Synteny, Chromosome Level Synteny, Synteny Around a Marker | Ternary search trees (TST) | On-the-fly computations allowing fast parameters adjustments | Yes | N | Complete | No requirements | Web user interface | [ |
|
| Protein sequences in FASTA format and genome annotation in BED | Output files from | Multiple synteny plots |
| Efficient tool for non-programming skilled users. Precomputed data for 18 plant genomes | No | N | Not available | No requirement | Web user interface | [ |
|
| Genome file and a file storing orthologous relationships among genes in all input genomes | Cluster file, with all the syntenic blocks detected, Stat file with information related to the size distribution of the syntenic blocks | One associated plot | Depth-first search method, can also use | Fast and easy to use. Can be applied using any types of markers as an input as long as their relationships can be established | Yes | N | Complete | C++ | Web user interface, Command Line | [ |
N: Theoretically arbitrary number of studied genomes.