Literature DB >> 29899658

Integrating Phylogenetic and Network Approaches to Study Gene Family Evolution: The Case of the AGAMOUS Family of Floral Genes.

Daniel S Carvalho1,2, James C Schnable1,2, Ana Maria R Almeida3.   

Abstract

The study of gene family evolution has benefited from the use of phylogenetic tools, which can greatly inform studies of both relationships within gene families and functional divergence. Here, we propose the use of a network-based approach that in combination with phylogenetic methods can provide additional support for models of gene family evolution. We dissect the contributions of each method to the improved understanding of relationships and functions within the well-characterized family of AGAMOUS floral development genes. The results obtained with the two methods largely agreed with one another. In particular, we show how network approaches can provide improved interpretations of branches with low support in a conventional gene tree. The network approach used here may also better reflect known and suspected patterns of functional divergence relative to phylogenetic methods. Overall, we believe that the combined use of phylogenetic and network tools provide a more robust assessment of gene family evolution.

Entities:  

Keywords:  functional divergence; gene families; network analysis; phylogenetics

Year:  2018        PMID: 29899658      PMCID: PMC5993073          DOI: 10.1177/1176934318764683

Source DB:  PubMed          Journal:  Evol Bioinform Online        ISSN: 1176-9343            Impact factor:   1.625


Introduction

Advances in sequencing technology have lead to dramatic expansions in the number of sequenced genes within most gene families, both through the use of whole genome or whole transcriptome sequencing and through broader taxon sampling. Gene families are generally studied through the use of phylogenetic approaches to identify closely and distantly related sequences, as well as to classify divergence between gene copies into those resulting from speciation (orthology) or gene duplication (paralogy).[1,2] Thus, phylogenetic approaches are widely employed to study how sequence divergence can lead to divergence of structure and/or function.[3,4] When coupled with genome context information, this approach can provide insightful understanding of gene regulation and function. For instance, it is well known that orthologous genes conserved at syntenic locations in the genome are more likely to exhibit conserved regulation[5] and function[6] than genes at nonsyntenic locations. However, the prevalence of whole genome duplications in plants poses challenges to the study of gene family evolution using exclusively phylogeny-based methods[3] due to the diverse outcomes of duplicated genes. Whole genome duplications produce syntenic paralogs that can be reciprocally lost,[7,8] sub- or neofunctionalized,[9] or even retained in the same functional roles as a result of relative or absolute dosage constraints.[10] A fundamental assumption of any phylogenetic reconstruction is that the observed changes occur exclusively through a hierarchical bifurcated branching process. This model is certainly a good representation of a major evolutionary force (ie, descent with modification)[11]; however, many will argue that it fails to capture the diversity of evolutionary processes which shape the gene content of extant species.[12,13] One way to address the complexity of evolutionary processes is to apply network approaches to address questions related to cell organization and functioning,[14] human diseases relationships,[15] and plant gene function prediction.[16] Network approaches have also been successfully applied to study fungi evolution based on enzymes related to the chitin synthase pathway.[17] Recently, Carvalho et al[18] have used a network-based approach to address the origin of the mitochondria, providing a new perspective on the study mitochondrial evolution. Network-based approaches can overcome some of the limitations of phylogenetic methods. For instance, these approaches do not require the assumption of a hierarchical bifurcating framework and therefore may be capable of dealing with more complex biological patterns and phenomena.[19-21] Networks are generally less precise in their ability to reconstruct the divergence points of different groups within a gene family; however, they may be able to capture additional insight into function evolution and divergence using information which might be lost in phylogenetic reconstructions. In this study, we compare the information gained from conventional phylogenetic analysis and a network-based approach using a well-characterized subfamily of floral transcription factors, the AGAMOUS floral genes. The AGAMOUS gene subfamily comprises MADS-box transcription factors and is involved in important aspects of flower and fruit development.[22] Among angiosperms (flowering plants), the AGAMOUS subfamily is traditionally divided into the C and D lineages. C lineage genes include the closest relatives of the Arabidopsis thaliana AGAMOUS (AG) gene[23,24] in all angiosperms, as well as close relatives of the SHATTERPROOF (SHP) gene, present exclusively in core eudicots. D lineage genes, on the other hand, include angiosperm SEEDSTICK (STK) genes.[25,26] The C/D split likely occurred after the split between gymnosperms and angiosperms and gymnosperms usually carry a single-gene copy of the AGAMOUS subfamily. While D lineage genes are usually related to ovule development, C lineage genes have been implicated in stamen and carpel development. Particularly, in core eudicots, SHP genes have also been shown to be involved in fruit development and ripening.[27-30] This gene subfamily has been extensively studied and mutant characterization has provided insights into their functional roles in carpel, ovule, and fruit development as well as floral meristem termination. The AGAMOUS subfamily has undergone several instances of duplication followed by neo- and subfunctionalization throughout its evolutionary history in angiosperms,[26,31] and understanding the evolutionary history of this group has proven challenging as a result of low support for deep nodes on the tree. Here, we investigate the contributions of a a similarity-based phylogenetic network approach to our understanding of AGAMOUS gene family evolution.[32-34] The phylogenetic network methods used here do not require the assumption of a scale-free topology, or the need to calculate gene correlation based on expression data,[16,35] which makes the approach used more straightforward. Also, the approach used here does not rely on an existing tree to generate the networks, as most phylogenetic networks. Overall, both the phylogeny and network results showed consistent clustering of the gene families. However, our results suggest that the network approach was less affected by sequence divergence. We demonstrate that a combination of both methods can provide additional insight into evolutionary events and functional divergence within gene families.

Methods

Sequence search and multiple sequence alignment

C and D lineage AGAMOUS nucleotide sequences were retrieved on Phytozome (https://phytozome.jgi.doe.gov/pz/portal.html) and NCBI (National Center for Biotechnology Information). Species of origin and accession numbers for each sequence included in this analysis are provided in Table 1. A multiple sequence alignment was performed using the ClustalW[36] alignment tool within Geneious v7.0.4,[37] based on translated nucleotides. Further refinements were made manually, using translated sequences as a way to guide manual curation. Manual curation of the multiple sequence alignment was performed using a codon-preserving approach and taking into account domains and motifs previously described in the literature.[25] Unalignable regions were removed prior to further analysis. The final multiple sequence alignment included 549 nucleotides. The alignment statistics obtained from HMMSTAT, from HMMER3 package,[38] were eff_nseq = 2.72, M = 531, relent = 0.45, info = 0.45, p relE = 0.31, and compKL = 0.02. jModelTest 2.1.1 was used to estimate the best-fit evolutionary model of nucleotide evolution.[39] A protein multiple sequence alignment was also performed with the same sequences and used in downstream phylogenetic analysis.
Table 1.

List of species and sequence identifiers used in this study.

Gene nameSpecies nameClade nameAccession or gene ID
GinbiMADS5 Ginkgo biloba GinkgoaceaeGU563899
PiabDAL2 Picea abies PinaceaeX79280.1
PiradAG Pinus radiata PinaceaeAF023615
TbaccAG Taxus baccata TaxaceaeJF519754
CryjaMADS4 Cryptomeria japonica TaxodiaceaeHM177453
ShenAG Saruma henryi AristolochiaceaeAY464101
ChlspiSTK Chloranthus spicatus ChloranthaceaeAY464099
PeamAG1 Persea americana LauraceaeDQ398021
PeamAG2 Persea americana LauraceaeDQ398022
MafiAG1 Magnolia figo MagnoliaceaeJQ326236
MapreSTK Magnolia praecossisima MagnoliaceaeAB050653
MialAG Michelia alba MagnoliaceaeJQ326219
ElguiAG1 Elaeis guineensis ArecaceaeAY739698
ElguiAG2 Elaeis guineensis ArecaceaeAY739699
ElguiSTK Elaeis guineensis ArecaceaeXP_010912706.1
BdiAG1* Brachypodium distachyon PoaceaeBradi2g06330.1
BdiAG2* Brachypodium distachyon PoaceaeBradi4g40350.1
BdiAG3* Brachypodium distachyon PoaceaeBradi2g25090.1
OsMADS3 Oryza sativa PoaceaeL37528
OsMADS13 Oryza sativa PoaceaeAF151693
OsMADS21 Oryza sativa PoaceaeFJ750944
OsMADS58 Oryza sativa PoaceaeAB232157
SbAG1* Sorghum bicolor PoaceaeSb03g002525
SbAG2* Sorghum bicolor PoaceaeSb08g006460 or Sobic.008G072900.1
SbAG3* Sorghum bicolor PoaceaeSb09g006360 or Sobic.009G075500.3
ZAG2* Zea mays PoaceaeGRMZM2G160687-T03
ZMM2* Zea mays PoaceaeGRMZM2G359952-T01
ZAG1* Zea mays PoaceaeGRMZM2G052890-T01
LaschisAG Lacandonia schismatica TriuridaceaeGQ214163
LaschisSTK Lacandonia schismatica TriuridaceaeGQ214164
GongaSTK Gongora galeata OrchidaceaeAIU94767.1 or KF914206.1
GongaAG Gongora galeata OrchidaceaeAIU94768.1
HyvilSTK Hypoxis villosa HypoxidaceaeAIU94766.1 or KF914205.1
HyvilAG Hypoxis villosa HypoxidaceaeAIU94771.1
AspvirSTK Asparagus virgatus AsparagaceaeAB175825.1
AspvirAG Asparagus virgatus AsparagaceaeBAD18011.1
BgilAG Berberis gilgiana BerberidaceaeAY464106
EupleAG1 Euptelea pleiosperma EupteleaceaeGU357452
EupleAG2 Euptelea pleiosperma EupteleaceaeGU357453
AkquiAG Akebia quinata LardizabalaceaeAY464107
AktriAG Akebia trifoliata LardizabalaceaeAY627635
AktriSTK Akebia trifoliata LardizabalaceaeAY627629
HogrAG1 Holboellia grandiflora LardizabalaceaeJQ806406
HogrAG2 Holboellia grandiflora LardizabalaceaeJQ806407
EscaAG1 Eschscholzia californica PapaveraceaeDQ088996
EscaAG2 Eschscholzia californica PapaveraceaeDQ088997
EscaSTK Eschscholzia californica PapaveraceaeDQ088998
AqAG1* Aquilegia coerulea RanunculaceaeAquca-136-00009.1
AqAG2* Aquilegia coerulea RanunculaceaeAquca-022-00039.1
AqAGL11* Aquilegia coerulea RanunculaceaeAquca-136-00010.1
ThathAG1 Thalictrum thalictroides RanunculaceaeJN887118
ThathAG2 Thalictrum thalictroides RanunculaceaeAY867879
MedilSTK Meliosma dilleniifolia SabiaceaeAY464105
AlyrAG* Arabidopsis lyrata Brassicaceae946287
AlyrSHP1* Arabidopsis lyrata Brassicaceae486333
AlyrSHP2* Arabidopsis lyrata Brassicaceae321962
AlyrSTK* Arabidopsis lyrata Brassicaceae489841
ATSHP1 Arabidopsis thaliana BrassicaceaeAT3G58780
ATSHP2 Arabidopsis thaliana BrassicaceaeAT2G42830
ATSTK Arabidopsis thaliana BrassicaceaeAT4G09960 or NM_001203767.1
BraAG* Brassica rapa BrassicaceaeBrara.K01743.1
BraSHP1* Brassica rapa BrassicaceaeBrara.G01817.1
BraSHP2* Brassica rapa BrassicaceaeBrara.E00310.1
BraSTK* Brassica rapa BrassicaceaeBrara.C02624.1
CaruAG* Capsella rubella BrassicaceaeCarubv10005558m
CaruSHP1* Capsella rubella BrassicaceaeCarubv10019520m
CaruSHP2* Capsella rubella BrassicaceaeCarubv10025002m
CaruSTK* Capsella rubella BrassicaceaeCarubv10003771
ThhSHP1* Thellungiella halophila BrassicaceaeThhalv10006196
ThhSHP2* Thellungiella halophila BrassicaceaeThhalv10017047
ThhSTK* Thellungiella halophila BrassicaceaeThhalv10028938
CapaSHP* Carica papaya CaricaceaeEvm. TU supercontig_50.73
MetrAG* Medicago truncatula FabaceaeMedtr8g087860.1
MetrSHP Medicago truncatula FabaceaeJX308825
MetrSTK* Medicago truncatula FabaceaeMedtr 3g 005530.1
GoraAG1* Gossypium raimondii MalvaceaeGorai.N017200.1
GoraAG2* Gossypium raimondii MalvaceaeGorai.011G035500.1
GoraSHP* Gossypium raimondii MalvaceaeGorai012G042600.1
GoraSTK1* Gossypium raimondii MalvaceaeGorai.009G265100.1
GoraSTK2* Gossypium raimondii MalvaceaeGorai.009G288000.1
ThecAG* Theobroma cacao MalvaceaeThecc1E6029596t1
ThecSHP* Theobroma cacao MalvaceaeThecc1EG001841t1
ThecSTK* Theobroma cacao MalvaceaeThecc1EG036541t1
MguAG* Mimulus guttatus PhrymaceaeMgv1a012605 or Migut.M00986.1
MguSTK* Mimulus guttatus PhrymaceaeMgv1a013047m or Migut.C01334.1
PotriAG* Populus trichocarpa SalicaceaePotri.011G075800.1
PotriSTK* Populus trichocarpa SalicaceaePotri.013G104900.1
PotriSTK2* Populus trichocarpa SalicaceaePotri.019G077200.1
TAG Solanum lycopersicum SolanaceaeL26295.1
TSHP Solanum lycopersicum SolanaceaeAY098735
TSTK Solanum lycopersicum SolanaceaeNM_001247265.2
ViviSHP* Vitis vinifera VitaceaeGSVIVG01000802001
ViviAG* Vitis vinifera VitaceaeGSVIVT01021303001

Genes retrieved from NCBI (National Center for Biotechnology Information) (genes with * were retrieved from Phytozome) (long).

List of species and sequence identifiers used in this study. Genes retrieved from NCBI (National Center for Biotechnology Information) (genes with * were retrieved from Phytozome) (long).

Phylogenetic analysis

Maximum likelihood analysis was performed using PhyML 3.0 (http://www.atgc-montpellier.fr/phyml/)[40,41] with the TN93 model,[42] a gamma distribution parameter of 1.107. Bootstrap support was calculated based on 100 iterations. The most likely tree was computed based on the PhyML-estimated parameters: transition/transversion ratio for purines of 2.541, transition/transversion ratio for pyrimidines of 4.342, and nucleotides frequencies of f(A) = 0.33406, f(C) = 0.20359, f(G) = 0.24537, and f(T) = 0.21698. An ML tree of the protein sequence multiple sequence alignment was also performed on PhyML 3.0 using the LG model of amino acid substitution.

Obtaining identity matrix

A pairwise distance matrix, based on a nucleotide multiple sequence alignment of the 93 sequences, was calculated using MEGA7. Even though the length of the final alignment obtained was 543 positions, removal of gaps and missing data was performed to calculate the distance matrix, resulting in a final set of 372 informative positions in the final filtered data set.[43] The number of base substitutions per site between sequences was calculated using the maximum composite likelihood model.[44] To obtain the identity value of the sequence pairs, we subtracted 1 from the distance value of every term of the distance matrix to finally obtain the identity matrix.

Network analysis

Once the gene identity matrix was generated, a set of 101 networks were created based on the identity threshold between sequence pairs (1 network for each threshold, 0% through 100%), which is represented by the parameter σ. In each network, each nucleotide sequence is represented by a single node. Two nodes (say i and j) are considered connected if the identity threshold is greater than σ. The networks were represented in the format of an adjacency matrix M(σ), where the matrix elements M (pairs of sequences) were either 1, if they were connected, or 0, if they were not connected.[45] Then, neighborhood matrices were built for each one of the M(σ).[46,47] Each element from represents the number of steps in the shortest path connecting 2 nodes i and j. Whenever 2 nodes are not connected and belong in different clusters, . A neighborhood matrix shows the number of edges connecting 2 nodes in the network. The neighborhood matrices were later used to calculate the network distance δ(σ, σ + Δσ) between the pairs of successive networks (in this case, Δσ = 1), to find the network with the most meaningful biological information, as previously described.[45] Further description of the symbols used here is in Table 2.
Table 2.

Summary of symbols.

SymbolDenominationDescription
σ Identity thresholdThreshold value used to build a network, based on similarity values ranging from 0% to 100%. Pairs of sequences that have an identity value greater than or equal to σ mean that they are connected
M(σ)Adjacency matrix at σAdjacency matrix obtained at a certain value of σ, composed of 0 and 1, representing whether a pair of sequences is connected (represented by 1) or disconnected (represented by 0)
Mij Element of the adjacency matrixElement of the adjacency matrix and represents the presence (1) or absence (0) of an edge between sequences i and j of an adjacency matrix M
M^(σ) Neighborhood matrixMatrix composed of elements representing the least number of edges necessary to connect a pair of sequences
m^ij Element of the neighborhood matrixElement of the neighborhood matrix and represents the least number of edges connecting sequences i and j
ΔσIncrements of σValue incremented to σ, ie, Δσ = 1 means that the σ increases by 1
δ(σ, σ + Δσ)Network distance between 2 networksRepresents the network distance δ between the networks obtained at σ and σ + Δσ
Summary of symbols. Gephi was used to visualize and further interrogate the networks.[48] The modularity calculation from Gephi, based on Blondel et al[49] and resolution from Lambiotte et al,[50] was used to classify individual nodes into communities. To summarize the network approach applied here, we describe the main steps performed: Alignment of gene sequences; Calculation of genetic distances and generation of identity matrix; Calculation of network distances; Identification of best σ; Network generation and analysis under most informative value. The proposed approach used here requires less than 10 seconds to run on an Acer Intel Core i7-6700 CPU @ 3.40GHz for all data sets tested to date (<100 sequences). The scripts used here can be found on GitHub (https://github.com/deCarvalho90/network_analysis) and the software with a graphical interface is available in the work by Goés-Neto et al.[51]

Results

The maximum likelihood phylogeny of AGAMOUS genes presented in Figure 1 is consistent with the topology previously published studies of the AGAMOUS gene family.[25,26,31] The most likely nucleotide tree had a log likelihood score of −20654.546986. The ML protein tree had poor support for main clades and therefore was not used in subsequent analysis (data not shown).
Figure 1.

Phylogenetic tree of the AGAMOUS family genes. Main functional groups are highlighted in black boxes along the tree.

Phylogenetic tree of the AGAMOUS family genes. Main functional groups are highlighted in black boxes along the tree. Gymnosperm AGAMOUS genes (here termed C/D homologues) form a paraphyly at the base of the unrooted tree. An initial duplication event separates C and D lineage angiosperm genes and likely occurred in the common ancestor of angiosperms. Basal angiosperm C lineage homologues, although clustering with D lineage genes, exhibit expression patterns, and likely function, similar to that of core eudicot C lineage genes. D lineage genes form a monophyletic clade that includes all other angiosperm species included in this study. Monocot D lineage genes appear as a paraphyly at the base of the D lineage clade; however, the relationships among D lineage genes otherwise are largely consistent with known species relationships. The relationships of C lineage genes are more convoluted. The base of this subtree is a polyphyly including monocot, basal eudicot, and core eudicot genes. At the base of the core eudicots, a second duplication event resulted in the split of the AGAMOUS and PLENA/SHATTERPROOF (SHP) lineages. A third duplication, likely at the base of the Brassicales, resulted in 2 copies of SHP genes in this group (SHP1 and SHP2; Figure 1). Basal angiosperm C lineage genes form a group that diverges after the gymnosperm C/D lineage, but before the angiosperm C/D lineage split. The artificial polyphyletic group of the paleoAGAMOUS includes monocot and basal eudicot sequences. While the basal eudicot genes group with other core eudicot AGAMOUS genes, monocot paleoAGAMOUS genes share a most recent common ancestor with D lineage genes. It is important to notice, however, that the low branch support in many areas of the AGAMOUS gene tree poses challenges to the interpretation of the evolutionary relationship between clades. The network distance graph showed its highest peak at 75% identity, which means that the network generated at that peak is the most distant from the others (Figure 2A). Also, it means that the network presents a clear community structure with relevant evolutionary information. Despite the fact that the network with the biggest distance was obtained at 75% identity, the community structure was already too fragmented to answer questions about the evolution of the gene families analyzed in the phylogeny (Figure S1A). Even though the network obtained at 75% was too fragmented, the network still provided relevant information about the functional divergence of the genes. However, we wanted to see how the community structure would behave in a scenario closer to the phylogeny. To do so, we had to find the network where all sequences were connected in a way that it would still be possible to retrieve a community structure. A similar situation occurred in the work by Carvalho et al,[18] and the problem was solved by analyzing other networks in different peaks. Here, we attempted to solve this problem by analyzing the network at 51% to find the last network where all sequences were connected. However, it was not possible to see a clear community structure in this network due to the high degree of connectivity between nodes (Figure S1B). Finally, in this study, we focused mainly on the network obtained at the identity threshold 67, which meant that 2 sequences had to have an identity value of 67% or higher to be connected. The choice of the network threshold was based on the fact that all sequences in this study were connected, with exception of the out-group sequences, which reflected a scenario similar to the phylogeny.
Figure 2.

(A) Network distance graph based on the δ(σ, σ + Δσ) distance. The values for the analyzed networks obtained at 51%, 67%, and 75% are marked. (B) Network obtained at 67% identity. Nodes are colored based on the community they belong to (C1-C5), as result of the modularity algorithm (see methods). The sequences that do not belong to any community are represented as gray nodes. Network obtained at 67% identity, colored based on (C) gene function and (D) species phylogenetic placement.

(A) Network distance graph based on the δ(σ, σ + Δσ) distance. The values for the analyzed networks obtained at 51%, 67%, and 75% are marked. (B) Network obtained at 67% identity. Nodes are colored based on the community they belong to (C1-C5), as result of the modularity algorithm (see methods). The sequences that do not belong to any community are represented as gray nodes. Network obtained at 67% identity, colored based on (C) gene function and (D) species phylogenetic placement. After applying the modularity calculation (see section “Methods”) in the 67% network, it was possible to see the emergence of the community structure of the network, containing 5 communities (C1-C5; Figure 2B). Each one of the communities mainly cluster genes that have similar functions. In C1, 3 out of the 5 nodes from gymnosperm C/D homologues are connected. Even though the 5 nodes are not connected, this result was expected due to the fact that they are part of the most distant out-group sequences as seen in Figure 1. In C2, however, the functions of the nodes are related to AG, paleoAG and basal angiosperm C homologues. This might suggest that the AGAMOUS genes have retained a function very similar to their basal angiosperm C homologues. In C3, the SHP genes are clustered together, but in a different community of the AG genes, also suggesting functional divergence. The genes clustered in C4 comprise the STK genes. Even though the communities were mostly composed of genes with similar functions, 3 genes exhibited unexpected placements. For instance, the SHP gene from Vitis vinifera (ViviSHP) clustered with other AG genes in C2, instead of with other SHP genes in C3. Similarly, Sorghum bicolor SbAG2, a STK gene, clustered in C5, instead of the expected C4, whereas Sorghum bicolor SbAG3, a paleoAG gene, clustered in C4, instead of the expected C5. Finally, the genes clustered in C5 belong to the monocots paleoAG. This result might suggest that monocot paleoAG genes are evolving under different evolutionary forces than the paleoAG and AG lineages. Finally, we can notice that the grouping obtained by both methods were consistent with one another by comparing Figures 1 and 2D. Also, the results obtained at the 67% threshold is largely congruent with the one obtained for the protein network generated (Figure S3), obtained at the 60% threshold (highest peak). However, the protein network showed lower resolution because it clustered together AG, eudicot paleoAG, and SHP genes, whereas we see a clear separation of SHP from the other genes. Even though the 75% network showed a fragmented community structure for this study, we can notice that it shows that the STK sequences from maize, sorghum, rice, and brachypodium are in a separate community. This information might suggest that STK genes from grasses might be undergoing a functional divergence compared with the remaining STK genes; however, limitations in gene functional annotation do not enable us to further support this inference. Both the phylogenetic and network-based analyses returned largely consistent sets of gene clusters. However, the grouping of monocots paleoAG sequences in a separate cluster (C5) than other C homologues from basal angiosperm, basal eudicot, and eudicot sequences (jointly clustered in C2) in the network-based analysis suggest 2 testable hypothesis: (1) monocot sequences are undergoing different and independent evolutionary processes when compared with other non-monocot AG homologues and (2) non-monocot AG sequences are clustered with euAG genes due to conservation of function.

Discussion

The use of phylogenetic methods to study gene family evolution has provided vast increases in the understanding of molecular evolution, and the utility of these methods for reconstructing ancestral relationships remains unparalleled. However, in many cases, complex evolutionary processes including neofunctionalization, repeated co-option into new biological roles, as has occurred in independent origins of C4 photosynthesis,[52] high birth/death gene families, and reciprocal gene loss following gene or genome duplication, reconstructing phylogenetic relationships may not be the most effective method for identifying genes with equivalent functional roles. Among the contributions of a network approach to gene family studies is the interpretation of the relationships among gene sequences that are not limited to a bifurcating pattern, which is often the case in a phylogenetic framework. A network approach allows for the emergence of patterns that are not seen otherwise. Here, we propose the use of network-based approach which has complementary sets of strengths and weaknesses to conventional phylogenetic methods and tested the contributions of these methods using data from the well-characterized AGAMOUS family of floral transcription factors. For instance, in the phylogenetic tree, the non-monocot euAG and paleoAG genes are not clustered with the basal C homologues. Rather, in the networks we notice that these genes are clustered together suggesting a higher functional conservation between them, which is not seen in the tree. Also, from the tree alone we cannot infer whether either euAG or SHP genes neofunctionalized. However, because all the euAG and non-monocot paleoAG are clustered together with the ancestral C homologues, and apart from the SHP genes, we can infer that the SHP genes neofunctionalized, whereas the euAG and non-monocot paleoAG retained an ancestral function. We believe that a combined approach might help with discerning functional and structural evolution in a way that neither methods can provide on its own. In agreement with the literature,[26,53] the network-based analysis recovered clusters of paeloAG and AG genes from basal angiosperms, basal eudicots, and core eudicots, potentially indicating conserved functional roles for the genes included in these clusters despite sequence divergence. In contrast, the position of the basal angiosperms’ C lineage in the phylogenetic tree leads to uncertain interpretations of conserved or divergent function with respect to the D lineage. The network-based approach also separated the STK and paleoAG genes within the monocot lineage, despite the close phylogenetic relatedness of these 2 gene clades, consistent with reports of distinct functional roles for these 2 sets of genes in monocots.[54,55] For instance, paleoAG gene from maize has undergone a duplication event in the common ancestor of maize, wheat, and rice[25] which leads to subfunctionalization of these genes that perform functions still related to, but different from Arabidopsis AG.[56] A similar process also occurred in rice.[57] These differences may be the reason the monocot paleoAG clustered together in the network, but in a different community than the remaining AG gene sequences. Moreover, genetic networks of the inflorescence meristems can vary a lot between grasses and eudicots because several changes in these regulatory networks are either only present in grasses or perform a different function in eudicots.[58] However, network-based approaches to studying gene families bring with them their own set of limitations. Some of these are inherent to the particular methodology used here, whereas others are a result of the relative immaturity of statistical and software tools for applying these methods to the analysis of gene family evolution. For example, a range of statistical methods are widely available for estimating the level of support for individual branches/clades within a given phylogeny, such as jackknife, bootstrap, and posterior probabilities.[59,60] In contrast, methods for calculation of cluster support in a biological context are far less mature, at least for the implementation employed here. The use of sequence identity as a measure of distance, while computationally tractable, also means discarding a great deal of information on the frequency of different types of substitutions at both the nucleotide and amino acid levels which can be incorporated into many modern phylogenetic algorithms.[61] Figure 3 summarizes the contributions and relative strengths and weaknesses of phylogenetic and network-based approaches to the study of gene family evolution. We propose that the combination of both methods can provide more assessment of both functional and historical relationships between sequences than either approach alone.
Figure 3.

Schematic diagram of results based on phylogenetic (left) and network (right) analyses. Potential contributions of each approach, as well as benefits steaming from the combination of both methods are described below the diagrams.

Schematic diagram of results based on phylogenetic (left) and network (right) analyses. Potential contributions of each approach, as well as benefits steaming from the combination of both methods are described below the diagrams.

Conclusions

Investigating the contributions of a particular network-based approach to the study of the evolution of a well-known family of transcription factor genes involved in floral development supports the idea that network-based approaches, when used in conjunction with phylogenetic methods, can be used to improve our understanding of functional conservation or divergence within gene family evolution. The network-based analysis of gene families used here currently lacks the robust ecosystem of computational tools and statistical approaches developed for phylogenetic analysis; however, it can already provide an independent assessment of relationship structures which can aid in the interpretation of phylogenetic data, especially in areas of the tree exhibiting low branch support. In particular, network analysis can be used to generate testable hypotheses regarding the conservation or divergence of gene function in cases of potential subfunctionalization or neofunctionalization. In combination, we believe that these methods provide a robust framework that expands the power of gene family evolution studies. Click here for additional data file. Supplemental material, Supplementary_Material for Integrating Phylogenetic and Network Approaches to Study Gene Family Evolution: The Case of the AGAMOUS Family of Floral Genes by Daniel S Carvalho, James C Schnable and Ana Maria R Almeida in Evolutionary Bioinformatics
  55 in total

Review 1.  Network biology: understanding the cell's functional organization.

Authors:  Albert-László Barabási; Zoltán N Oltvai
Journal:  Nat Rev Genet       Date:  2004-02       Impact factor: 53.242

2.  The human disease network.

Authors:  Kwang-Il Goh; Michael E Cusick; David Valle; Barton Childs; Marc Vidal; Albert-László Barabási
Journal:  Proc Natl Acad Sci U S A       Date:  2007-05-14       Impact factor: 11.205

3.  A new bioinformatics analysis tools framework at EMBL-EBI.

Authors:  Mickael Goujon; Hamish McWilliam; Weizhong Li; Franck Valentin; Silvano Squizzato; Juri Paern; Rodrigo Lopez
Journal:  Nucleic Acids Res       Date:  2010-05-03       Impact factor: 16.971

4.  Functional diversification of AGAMOUS lineage genes in regulating tomato flower and fruit development.

Authors:  Irvin L Pan; Ryan McQuinn; James J Giovannoni; Vivian F Irish
Journal:  J Exp Bot       Date:  2010-03-24       Impact factor: 6.992

5.  Characterization of an AGAMOUS homologue from the conifer black spruce (Picea mariana) that produces floral homeotic conversions when expressed in Arabidopsis.

Authors:  R Rutledge; S Regan; O Nicolas; P Fobert; C Côté; W Bosnich; C Kauffeldt; G Sunohara; A Séguin; D Stewart
Journal:  Plant J       Date:  1998-09       Impact factor: 6.417

6.  Genetic interactions among floral homeotic genes of Arabidopsis.

Authors:  J L Bowman; D R Smyth; E M Meyerowitz
Journal:  Development       Date:  1991-05       Impact factor: 6.868

7.  Detecting network communities: an application to phylogenetic analysis.

Authors:  Roberto F S Andrade; Ivan C Rocha-Neto; Leonardo B L Santos; Charles N de Santana; Marcelo V C Diniz; Thierry Petit Lobão; Aristóteles Goés-Neto; Suani T R Pinho; Charbel N El-Hani
Journal:  PLoS Comput Biol       Date:  2011-05-05       Impact factor: 4.475

8.  Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data.

Authors:  Matthew Kearse; Richard Moir; Amy Wilson; Steven Stones-Havas; Matthew Cheung; Shane Sturrock; Simon Buxton; Alex Cooper; Sidney Markowitz; Chris Duran; Tobias Thierer; Bruce Ashton; Peter Meintjes; Alexei Drummond
Journal:  Bioinformatics       Date:  2012-04-27       Impact factor: 6.937

9.  What are the Evolutionary Origins of Mitochondria? A Complex Network Approach.

Authors:  Daniel S Carvalho; Roberto F S Andrade; Suani T R Pinho; Aristóteles Góes-Neto; Thierry C P Lobão; Gilberto C Bomfim; Charbel N El-Hani
Journal:  PLoS One       Date:  2015-09-02       Impact factor: 3.240

10.  Evolution of fruit development genes in flowering plants.

Authors:  Natalia Pabón-Mora; Gane Ka-Shu Wong; Barbara A Ambrose
Journal:  Front Plant Sci       Date:  2014-06-26       Impact factor: 5.753

View more
  2 in total

1.  Network-based microsynteny analysis identifies major differences and genomic outliers in mammalian and angiosperm genomes.

Authors:  Tao Zhao; M Eric Schranz
Journal:  Proc Natl Acad Sci U S A       Date:  2019-01-23       Impact factor: 11.205

2.  Identification of Key Pathways and Genes Related to the Development of Hair Follicle Cycle in Cashmere Goats.

Authors:  Jianfang Wang; Jie Sui; Chao Mao; Xiaorui Li; Xingyi Chen; Chengcheng Liang; Xiaohui Wang; Si-Hu Wang; Cunling Jia
Journal:  Genes (Basel)       Date:  2021-01-27       Impact factor: 4.096

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.