Literature DB >> 34917776

Identification of deleterious single nucleotide polymorphism (SNP)s in the human TBX5 gene & prediction of their structural & functional consequences: An in silico approach.

A M U B Mahfuz¹, Md Arif Khan¹, Promita Deb¹, Sharmin Jahan Ansary¹, Rownak Jahan¹.

Abstract

T-box transcription factor 5 gene (TBX5) encodes the transcription factor TBX5, which plays a crucial role in the development of heart and upper limbs. Damaging single nucleotide variants in this gene alter the protein structure, disturb the functions of TBX5, and ultimately cause Holt-Oram Syndrome (HOS). By analyzing the available single nucleotide polymorphism information in the dbSNP database, this study was designed to identify the most deleterious TBX5 SNPs through in silico approaches and predict their structural and functional consequences. Fifty-eight missense substitutions were found damaging by sequence homology-based tools: SIFT and PROVEAN, and structure homology-based tool PolyPhen-2. Various disease association meta-predictors further scrutinized these SNPs. Additionally, conservation profile of the amino acid residues, their surface accessibility, stability, and structural integrity of the native protein upon mutations were assessed. From these analyses, finally 5 SNPs were detected as the most damaging ones: [rs1565941579 (P85S), rs1269970792 (W121R), rs772248871 (V153D), rs769113870 (E208D), and rs1318021626 (I222N)]. Analyses of stop-lost, nonsense, UTR, and splice site SNPs were also conducted. Through integrative bioinformatics analyses, this study has identified the SNPs that are deleterious to the TBX5 protein structure and have the potential to cause HOS. Further wet-lab experiments can validate these findings.

Entities: Chemical

Keywords: Congenital heart disease; Holt-oram syndrome; Mutation; SNP; TBX5

Year: 2021 PMID： 34917776 PMCID： PMC8646135 DOI： 10.1016/j.bbrep.2021.101179

Source DB: PubMed Journal: Biochem Biophys Rep ISSN： 2405-5808

Introduction

Heart is the first organ to develop in the human body and numerous genes and their products orchestrate this process. Among these genes, TBX5 is an important one. It is a member of the T-box gene family that encodes transcription factors with a highly conserved DNA binding domain (T-box) of 180–200 amino acids [1]. Members of this gene family have been found from metazoans to humankind, and in humans, more than 20 members of this gene family have been reported till now [2,3]. Moreover, several human diseases are associated with mutations in the T-box genes, namely, DiGeorge Syndrome (TBX1), Ulnar-Mammary Syndrome (TBX3), Small Patella Syndrome (TBX4), Holt-Oram Syndrome (TBX5), Spondylocostal Dysostosis (TBX6), Cousin Syndrome (TBX15), Isolated ACTH Deficiency (TBX19), Congenital Heart Defects (TBX20), X-Linked Cleft Palate with or without Ankyloglossia (TBX22) [4]. According to the recent information retrieved from NCBI and UniProt databases, the TBX5 gene is located in the long arm of chromosome 12 at position 24.21 (12q24.21), contains ten exons, and has three known transcript variants (1, 3, and 4). Transcript variants 1 and 4 encode a protein of 518 amino acid residues and transcript variant 3 encodes a 468 (51–518) amino acid containing protein with a shortened N-terminus. In the TBX5 transcription factor, the T-box spans from amino acid position 58 to 238. The binding site of TBX5 has been identified to be present in the upstream regions of various cardiac-expressed genes, such as cardiac α actin, atrial natriuretic factor, cardiac myosin heavy chain α, cardiac myosin heavy chain β, myosin light chain 1A, myosin light chain 1V and Nkx2-5. The consensus binding site of TBX5 corresponds to the first eight bases from one-half of the Brachyury binding 22 palindromic consensus sequence. It was shown in vitro that TBX5, when in total length, binds to the target site mainly as a dimer, and when truncated, tends to bind mainly as a monomer [5]. It was further showed in vitro that TBX5 can form a heterodimer complex with Nkx2-5, another essential transcription factor for cardiac development [6]. During the early stages of heart development, TBX5 acts mainly as a transcriptional activator of genes responsible for cardiomyocyte maturation and cardiac septa formation. It is also necessary for effective establishment of the cardiac conduction system [7]. TBX5 also plays a vital role in the very initiation of upper (fore) limbs [8], but the necessity of its presence during limb outgrowth phase remains controversial [8,9]. Holt-Oram Syndrome (HOS) is an autosomal dominant disease due to mutation in the TBX5 gene. It is considered 100% penetrant with varying expressivity in the heart and upper extremity [10]. However, incomplete penetrance [11], lack of penetrance [12], somatic mosaicism [12], and probable germinal mosaicism [13] have also been reported for this disease. The majority of the patients suffer from this syndrome due to de novo mutations [14]. Cardiac presentations include atrial septal defect, ventricular septal defect, atrioventricular septal defect, pulmonary atresia/stenosis, double outlet right ventricle, aortic valve insufficiency/stenosis, tricuspid valve atresia, mitral valve abnormality, patent ductus arteriosus, tetralogy of pentalogy of Fallot, common arterial truncus, dextrocardia, right aortic arch, and other non-specified congenital heart diseases [15]. Upper limb anomalies may include a hypoplastic or absent thumb, triphalangeal thumb, accessory/bifid thumb, aplasia/hypoplasia of hand and fingers, syndactyly, radial with or without ulnar aplasia or hypoplasia, radio-ulnar synostosis, hypoplasia of humerus, aplasia of humerus, and clavicular abnormalities [15]. Nonsense mutation [16], missense mutation [17], frameshift mutation [16,17], deletion [18], duplication [19], and splice site mutation [[20], [21], [22]] of the TBX5 gene have been reported to be responsible for HOS. It has been found that missense mutations tend to cause more serious cardiac anomalies than mutations that cause a truncated TBX5 protein (nonsense, frameshift, splice site point variants, intragenic deletions, or duplications) [12]. HOS is caused by heterozygous mutation. To the best of our knowledge, only one homozygous missense mutation in the TBX5 gene has been reported until present but only with cardiac involvement (atrioventricular septal defect) and no skeletal deformity [23]. TBX5 missense mutations are also associated with familial [24] and sporadic dilated cardiomyopathy [25], atrial fibrillation, and lone atrial fibrillation [26]. Furthermore, TBX5 non-synonymous (missense and nonsense) mutations have also been reported in association with tetralogy of Fallot without any skeletal abnormality [27] and a TBX5 3′UTR variant makes its carriers more prone to congenital heart diseases [28]. Considering the above-stated findings, this study was conducted to find out the most deleterious TBX5 SNPs and analyze their structural and functional outcomes in silico.

Materials and methods

The overall methodology adopted in this study is graphically presented in Fig. 1.

Fig. 1

An overview of the steps followed to identify the deleterious TBX5 SNPs and analyze their structural and functional consequences.

Retrieval of amino acid sequence and wild structure

Amino acid sequence of the human TBX5 protein was downloaded in FASTA format from the UniProt database (UniProt ID: Q99593) (https://www.uniprot.org/uniprot/Q99593). Experimental structures of TBX5 were collected from Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) (https://www.rcsb.org/).

Retrieval of SNPs

SNP information of the human TBX5 gene was collected using NCBI Variation Viewer (https://www.ncbi.nlm.nih.gov/variation/view/). Nonsense, stop lost, splice acceptor, splice donor, and UTR SNPs of the TBX5 gene along with their rs (reference SNP) IDs were retrieved from Variation Viewer applying the following filters: dbSNP under ‘Source database’ menu, single nucleotide variant under ‘Variant type’ menu, and nonsense (stop gained)/stop lost/splice acceptor variant/splice donor variant/5 prime UTR variant/3 prime UTR variant under ‘Molecular consequence’ menu. In addition, information about the missense SNPs was downloaded directly from the NCBI dbSNP database (https://www.ncbi.nlm.nih.gov/snp/).

Identification and analysis of damaging missense SNPs

A total of 355 missense SNPs were retrieved from dbSNP.

Prediction of damaging missense SNPs by sequence-homology based tools

The missense SNPs were first subjected to analysis by sequence-homology-based predictor PROVEAN [29]. PROVEAN predicts the effects of single or multiple amino acid substitutions or indel mutations on the functions of a protein. In PROVEAN, BLAST hits with more than 75% global sequence identity are clustered first. The top 30 clusters constitute the supporting sequence set, and a delta alignment score is computed for every supporting sequence. These scores are averaged within and across clusters to obtain the final PROVEAN score. An amino acid substitution is predicted ‘Deleterious’ if the PROVEAN score is below or equal to the predefined threshold value of −2.5, and is considered ‘Neutral’ if the PROVEAN score is above this threshold value [29]. In this study, a list of TBX5 gene SNPs defined by their chromosomal positions, reference alleles, and variant alleles was used as input to the PROVEAN server. PROVEAN also provides scores from SIFT, another prediction tool. SIFT (Sorting Intolerant From Tolerant) presumes that essential amino acids are well conserved in a protein family, and a replacement in a conserved position is predicted as ‘Damaging’ [30]. SIFT scores lie between 0 and 1, and amino acid substitutions with a score ≤0.05 are predicted to alter protein function, and hence are considered ‘Damaging’ [31]. 85 SNPs of the TBX5 gene were predicted as 'Deleterious' by PROVEAN, and 236 SNPs were predicted as ‘Damaging’ by SIFT. 80 SNPs were predicted as harmful by both tools.

Prediction of damaging missense SNPs by structure-homology based tool

PolyPhen-2 (Polymorphism Phenotyping version 2) prediction is made on eight sequence-based and three structure-based parameters. Two pairs of datasets, viz. HumDiv and HumVar were recruited to train and test two PolyPhen-2 models. PolyPhen-2 computes the Naive Bayes posterior probability of a given mutation's being damaging and estimates the false-positive rate and true-positive rate. It classifies SNPs as ‘Probably Damaging,’ ‘Possibly Damaging,’ and ‘Benign’ based on the false-positive rate (FPR) thresholds [32]. The 80 SNPs that were predicted harmful by both PROVEAN and SIFT were next analyzed by both HumDiv and HumVar trained PolyPhen-2 using the ‘batch query’ option. A list of SNPs defined by chromosome number, chromosomal position, reference allele, and variant allele was used as the input. 58 SNPs were predicted as ‘Probably Damaging’ by either HumDiv or HumVar PolyPhen-2, and these SNPs were considered of high fidelity in disease association.

Prediction of pathogenic amino acid substitutions by MutPred2

MutPred2 was developed by integrating genetic and molecular information to predict pathogenicity caused by amino acid substitutions [33]. The cutoff value for predicted pathogenicity is 0.5, and the higher the score, the greater the probability of that amino acid substitution's disease association. TBX5 protein sequence in FASTA format and wild type and substituted amino acids of 58 high-confidence deleterious SNPs with their respective positions were submitted to MutPred2 server using the default P-value threshold (0.05).

Scrutinization of predicted pathogenicity by meta-predictors

Several meta-predictors have been developed for predicting disease-causing SNPs in recent years. Such meta-predictors employ various single pathogenicity prediction tools and provide a consensus score derived from different calculations. It is well established that ensemble methods render improved predictive performance than any of its constituent algorithms alone. We decided to crosscheck the predicted 58 pathogenic SNPs with 5 such meta-predictors, namely i) PredictSNP2, ii) PredictSNP1, iii) MetaLR, iv) MetaSVM, and v) REVEL. PredictSNP2 generates a consensus score from the predictions of five tools- CADD, DANN, FATHMM, FunSeq2, and GWAVA [34]. The input to PredictSNP2 was similar to that used for PolyPhen-2. After analysis by PredictSNP2, the SNPs were sent for further analysis by PredicteSNP1. PredictSNP1 has incorporated six individual predictors- MAPP, PhD-SNP, PolyPhen-1, PolyPhen-2, SIFT, and SNAP [35]. MetaSVM and MetaLR are two meta-classifiers developed for the dbNSFP (database for non-synonymous SNPs' functional predictions) project. They calculate ensemble scores based on the predictions of 10 component tools and the maximum frequency observed in the 1000 genomes project populations [36]. The difference between them lies in their adopted approaches- one uses Support Vector Machine (MetaSVM) algorithm and the other uses Logistic Regression (MetaLR) algorithm. They were both accessed through Ensembl Variant Effect Predictor (VEP) (http://www.ensembl.org/Tools/VEP). rsID (Reference SNP ID number)s of the SNPs were used as the inputs and the chosen reference genome was GRCh37. REVEL (Rare Exome Variant Ensemble Learner) is a meta-predictor that integrates 13 tools to predict the probability of disease causation by a missense SNP. It is particularly efficient in differentiating pathogenic from rare neutral variants with allele frequencies <0.5% [37]. REVEL scores range between 0 and 1, and the threshold score for pathogenic variants is 0.5. REVEL scores in this study were obtained alongside MetaSVM and MetaLR scores using the Ensembl VEP.

Identification of conserved amino acid residues

Structural and functional importance of an amino acid residue in a protein is often strongly related to its level of conservation. ConSurf ranks evolutionary conservation status of amino acid and nucleic acid positions in protein and DNA/RNA molecules, respectively [38]. The continuous conservation scores of residues are split into an integer scale of 1–9 and depicted with a color scheme. The most variable positions (grade 1) are colored turquoise, intermediate conserved positions (grade 5) are colored white, and the most conserved positions (grade 9) are colored maroon. TBX5 protein sequence in FASTA format served as the input to ConSurf.

Solvent accessible surface area (SASA) and secondary structure prediction

Knowing the solvent accessibility of amino acid residues is essential to identify the interaction interfaces and active sites in a fully folded protein. Amino acid substitutions in such sites may bring change in binding affinity [39], or disturb catalytic activity if the protein is an enzyme [40]. NetSurfP-2.0 predicted surface accessibility of TBX5 residues and its secondary structure. It accepts protein sequence in FASTA format as the input and recruits deep neural networks that have been trained on crystal protein structures [41].

Prediction of change in stability of the mutated proteins by iStable 2.0

Amino acid substitutions caused by missense SNPs can significantly alter the folding free energy. Missense SNPs decreasing the stability of a native protein can affect the functions of that protein and ultimately result in diseases [42]. Missense SNPs causing destabilized proteins are associated with myopathy, von Willebrand disease, retinitis pigmentosa, hemophagocytic lymphohistiocytosis, and prion diseases. However, stabilizing missense SNPs can also cause disease [42]. In our study, changes in TBX5 protein stability due to missense SNPs were predicted by iStable 2.0 [43]. iStable 2.0 results are derived from 11 protein stability prediction tools [43]. For amino acid substitutions up to F232V, chain A of the TBX5 dimer crystal structure containing 1–239 residues (PDB ID: 5BQD) was used as the native structure. Due to the unavailability of an experimental structure, rest of the substitutions were analyzed using the protein's FASTA sequence.

Prediction of the effects of amino acid substitutions on the structural integrity of TBX5 protein by Missense3D and normal mode analysis

To analyze the impacts of amino acid substitutions on TBX5 three-dimensional structure, Missense3D server (Missense3D | Structural Bioinformatics Group | Imperial College London) was utilized. Missense3D considers 16 structural features to differentiate disease-associated SNPs from the neutral ones [45]. For substitutions up to F232V, ‘Position on Protein Sequence’ input option was chosen, and the UniProt ID (Q99593) and PDB code & chain ID (5BQD; A) along with the substitutions were used as inputs. For the rest of the substitutions, the energy minimized model generated by Robetta was used [see 2.8]. The mutated protein structures due to SNPs were later subjected to normal mode analysis by DynaMut [44]. DynaMut calculates probable consequences of an amino acid replacement on the stability of a submitted protein from vibrational entropy changes. Normal mode analysis is a widely applied tool for protein structure study [44]. The input was similar to Missense3D.

Molecular dynamics (MD) simulation

MD simulation studies of the five mutant structures distilled from the consensus of Missense3D, DynaMut, and iStable 2.0 [see 3.2.7], and the native TBX5 protein structure (PDB ID:5BQD, chain A) were conducted by employing GROMOS96 43a1 force field [46]. To mimic the physiological environment, a triclinic simulation box (36 × 36 × 44 Å) and 0.9% (0.15 M) NaCl concentration was chosen. Depending on total positive and negative charges, the system was made neutral by adding sodium or chlorine ions. Chosen solvation model was SPC. The system was energy minimized before starting simulation by applying Steepest Descent Algorithm for 5000 steps. MD simulation was continued for 100 ns under 300K constant temperature and 1.0 bar constant pressure for each protein structure. The trajectories were saved at 100 ps intervals.

Prediction of post-translational modification site missense SNPs

Protein post-translational modification (PTM)s occur at the time of or after protein synthesis and such modifications are mediated by different enzymes or covalent bonds with other structures. SNPs in the PTM sites can result in diseases [47]. SNPs in the putative PTM sites of the TBX5 protein were detected by AWESOME [47]. AWESOME predicts the PTM site SNPs based on the evaluation of 20 PTM site predictors. Gene name was used as the input to AWESOME.

Analysis of nonsense and stop lost SNPs

31 nonsense (stop gained) and one stop lost SNPs were retrieved from NCBI Variation Viewer. Among the 31 nonsense SNPs, 29 are already annotated as ‘Pathogenic’ in the ClinVar database (https://www.ncbi.nlm.nih.gov/clinvar/). Rest 2 nonsense SNPs, rs1393693495 (W514Ter) and rs1555226308 (Q156Ter), and 1 stop lost SNP, rs1394777873 (Ter519Q) were analyzed by PredictSNP2 [35].

Prediction of the effects of UTR SNPs

Untranslated region (UTR)s play a pivotal role in the regulation of gene expression post-transcriptionally. They ascertain nucleo-cytoplasmic transport of mRNA, maintain translational efficiency, subcellular localization, and RNA stability [48]. The essential regulators in the 5′-UTR are Kozak consensus sequence (ACCAUGG) [Shine-Dalgarno consensus sequence (AGGAGG) in prokaryotes], CpG (5′—C—phosphate—G—3′) sites, uORFs (upstream open reading frames), IRESs (Internal Ribosome Entry Sites), and RBP (Ribosome-binding protein) binding sites. 3′-UTR length, RBP binding sites, and miRNA binding sites are the critical players in the 3′-UTR [49]. Single nucleotide variants in these regions have been found to be associated with many diseases like thalassemia intermedia, atrial septal defect [50], campomelic dysplasia [51], hereditary chronic pancreatitis [52], Marie Unna hereditary hypotrichosis [53], Charcot-Marie-Tooth disease [54], and cerebral amyloid angiopathy [55] to name a few. So, it is essential to identify potentially harmful SNPs in the 3′ & 5′-UTR. TBX5 UTR SNPs that were predicted to affect transcriptional motifs were retrieved from the UTR database (UTRdb (cnr.it)) [56]. ‘UTRef’ (NCBI RefSeq transcripts) as the searching database and ‘Gene Name’ as the accession type were chosen during retrieval. MicroRNA (miRNA)s also have critical roles in post-transcriptional regulation of gene expression through mRNA cleavage or translational repression [57]. miRNAs are small non-coding RNAs having about 22 nucleotides. miRNA seed region (positions 2–8 on miRNA) is the most crucial sequence in finding the complementary sequence on mRNA and subsequently binding with it. SNPs in the miRNA target sites in the 3′ UTR may create or delete an miRNA target or change the binding efficiency between miRNA and mRNA. Creation or deletion of an miRNA target site may also affect binding in the neighboring miRNA target sequences. Such SNPs have been reported in several diseases, including cancers, psychiatric illnesses, cardiomyopathy, asthma, and Parkinson's disease [58,59]. In our study, SNPs in the predicted miRNA target sites were identified using the MirSNP server [59]. Gene name was used as the input to MirSNP.

Prediction of the effects of splice site SNPs

RNA splicing is necessary for converting a precursor mRNA to a mature mRNA ready for translation. Various cis-regulatory elements and trans-acting elements maintain this splicing process. Cis-regulatory elements include 5′ and 3′ consensus splice site sequences in the exon-intron boundaries, branch point sequences, and polypyrimidine tract sequences. Trans-acting elements include spliceosomal small nuclear RNAs, proteins, and various other splicing repressors and activators, many of which are still unknown [60,61]. Mutations that hamper the controlled process of splicing lead to a good number of diseases. Around 9% of all the pathogenic mutations recorded in the Human Gene Mutation Database are splicing ones [60,61]. SNPs in the splice donor and acceptor sites may break an established splice site or create a new splice site. They may also activate cryptic sequences (sequences similar to splice sites) or affect splicing enhancers and silencers. They may even intercept the binding of spliceosomal components by making a conformational change in the mRNA secondary structure [60]. In this study, seven splice donor single nucleotide variants and seven splice acceptor single nucleotide variants were retrieved from dbSNP using NCBI Variation Viewer. Among the seven donor variants, two are annotated as ‘Likely-pathogenic’. Among the seven acceptor variants, one is annotated as ‘Pathogenic’ and three are annotated as ‘Likely-pathogenic’ in the ClinVar database. Excluding them, the rest 8 SNPs were analyzed employing Human Splicing Finder (HSF) [62]. For our analysis, ‘Analyze a sequence’ and ‘Paste your own sequence’ options were chosen and individual variants with a flanking sequence of 100 nucleotides in both the 5′ and 3′ directions were used as inputs.

Prediction of the effects of deep functional intronic SNPs

Intronic regions located >100 base pairs away from the exon-intron borders are known as deep intronic regions [63]. Although they have been overlooked for a long time, they are now considered important, and mutations in these regions have been found in association with more than 75 diseases. Deep intronic point mutations, deletions, and insertions are held accountable for these diseases, with the first one being the most frequent [63]. The rs (reference SNP) IDs of all the TBX5 intron variants were submitted to SNP Function Prediction (FuncPred; SNPinfo Web Server (nih.gov)) using default settings [64]. 19 SNPs were predicted by FuncPred to affect TF binding sites and were next analyzed by RegulomeDB 2.0 server. RegulomeDB 2.0 is an annotated database of SNPs in the human intergenic regions that have either established or predicted regulatory roles [65].

3D structure prediction of full TBX5 protein

Currently, there are four experimental structures of TBX5 (2X6U, 2X6V, 4SOH, and 5BQD) deposited in the Protein Data Bank (PDB). Nevertheless, all of them cover less than half of the total protein. So, we decided to predict the whole structure of TBX5 protein by three reliable protein structure prediction tools, namely I-TASSER [66], Phyre2 [67], and Robetta [68]. Energy minimization of the generated TBX5 models was achieved by applying YASARA force field. YASARA force field is a combination of AMBER all-atom force field and multi-dimensional knowledge-based torsional potentials [69]. The energy minimized models were further refined by GalaxyRefine server (http://galaxy.seoklab.org/cgi-bin/submit.cgi?type=REFINE).

Evaluation of TBX5 3D models

The quality of the three predicted TBX5 models was evaluated by PROCHECK [70], Verify3D [71], and SWISS-MODEL Structure Assessment Tool [72]. PROCHECK examines specific stereochemical and geometrical properties of the query protein structure and compares them against a set of ideal values of these properties obtained from well-refined high-resolution protein structures. It assesses residue-by-residue geometry and overall structural geometry of the query structure [70]. A Ramachandran plot of φ-ψ torsion angles generated by PROCHECK is the most helpful quality indicator and a good quality model is expected to have over 90% of its residues in the most favored regions. Verify3D checks the agreement between a protein tertiary structure and its primary structure [71]. SWISS-MODEL Structure Assessment Tool provides assessments from MolProbity and QMEAN algorithms. These algorithms evaluate a model quality at both global (i.e. for the entire structure) and local (i.e. per residue) levels [72,73]. QMEAN Z-score is an indicator of the “degree of nativeness” of a model [74].

Results

SNP information retrieval

Searching the NCBI Variation Viewer using dbSNP as the ‘source database’ and single nucleotide variant (SNV) as the ‘variant type’ returned 14,413 rsIDs associated with the TBX5 gene as single nucleotide variants. Three hundred fifty-five rsIDs were found to be missense variants, 30 rsIDs to be nonsense (stop gained) variants, one rsID to be ‘stop lost’ variant, 210 rsIDs to be synonymous variants, 11,300 rsIDs to be intron variants, 821 rsIDs to be non-coding transcript variants, seven rsIDs to be splice acceptor and seven rsIDs to be splice donor variants, 689 rsIDs to be five prime UTR variants, and 345 rsIDs to be three prime UTR variants. Fig. 2 is a pie chart showing different types of SNPs. On further inspection, 404 missense variants with 355 rsIDs (3 nonsense variants are included in rs377649723, rs765204502, and rs903933027, and 11 synonymous variants are included in rs186780790, rs377532269, rs533581420, rs374600913, rs753688559, rs762204624, rs771883815, rs777853147, rs972633602, rs973631007, and rs769107196) and 28 nonsense variants with 28 rs IDs (27 stop gained and 1 stop lost-rs1394777873) were found [total nonsense variants= 28+3= 31]. The reason behind a single rsID having more than one variant is these variants were reported on the same chromosomal location. As per the NCBI rule, only one rsID is assigned to a specific chromosomal location [75].

Fig. 2

Pie-chart displaying the distribution of TBX5 SNPs.

Prediction of missense SNPs by sequence-homology based tools

The retrieved 404 missense SNPs were analyzed first using PROVEAN. Out of 404 submitted substitutions, PROVEAN predicted 85 as 'Deleterious' and 301 as ‘Neutral’, and SIFT predicted 236 as ‘Damaging’ and 150 as ‘Tolerated’. 242 substitutions were predicted as 'Deleterious' or ‘Damaging’ by either PROVEAN or SIFT, respectively, and 80 substitutions were predicted as 'Deleterious' or ‘Damaging’ by both tools. Results of rs571328934, rs755664220, rs756902625, rs763178387, rs769107196, rs769895888, rs777370158, rs907398622, rs945043550, rs1263504354, rs1407437475, and rs1489969947 could not be predicted by PROVEAN.

Prediction of missense SNPs by structure-homology based tool

For high fidelity, the 80 SNPs predicted by both PROVEAN and SIFT as 'Deleterious'/‘Damaging’ were next analyzed by PolyPhen-2. Both HumDiv and HumVar dataset trained PolyPhen-2 models were used. The HumDiv trained PolyPhen-2 predicted 58 substitutions as ‘probably damaging’ (Supplementary Tables 1) and 16 substitutions as ‘possibly damaging’, and 6 substitutions as ‘benign’. The HumVar trained model predicted 56 substitutions as ‘probably damaging’ (Supplementary Tables 1) and 10 substitutions as ‘possibly damaging’, and 14 substitutions as ‘benign’. The 58 SNPs that were predicted as ‘probably damaging’ by HumDiv were also predicted as ‘probably damaging by HumVar except for rs190881877 (V205F) and rs944423586 (P337H), which were predicted as ‘possibly damaging’. rs1411518530 is associated with two nucleotide variants (G > C, T), but they encode the same amino acid (D110E) and were considered a single variant in the subsequent evaluations. Predictions by PROVEAN, SIFT, and PolyPhen-2 are available in Supplementary Table 1. The threshold score of pathogenicity, as predicted by MutPred2, is 0.5 and a substitution having a MutPred2 score ≥0.8 can be considered highly likely to be pathogenic. All but two (K226R and P337H) of the 58 high-confidence deleterious SNPs have a MutPred2 prediction score ≥0.5. Supplementary Table 2 provides MutPred2 outcomes. Pathogenicity of the aforementioned substitutions was further examined by five deleteriousness meta-predictors whose predictions are based on various ensemble methods. Forty-eight substitutions were predicted ‘Damaging’ by all the meta-predictors and the rest were predicted ‘Damaging’ by at least two meta-predictors. The results are shown in Table 1.

Table 1

Amino acid substitution and disease causation predictions by five meta-predictors.

AAS	PredictSNP 1		PredictSNP 2		MetaLR		MetaSVM		REVEL
AAS	Prediction	Expected Accuracy	Prediction	Expected Accuracy	Prediction	Score	Prediction	Score	Prediction	Score
L58P	D^†	0.87	D	0.87	D	0.93	D	1.08	D	0.98
T72K	D	0.87	D	0.87	D	0.88	D	1.02	D	0.94
T72 M	D	0.87	D	0.87	D	0.83	D	0.86	D	0.95
G80R	D	0.87	D	0.87	D	0.94	D	1.10	D	0.98
F84L	D	0.72	D	0.87	D	0.91	D	1.06	D	0.97
P85L	D	0.87	D	0.87	D	0.96	D	1.09	D	0.96
P85S	D	0.87	D	0.87	D	0.96	D	1.10	D	0.97
V91A	D	0.65	D	0.87	D	0.78	D	0.62	D	0.73
G93S	D	0.87	D	0.87	D	0.96	D	1.10	D	0.97
G93D	D	0.87	D	0.87	D	0.96	D	1.10	D	0.94
G93V	D	0.61	D	0.87	D	0.96	D	1.10	D	0.94
P96S	N^§	0.63	D	0.87	D	0.86	D	0.87	D	0.83
Y100C	D	0.87	D	0.87	D	0.96	D	1.10	D	0.98
Y100F	D	0.76	D	0.82	D	0.93	D	1.08	D	0.93
M104T	D	0.65	D	0.87	D	0.86	D	0.93	D	0.93
D105 N	D	0.87	D	0.87	D	0.89	D	1.01	D	0.87
V107E	D	0.87	D	0.82	D	0.76	D	0.67	D	0.92
P108T	D	0.72	D	0.87	D	0.86	D	0.92	D	0.88
D110E	D	0.87	N	0.65	D	0.83	D	0.76	D	0.72
D110 N	D	0.87	D	0.87	D	0.86	D	0.91	D	0.77
D111Y	D	0.76	D	0.87	D	0.82	D	0.79	D	0.87
R113I	D	0.87	D	0.82	D	0.89	D	1.00	D	0.95
R113K	D	0.61	D	0.87	D	0.83	D	0.84	D	0.92
Y114H	D	0.87	D	0.87	D	0.85	D	0.89	D	0.97
W121R	D	0.87	D	0.87	D	0.97	D	1.08	D	0.98
G125R	D	0.87	D	0.87	D	0.89	D	1.00	D	0.90
G133C	N	0.63	D	0.87	D	0.76	D	0.60	D	0.68
L135R	D	0.87	D	0.87	D	0.78	D	0.72	D	0.87
V137A	D	0.51	D	0.87	D	0.79	D	0.74	D	0.89
D140E	N	0.65	D	0.87	D	0.79	D	0.65	D	0.74
P142S	D	0.87	D	0.87	D	0.84	D	0.90	D	0.76
A143T	D	0.55	D	0.87	D	0.81	D	0.73	D	0.59
G145W	D	0.87	D	0.87	D	0.98	D	1.04	D	0.89
V153D	D	0.87	D	0.82	D	0.89	D	1.00	D	0.95
L160F	D	0.87	D	0.87	D	0.89	D	1.02	D	0.88
H170L	D	0.72	N	0.63	D	0.78	D	0.75	D	0.96
N174D	D	0.87	D	0.87	D	0.87	D	0.99	D	0.81
V186G	D	0.87	D	0.87	D	0.87	D	0.93	D	0.94
V186 M	D	0.61	D	0.87	D	0.86	D	0.86	D	0.85
D189Y	D	0.87	D	0.87	D	0.84	D	0.80	D	0.68
H204D	D	0.87	D	0.87	D	0.80	D	0.69	D	0.88
V205F	D	0.72	D	0.87	D	0.80	D	0.70	D	0.75
E208D	D	0.51	D	0.87	D	0.84	D	0.80	D	0.71
A213T	D	0.55	D	0.87	D	0.85	D	0.71	D	0.72
I222N	D	0.87	D	0.87	D	0.89	D	0.94	D	0.90
T223 M	D	0.87	D	0.87	D	0.91	D	1.05	D	0.96
K226R	D	0.76	D	0.87	D	0.92	D	1.01	D	0.88
K226E	D	0.87	D	0.87	D	0.93	D	1.10	D	0.96
F232V	D	0.87	D	0.82	D	0.92	D	1.08	D	0.95
R237P	D	0.87	D	0.87	D	0.91	D	1.06	D	0.95
R237Q	D	0.87	D	0.87	D	0.89	D	1.00	D	0.92
R237W	D	0.87	D	0.87	D	0.90	D	1.03	D	0.91
S261C	D	0.55	D	0.82	D	0.71	D	0.54	D	0.88
C323Y	D	0.72	D	0.87	D	0.85	D	0.64	T	0.47
P337H	D	0.61	D	0.87	T^‡	0.20	T	−0.66	T	0.20
R375W	D	0.76	D	0.87	T	0.33	T	−0.32	T	0.38
W401C	D	0.72	D	0.82	T	0.33	T	−0.36	T	0.44

AAS = Amino Acid Substitution, D = Deleterious/Damaging/Disease, T = Tolerated, §N=Neutral.

Amino acid substitution and disease causation predictions by five meta-predictors. AAS = Amino Acid Substitution, D = Deleterious/Damaging/Disease, T = Tolerated, §N=Neutral.

Prediction of amino acid conservation status, solvent accessibility, and TBX5 secondary structure

ConSurf provided position-specific amino acid conservation score of the TBX5 residues. It also ranked the amino acid residues in a color-coded scale of integers ranging from 1 to 9 where 1 indicates a highly variable and 9 indicates the most conserved residues. It was found that most of the substitutions are located in highly conserved positions. NetSurfP-2.0 predicted solvent accessibility (exposed or buried) of the amino acids and provided each residue's relative and total accessible surface area. 37 substitutions were found to occur in buried residues and 21 substitutions were found to be in exposed residues. Table 2 summarizes ConSurf and NetSurfP-2.0 outcomes. Fig. 3, Fig. 4 are graphical presentations of these results.

Table 2

Predictions by ConSurf and NetSurfP-2.0.

ConSurf					NetSurfP-2.0
Position	AA	Normalized Conservation Score	COLOR	RESIDUE VARIETY	Assigned Class	RSAa	ASAb (Å)
58	L	−0.784	8	T,L,M,H	Buried	0.068	12.397
72	T	−1.098	9	T, S	Buried	0.094	13.02
80	G	−0.858	8	N,H,G	Buried	0.184	14.447
84	F	−0.965	9	C, F	Buried	0.184	37.024
85	P	−0.917	9	G, V, P	Buried	0.094	13.32
91	V	−0.972	9	L,X,V,C,M	Buried	0.062	9.596
93	G	−0.889	8	K,G,R,X	Exposed	0.526	41.429
96	P	−0.715	8	S,A,K,P,E	Exposed	0.369	52.33
100	Y	−1.095	9	Y	Buried	0.068	14.536
104	M	−0.334	6	M,V,I,A,T	Buried	0.009	1.773
105	D	−1.138	9	D	Buried	0.174	25.002
107	V	−1.025	9	I, V	Buried	0.163	25.004
108	P	−0.651	8	A,T,S,Q,P	Buried	0.218	31.002
110	D	−1.138	9	D	Exposed	0.347	50.019
111	D	−0.774	8	E, D	Exposed	0.653	94.129
113	R	−1.137	9	R	Exposed	0.277	63.479
114	Y	−1.095	9	Y	Buried	0.064	13.575
121	W	−0.647	8	W,R,C	Buried	0.249	59.799
125	G	−0.861	8	G,P,S	Exposed	0.293	23.038
133	G	−0.794	8	P,R,G,K	Exposed	0.507	39.935
135	L	−0.875	9	M,L,P	Buried	0.216	39.513
137	V	−0.981	9	P,I,V,L	Exposed	0.292	44.873
140	D	−0.944	9	A,S,D	Exposed	0.559	80.603
142	P	−1.116	9	P	Exposed	0.297	42.08
143	A	−0.875	8	P,V,G,S,A	Buried	0.223	24.613
145	G	−0.901	9	D,G,T	Buried	0.042	3.271
153	V	−0.761	8	I, V, L	Buried	0.041	6.232
160	L	−0.81	8	R,P,T,L	Buried	0.034	6.174
170	H	−0.889	8	H,F,S,Q	Buried	0.165	29.977
174	N	−0.884	8	A,G,S,N,R	Exposed	0.279	40.854
186	V	−0.825	8	A,L,S,M,V,P	Buried	0.046	7.095
189	D	−0.802	8	H,P,A,S,D,T	Exposed	0.44	63.434
204	H	−0.657	8	H,R,F,Q,Y,A	Buried	0.141	25.641
205	V	−0.275	6	W,A,S,L,V,M,I	Exposed	0.44	67.687
208	E	−0.846	8	L,W,P,V,E	Exposed	0.283	49.385
213	A	−0.689	8	G,T,S,A,P,M	Buried	0.028	3.079
222	I	−0.944	9	I,V,N,W	Buried	0.044	8.056
223	T	−1.012	9	T, Q, N	Buried	0.159	22.092
226	K	−1.021	9	R, K	Buried	0.128	26.428
232	F	−0.808	8	N,Y,F,L	Exposed	0.4	80.298
237	R	−1.099	9	R	Exposed	0.545	124.801
261	S	−0.451	7	N,D,S,T	Exposed	0.424	49.711
323	C	−0.522	7	C,R,Y,L,S,G	Buried	0.247	34.65
337	P	1.6	1	G,T,D,L,S,Q,E,P,A,H	Exposed	0.578	81.958
375	R	−0.72	8	G,S,Q,R	Exposed	0.412	94.327
401	W	−1.02	9	W	Exposed	0.309	74.289

RSA = Relative Surface Accessibility.

ASA = Absolute Surface Accessibility.

Fig. 3

Detection of conserved amino acids by ConSurf. Here, e = an exposed residue, b = a buried residue, f = a functional residue (highly conserved and exposed), and s = a structural residue (highly conserved and buried).

Fig. 4

Solvent accessibility predictions of TBX5 residues. Here, Relative Surface Accessibility: Red upward elevations are exposed residues and sky-blue low elevations are buried residues, thresholded at 25%; Secondary Structures: orange spiral = helix, indigo arrow = strand, and pink straight line = coil; disorder: swollen black line, thickness of line corresponds to the probability of disorder. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Predictions by ConSurf and NetSurfP-2.0. RSA = Relative Surface Accessibility. ASA = Absolute Surface Accessibility. Detection of conserved amino acids by ConSurf. Here, e = an exposed residue, b = a buried residue, f = a functional residue (highly conserved and exposed), and s = a structural residue (highly conserved and buried). Solvent accessibility predictions of TBX5 residues. Here, Relative Surface Accessibility: Red upward elevations are exposed residues and sky-blue low elevations are buried residues, thresholded at 25%; Secondary Structures: orange spiral = helix, indigo arrow = strand, and pink straight line = coil; disorder: swollen black line, thickness of line corresponds to the probability of disorder. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.) The amino acid substitutions were next analyzed by protein stability prediction tool, iStable 2.0. iStable 2.0 predicted 47 substitutions as destabilizing (Table 3).

Table 3

Prediction of stability of the mutated proteins by iStable 2.0, DynaMut, and Missense3D.

AAS	iStable 2.0		DynaMut		Missense3D
AAS	Confidence Score	Stability	ΔΔG (kcal/mol)	Stability	Structural Damage	Comment
L58P	−1.7689416	Decrease	−1.348	Destabilizing	No
T72K	−1.0673743	Decrease	0.677	Stabilizing	Yes	Buried charge introduced
T72 M	−0.067792356	Increase	0.472	Stabilizing	No
G80R	−1.4305459	Decrease	0.826	Stabilizing	Yes	Clash alert, Buried charge introduced, Disallowed phi/psi alert, Buried Gly replaced
F84L	−1.3939497	Decrease	−0.815	Destabilizing	No
P85L	−0.40284556	Increase	−0.257	Destabilizing	Yes	Cis proline replaced
P85S	−1.9663911	Decrease	−0.991	Destabilizing	Yes	Cis proline replaced
V91A	−2.363688	Decrease	−1.751	Destabilizing	No
G93S	−1.005738	Decrease	−0.948	Destabilizing	No
G93D	−1.0917563	Decrease	−0.592	Destabilizing	No
G93V	−1.1831716	Increase	−0.153	Destabilizing	No
P96S	−1.0492245	Decrease	1.117	Stabilizing	No
Y100C	−1.5898333	Decrease	−0.83	Destabilizing	No
Y100F	−1.0702753	Decrease	0.061	Stabilizing	No
M104T	−3.143975	Decrease	−0.444	Destabilizing	No
D105 N	−0.9833634	Decrease	0.443	Stabilizing	Yes	Buried salt bridge breakage
V107E	−1.7875004	Decrease	0.555	Stabilizing	No
P108T	−0.82126737	Decrease	0.924	Stabilizing	No
D110E	−0.68696284	Decrease	−1.513	Destabilizing	No
D110 N	−0.63278186	Decrease	−1.912	Destabilizing	No
D111Y	−0.11971706	Decrease	−0.260	Destabilizing	No
R113I	−0.0076336265	Increase	−0.166	Destabilizing	No
R113K	−1.2718449	Decrease	−0.449	Destabilizing	No
Y114H	−2.133617	Decrease	−1.729	Destabilizing	No
W121R	−1.7060409	Decrease	−1.785	Destabilizing	Yes	Cavity altered (expansion of the cavity volume by 76.464 Å^3)
G125R	−0.8936368	Decrease	1.276	Stabilizing	Yes	Disallowed phi/psi alert
G133C	−0.7795974	Decrease	0.599	Stabilizing	No
L135R	−0.6822803	Decrease	−0.049	Destabilizing	No
V137A	−1.5604656	Decrease	−0.738	Destabilizing	No
D140E	−0.47431314	Increase	0.009	Stabilizing	No
P142S	−1.0563802	Decrease	0.281	Stabilizing	Yes	Cis proline replaced
A143T	−1.4433032	Decrease	0.885	Stabilizing	No
G145W	−0.910699	Increase	−0.37	Destabilizing	Yes	Clash alert, Buried Gly replaced
V153D	−2.2433639	Decrease	−1.831	Destabilizing	Yes	Buried hydrophilic residue introduced, Buried charge introduced
L160F	−1.6011777	Decrease	−0.659	Destabilizing	No
H170L	0.340119	Increase	0.057	Stabilizing	No
N174D	−0.6325445	Decrease	−0.051	Destabilizing	No
V186G	−2.7898633	Decrease	−0.109	Destabilizing	No
V186 M	−0.37593216	Decrease	0.347	Stabilizing	No
D189Y	−1.0340729	Decrease	−0.006	Destabilizing	No
H204D	−1.2200044	Decrease	−1.702	Destabilizing	No
V205F	−1.4460453	Decrease	0.454	Stabilizing	No
E208D	−1.1203609	Decrease	−1.772	Destabilizing	Yes	Buried H-bond breakage
A213T	−1.3542162	Decrease	−1.086	Destabilizing	No
I222N	−2.3990505	Decrease	−1.566	Destabilizing	Yes	Buried hydrophilic residue introduced
T223 M	−0.053840578	Decrease	0.863	Stabilizing	No
K226R	−0.55279315	Decrease	−0.600	Destabilizing	No
K226E	−1.2064422	Decrease	−1.730	Destabilizing	No
F232V	−0.04029095	Decrease	−0.108	Destabilizing	No
R237P	−1.6782372	Decrease	−0.338	Destabilizing	No
R237Q	−0.98735225	Decrease	−0.627	Destabilizing	No
R237W	−0.68915725	Increase	0.903	Stabilizing	No
S261C	−0.1880107	Increase	1.311	Stabilizing	Yes	Cavity altered (expansion of the cavity volume by 113.616 Å^3)
C323Y	−0.45005268	Increase	0.91	Stabilizing	Yes	Clash alert
P337H	−0.93664145	Decrease	0.614	Stabilizing	No
R375W	−0.0087457895	Increase	−0.659	Destabilizing	No
W401C	−0.64045143	Decrease	1.108	Stabilizing	Yes	Cavity altered (expansion of the cavity volume by 103.68 Å^3)

Prediction of stability of the mutated proteins by iStable 2.0, DynaMut, and Missense3D. The amino acid substitutions were subsequently analyzed by Missense3D to see how these amino acid substitutions would affect protein tertiary structure. 15 substituted structures were predicted to be ‘Damaging’ by Missense3D (Table 3). Details of Missense3D predictions can be found in Supplementary Figs. 3–17. DynaMut predicted 37 substitutions as destabilizing for the TBX5 protein. Among the 15 substitutions predicted to be detrimental by Misssense3D, 13 were also predicted as damaging by either iStable2.0 or DynaMut, and 5 were predicted to be destabilizing by both (Fig. 5). Table 3 provides predictions of these tools.

Fig. 5

Close-up views of the 5 substitutions predicted to be damaging by iStable 2.0, Missense3D, and DynaMut. These 5 substitutions are: A) P85S, B) W121R, C) V153D, D) E208D, and E) I222N. Native amino acids are shown in green and mutated amino acids are shown in red in this figure. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Molecular dynamics simulation

Differences between the backbone atom RMSD profiles of the native (5BQD_A) and mutant structures are visible in Fig. 6 (A). RMSD profiles of most mutant structures (P85S, W121R, V153D, and E208D) showed more deviations than the native structure. RMSD of these mutants showed an upward pattern from approximately 50 ns, whereas RMSD of 5BQD_A remained uniform from 15 ns onward. From the RMSF profiles (Fig. 6 (B)), it can be found that residue-wise RMSF values of the mutant structures were greater than the native structure for most residues. Greater deviations in the RMSD and RMSF profiles support more flexible nature of these TBX5 mutants. However, RMSD and RMSF profiles of I222N were found more stable than 5BQD_A.

Fig. 6

(A) RMSD values of the backbone atoms of native and mutant TBX5 proteins, (B) Residue-wise RMSF values of native and mutant TBX5 proteins, (C) radius of gyration values, (D) SASA values, and (E) number of hydrogen bonds of native and mutant TBX5 proteins. The radius of gyration (Rg) profile of 5BQD_A remained regular during the simulation period (Fig. 6 (C)). All the mutants showed increased Rg values than 5BQD_A up to 50 ns. Then, the Rg values of W121R and E208D decreased than 5BQD_A but became fluctuating, especially of W121R. V153D and I222N maintained a higher Rg profile than 5BQD_A throughout the simulation period. Rg profile reflects the compactness of a protein. From Fig. 6(C), it can be deduced that none of the mutants would be as stable as the native protein. SASA values of P85S, W121R, and I222N were greater than 5BQD_A for most of the time, and V153D and E208D had lower SASA than 5BQD_A. A lower than normal SASA may prevent effective contact between TBX5 and DNA, and thereby interfere with its activity as a transcription factor. E208D and I222N had more hydrogen bonds present than 5BQD_A during the simulation (Fig. 6 (E)). This may be the cause of stable RMSD and RMSF of I222N and decreased SASA and Rg values of E208D. Other mutants had hydrogen bonds more or less similar to 5BQD_A. The above-mentioned findings from molecular dynamics simulations suggest TBX5 mutant proteins bear the possibility to hamper the biological activities of the native protein, because these mutants become either more flexible or rigid compared to the native state. Among the 58 high-confidence deleterious SNPs, AWESOME predicted P108T, D111Y, and A143T substitutions could create new phosphorylation sites (AWESOME scores are −0.711, −1.458, and −0.05, respectively), and S261C substitution can cause loss of a phosphorylation site (AWESOME score 2.752). R237P, R237Q, R237W, and R375W can cause loss of methylation sites (AWESOME scores are 0.834, 0.834, 0.834, and 0.768, respectively). S261C substitution can cause loss of serine mediated O-linked N-acetylgalactosamine (AWESOME score 0.8291) and O-linked N-acetylglucosamine formation (AWESOME score 0.4322). P108T and A143T substitutions can cause threonine mediated O-linked N-acetylglucosamine formation (AWESOME scores are −0.3507 and −0.354, respectively).

Evidence of deleteriousness from experiments

Among our identified 58 missense SNPs, G80R [76], G125R [77], T223 M [78], R237P [79], R237Q, and R237W [76] have been confirmed experimentally to be responsible for HOS. G80R, R237P, R237Q, and R237W cause loss of function, whereas G125R leads to gain of function [79].

Identification and analyzing nonsense and stop lost SNPs

2 stop gained (nonsense) and 1 stop lost SNP were analyzed by PredictSNP2. It predicted the nonsense mutations as deleterious and the stop lost mutation as neutral. Q156Ter nonsense mutation is significant because it disrupts the highly conserved T-box (amino acid position 58–238). Table 4 shows the results from PredictSNP2 server.

Table 4

Predictions about nonsense and stop lost SNPs by PredictSNP2.

Variant	PredictSNP2		CADD		DANN		FATHMM		FunSeq2		GWAVA
Variant	P	S	P	S	P	S	P	S	P	S	P	S
W514Ter	Del	0.6777	Del	46	Del	0.9959	Del	0.9623	Del	5	?	0.39
Q156Ter	Del	1	Del	38	Del	0.9984	Del	0.9167	Del	5	Del	0.5
Ter519Q	Neu	−0.1146	Neu	14.06	Neu	0.758	Del	0.9717	Neu	0	Del	0.31

P= Prediction, S= Score, Del = Deleterious, and Neu = Neutral.

Predictions about nonsense and stop lost SNPs by PredictSNP2. P= Prediction, S= Score, Del = Deleterious, and Neu = Neutral. UTR SNPs that can affect transcriptional motifs were retrieved from the UTRdb server. Three SNPs in the 3′ UTR (rs28730760, rs28730761, and rs883079) were found to be present in the polyadenylation (poly-A) sites, and hence may be pathogenic. In addition, UTRdb output returned six transcriptional motif matches in the 5′ UTR and three transcriptional motif matches in the 3′ UTR. Table 5 provides results from UTRdb.

Table 5

Transcriptional motifs present in TBX5 untranslated regions.

Signal Name	UTR region	Match total	NCBI Reference Sequence	Genomic Position (NCBI36/hg18)
Uorf (Upstream Open Reading Frame)	5′	1	NM_181486.4	Chr12:113328041–113328106
		2	NM_000192.3	Chr12:113330378–113330629
		2	NM_000192.3	Chr12:113330136–113330216
		2	NM_080717.3	Chr12:113330378–113330629
		2	NM_080717.3	Chr12:113330136–113330216
IRES (Internal Ribosome Entry Site)	5′	1	NM_000192.3	Chr12:113326087–113330055
CPE (Cytoplasmic polyadenylation element)	3′	1	NM_181486.4	Chr12:113276118–113276332
K-Box	3′	1	NM_181486.5	Chr12:113276848–113276855
PAS (Polyadenylation Signal)	3′	1	NM_181486.6	Chr12:113276118–113276144

Transcriptional motifs present in TBX5 untranslated regions. 3′ UTR SNPs that might create or break an miRNA target site, or enhance or decrease miRNA binding to mRNA were predicted by MirSNP. A total number of 86 SNPs in the 3′ untranslated region of the TBX5 gene were predicted to alter miRNA-mRNA binding. Supplementary Table 3 shows the results from MirSNP server. Human Splicing Finder (HSF) web tool was employed to predict the effects of SNPs located in the 5′ and 3′ splice sites. A total number of 11 splice site SNPs were subjected to assessment. All were found to have ‘probably no impact on splicing’ by HSF. Table 6 shows the predictions obtained from HSF.

Table 6

Predictions about splice site SNPs by HSF.

rsIDs	Alleles	Predicted Signal	Interpretation
Splice Acceptor Variant
rs1555226330	C > A	No significant splicing motif alteration detected	This mutation has probably no impact on splicing
rs1031873727	G > A	ESS Site broken	Alteration of an intronic ESS site, Probably no impact on splicing
rs1031873727	G > T	ESS Site broken	Alteration of an intronic ESS site, Probably no impact on splicing
rs891464399	G > A	New ESE Site	Creation of an intronic ESE site, Probably no impact on splicing
rs958951320	C > T	No significant splicing motif alteration detected	This mutation has probably no impact on splicing
rs181078973	A > C	No significant splicing motif alteration detected	This mutation has probably no impact on splicing
rs181078973	A > G	No significant splicing motif alteration detected	This mutation has probably no impact on splicing
rs374919751	C > G	New ESE Site	Creation of an intronic ESE site, Probably no impact on splicing
rs374919751	C > T	No significant splicing motif alteration detected	This mutation has probably no impact on splicing
rs1288475235	C > A	No significant splicing motif alteration detected	This mutation has probably no impact on splicing
rs1015550731	T > C	ESS Site broken	Alteration of an intronic ESS site, Probably no impact on splicing

Predictions about splice site SNPs by HSF. 19 among all the TBX5 intron variants were predicted to affect transcription factor binding sites by FuncPred. These 19 SNPs were further analyzed by RegulomeDB 2.0 server, and only two among these SNPs (rs12827969 and rs61931002) were considered functionally important. Table 7 summarizes results from SNP FuncPred and RegulomeDB.

Table 7

Predictions regarding deep intronic SNPs.

SNP FuncPred						RegulomeDB
rsID	Chromosome	Position	Allele	Regulatory Potential	Conservation	Rank	Probability
rs11067100	12	113328237	G/T	0.271604	0.998	4	0.60906
rs11067101	12	113328586	A/G	0.186436	0	4	0.60906
rs11613605	12	113326220	C/G	0.099303	0	4	0.60906
rs11837917	12	113329990	G/T	0.096837	0	4	0.60906
rs12319868	12	113329396	G/T	0.150806	0	4	0.60906
rs12372585	12	113330591	A/G	0.225674	0.19	4	0.60906
rs12423887	12	113327989	C/T	0.410483	0.025	4	0.60906
rs1248046	12	113329889	T/C	0	0	4	0.60906
rs12827969	12	113326517	A/T	0.181987	1	2b	0.80975
rs1522368	12	113329052	A/G	0.148659	0.997	3a	0.6893
rs1522369	12	113329156	A/G	0.076716	0	4	0.60906
rs2113436	12	113326379	G/A	0.469676	0.006	4	0.60906
rs35203448	12	113328816	A/G	0.113463	0.5	4	0.60906
rs3782467	12	113327309	C/G	0.000319	0.001	4	0.60906
rs4553413	12	113328827	A/G	0.0709	0.567	4	0.60906
rs57820630	12	113330823	C/T	NA	NA	4	0.60906
rs61931002	12	113326665	A/G	NA	NA	2b	0.64343
rs736560	12	113327637	A/C	0.393482	0	4	0.60906
rs7957609	12	113331519	A/G	0.316415	0.022	4	0.60906

Predictions regarding deep intronic SNPs. RegulomeDB ranks: 1a-1f: likely to affect transcription factor binding and linked to expression of a gene target, 2a-2c: likely to affect binding, 3a-3b: less likely to affect binding, 4,5,6: minimal binding evidence. Due to the unavailability of a TBX5 full-length crystal structure, full-length 3D models of TBX5 were created employing three different protein structure prediction tools which were later energy minimized and refined. Supplementary Fig. 1 shows cartoon presentations of the energy minimized and refined models. 3D Model evaluation is indispensable for checking the credibility of a generated model. PROCHECK, Verify3D, and SWISS-MODEL Structure Assessment Tool assessed the 3D models generated in the current study. Ramachandran plot is a reliable qualitative indicator of protein structure [70]. PROCHECK calculated Ramachandran plots for the created models. A model is considered reasonable by PROCHECK if >90% of the residues are present in the most favored regions of Ramachandran plot. Only one model (generated by Robetta and later energy minimized and refined) was found to fulfill this criterion. It has 94.5% of residues in the most favored regions. Supplementary Fig. 2 shows the Ramachandran plots calculated by PROCHECK. Verify3D is another tool that approves a model if its 80% or more amino acid residues score ≥0.2 in the 3D/1D profile. However, none of the models could fulfill this requirement. MolProbity score and QMEAN Z-score of our models were obtained from SWISS-MODEL Structure Assessment Tool. A low MolProbity score and a high QMEAN Z-score are preferred. A negative QMEAN Z-score indicates instability in protein structure [71]. All of our models have a negative QMEAN Z-score. This disagreement can be explained by the fact that presently available ab initio protein modeling tools predict poorly about the proteins having more than 120 residues [80], and for our protein models, 279 residues (5BQD-A chain extends from 1 to 239 residues only and the full-length protein has 518 residues) do not have any template. However, among the models, energy minimized and refined Robetta model has the lowest MolProbity score and the least negative QMEAN Z-score. Considering Ramachandran plot, MolProbity scores, and QMEAN Z-scores, it can be decided that Robetta outperformed two other protein structure prediction tools used in this study. Supplementary Table 4 summarizes the results obtained from PROCHECK, Verify3D, and SWISS-MODEL Structure Assessment Tool, and Supplementary Fig. 2 shows the Ramachandran plots obtained from PROCHECK.

Discussion

Recently bioinformatics approaches have become popular for identifying different human gene SNPs and predicting their structural and functional consequences. In the past few years, damaging SNPs of CCR6 gene [81], SMPX gene [82], human aldehyde oxidase gene [83], PTEN gene [84], folate pathway genes [85], human bone morphogenetic protein receptor type 1A gene [86], SKP2 gene [87], human adiponectin receptor 2 gene [88], MMP9 gene [89], etc. have been identified with the help of different computational biology tools. With the rapid development of genomics, the number of reported SNPs in different databases is continuously increasing for the last few years. However, determining the SNPs that result in diseases is challenging. Various pathogenicity prediction tools can narrow this number to a list of high-confidence deleterious SNPs [90]. TBX5 gene is an essential regulator of cardiac and upper limb development, and mutations in this gene are responsible for Holt-Oram syndrome, especially missense mutations. It is estimated that 1 in 100,000 neonates are born with Holt-Oram syndrome, and male and female children are affected equally. Since HOS is an autosomal dominant inheritance, it is imperative that all individuals with HOS and their parents and siblings get genetic counseling [91]. In the past years, some mutations, including SNPs have been reported in association with the TBX5 gene. However, most of the SNPs of this gene have not been characterized yet. In this study, we have tried to fill this gap. In our current study, we have analyzed all currently available TBX5 SNPs. Since results from a single bioinformatics tool may be inconclusive, we have appointed 9 highly reliable prediction tools for identifying deleterious missense SNPs. The 58 deleterious missense SNPs that were initially identified by concordance among PROVEAN, SIFT, and PolyPhen-2 were also found to be damaging by 6 other tools (PredictSNP1, PredictSNP2, MetaLR, MetaSVM, and REVEL) with few exceptions. Among these 58 missense SNPs, 5 missense SNPs (G80R, G125R, T223 M, R237P, R237Q, and R237W) have been proven to be responsible for HOS [[76], [77], [78], [79]]. The amino acid conservation analysis found that 55 among the 58 missense SNPs cause amino acid substitution in highly conserved positions (ConSurf score 7–9). This further corroborates the importance of identified SNPs. Furthermore, 8 of these 58 missense SNPs (P108T, D111Y, A143T, R237P, R237Q, R237W, S261C, and R375W) were found to have the ability to modify post-translational sites. We have also analyzed the effects of these 58 SNPs on the stability of native TBX5 protein structure. 15 SNPs were predicted to cause structural damage by Missense3D. G80R, T72K, and V153D substitutions introduce buried charges. V153D and I22 N mutations substitute buried hydrophobic residues with hydrophilic residues. G80R and G145W replace buried glycines; P85L, P85S, and P142S replace cis prolines, D105 N breaks a buried salt bridge, and E208D disrupts a side-chain/side-chain H-bond and a side-chain/main-chain H-bond. W121R, S261C, and W401C cause an expansion of the internal protein cavity, and thus make the protein structure unstable. G80R, G145W, and C323Y substitutions trigger clash alerts. These three substitutions lead to a MolProbity clash score ≥30, and the increase in clash score is >18 when compared to the wild-type residues. G80R and G125R mutations induce disallowed phi/psi alerts. 5 of these 15 substitutions (P85S, W121R, V153D, E208D, and I222N) were predicted to destabilize the native protein structure by the consensus of Missense3D, DynaMut, and iStable 2.0. This finding was further supported by MD simulation results. These 5 SNPs [rs1565941579 (P85S), rs1269970792 (W121R), rs772248871 (V153D), rs769113870 (E208D), and rs1318021626 (I222N)], which have not yet been reported, are most likely to cause HOS. We have also employed 6 other tools to identify deleterious non-coding SNPs in the TBX5 gene. 86 SNPs in the 3′ UTR of the TBX5 gene were identified that can affect miRNA-mRNA binding. No significant SNP was identified in the splice sites of TBX5, and 2 deep intronic SNPs were identified to affect transcription factor binding and gene expression. Due to the unavailability of a full-size experimental TBX5 protein structure, TBX5 3D models were also predicted employing I-TASSER, Phyre2, and Robetta. Later, the 3D structures were evaluated by PROCHECK, Verify3D, and SWISS-MODEL Structure Assessment Tool. A model is considered good if >90% of its residues are present in the most favored regions of the Ramachandran plot. In the case of our TBX5 models, only Robetta generated model passed this requirement. On the other hand, verify3D results and QMEAN Z-scores indicate poor quality of the generated models. This can be attributed to the inability of the available ab initio protein modeling tools to predict the correct structure of proteins having more than 120 residues [80]. These template-less regions are responsible for the deviations of our models from an ideal one.

Conclusion

This extensive in silico study attempts to find the most damaging TBX5 SNPs. Our obtained results can guide further wet-lab studies of TBX5 gene-related diseases and may potentiate finding cure in the future.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and material

All data generated or analyzed during this study are included in this paper and the supplementary file.

Funding

This work did not receive any funding.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

82 in total

1. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm.

Authors: Prateek Kumar; Steven Henikoff; Pauline C Ng
Journal: Nat Protoc Date: 2009-06-25 Impact factor: 13.491

2. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies.

Authors: Chengliang Dong; Peng Wei; Xueqiu Jian; Richard Gibbs; Eric Boerwinkle; Kai Wang; Xiaoming Liu
Journal: Hum Mol Genet Date: 2014-12-30 Impact factor: 6.150

3. MolProbity: More and better reference data for improved all-atom structure validation.

Authors: Christopher J Williams; Jeffrey J Headd; Nigel W Moriarty; Michael G Prisant; Lizbeth L Videau; Lindsay N Deis; Vishal Verma; Daniel A Keedy; Bradley J Hintze; Vincent B Chen; Swati Jain; Steven M Lewis; W Bryan Arendall; Jack Snoeyink; Paul D Adams; Simon C Lovell; Jane S Richardson; David C Richardson
Journal: Protein Sci Date: 2017-11-27 Impact factor: 6.725

Review 4. TBX5: A Key Regulator of Heart Development.

Authors: J D Steimle; I P Moskowitz
Journal: Curr Top Dev Biol Date: 2016-09-28 Impact factor: 4.897

5. TBX5 intragenic duplication: a family with an atypical Holt-Oram syndrome phenotype.

Authors: Chirag Patel; Lee Silcock; Dominic McMullan; Louise Brueton; Helen Cox
Journal: Eur J Hum Genet Date: 2012-02-15 Impact factor: 4.246

6. ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules.

Authors: Haim Ashkenazy; Shiran Abadi; Eric Martz; Ofer Chay; Itay Mayrose; Tal Pupko; Nir Ben-Tal
Journal: Nucleic Acids Res Date: 2016-05-10 Impact factor: 16.971

Review 7. Mutations at protein-protein interfaces: Small changes over big surfaces have large impacts on human health.

Authors: Harry C Jubb; Arun P Pandurangan; Meghan A Turner; Bernardo Ochoa-Montaño; Tom L Blundell; David B Ascher
Journal: Prog Biophys Mol Biol Date: 2016-11-29 Impact factor: 3.667

8. PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions.

Authors: Jaroslav Bendl; Miloš Musil; Jan Štourač; Jaroslav Zendulka; Jiří Damborský; Jan Brezovský
Journal: PLoS Comput Biol Date: 2016-05-25 Impact factor: 4.475

9. A TBX5 3'UTR variant increases the risk of congenital heart disease in the Han Chinese population.

Authors: Feng Wang; Dong Liu; Ran-Ran Zhang; Li-Wei Yu; Jian-Yuan Zhao; Xue-Yan Yang; Song-Shan Jiang; Duan Ma; Bin Qiao; Feng Zhang; Li Jin; Yong-Hao Gui; Hong-Yan Wang
Journal: Cell Discov Date: 2017-07-25 Impact factor: 10.849

10. Inferring the molecular and phenotypic impact of amino acid variants with MutPred2.

Authors: Vikas Pejaver; Jorge Urresti; Jose Lugo-Martinez; Kymberleigh A Pagel; Guan Ning Lin; Hyun-Jun Nam; Matthew Mort; David N Cooper; Jonathan Sebat; Lilia M Iakoucheva; Sean D Mooney; Predrag Radivojac
Journal: Nat Commun Date: 2020-11-20 Impact factor: 14.919

1 in total

1. Impact of deleterious missense PRKCI variants on structural and functional dynamics of protein.

Authors: Hania Shah; Khushbukhat Khan; Naila Khan; Yasmin Badshah; Naeem Mahmood Ashraf; Maria Shabbir
Journal: Sci Rep Date: 2022-03-08 Impact factor: 4.379

1 in total