| Literature DB >> 35664302 |
Gh Rasool Bhat1, Itty Sethi2, Bilal Rah1, Rakesh Kumar3, Dil Afroze1.
Abstract
Bioinformatics is an amalgamation of biology, mathematics and computer science. It is a science which gathers the information from biology in terms of molecules and applies the informatic techniques to the gathered information for understanding and organizing the data in a useful manner. With the help of bioinformatics, the experimental data generated is stored in several databases available online like nucleotide database, protein databases, GENBANK and others. The data stored in these databases is used as reference for experimental evaluation and validation. Till now several online tools have been developed to analyze the genomic, transcriptomic, proteomics, epigenomics and metabolomics data. Some of them include Human Splicing Finder (HSF), Exonic Splicing Enhancer Mutation taster, and others. A number of SNPs are observed in the non-coding, intronic regions and play a role in the regulation of genes, which may or may not directly impose an effect on the protein expression. Many mutations are thought to influence the splicing mechanism by affecting the existing splice sites or creating a new sites. To predict the effect of mutation (SNP) on splicing mechanism/signal, HSF was developed. Thus, the tool is helpful in predicting the effect of mutations on splicing signals and can provide data even for better understanding of the intronic mutations that can be further validated experimentally. Additionally, rapid advancement in proteomics have steered researchers to organize the study of protein structure, function, relationships, and dynamics in space and time. Thus the effective integration of all of these technological interventions will eventually lead to steering up of next-generation systems biology, which will provide valuable biological insights in the field of research, diagnostic, therapeutic and development of personalized medicine.Entities:
Keywords: Human Splice finder (HSF); Next Generation Sequencing (NGS); Single nucleotide polymorphisms (SNPs); bioinformatics; in silico
Year: 2022 PMID: 35664302 PMCID: PMC9159363 DOI: 10.3389/fgene.2022.865182
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1specifies the timeline of DNA sequencing. Some of the most significant and ground-breaking developments in DNA sequencing. NG stands for next generation, and PCR is for polymerase chain reaction. SMS stands for single molecule sequencing, and SeqLL stands for sequence the lower limit.
FIGURE 2illustrates the various steps like Raw Data Quality Control, Alignment, Post Alignment Processing, Variant Filtration, Annotation and Reporting of variants involved in bioinformatics workflow for next-generation sequencing (NGS).
Demonstrates a list of commonly used tools for performing an NGS functional filter, along with examples.
| S.No | Software | Description | Ref |
|---|---|---|---|
| 1 |
| The patterns of conservation (positive scores)/acceleration (negative scores) for various annotation classes and clades of interest are investigated using a neutral evolution model |
|
| 2 |
| Based on the sequence homology, Predicts whether an AA change would affect protein function and maybe alter the phenotype. A variation with a score of less than 0.05 is considered deleterious |
|
| 3 |
| Using a naive Bayes classifier, predicts the functional impact of an AA substitution based on its individual properties Two tools are included. HumDiv (intended for use in complicated phenotypes) and HumVar (designed for Mendelian disease diagnosis). Higher scores (>0.85) predicts more confidently, damaging variants |
|
| 4 |
| Scores all human SNV and Indel using a combination of genomic annotations. According to functional categories, effect sizes, and genetic architectures, it prioritizes functional, deleterious, and disease-causing variations. Pathogenic variants should be identified using a cut-off score of 10 or above |
|
| 5 |
| Evaluates evolutionary conservation, splice-site alterations, protein loss, and changes that could affect mRNA levels. Polymorphisms and disease-causing variants are both classed as polymorphism |
|
| 6 |
| Extracts structural and evolutionary information from a query nsSNP and predicts its phenotypic effect using a machine learning method (Random Forest). The variant is divided into two categories: neutral and disease |
|
| 7 |
| SNPs are analysed based on their geometric position and conservation information, resulting in an interactive visualisation of disease and non-disease linked with each SNP. |
|
| 8 |
| Annotates variants based on a variety of criteria, including whether SNPs or CNVs affect protein function (gene-based), locating variants in specified genomic regions outside of protein-coding regions (region-based), and locating known variants in public and licensed databases (filter-based) |
|
| 9 |
| Determines the impact of numerous variants (SNPs, insertions, deletions, CNVs, or structural variants) on genes, transcripts, and protein sequences, as well as regulatory domains, on genes, transcripts, and protein sequences |
|
| 10 |
| SNV are annotated and classified based on their effects on annotated genes, such as synonymous/nsSNP, start or stop codon gains or losses, genomic positions, and so on Considered a structurally based annotation tool |
|
| 11 |
| Provides dbSNP rs IDs, gene names and accession numbers, variant functions, protein locations and AA changes, conservation scores, HapMap frequencies, PolyPhen predictions, and clinical association for SNVs and tiny indels |
|
The bold values are the names of software/tools.
Demonstrates various software used in third generation sequencing.
| S.No | Software | Description | Ref |
|---|---|---|---|
| 1 | MinHash Alignment Process (MHAP) | Detects long read overlaps |
|
| 2 | Minimap/miniasm |
|
|
| 3 | DALIGN | finds overlaps and local alignments in very noisy long read DNA sequencing data sets |
|
| 4 | Graphmap | detects single-nucleotide variant calling on the human genome; have increased sensitivity of 15%; provides precise detection of structural variants from length 100 bp - 4 kbp |
|
| 5 | BLASR | Maps long reads influenced by insertion and deletion errors |
|
| 6 | Nanocorrect | Error correction in long reads |
|
| 7 | PBJelly | For gap closing in genome assembly |
|
| 8 | HGAP | De novo assembly |
|
| 9 | PoreSeq | Variant calling |
|
| 10 | Nanocorr | Error correction/ |
|
| 11 | Nanocall | Variant calling |
|
| 12 | DeepNano | Base caller |
|
| 13 | Nanopolish | Enhances the base quality |
|
FIGURE 3Output window with complete list of scores. High scores are represented as color-coded bars. The height of each bar indicates the score value (motif score), and its width and placement on the x-axis represent the length of the motif (6–8 nt) and its position along the sequence.
FIGURE 4The diverse and dynamic methods of proteome regulation give the human genome a higher level of complexity. There are roughly 20,300 genes in the human genome. The molecular basis of the cellular phenotype (that is, the tissue cell types) is determined by the specific expression of a subset of the genome (11,000 genes). The sophisticated methods of protein regulation, such as splicing variations PTMs, post-translational modifications; PPIs, protein–protein interactions, and subcellular localization, acquire a considerably higher order of complexity. This results in tissue- and organelle-specific protein networks that respond to perturbations differently throughout time (for example, ageing or drug treatment).
Demonstrates the Protein sequence analysis tool.
| S.No | Software | Description | Ref |
|---|---|---|---|
| 1 | Expasy | A molecular server dedicated to protein and nucleic acid sequence analysis |
|
| 2 | Frame plot | Protein coding region prediction in Bacterial DNA |
|
| 3 | MPEx | Membrane Protein Explorer (MPEx) is a tool that uses hydropathy plots based on thermodynamic principles to explore the topology and other properties of membrane proteins |
|
| 4 | Predict Protein | Predict Protein is an online service that analyses protein sequences and predicts their structure and function. Predict Protein offers numerous sequence alignments, PROSITE sequence motifs, low-complexity regions (SEG), nuclear localization signals, regions lacking regular structure (NORS), and secondary structure predictions after users submit protein sequences or alignments |
|
| 5 | ProDom | Pro Dom is a database of protein domain families built by grouping homologous regions. The recursive PSI-BLAST searches [ALTS2] are used in the ProDom construction technique MKDOM2. Non-fragmentary protein sequences from the SWISS-PROT and TrEMBL databases were used as the starting point |
|
| 6 | Prot Scale | Prot Scale lets you compute and visualise the profile generated by any amino acid scale on a given protein. Each type of amino acid is assigned a number value on an amino acid scale |
|
| 7 | Sequence Manipulation Suite (SMS) | The Sequence Manipulation Suite is a set of JavaScript tools for generating, formatting, and analysing short DNA and protein sequences in BioSyn’s Gizmo Tools |
|
| 8 | Worldwide Protein Data Bank (wwPDB) | The wwPDB hosts a single Protein Data Bank Archive of macromolecular structural data that is freely and openly accessible to the entire world |
|
FIGURE 5Typical workflow for identifying, validating, and stratifying protein-based biomarker signatures. Proteomics based on mass spectrometry (MS) is utilized for in-depth quantitative characterization of a disease model’s proteome and its appropriate control mechanisms. Following the application of strict statistics, a list of candidate proteins that can be used as a phenotypic signature is defined. These markers are verified in large patient cohorts using more specific methodologies, such as MS-based (for example, selective reaction monitoring (SRM)) or antibody-based approaches. To confirm that the biomarker has a direct mechanistic involvement in the disease, the biological connections between the signature proteins and the disease phenotype should be biochemically confirmed.
Demonstrates various in silico approaches used in Pharmacogenomics.
| S.No | Software name | Software Description | Ref |
|---|---|---|---|
| 1 |
| It’s a comprehensive resource that compiles information on the impact of genetic variation on drug response, such as dosing guidelines, drug labels, gene-drug connections, and the genotype-phenotype link |
|
| 2 |
| DGIdb is a database and web interface for identifying drug-gene interactions, both known and unknown |
|
| 3 |
| It covers data on marketed drugs and any adverse medication reactions that have been reported. Public documents and package inserts were used to gather the data. Side effect frequency, drug and side effect categories, and connections to additional information, such as drug–target relationships, are all included in the available data |
|
| 4 |
| Drug Bank Online is a comprehensive, free-to-use online database of drug and drug target information |
|
| 5 |
| It uses data from the scientific literature and new research findings to describe chemical interactions with genes and proteins, as well as diseases and chemicals, and diseases and genes/proteins on humans |
|
| 6 |
| The database contains data on the link between tumour cell genomes and anti-cancer drug sensitivity The sensitivity patterns of human cancer cell lines to a wide range of anti-cancer treatments were compared to genomic and expression data in order to find genetic factors that are predictive of sensitivity |
|
The bold values are the names of software/tools.
Showing various in silico approaches in Epigenomics.
| S.No | Software name | Software Description | Ref |
|---|---|---|---|
| 1 |
| R package and executable for analysing and visualizing differentially methylated regions (DMRs) using CpG count matrices statistically (Bismarck genome-wide cytosine reports) It primarily employs the dmrseq and bsseq algorithms for upstream pre-processing, downstream analysis, and data display |
|
| 2 |
| A whole genome bisulfite sequencing (WGBS) process for DNA methylation alignment and quality control that starts with raw reads (FastQ) and ends with a CpG count matrix (Bismark genome-wide cytosine reports) |
|
| 3 |
| A Bioconductor (R) package for comprehensive analysis of DNA methylation data from Illumina Infinium arrays (450 K and EPIC) and BS-seq. MeDIP-seq and MBD-seq are also supported after some external processing |
|
| 4 |
| A Bioconductor (R) package for MeDIP (methylated DNA immunoprecipitation) and sequencing research (MeDIP-seq) |
|
| 5 |
| A Bioconductor (R) package for your Illumina Infinium arrays (450 K and EPIC) that enables complete analysis and takes cellular heterogeneity into account |
|
| 6 |
| A Bioconductor (R) package for the identification of DMR from the human genome using WGBS and Illumina Infinium array (450 K and EPIC) data |
|
| 7 |
| Integrative analysis of DNA methylation and gene expression data |
|
| 8 |
| Visualization of Epigenome-Wide Association Study (EWAS) from a genomic region |
|
The bold values are the names of software/tools.
Showing various enrichment tools.
| S.No | Software name | Software Description | Ref |
|---|---|---|---|
| 1 |
| The enrichment P-value for each term from the pre-selected interesting gene list is calculated |
|
| Then, in a basic linear text style, the enriched terms are listed. The most traditional algorithm is this one The majority of enrichment analysis tools still rely on it | |||
| 2 |
| The enrichment analysis takes into account all genes (without pre-selection) and their related experimental values. The following are the distinguishing characteristics of this strategy: I Unlike Classes I and II, there is no requirement to pre-select interesting genes; (ii) Experimental values are integrated into P-value computation |
|
| 3 |
| This approach carries on the spirit of the SEA. The term–term/gene–gene associations, on the other hand, are taken into account when calculating the enrichment P-value The benefit of this technique is that the term–term/gene–gene interaction may contain biological meaning that isn’t shared by a single term or gene This type of network/modular analysis is more in line with the structure of biological data |
|
The bold values are the names of software/tools.
Different omics levels of gene-function relationship.
| S.No | Level of Analysis | Description | Method of Analysis |
|---|---|---|---|
| 1 | Genome | Complete set of genes of an organism or its organelles | WGS, WES, DNA microarray |
| 2 | Transcriptome | Complete set of messenger RNA molecules present in a cell, tissue of organ | RNA-Sequencing Expression microarray Expression microarray Spatially resolved transcriptomics |
| 3 | Proteome | Complete set of protein molecules present in a cell, tissue or organ | Peptide/protein microarrays (RPPA) Mass spectrometry Imaging mass cytometry |
| 4 | Metabolome | Complete set of metabolites (low-molecular-weight intermediates) in a cell, tissue or organ | Nuclear magnetic resonance spectrometry Mass spectrometry Infrared spectroscopy |
| 5 | Methylome | Complete set of methylation sites within a genome | Bisulfite-Sequencing, ChIP-Seq |
| 6 | Microbiome | Complete set of genes of all microbes (bacteria, fungi, protozoa and viruses) in a cell, tissue or organ | DNA-Sequencing 16 S rRNA-Sequencing |
| 7 | Lipidome | Complete set of all biomolecules defined as lipids | Mass Spectrometry |
WGS, Whole-genome Sequencing; WES, Whole-exome sequencing; ChIP, chromatin immunoprecipitation.
Demonstrates various single cell sequencing technologies.
| S.No | Tool name | Description | Ref |
|---|---|---|---|
| 1 | SCI-seq | Construction of single-cell libraries and detection of cell copy number variation |
|
| 2 | LIANTI | Finding the copy number variation and disease-related mutation |
|
| 3 | scCOOL-seq | Uncovering of chromatin status/nucleosome localization, DNA methylation, copy number variation and ploidy |
|
| 4 | Microwell-seq | Enhances the detection abundance of single cell sequencing technology |
|
| 5 | SPLit-seq | Single cell transcriptome sequencing |
|
| 6 | Single-Nucleus RNA-Seq + DroNc-Seq | A variety of cells can be accurately analyzed. It may be used in the Human Cell Atlas Project in the future |
|
Shows list of deep learning techniques in genomics.
| S.No | Tools | Prediction | Ref |
|---|---|---|---|
| 1 |
| target prediction |
|
| 2 |
| miRNA Target |
|
| 3 |
| Case control pre-processing step for clustering. Prediction of transcriptomic machinery | ( |
| 4 |
| Gene expression interference |
|
| 5 |
| Classify Gene Expression |
|
| 6 |
| Predictive Quantative epigenetic variation |
|
| 7 |
| Predict tissue-of-origin, normal or disease state and cancer type |
|
| 8 |
| predicts missing methylation states and detects sequence motifs |
|
| 9 |
| predicting the function of DNA directly from sequence alone |
|
| 10 |
| optimize the synthetic gene sequences |
|
The bold values are the names of software/tools.
Showing the various bioinformatic software tools used in circRNAs analysis.
| Tool name | TT | Installation Type | ATMR | PL | CV | Platform | Ref |
|---|---|---|---|---|---|---|---|
| CIRCexplorer | De novo; annotation | pip, Conda, Docker | STAR, BWA |
| v2.3.8 | Unix/Linux | ( |
| CircPro | De novo; annotation | MID | BWA (CIRI2) | Perl | — | Unix/Linux |
|
| MapSplice | De novo; annotation | Conda | Bowtie |
| v2.2.1 | Unix/Linux |
|
| circRNA_finder | De novo | MID | STAR | Perl, AWK | v1.2 | Unix/Linux | ( |
| CircRNAFisher | De novo | MID | Bowtie2 | Perl | v0.1 | Unix/Linux |
|
| miARma | De novo | Docker, Virtual box image | BWA (CIRI) | Perl, | v1.7.5 | Unix/Linux, Windows |
|
| CIRI | De novo | MID | BWA | Perl | v2.0.6 | Unix/Linux | ( |
| ACFS |
| MID | BWA BLAT | Perl | v2.0 | Unix/Linux |
|
| CircDBG | Annotation | CR | k-mer (no need aligner) | C++ | - | Unix/Linux |
|
Header Abbreviations: TT, tools type; IT, installation type; CV, current version; Ref, reference; ATMR, aligner or tools or method required; PL, programming language.