Literature DB >> 19458762

Databases and QSAR for cancer research.

Adeel Malik1, Hemajit Singh, Munazah Andrabi, Syed Akhtar Husain, Shandar Ahmad.   

Abstract

In this review, we take a survey of bioinformatics databases and quantitative structure-activity relationship studies reported in published literature. Databases from the most general to special cancer-related ones have been included. Most commonly used methods of structure-based analysis of molecules have been reviewed, along with some case studies where they have been used in cancer research. This article is expected to be of use for general bioinformatics researchers interested in cancer and will also provide an update to those who have been actively pursuing this field of research.

Entities:  

Year:  2007        PMID: 19458762      PMCID: PMC2675501     

Source DB:  PubMed          Journal:  Cancer Inform        ISSN: 1176-9351


Introduction

Bioinformatics has played a crucial role in structure based drug and target discovery, diagnosis and analysis of various diseases and their diversity. In particular there is enormous potential of its application in cancer research, which has only been partially exploited so far. Essentially all bioinformatics starts with a database and proceeds to some kind of knowledge discovery and prediction. In this article, we review bioinformatics databases and different types of quantitative structure-activity relationship (QSAR) studies, which have either been used in cancer research or have the potential of such application.

Bioinformatics databases

Biological experiments result in useful information. This information has remained scattered in published literature, technical lab reports and patent files until not very long ago. However, there has been a tremendous effort during last couple of decades to compile, share, standardize and model biological information (e.g. Wu et al. 2003; Bairoch and Boeckmann 1991; Benson et al. 2005; Hamosh et al. 2005; Bateman et al. 2004; Boguski et al. 1993; Bauer et al. 2005; Smigielski et al. 2000; Wu et al. 2001; Berman 2000; Hulo et al. 2006; Attwood et al. 2000; Gromiha et al. 1999; Mulder et al. 2002; Pongor et al. 1992; Kanehisa and Goto 2000; Dowell et al. 2001;). There has also been relatively recent interest in improving the quality of databases, developing web-interfaces and integration of databases (Achard et al. 2001; He et al. 2005; Hanisch et al. 2002; Westbrook et al. 2002; Arauzo-Bravo and Ahmad 2005). These efforts have made it possible to know the state of the art in a given area of biology and provide a basis for what is sometimes called in-silico biology, as opposed to in-vivo and in-vitro biology. Some of the most widely used databases have been listed in Table 1.
Table 1.

General Bioinformatics Databases.

Major sequence repositories
DNA Data Bank of Japan (DDBJ)http://www.ddbj.nig.ac.jpAll known nucleotide and protein sequences; International Nucleotide Sequence Database Collaboration
EMBL Nucleotide Sequence Databasehttp://www.ebi.ac.uk/embl.htmlAll known nucleotide and protein sequences; International Nucleotide Sequence Database Collaboration
GenBankhttp://www.ncbi.nlm.nih.gov/All known nucleotide and protein sequences; International Nucleotide Sequence Database Collaboration
NCBI Reference Sequence Projecthttp://www.ncbi.nlm.nih.gov/RefSeq/Non-redundant collection of naturally-occurring biological molecules
Ensemblhttp://www.ensembl.org/Annotated information on eukaryotic genomes
UCSC Genome Browserhttp://genome.ucsc.edu/Genome assemblies and annotation
UniGenehttp://www.ncbi.nlm.nih.gov/UniGene/Non-redundant, gene-oriented clusters

Protein Databases
CSDBasehttp://www.chemie.uni-marburg.de/~csdbase/Cold shock domain-containing proteins
DExH/D Family Databasehttp://www.helicase.net/dexhd/dbhome.htmDEAD-box, DEAH-box and DExH-box proteins
Endogenous GPCR Listhttp://www.tumor-gene.org/GPCR/gpcr.htmlG protein-coupled receptors; expression in cell lines
EXProthttp://www.cmbi.Kun.nl/EXProt/Proteins with experimentally-verified function
GenProtEChttp://genprotec.mbl.eduE. coli K-12 genome, gene products and homologs
Histone Databasehttp://research.nhgri.nih.gov/histones/Histone and histone fold sequences and structures
HIV Molecular Immunology Databasehttp://hiv-web.lanl.gov/content/immunology/indexHIV epitopes
HIV RT and Protease Sequence Databasehttp://hivdb.stanford.eduHIV reverse transcriptase and protease sequences
Homeodomain Resource genomichttp://genome.nhgri.nih.gov/homeodomain/Homeodomain sequences, structures and related genetic and genomic information
HUGEhttp://www.kazusa.or.jp/huge/Large (>50 kDa) human proteins and cDNA sequences
IMGThttp://imgt.cines.frImmunoglobulin, T cell receptor and MHC sequences from human and other vertebrates
IMGT/HLAhttp://www.ebi.ac.uk/imgt/hla/Polymorphic sequences of human MHC and related genes
IMGT/MHC Databasehttp://www.ebi.ac.uk/ipd/mhc/index.htmlMajor histocompatibility complex sequences
InBasehttp://www.neb.com/neb/inteins.htmlAll known inteins (protein splicing elements): properties, sequences, bibliography
InterProhttp://www.ebi.ac.uk/interproProtein families and domains
LGICdbhttp://www.ebi.ac.uk/compneur-srv/LGICdb/LGICdb.phpLigand-gated ion channel subunit sequences
Nuclear Protein Database (NPD)http://npd.hgu.mrc.ac.ukProteins localized in the nucleus
NRMDhttp://www.receptors.org/NR/Nuclear receptor superfamily
NUREBASEhttp://www.ens-lyon.fr/LBMC/laudet/nurebase.htmlNuclear hormone receptors
ooTFDhttp://www.ifti.org/ootfdTranscription factors and gene expression
PANTHERhttp://www.pantherdb.org/Gene products organized by biological function
Peptaibolhttp://www.cryst.bbk.ac.uk/peptaibol/home.shtmlPeptaibol (antibiotic peptide) sequences
Phospho.ELMhttp://phospho.elm.eu.org/Protein phosphorylation sites
PKRhttp://www.kinasenet.org/pkr/Welcome.doProtein kinase sequences, enzymology, genetics and molecular and structural properties
Prolysishttp://delphi.phys.univ-tours.fr/Prolysis/Proteases and natural or synthetic protease inhibitors
Protein Information Resource (PIR)http://pir.georgetown.eduComprehensive, annotated, non-redundant protein sequence databases
ProtoNethttp://www.protonet.cs.huji.ac.il/Hierarchical clustering of protein sequences
RTKdbhttp://pbil.univ-lyon1.fr/RTKdb/Receptor tyrosine kinase sequences
SEVENShttp://sevens.cbrc.jp7-transmembrane helix receptors
SWISS-PROT/TrEMBLhttp://www.expasy.org/sprotCurated protein sequences
TIGRFAMshttp://www.tigr.org/TIGRFAMsFunctional identification of proteins
trEST, trGEN, Hitshttp://hits.isb-sib.chHypothetical protein sequences

Structure
ASTRALhttp://astral.stanford.edu/Sequences of domains of known structure, selected subsets and sequence-structure correspondences
BioMagResBank acidshttp://www.bmrb.wisc.edu/NMR spectroscopic data from proteins, peptides, and nucleic acids
CATHhttp://www.biochem.ucl.ac.uk/bsm/cath_newProtein domain structures
CKAAPs DBhttp://ckaaps.sdsc.edu/perl/browser.plStructurally-similar proteins with dissimilar sequences
CSDhttp://www.ccdc.cam.ac.uk/products/csd/Crystal structure information for organic and metal organic compounds
Database of Macromolecular Movementshttp://bioinfo.mbb.yale.edu/Mol-MovDB/Descriptions of protein and macromolecular motions, including movies
Decoys ‘R’ Ushttp://dd.stanford.edu/Computer-generated protein conformations based on sequence data
DSMMhttp://projects.eml.org/mcm/database/dsmmDatabase of Simulated Molecular Motions
Gene3Dhttp://cathwww.biochem.ucl.ac.uk:8080/Gene3D/Precalculated structural assignments for genes within whole genomes
GTOPhttp://spock.genes.nig.ac.jp/~genome/gtop.htmlProtein fold predictions from genome sequences
HIC-Uphttp://alpha2.bmc.uu.se/hicup/Structures of small molecules (‘hetero-compounds’)
HSSPhttp://www.sander.ebi.ac.uk/hssp/Structural families and alignments; structurally-conserved regions and domain architecture
LPFChttp://smi-web.stanford.edu/projects/helix/LPFC/Library of protein family core structures
MMDB linkedhttp://www.ncbi.nlm.nih.gov/Structure/All experimentally-determined three-dimensional structures, linked to NCBI Entrez
ModBasehttp://modbase.compbio.ucsf.edu/modbase-cgi-new/index.cgiAnnotated comparative protein structure models
NDBhttp://ndbserver.rutgers.edu/Nucleic acid-containing structures
NTDBhttp://ntdb.chem.cuhk.edu.hkThermodynamic data for nucleic acids
PALIhttp://pauling.mbu.iisc.ernet.in/~paliPhylogeny and alignment of homologous protein structures
PASS2http://caps.ncbs.res.in/campass/pass.htmlStructural motifs of protein superfamilies
PDBhttp://www.pdb.org/Structure data determined by X-ray crystallography and NMR
PDB-REPRDBhttp://mbs.cbrc.jp/pdbreprdb-cgi/reprdb_menu.plRepresentative protein chains, based on PDB entries
PDBsumhttp://www.ebi.ac.uk/thornton-srv/databases/pdbsum/Summaries and analyses of PDB structures
ProThermhttp://gibk26.bse.kyutech.ac.jp/jouhou/therm/protherm.htmlThermodynamic data for Pro-wild-type and mutant proteins
PSSHhttp://srs3d.ebi.ac.uk/Alignments between protein sequences and tertiary structures
RNABasehttp://www.rnabase.orgRNA-containing structures from PDB and NDB
SCOPhttp://scop.mrc-lmb.cam.ac.uk/scopFamilial and structural protein relationships
SCORhttp://scor.lbl.govRNA structural relationships
Sloophttp://www-cryst.bioc.cam.ac.uk/~sloop/Classification of protein loops
Structure-Superposition Databasehttp://ssd.rbvi.ucsf.eduPairwise superposition of TIM-barrel structures
SUPERFAMILYhttp://supfam.orgAssignments of proteins to structural superfamilies

Retrieval Systems and Database Structure
TESShttp://www.cbil.upenn.edu/cgi-bin/tess/tessTranscription element search system
Virgilhttp://www.infobiogen.fr/services/virgilDatabase interconnectivity
The cancer research community has not remained indifferent to the importance of databases. From the big organizations such as National Cancer Institute (NCI; http://www.cancer.gov) to smaller research groups, scientists have developed databases relating to the genetics, molecular biology, microarray clinical reports and several other aspects of cancer. Table 2 lists some of the most prominent databases, which have emerged in respect of cancer research. Some of these databases are discussed below:
Table 2.

Cancer related bioinformatics databases.

Database NameURLDescription
Atlas of Genetics and Cytogenetics in Oncology and Haematologyhttp://www.infobiogen.fr/services/chromcancer/Cancer-related genes, chromosomal abnormalities in oncology and haematology, and cancer-prone diseases
Cancer Chromosomeshttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cancerchromosomesCytogenetic, clinical and reference information on cancer-related aberrations
CGEDhttp://cged.hgc.jp/cgi-bin/input.cgiCancer gene expression database
COSMIChttp://www.sanger.ac.uk/genetics/CGP/cosmic/Catalogue of somatic mutations in cancer: sequence data, samples and publications
Germline p53 Mutationshttp://www.lf2.cuni.cz/win/projects/p53.htmMutations in germline_mut_ human tumor and cell line p53 gene
IARC TP53 Databasehttp://www.p53.iarc.fr/index.htmlHuman TP53 somatic and germline mutations
MTBhttp://tumor.informatics.jax.org/mtbwi/index.doMouse tumor biology database: tumor types, genes, classification, incidence, pathology
OncoMinehttp://www.oncomine.org/Cancer microarray data by gene or cancer type
Oral Cancer Gene Databasehttp://www.tumor-gene.org/Oral/oral.htmlCellular and molecular data for genes involved in oral cancer
RB1 Gene Mutation DBhttp://www.verandi.de/joomla/Mutations in the human retinoblastoma (RB1) gene
RTCGDhttp://rtcgd.ncifcrf.gov/Mouse retroviral tagged cancer gene database
SNP500Cancerhttp://snp500cancer.nci.nih.govRe-sequenced SNPs from 102 reference samples
SV40 Large T-Antigen Mutantshttp://www.pitt.edu/pipaslab/[]Mutations in SV40 large tumor antigen gene
Tumor Gene Family Databaseshttp://www.tumor-gene.org/tgdf.htmlCellular, molecular and biological data about genes involved in various cancers

These sites could not be opened at the time of revising the manuscript.

Cancer Chromosomes database

(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cancerchromosomes) Cancer Chromosomes integrates data from three sources: the NCI/NCBI SKY/M-FISH & CGH Database, the NCI Mitelman Database of Chromosome Aberrations in Cancer, and the NCI Recurrent Aberrations in Cancer (Knutsen et al. 2005). This is a publicly available database and can be searched for cytogenetic, clinical, and/or reference information. Similarity reports demonstrating cytogenetic and clinical relatedness at varying levels of specificity are also returned on querying this database.

CGED (Cancer Gene Expression Database)

(http://cged.hgc.jp/cgi-bin/input.cgi) CGED is a database containing expression profiles and accompanying clinical information of breast, colorectal, and hepatocellular cancer related genes (Kato et al. 2005). The data in CGED have been obtained through collaborative efforts made at the Nara Institute of Science and Technology and Osaka University School of Medicine to identify genes of clinical importance. The expression data have been obtained by a high-throughput RT-PCR technique (adaptor-tagged competitive PCR). The data can be retrieved either using gene identifiers or by functional categories defined by Gene Ontology terms or the SwissProt annotation. Gene expression data are displayed in mosaic plots. This database also provides for the expression patterns of multiple genes, selected by names or similarity search of the patterns. The sorting function enables users for easy recognition of relationships between gene expression and clinical parameters.

The Atlas of Genetics and Cytogenetics in Oncology and Haematology

(http://www.infobiogen.fr/services/chromcancer) The Atlas of Genetics and Cytogenetics in Oncology and Haematology is a database containing information about genes related to cancer (Huret et al. 2000). This database contains information in the form of cards on cancer related genes, chromosomal abnormalities, cancers, and cancer-prone diseases. These cards are well-structured papers, which represent the body of the Atlas. Cards on genes include data on DNA/RNA, protein, mutations, and diseases. Cards on leukemias and solid tumours include data on: clinics, cytogenetics, genes, hybrid gene and fusion protein. Cards on cancer-prone diseases include data on: inheritance mode, clinics, neoplastic risk, cytogenetics, genes and proteins, mutations. These Cards are linked to NCBI published literature database PubMed, and to other major databases (nomenclature, cartography, gene structure, transcripts, proteins, domain families, diseases, mutations, probes). This database has another component called Deep Insights and Case Reports. Deep insights are review articles related to special topics and the Case Reports section is dedicated to rare cytogenetic entities of leukemia including the associated prognosis. This database also referred to as The Atlas is part of the genome project and participates in the research in cancer epidemiology.

Database of germline p53 mutations

(http://www.lf2.cuni.cz/projects/germline_mut_p53.htm) Somatic mutations in the p53 tumor suppressor gene are found in many human cancers (Le Roux et al. 2005). In addition, germline p53 mutations have been identified in individuals from cancer-prone families and in isolated cancer patients affected at a young age or suffering from multiple tumours (Harris 1996; Hollstein et al. 1991). A large fraction of the cancer-prone families with germline p53 mutation follow the criteria of Li-Fraumeni syndrome (LFS) (Li et al. 1988;). This syndrome is a rare familial autosomal dominant cancer syndrome characterised by early-onset sarcomas, brain tumours, premenopausal breast cancer, leukaemias and adrenocortical tumours. It is with this view that a database dedicated to p53 mutations has been developed. Genotype-phenotype correlations, compiled in this data may improve the counseling and preventive approaches in the affected families. This is a comprehensive database of those cases of germline p53 mutations for which sufficient detail is given in the literature. In addition to listing all mutations, the database includes detailed information about the families, affected individuals and their tumours. It therefore provides a powerful means for drawing correlations between various aspects of germline p53 mutations. Each p53 mutation (type of the mutation, exon and codon affected by the mutation, nucleotide and amino acid change), have been explained. In addition, it has the information on the family history of cancer, diagnosis of LFS, each affected individual (sex, generation, p53 status, from which parent the mutation was inherited) and each tumour (type, age of onset, p53 status (loss of heterozygosity and immunostaining). Each entry contains the original research article as reference(s).

COSMIC database

(http://www.sanger.ac.uk/genetics/CGP/cosmic/) COSMIC is a database designed to store and display somatic mutation information-relevant for cancer (Forbes et al. 2006). In particular, it contains information relating to human cancers. COSMIC contains information on publications, samples and mutations implicated in cancer. It also includes samples, which have been found to be negative for mutations during screening. This allows the calculation of frequency data normalized by control frequencies for mutations in different genes in different cancer types. Samples entered include benign neoplasms and other benign proliferations, in situ and invasive tumors, recurrences, metastases and cancer cell lines. Histology and tissue ontology has also been created in this database. All mutations are mapped to a single version of each gene. The data can be queried by tissue, histology or gene and displayed as a graph, as a table or exported to other formats.

EHCO database

(http://ehco.nchc.org.tw) EHCO (Encyclopedia of Hepatocellular Carcinoma genes Online) is an integrative database for HCC (hepatocellular carcinoma) research. It carries gene annotations collected by computer-assisted mining, manual curation, and extraction from public databases. Currently EHCO contains information for about 3500 HCC-related genes. Various entries in this database can be compared online. Detailed annotations for particular genes, including sequence, ontology, cited literature, and expression profiles are also available.

Human p53 database

(http://metalab.unc.edu/dnam/mainpage.html) A collection of databases relating to p53 gene mutations, lacI and lacZ is available on this website (Cariello et al. 1994; Cariello et al. 1996). There are nearly 6000 entries corresponding to p53, 200 for lacZ and 1500 of lacI. In addition 1500 transgenic and 8000 bacterial entries are also included. A software for analysis of the databases is also included. Each database has a separate software analysis program. All these databases include information about mutations such as base position, the nature of the mutation, amino acid position, molecular weight and the name of mutant amino acid, the local sequence around a mutation and literature citation as the source of listed information. Information specific to the p53 database includes cancer type, cell origin, loss of heterozygosity.

IARC TP53 Database

(http://www.p53.iarc.fr/index.html) The IARC TP53 Database compiles data on human somatic and germline TP53 genetic variations that are reported in the published literature. (Olivier et al. 2002; Hernandez-Boussard 1999; Hainaut et al. 1997; Hainaut et al. 1998 ; Hollstein et al. 1994, Hollstein et al. 1996). With over 18,500 somatic and 225 germline mutations and 1,000 citations in the world literature, this database is now recognized as a major source of information on TP53 mutation patterns in human cancer. It can be searched and analyzed online and is useful to draw hypotheses on the nature of the molecular events involved in TP53 mutagenesis and on the natural history of cancer.

ITTACA Gene expression and clinical database

(http://bioinfo-out.curie.fr/ittaca/) ITTACA is a database of microarray experimental results and clinical information retrieved form published papers (Elfilali et al. 2006). It contains information on breast carcinoma, bladder carcinoma, and uveal melanoma. Online service also allows some basic statistical analysis of the database such as the comparison of expression distribution profiles, tests for differential expression, and patient survival analyses.

The Mouse Tumor Biology Database (MTB)

(http://www.informatics.jax.org) MTB database compiles and shares information about tumor frequency, genetics, and pathology in genetically denned mice (i.e., transgenics, targeted mutations, and inbred strains) (Bult et al. 2001). The database collects crucial information about incidence of different types of tumors in different strains, mutations relating to specific genes and tumors corresponding to them, which have been reported in medical journals. Existing standards for anatomy, tumor names, gene names, and strain names are well enforced, enabling direct links to information across MTB entries and to other relevant databases.

The Tumor Gene Family Databases (TGDBs)

(http://condor.bcm.tmc.edu/ermb/tgdb/tgdf.html) TGDB is made up of two databases viz. Oral Cancer Gene Database (OrCGDB) and Breast Cancer Gene Database (BCGD). Both these databases contain information on a mechanism of oncogenic activation, regulation, frequency of involvement in various tumor types, and chromosomal location for the genes involved in cancer (e.g. proto-oncogenes and tumor supressor genes). Data about the encoded proteins includes the cell type in which they are found, subcellular location, DNA, protein, and ligand binding, role in development, and normal biochemical function.

QSAR and in-silico analysis of molecular recognition

Once the molecular mechanism and the chemistry of a disease is understood, the next crucial task is to find a suitable cure for it. Atypical requirement is to find a suitable drug target and the drug itself (Brooijmans and Kuntz 2003). Target discovery draws much on bioinformatics tools today and in case of cancer the DNA and protein molecules both can be potential targets for drugs (Choudhary et al. 2005; Bandyopadhyaya et al. 2005; Bhongade et al. 2004; Asseffa et al. 2003; Gellert et al. 2005; Khaleque et al. 2006; Yao et al. 2005; McColl et al. 2005). Drug discovery is a complex, expensive and very time-consuming exercise, as there is no single systematic way to automatically discover a drug even when the disease and targets have been well understood (Dixit and Mitra 2002). There may be millions of candidate molecules if in-silico filtering is not performed. Experiments cannot be performed on such large number of drug candidates due to prohibitive costs both in terms of time and money. Quantitative structure-activity relationship (QSAR) studies form the center stage when a protein (typically an enzyme) is the target and there is a need to find a suitable molecule, which can control (inhibit) the activity of its target. The basic principle of such a study is the structure-dependence of chemical activity. QSAR has existed much longer than the first popularity of computers, because chemical structure has always been able to explain at least some aspects of chemical properties. However, with the availability of powerful computers and high quality databases of molecular libraries and interactions have made QSAR an essential component of drug discovery today. Role of structure in determining the activity of a chemical compound is illustrated in an example of protein-ligand complex in Fig 1.
Figure 1.

Identical A & B Chain Residues of 1A1E in complex with its ligand (ACE-PTR-GLU-DIY). Ligand in red.

QSAR based (in-silico) analysis may be better regarded as an exercise to screen or filter drug candidates, before they are subjected to more intensive calculations such as docking or an experimental measurement of activity (in-vitro) and finally under real conditions (in-vivo). Many times this step will pick up a dozen of drug candidate from a library of millions of well-studied molecules. Traditional QSAR is specific to a particular target or enzyme and all the screening is performed on drug candidates (ligand molecules). These ligand molecules are very diverse and in order to screen them suitably, we need to describe their structure as well as chemical nature. This leads to the issue of finding descriptors of molecular properties of ligands and drugs. Hundreds of molecular properties or descriptors are used to represent molecules (Labute 2000; Xue and Bajorath 2000; Wildman and Crippen 2002; Gozalbes et al. 2002). These properties may be purely geometric, topological, electromagnetic, classical and quantum-mechanical. Often, predicting activity of a protein-ligand combination if the descriptors of the ligand are known carries out this screening. Regression techniques such as Principal Component Analysis (PCA), Neural Network and Multi-variate correlation are the major techniques used for this purpose. In the following we review some of these techniques and special reference will be wherever a successful application to cancer has been reported. A large number of molecular descriptors are available and used (Todeschini and Consonni; Labute 2000; Wildman and Crippen 2002; Hansch et al. 1995; Basak et al. 1980; Gozalbes et al. 2002; Pirard and Picket 2000; Basak et al. 1981; Basak et al. 1982; Kier and Hall 1999; Raevsky 1999; Xue and Bajorath 2000). Molecular descriptors used in QSAR for a unique representation and identification of ligand molecules, which are likely to be drug candidates, may be classified as follows: Constitutional descriptors such as molecular weight, van der Waals volume, electronegativities, polarizability, number of atoms, non-H atoms, number of H bonds, multiple bonds, bond orders, aromatic ratio, number of rings, number of double and triple bonds, aromatic bonds, 3 different types of (n-membered) rings, benzene-like rings. Topological descriptors such as total structure connectivity index, Pogliani index, ramification index, polarity number, average vertex distance degree, mean square distance index (Balaban), Schultz Molecular Topological Index (MTI), square reciprocal distance sum index, quasi-Wiener index (Kirchhoff number), spanning tree number, hyper-distance-path index, reciprocal hyper-distance-path index, detour index, hyper-detour index, reciprocal hyper-detour index, distance/detour index, all-path Wiener index, Wiener-type index from Z weighted distance matrix (Barysz matrix), molecular electrotopological variation, E-state topological parameter, Kier symmetry index eccentricity, mean distance degree deviation, unipolarity, centralization, variation. Walk and path counts such as molecular walk counts, total walk count, self-returning walk counts, molecular path counts, molecular multiple path counts, total path count, conventional bond-order ID number, Randic ID number, Balaban ID number, ratio of multiple path count over path count, difference between multiple path count and path count. Connectivity indices such as connectivity indices, average connectivity indices, valence connectivity indices, average valence connectivity indices, solvation connectivity indices, modified, reciprocal distance Randic-type index, reciprocal distance squared Randic-type index. Information indices such as information index on molecular size, total information index of atomic composition, mean information index on atomic composition, mean information content on the distance equality, mean information content on the distance magnitude, mean information content on the distance degree equality, mean information content on the distance degree magnitude, total information content on the distance equality, total information content on the distance magnitude, mean information content on the vertex degree equality, mean information content on the vertex degree magnitude, graph vertex complexity index, graph distance complexity index (log), Balaban U index, Balaban V index, Balaban X index, Balaban Y index Basak indices of neighborhood symmetry. 2D autocorrelations Broto-Moreau autocorrelations of a topological structure, Moran autocorrelations, Geary autocorrelations. Edge adjacency indices edge connectivity index of order 0, edge connectivity index of order 1 eigenvalues from edge adj. matrix weighted by edge degrees, eigenvalues from edge adj. matrix weighted by dipole moments, eigenvalues from edge adj. matrix weighted by resonance integrals spectral moments from edge adj. matrix, spectral moments from edge adj. matrix weighted by edge degrees, spectral moments from edge adj. matrix weighted by dipole moments, spectral moments from edge adj. matrix weighted by resonance integrals. Eigenvalue-based indices Lovasz-Pelikan index (leading eigenvalue), leading eigenvalue from Z weighted distance matrix (Barysz matrix), leading eigenvalue from mass weighted distance matrix, leading eigenvalue from van der Waals weighted distance matrix, leading eigenvalue from electro-negativity weighted distance matrix, leading eigenvalue from polarizability weighted distance matrix. Geometrical descriptors 3D-Wiener index, 3D-Balaban index, 3D-Harary index average geometric distance degree, D/D index, average distance/distance degree gravitational index G1, gravitational index G2 (bond-restricted), radius of gyration (mass weighted), span R, average span R. Functional group counts terminal primary C(sp3), total secondary C(sp3), total tertiary C(sp3), total quaternary C(sp3), ring secondary C(sp3), ring tertiary C(sp3), ring quaternary C(sp3) aromatic C(sp2), unsubstituted benzene C(sp2), substituted benzene C(sp2), non-aromatic conjugated C(sp2), terminal primary C(sp2), aliphatic secondary C(sp2), aliphatic tertiary C(sp2), allenes groups, terminal C(sp), non-terminal C(sp) cyanates (aliphatic), cyanates (aromatic), isocyanates (aliphatic), isocyanates (aromatic), thiocyanates (aliphatic), thiocyanates (aromatic), isothiocyanates (aliphatic), isothiocyanates (aromatic). Charge descriptors maximum positive charge, maximum negative charge, total positive charge, total negative charge, total absolute charge (electronic charge index – ECI), mean absolute charge (charge polarization), total squared charge, relative positive charge, relative negative charge, submolecular polarity parameter, topological electronic descriptor, topological electronic descriptor (bond resctricted), partial charge weighted topological electronic descriptor, local dipole index. Molecular properties unsaturation index hydrophilic factor Ghose-Crippen molar refractivity topological polar and non-polar surface area. Many more descriptors may be calculated and comprehensive lists can be found. A comprehensive review of molecular descriptors is presented by Karelson (2000). Many free and commercial software also provide a current list of descriptors (e.g. http://www.talete.mi.it/products/dragon_molecular_descriptors.htm and http://preadmet.bm-drc.org/preadmet/query/query1.php, from where, list of many of the above descriptors is compiled.). An excellent coverage of issues and topics related to QSAR is also provided in a text book by Gasteiger and Engel (2003). After the descriptors of molecules have been calculated, redundant descriptors are removed using Principal Component Analysis or Multivariate analysis (Jolliffe 1986: Xue and Bajorath 2000). Many commercial and some free software programs are now available which may be used to calculate some of the descriptors and/or develop a QSAR model using them. Some of these programs are listed in Table 3. These softwares can give few key descriptors (such as 5 descriptors in Molinspiration) or a very large number of them (e.g. DRAGON gives more than 1500 descriptors), which will need to be reduced by some analysis.
Table 3.

Available QSAR and molecular descriptor programs.

Name of the softwareBrief DescriptionURL/Reference
PESTShape properties, Wavelet decomposition properties, Electrostatic potential, electronic kinetic energy density etc.C. Matthew Sundling, N. Sukumar and Curt Breneman Rensselaer Polytechnic Institute http://www.chem.rpi.edu/chemweb/recondocpest.html
Pharma Algorithm’s QSAR BuilderQSAR and QSPR modeling; Excess molar refraction, H-bond acidity, H-donor capability, H-bond basicity, H-acceptor capability Hexadecane/gas partition coefficient, LogP partition coefficient, TPSA - topological polar surface areaHugo Kubinyi, Professor of Pharmaceutical Chemistry at the University of Heidelberg, Germany http://apalgorithms.com/qsar_builder.htm
BioreasonQSAR, QPSRCommercial by Bioreason
ClassPharmerhttp://www.bioreason.com
ChemTKMolecule design, descriptors and modelinghttp://www.sageinformatics.com/chemtk.html
Molinspiration toolkitJava Based software and free online calculations of fragments and basic properties/descriptors.http://www.molinspiration.com
ShapeSigShape desciptors, and statistical analysishttp://histidine.umdnj.edu/~shape/index.php
Cerius (QSAR module)Modeling and QSAR, includes MOPAC Quantum mechanical calculations, alignments etc.http://www.accelrys.com
CODESSAQSAR programhttp://www.semichem.com/codessa/default.php
HASL3D QSARhttp://www.bio.com/store/product.jhtml?id=prod300024
QTRFITRigid body superpositionhttp://www.osc.edu/PET/CCM/skeleton/software/tested/source/qtrfit/qtrfit.html
DRAGON1664 molecular descriptorshttp://www.talete.mi.it/main_net.htm
Cancer researchers have frequently used these methods for a systematic filtering of potential drug candidates or for generalizing principles governing the choice of ligands that prefer to bind to a particular family of proteins in a selective and competitive way. Several aspects of cancer have been studied using QSAR techniques. Classical efforts at using QSAR for cancer drug research date back to 1970s (e.g. Hansch 1979). Antitumour drugs have remained a regular subject of investigation using QSAR (Ren and Lien 2004). During that time, focus was to discover drugs for chemotherapy. As cases of multidrug resistance were observed, a need to have alternative medicine for the same action were felt. Thus, a large number of researchers have focused on multidrug resistance in regards to chemotherapy and employed QSAR as a means to solve this problem. For example Breier et al. (2000) have studied multidrug resistance (MDR) for L1210/VCR-1 and L1210/VCR-2 cell lines in regards to leukemia treatment. They related the developed adaptation and drug resistance to structure descriptors of drugs viz. binding energy, molecular weight, pKa, log P etc. Klopman et al. (1997) have studied 609 diverse compounds to understand the drug resistance in P388/ADR resistant cell lines. In this study they identified several structural characteristics of MDR such as log P and graph index. More advanced techniques of QSAR such as Comparative Molecular Simillarity Index Analysis (CoMSIA) have been used to study antiviral and anticancer drugs targeting Thymidine Kinase (e.g. Bandyopadhyaya et al. 2005, Bhongade and Gadad 2004). Principle of CoMSIA is the alignment and comparison of drug molecules by comparing their similarity indices (selected descriptors). A similar approach, called Comparative Molecular Field Analysis (CoMFA) focuses on molecular field descriptors for this purpose (Cramer et al. 1988). Epidermal Growth Factor Receptors (EGFR) are one of the most popular class of proteins studied by QSAR method. Assefa et al. (2003) have used CoMFA for such a study and concluded that electrostatics and hydrophobicity descriptors play the most important role in EPGR target binding. Similarly, electrotopological state atom (ETSA) indices have been shown to play the most important role in anti tumour effect of pyridoacridine ascididemin analagues (Debnath et al. 2003). Thus, if a drug is available for chemotherapy and more such drugs are required to have redundancy against drug resistance, previously known successful drug/inhibitor is compared with a large data set of diverse molecules and those having their molecular indices (CoMSIA), or molecular fields (CoMFA) similar to that drug are picked up for potential use. Most recent QSAR related cancer studies have focused on genomic aspects of cancer related drug discovery (Workman 2001, Jung et al. 2003). This allows for individual prescriptions based on the genetic makeup of the patient. Thus, the possibility of having a large number of drugs having similar inhibitory ability but diverse genetic response opens a myriad of possibilities for cancer related research for peoples and individuals.

Summary

A number of databases directly and indirectly useful for cancer research have been reviewed. QSAR techniques and its application to cancer research have been outlined.
  72 in total

1.  Classification of kinase inhibitors using BCUT descriptors.

Authors:  B Pirard; S D Pickett
Journal:  J Chem Inf Comput Sci       Date:  2000 Nov-Dec

2.  ProML--the protein markup language for specification of protein sequences, structures and families.

Authors:  Daniel Hanisch; Ralf Zimmer; Thomas Lengauer
Journal:  In Silico Biol       Date:  2002

3.  PIML: the Pathogen Information Markup Language.

Authors:  Yongqun He; Richard R Vines; Alice R Wattam; Georgiy V Abramochkin; Allan W Dickerman; J Dana Eckart; Bruno W S Sobral
Journal:  Bioinformatics       Date:  2004-08-05       Impact factor: 6.937

4.  A new rapid and effective chemistry space filter in recognizing a druglike database.

Authors:  Suxin Zheng; Xiaomin Luo; Gang Chen; Weiliang Zhu; Jianhua Shen; Kaixian Chen; Hualiang Jiang
Journal:  J Chem Inf Model       Date:  2005 Jul-Aug       Impact factor: 4.956

Review 5.  Nuclear proteins: promising targets for cancer drugs.

Authors:  Y-L Yao; W-M Yang
Journal:  Curr Cancer Drug Targets       Date:  2005-12       Impact factor: 3.428

6.  The SBASE domain library: a collection of annotated protein segments.

Authors:  S Pongor; V Skerl; M Cserzö; Z Hátsági; G Simon; V Bevilacqua
Journal:  Protein Eng       Date:  1993-06

7.  Cytotoxic activity of several unrelated drugs on L1210 mouse leukemic cell sublines with P-glycoprotein (PGP) mediated multidrug resistance (MDR) phenotype. A QSAR study.

Authors:  A Breier; Z Drobná; P Docolomansky; M Barancik
Journal:  Neoplasma       Date:  2000       Impact factor: 2.575

8.  IARC Database of p53 gene mutations in human tumors and cell lines: updated compilation, revised formats and new visualisation tools.

Authors:  P Hainaut; T Hernandez; A Robinson; P Rodriguez-Tome; T Flores; M Hollstein; C C Harris; R Montesano
Journal:  Nucleic Acids Res       Date:  1998-01-01       Impact factor: 16.971

9.  The PROSITE database.

Authors:  Nicolas Hulo; Amos Bairoch; Virginie Bulliard; Lorenzo Cerutti; Edouard De Castro; Petra S Langendijk-Genevaux; Marco Pagni; Christian J A Sigrist
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

10.  COSMIC 2005.

Authors:  S Forbes; J Clements; E Dawson; S Bamford; T Webb; A Dogan; A Flanagan; J Teague; R Wooster; P A Futreal; M R Stratton
Journal:  Br J Cancer       Date:  2006-01-30       Impact factor: 7.640

View more
  3 in total

1.  High-Dimensional descriptor selection and computational QSAR modeling for antitumor activity of ARC-111 analogues Based on Support Vector Regression (SVR).

Authors:  Wei Zhou; Zhijun Dai; Yuan Chen; Haiyan Wang; Zheming Yuan
Journal:  Int J Mol Sci       Date:  2012-01-20       Impact factor: 6.208

2.  Tetracycline regulated systems in functional oncogenomics.

Authors:  Arkadiusz Welman; Jane Barraclough; Caroline Dive
Journal:  Transl Oncogenomics       Date:  2007-03-28

Review 3.  In silico cancer research towards 3R.

Authors:  Claire Jean-Quartier; Fleur Jeanquartier; Igor Jurisica; Andreas Holzinger
Journal:  BMC Cancer       Date:  2018-04-12       Impact factor: 4.430

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.