| Literature DB >> 33475988 |
Ankush Sharma1,2,3, Giovanni Colonna4.
Abstract
Biomedical institutions rely on data evaluation and are turning into data factories. Big-data storage centers, supercomputing systems, and increased algorithmic efficiency allow us to analyze the ever-increasing amount of data generated every day in biomedical research centers. In network science, the principal intrinsic problem is how to integrate the data and information from different experiments on genes or proteins. Data curation is an essential process in annotating new functional data to known genes or proteins, undertaken by a biobank curator, which is then reflected in the calculated networks. We provide an example of how protein-protein networks today have space-time limits. The next step is the integration of data and information from different biobanks. Omics data and networks are essential parts of this step but also have flawed protocols and errors. Consider data from patients with cancer: from biopsy procedures to experimental tests, to archiving methods and computational algorithms, these are continuously handled so require critical and continuous "updates" to obtain reproducible, reliable, and correct results. We show, as a second example, how all this distorts studies in cellular hepatocellular carcinoma. It is not unlikely that these flawed data have been polluting biobanks for some time before stringent conditions for the veracity of data were implemented in Big data. Therefore, all this could contribute to errors in future medical decisions.Entities:
Year: 2021 PMID: 33475988 PMCID: PMC7847983 DOI: 10.1007/s40291-020-00505-3
Source DB: PubMed Journal: Mol Diagn Ther ISSN: 1177-1062 Impact factor: 4.074
Fig. 1The total amount of first-order functional interactions (52 interactors) found for human SELK through STRING. The graph shows the maximum enrichment for SELK (53 nodes; 540 edges; average node degree: 20.4; average local clustering coefficient: 0.833; p < 1.0e−16).
Effect of phosphorylation on the structural organization of SELK
| ID | PTM | K | FCR | NPCR | Hydropathy | Isoelectric point | Plot region |
|---|---|---|---|---|---|---|---|
| Seq1 | Native polypeptide | 0.265 | 0.275 | 0.118 | 2.914 | 10.86 | 2 |
| Seq2 | + 1 phosphate | 0.239 | 0.314 | 0.078 | 2.914 | 10.24 | 2 |
| Seq3 | +2 phosphate | 0.265 | 0.353 | 0.039 | 2.808 | 9.68 | 3 |
| Seq4 | + 3 phosphate | 0.262 | 0.392 | 0.000 | 2.686 | 8.49 | 3 |
| Seq5 | + 4 phosphate | 0.219 | 0.431 | -0.039 | 2.573 | 7.36 | 3 |
| Seq6 | + 5 phosphate | 0.196 | 0.471 | -0.078 | 2.451 | 6.91 | 3 |
| Seq7 | + 6 phosphate | 0.177 | 0.510 | -0.118 | 2.365 | 6.52 | 3 |
K, FCR, and NCPR values have been calculated according to Pappulab (http://pappulab.wustl.edu/CIDER/analysis/) [109], hydropathy values according to Kyte and Doolittle [110], and the isoelectric point on the platform Bachem (https://www.bacandhem.com/de/service-support/peptide-calculator/). K (charge patterning parameter) is a parameter to describe the extent of charged amino acid mixing in a sequence; for a sequence of fixed composition, K goes from 0 to 1. FCR is the fraction of charged residues. As the fraction of charged residues increases, the relative impact of how those charges are spread across a sequence becomes more significant. Hydropathy is the 0–9 scaled Kyte–Doolittle hydropathy score for the sequence (9 most hydrophobic, 0 least hydrophobic) [110]. Phase plot region is the location where the sequence falls on the Das–Pappu phase plot. Region 2 is the collapsed or expanded structure, where their behavior may depend on other factors (salt concentration, ligand binding, interactions, etc.), and region 3 includes strong polyampholytes: coils, hairpins and chimeras – here the types of structures that form may depend on the K value
ID sequence identification, NCPR net charge per residue, PTM post-translational modification
Fig. 2The graphs obtained for the SELK protein using BioGRID and STRING. a Shows the BioGRID result for SELK only. BioGRID, unlike STRING, does not allow node enrichments but reports all the nodes that physically interact with SELK. b Shows the STRING result for SELK enriched with several interactors, similar to that from BioGRID. c Shows the STRING result for multiple searches with the set of proteins (including SELK) as reported by BioGRID in (a). d Shows the network parameters and functions reported by STRING for (b) and (c)
Hub genes in hepatocellular carcinoma extracted from the literature
| Set of 208 non-redundant genes extracted from a total of 324 hub genes found in the literature. The acronyms of the genes are listed alphabetically | Redundancy of the 61 hub genes. In parentheses the number of times that each gene has been found for a total of 177 times | |||
|---|---|---|---|---|
Fig. 3a The whole set of extracted proteins; b the 23 proteins with greater connectivity; c the distribution degree of the 209 nodes; d the comparison between the distribution of the 209 nodes with that in the human proteome
Fig. 4The different PPI networks obtained from the hub genes of hepatocellular carcinoma. STRING has translated the names of the genes into those of the respective proteins. a The network obtained for the 208 hub genes found in the literature with a confidence score of 0.4. The proteins that have not shown relations with the rest of the network are visible in the upper-right part of the panel. b The same network but with a confidence score of 0.9 to obtain relationships with greater significance. The number of proteins that do not exchange interactions is increased (see also Table 3). c and d The PPI networks obtained using only the experimentally validated interactions as data sources. The networks have a confidence score of c 0.4 and d 0.9. Notice how the use of experimentally validated interactions reduces the significant relationships between proteins with the collapse of the network (d). High resolution figures of the networks are shown in supplementary material figures 1S to 4S
Network parameters in STRING
| Network parameters | Source: 2 channels | Source: 2 channels Confidence Score = 0.9 | Source: 1 channel (Exp) Confidence score = 0.4 | Source: 1 channel (Exp) Confidence score = 0.9 |
|---|---|---|---|---|
| Interactions | 3502 | 1385 | 359 | 52 |
| Number of connected nodes | 204 | 173 | 91 | Collapsed network with three clusters of 28 nodes |
| Average node degree | 34.3 | 13.6 | 3.52 | 0.51 |
| Average local clustering coefficient | 0.652 | 0.542 | 0.283 | 0.163 |
An average node degree is a numerical value of how many interactions (at the score threshold) a protein has on the average in the network. The clustering coefficient is a measure of how connected the nodes in the network are. 2 channels denote data retrieved from experimentally proven data channel as well as databases and text mining channel
Exp stands for experimental channel only
Fig. 5Schematic representation of Ontology and Knowledge Modeling. The part enclosed in the red box shows the processes where human error may be more likely. The meaning of the terms is explained in the Electronic Supplementary Material (Box 1)
| Large-scale multimodal data on patient health is an important factor in the development of personalized medicine. |
| Biomedical Big data is inherently versatile. However, generation, storage, and analysis processes can incorporate inaccuracies that occur when the different structure–function relationships of the intrinsically disordered proteins and the post-translational modifications of proteins, used without a spatiotemporal interpretation, generate inaccurate interpretations in hub genes’ protein–protein interaction networks. |
| We encourage the biomedical community to search for a solution that can rectify data retrospectively, as pitfalls and flaws must be avoided when generating clinical decisions in personalized medicine.. |