| Literature DB >> 35965491 |
Shilpa Verma1, Rajesh Bhatia1, Sandeep Harit1, Sanjay Batish1.
Abstract
The necessity for scholarly knowledge mining and management has grown significantly as academic literature and its linkages to authors produce enormously. Information extraction, ontology matching, and accessing academic components with relations have become more critical than ever. Therefore, with the advancement of scientific literature, scholarly knowledge graphs have become critical to various applications where semantics can impart meanings to concepts. The objective of study is to report a literature review regarding knowledge graph construction, refinement and utilization in scholarly domain. Based on scholarly literature, the study presents a complete assessment of current state-of-the-art techniques. We presented an analytical methodology to investigate the existing status of scholarly knowledge graphs (SKG) by structuring scholarly communication. This review paper investigates the field of applying machine learning, rule-based learning, and natural language processing tools and approaches to construct SKG. It further presents the review of knowledge graph utilization and refinement to provide a view of current research efforts. In addition, we offer existing applications and challenges across the board in construction, refinement and utilization collectively. This research will help to identify frontier trends of SKG which will motivate future researchers to carry forward their work.Entities:
Keywords: Knowledge graph construction; Knowledge graph embedding; Scholarly communication; Utilization
Year: 2022 PMID: 35965491 PMCID: PMC9361271 DOI: 10.1007/s40747-022-00806-6
Source DB: PubMed Journal: Complex Intell Systems ISSN: 2199-4536
Fig. 1Classification of scholarly knowledge graphs
Research questions and motivation
| Research questions (RQ) | Motivation |
|---|---|
| What type of entities and relationships are extracted during information extraction task? | There is a need to review specific set of entities and relations extracted from literature along with specific domain in order to identify current status in various domains |
| What approaches have been used for the scholarly knowledge graphs construction? | A most vital step in construction of knowledge graphs in scholarly domain is knowledge extraction completed with the help of extraction tools need to be explored. Along with this, type of knowledge discovery is also an important aspect to cover. The ways of storing and visualize the knowledge graphs to provide various application services is a promising field |
| What are the ontology and OpenIE tools applied? | It is significant to provide an overview of ontology designed/reused along with Off-the shelf tools applied on scholarly knowledge graphs to exhibit the importance of semantic representation of scholarly communication |
| What are the various studies that are deployed and leveraged knowledge graphs as application service? | Various Knowledge graph utilization studies along with link, key features, objective, domain and mappings are important attributes to discuss. This belongs to storing, accessing and updating the required knowledge in suitable output formats |
| What are the application scenarios have been covered in KGR along with embedding approaches used for data completion task? | It is important to analyze the approaches for knowledge graph embedding type, triple type, dataset, evaluation will be covered along with application scenarios in the context of recommendation and data exploration in scholarly domain |
Fig. 2Pictorial view of example of entities/relationships and triples in scholarly knowledge graph (SKG)
Fig. 3Conceptual view of the process of data mining in scholarly knowledge graphs
Graphs supporting scholarly infrastructures
| Infrastructure | URL | Data representation format | Data size/no of triples | Data export | Ontology used | Linked data resources | Data access | Research entities |
|---|---|---|---|---|---|---|---|---|
| Microsoft Academic Knowledge Graph [ | RDF, N-Triple | Multidisciplinary, 210 million publications, 8 billion triples | SPARQL | Yes | MAG, DBpedia, Wikidata, OpenCitations, and the Global Research Identifier Database (GRID) | Open | Author, Paper, Citation, Field of study, Journal, Affiliation, Conference instance, Conference series | |
| SciGraph [ | JSON-LD, N-Triple, Turtle, RDF | 2 billion triples | SPARQL | Yes | Springer Nature, Dimensions.ai, GRID | Open | Authors, Funders, grants, research projects, conferences, affiliations and publications | |
| ScholarlyData [ | HTML, RDF-XML, N-Triples and JSON-LD | Computer Science conferences and workshops, 1,128,618 triples | SPARQL | Yes | Events, ORCID, DOI | Open | Academic event, Affiliation, Organization, person | |
| OpenAIRE [ | RDF-XML, HTTP responses, RDF data, JSON | 480Mi | SPARQL endpoint | – | Repository, Funders, Archives, databases, Publishers | Open | Literature, datasets, software, funders, grants, organizations, researchers, data sources | |
| Open Research Knowledge Graph [ | JSON, RDF serializations | – | REST API, SPARQL | Yes | Literature, Research repository and terminology | Open | Literature and its content | |
| ResearchGraph [ | XML, RDF-XML Triplestore, JSON-LD | 250 million nodes | Cloud hosted services, REST API, GraphQL | Yes | PID, Literature, Repository, Publishers, Funders, aggregators, discovery | Controlled | Academic articles, datasets, funders, grants, organizations, researchers | |
| OpenCitations [ | RDF Triplestore | 55M publications and 655M bibliographic citations | SPARQL | Yes | Bibliographic and citation metadata | Open | Researchers, Funders, Data repositories, Publishers | |
| OpenResearch [ | CSV, RDF | Computer science Conferences, 9077 Events and 1061 Event series | ExportRDF, SPARQL | Yes | Repository, Funders, Archives, databases, Publishers, ORCID | Open | Events and its contents (EventTitle, country, topic) | |
| PID [ | RDF | 30 million nodes | GraphQL | – | PID providers | Open | Publications, datasets, Software, Funders, Research Organization, Researcher | |
| Open Academic Graph [ | JSON | 0.7 billion entities and 2 billion relationships | – | – | MAG and AMiner | Open | Venue, paper, Author, Affiliation |
Information extraction from scientific documents
| References | Extraction | Knowledge | ||||||
|---|---|---|---|---|---|---|---|---|
| Level | Input/field | Fact | Domain | Approach | Tasks | Source integration | Metrics | |
| [ | Entity | Abstracts | Triple | DI | Supervised | NER, RE, CR | – | P, R, F |
| [ | Relation | Full-text | Triple | DS | Unsupervised | CLS | – | P, R, F, Accuracy |
| [ | Entity | Abstract and full-text | Triple | DS | Conditional Random field | NER, CLS | SciBERT, MAKG | P, R, F |
| [ | Entity and relation | Full-text | – | DI | Bi-LSTM | NER, CR, RE | SciBERT | P, R, F |
| [ | Entity | Full-text | Triple | DS | Conditional Random field | SL | – | P, R, F |
| [ | Entity | Full-text | Triple | DS | – | NER | DBpedia, Wikidata and BioPortal | - |
| [ | Entity | Sentences | – | DS | Supervised | RE | – | P, R, F |
| [ | Relation | Full-text | – | DI | TF-IDF, Graph embedding | CR, ECLS, RE | DBpedia, ORKG | P, R, F |
| [ | Entity | Full-text | – | DI | Semi-supervised | NER | SciSpacy, UMLS | P, R, F |
| [ | Entity | Full-text | – | DI | Unsupervised pre-training | NER, CLS, RCLS, Parsing | ScispaCy, BERT | F |
| [ | Entity | Abstract | – | DI | Supervised | SL, CLS | – | P, R, F |
| [ | Entity and relation | Sentences | Triple | CD | – | TE, CR, EL, RL | RoBERTa | P, R, F |
| [ | Concept | Keyphrases | – | DS | Unsupervised | CLS | BabelNet | Accuracy |
| [ | Concept | Sentence | Triples | DI | Unsupervised | RE | GROBID | P |
| [ | Concept | Sentence | Tuple | DI | Semi-supervised | SL | – | P, R, F |
| [ | Concept | Sentence | Quad | DS | – | – | – | – |
Fig. 4Algorithmic view of a neural network-enabled KG creation, b natural language processing-enabled KG creation, c rule learning-based knowledge graph creation
Scholarly knowledge graph construction
| References | Extraction | Knowledge | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Level | Input/field | Fact | Entity | Relation | Domain | Approach | Tasks | Source/tool integration | Metrics | Application | |
| [ | ER | Text | Span | Task, Method, Metric, Material, Other-ScientificTerm and Generic | Compare, Part-of, Conjunction, Feature-of, Used-for, HyponymOf | DI | NN Classifier | Classification | SciBERT, JSON | P, R, F | – |
| [ | ER | Text | Triple | Task, Method, Metric, Material, Other-ScientificTerm and Generic | Compare, Part-of, Conjunction, Evaluate-for, Feature-of, Used-for, HyponymOf | DI | GAT | Entity Alignment and Deduplication | DyGIE++ | P, R, F | Summary generation |
| [ | ER | Text | - | Paper, Word, Author, Laboratory, location, Institution | Cites, is_similar, includes, connects, writes, co_authors, affiliates_with | DS | Word Embedding | Link prediction | CSV, Neo4j | Accuracy, R, P | Discovering future research collaborations |
| [ | ER | Text | Triple | Drugs, genes, proteins, pathways and enzymes | HasTarget, hasEnzyme, hasTransporter, isPresentIn, isImplicatedIn | DS | CNN and LSTM network | Classification | Bio2RDF, SPARQL | P, R, F | Drug–drug interaction prediction |
| [ | ER | Text | Triple | CHEMICAL, PROTEIN, DISEASE | CHEMICAL-PROTEIN, CHEMICAL-INDUCED-DISEASE | DS | Graph Convolution Network Auto Encoder | Link Prediction | SciBERT | P | Association of biomedical entities |
| [ | Concept | Text | Triple | Software mentions | Replaced_by | DS | Bi-LSTM, transfer learning | Entity disambiguation | JSON-LD, SPARQL | Manual, F | Software usage in social science |
| [ | Concept | Text | – | Author, Material | – | DS | Naive Bayes Classifier, CTANE | Classification, Deduplication | P, R | Scientific research trend analysis | |
| [ | Concepts | Text | – | – | – | DS | Conditional random Field, TF-IDF | Content segmentation and extraction | – | P, R, F | Chinese word extraction from Geoscience literature |
| [ | ER | Text | Triple | Disease, Patient | Treat, Not treat | DS | SemRep | Classification | – | P, F | Drug Repurposing |
| [ | ER | Text | Triple | – | Agent, Patient | DI | Semantic Role Labeling | Ontology Linking | Stanford’s CoreNLP, RDF turtle | Manual | Semi-automatic method to generate KG |
| [ | Concepts | Text | – | Title, Abstract and Citation | Cited, Aim, Method, Result | DI | Sequence labeling, BERT embeddings | Concept Extraction, Graph Construction | DBSCAN | P, R, F | Research trend analysis |
| [ | E | Text and figures | Triple | Gene nodes, Disease nodes, Chemical nodes, and Organism | Gene-Chemical-Interaction Relationships, Chemical-Disease Associations, Gene-Disease Associations, Chemical-GO Enrichment Associations and Chemical-Pathway Enrichment Associations | DI | Sequence embedding | NER, Event extraction | OCR, BioBert | Manual, F | Multimedia extraction, Question answering, report generation |
| [ | Concept | Text | Keyphrases | Papers, Authors,entities, entities mentions | Citations, Authorship, mention-mention, Entity-entity relations | DI | Sequence labeling | Entity extraction, Linking | Tagme, MetaMap Lite, ScienceParse | P, R | Data discovery and ranking |
| [ | Concepts | Text | – | Background, objective, solution, and finding | – | DI | BERT | Reasoning | BERT, CSV, SPARQL | P, R | Abstract Knowledge representation and ontology element identification |
| [ | ER | Text | Triple | Paper, Author, Affiliations | – | CD | Rule mapping | Instance matching | MAG, DBLP, SWRC ontology, Dublin Core and FOAF, Scrapy, CSV, SPARQL | R, Accuracy | To create KG pipeline |
Scholarly knowledge graph fusion-based construction
| References | Extraction | Knowledge | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Level | Input/field | Fact | Entity | Relation | Domain | Tasks | Source Integration | Metrics | Application | |
| [ | Keyphrases | Text | Triple | – | – | DS | Annotation | RDFization | Bibo, foaf, prov, Wikidata, sio, RDF, Neo4j | Scientific literature semantic data management in agriculture domain |
| [ | Concept | Text | Triple | Publication, OntologyClass, Drug, ChemicalSubstance, BiologicalProcess, Disease, Protein, Gene, PhenotypicFeature, MolecularActivity | – | DS | Classification, link prediction | Biolink, HPO and the Mondo disease ontology, RO, RDF, Neo4j | – | Prediction and querying |
| [ | ER | Text | Triple | Publications, people, Campaigns, Environmental variables, Species, locations | Contributor, has_subject, reported_by, participant, has_measurement, collect, recorded_by, has_place | DS | Cross linking | NMDS, GBIF, OBIS, foaf, RDF, SPARQL | – | Data augmentation and meta analysis for ocean science |
| [ | E | Text | Triple | Article, Title, DOI, Introduction, Author Name, Treatment, Nomenclature, Materials, section, Taxonomy concepts | – | DS | Disambiguation | SPAR, foaf, RDF4R and ROpenBio, RDF, SPARQL | Competency questions | FAIR-complaint biodiversity literature-based knowledge management system |
| [ | ER | Text | Triple | Publications, patents, topics and industrial sectors | HasTopic, hasAffiliationType, hasAssigneeType, hasIndustrialSector | CD | Topic detection, Classification | DBpedia, MAG, CSO, GRID, SKO, PROV-O, INDUSO | Manual | Cross-domain knowledge graph |
| [ | ER | Text | Triple | Publications, researchers, publication venues, scientific institutions | Co-authorship, citation, and collaboration | DS | Network detection | METIS, SemEP, KORONA ontology | To generate communities of researchers for Collaboration recommendation based on Co-author networks | |
| [ | Concepts | Text | Triple | Concepts and documents | Mentions | DI | Triple filtering, linking concepts | RnnOIE, RDF, Cypher | P, R, F | Open information extraction and literature KG from clinical trials methodological articles |
| [ | ER | Text | Triple | Task, Method, Metric, Material, Other-ScientificTerm and Generic | Compare, Part-of, Conjunction, Evaluate-for, Feature-of, Used-for, HyponymOf | DI | Triple refining and Entity merging | OpenIE, CSO classifier | P, R, F, Manual | Generic knowledge graph construction |
| [ | ER | Text | Triples | Research topics, tasks, methods, metrics, materials | Verbs (uses, includes, is, evaluates, provides, supports, improves, requires, and predicts) | DS | ER extraction | DyGIE++, Stanford CoreNLP, CSO Classifier | P, R, F | Domain-specific KG generation |
| [ | ER | Text | Triple | Task, Method, Metric, Material, Other-ScientificTerm and Generic | Compare, Part-of, Conjunction, Evaluate-for, Feature-of, Used-for, HyponymOf | DS | Entity relationship merging | OpenIE, CSO classifier | Manual | KG construction using openIE |
| [ | E | Text | Triple | Paper, Authors, Institution, Concepts, Topics | Authored_by, affiliated_with, associated_concept, associated_topic, cites | DS | Concept,author normalization, Citation linking | Comprehend Medical, Apache TinkerPop Gremlin and SPARQL | Manual | Question answering and paper recommendations |
Scholarly knowledge graphs utilization
| Model | Objective | Key features | Link for visualization | Data model/domain | Method used | Technical details |
|---|---|---|---|---|---|---|
| GraphWriter | KG Utilization | Graph to text generation | – | AGENDA | GAT capturing global context | – |
| Graformer [ | KG Utilization | Graph to text generation | – | AGENDA and WebNLG | Self-attention Graph method | – |
| SciKGraph [ | Visualization | Tracks the evolution of a scientific field at a concept level | SciKGraph framework | Clustering | Python 3.7, flask 1.1.1, HTML 5,CSS 3, Bootstrap 3.3.7, and javascript 6 | |
| ResearchFlow [ | KG Utilization | To quantify the research topic trends across academia and industry | AIDA KG | Diachronic analysis | – | |
| AIDA dashboard [ | Visualization, Web application | Analytics about research dynamics | AIDA KG | Classification and tagging | Python, HTML5 and Javascript | |
| Aurora [ | Querying | Generates overviews of research domains | OpenResearch | Crowdsourcing platform | SPARQL endpoint | |
| TDMS-IE [ | Tabular visualization | Automatic construction of NLP Leaderboard and summarize scientific results | NLP-TDMS | Classification | – | |
| CL-scholar [ | Querying | Search and explores current research progress in the computational linguistics community | ACL Anthology | OCR++ for extracting metadata | ReactJS, supports REST API, NodeJS server, MongoDB | |
| Whyis [ | KG Exploration | Semantic meta analysis capabilities | DrugBank, Uniprot | Stouffer’s Z-Method | Extensible Stylesheets Language Template (XSLT) to generate RDF | |
| Covid-KG [ | Visualization | Dense tag clouds and heatmaps | CORD-19 | Data indexing | Elasticsearch and Kibana dashboard | |
| SemSpect [ | Visualization | Aggregated Tree overview | SciGraph | Classification | Cient-server Application HTML5/JavaScript UI, Java REST backend, Neo4j for storage | |
| Covid Linked Data Visualizer [ | Visualization and querying | Enriching, reusing and adapting pipeline | CORD-19 | Argumentative Clinical Trial Analysis tool | Python and R Jupyter notebooks, JSON format, SPARQL endpoint | |
| BiKMi [ | Web Application | Cause-and-effect network | CORD-19 | Biological Expression Language derived network | Python Django and OrientDB | |
| KGTK [ | KG Utilization and exploration | Represents graphs in tables for data science applications | CORD-19 | ConceptNet, BERT | Scikit-learn, SpaCy, TSV for edges, RDF, Neo4j, Gephi, SPARQL |
Summary of knowledge graph embeddings in scholarly domain
| Embedding type (ET) | References | Applied ET | Triple | Dataset | Task | Best performing ET | Evaluation metrics | Application |
|---|---|---|---|---|---|---|---|---|
| Translational | [ | TransE, RotatE, ComplEx, Trans4E | AIDA | Link Prediction | Trans4E | MRR, Hits | – | |
| [ | TransE, TransH, TransR, TransD | DBLP | Prediction | TransD | P, R, MRR | Paper Recommendation | ||
| [ | TransE, ComplEx, ConvE, RotatE, Trans-RS, TransE-SM and RotatE-SM | DBLP, semanticscholar, springernature, grid | Link Prediction | RotatE-SM | MRR, Hits | Author Recommendation | ||
| [ | TransE, RESCAL, TransH, TransR, TransD, TransP | maui-semeval 2010 | Entity Typing | TransP | P, R, F | Scholar Profile construction | ||
| [ | TransE, DBOW | – | DBLP, semanticscholar | Classification | TransE and DBOW combined | P, R, F | Paper Recommendation | |
| [ | TransE, RotatE, DistMult, CompIE | PubMed and CORD-19 | Link Prediction | TransE | MRR, Hits | Drug repurposing and to generate mechanistic explanations | ||
| [ | DKRL, DistMult, TransE, TransH, TransR | DBLP, semanticscholar, springernature, grid | Link Prediction | TransE | MR, Hits | Co-authorship Recommendation | ||
| [ | RotatE | PubMed | Link Prediction | RotatE | AUC | Recommending drug candidates for repurposing | ||
| [ | TransD, TransD, TransH, TransE | AIDA, MAG | Link Prediction | TransD | P, R, MRR, NDCG | – | ||
| [ | TransD, TransE, Distmult, and ComplEx, RotatE, Node2Vec | CORD-19 | Link Prediction | TransD | P, ROC | Association analysis | ||
| Multiplicative | [ | TransE, transH, DistMult, HolE, ComplEx | AK18K | Link Prediction | HolE | MRR, Hits | Scholar classification and scholar clustering | |
| [ | CP, Word2Vec | – | – | Data Exploration | – | – | Querying and Browsing | |
| [ | TransE, TransD, TransR and ComplEx | Collected using web crawler | Link Prediction | ComplEx | Mean Rank, Hits | Paper, author and venue recommendations | ||
| [ | TransE, Distmult,and ComplEx | AIDA35k | Link Prediction | WGE | MSE, MAE, F, Accuracy | Classifying research articles | ||
| [ | ComplEx, SimpleIE, TransE, CrossE, RDF2Vec | CORD-19 | Link Prediction | ComplEx | AUPR, F | DrugDrug Interaction Prediction | ||
| Deep Learning | [ | TransE, Distmult,and ComplEx, ConvE, ConvTransE | PubMed | Link Prediction | ConvTransE | Hits | Classifying drug candidates for repurposing | |
| [ | ConvCN | Aminer | Link prediction | ConvCN | MRR and Hits | Citation Recommendation |