Literature DB >> 33071400

Ontology based recommender system using social network data.

Mohamad Arafeh¹, Paolo Ceravolo¹, Azzam Mourad², Ernesto Damiani³, Emanuele Bellini⁴.

Abstract

Online Social Network (OSN) is considered a key source of information for real-time decision making. However, several constraints lead to decreasing the amount of information that a researcher can have while increasing the time of social network mining procedures. In this context, this paper proposes a new framework for sampling Online Social Network (OSN). Domain knowledge is used to define tailored strategies that can decrease the budget and time required for mining while increasing the recall. An ontology supports our filtering layer in evaluating the relatedness of nodes. Our approach demonstrates that the same mechanism can be advanced to prompt recommendations to users. Our test cases and experimental results emphasize the importance of the strategy definition step in our social miner and the application of ontologies on the knowledge graph in the domain of recommendation analysis.

Entities: Chemical Disease Gene Species

Keywords: Big data; Data analysis; Data miner; Data sampling; Ontology; Recommender system; Social network

Year: 2020 PMID： 33071400 PMCID： PMC7546693 DOI： 10.1016/j.future.2020.09.030

Source DB: PubMed Journal: Future Gener Comput Syst ISSN： 0167-739X Impact factor: 7.187

Introduction

In recent years, significant attention has been spent on mining Online Social Network (OSN) in real-time. Decision analytic methods, from marketing to emergency management, from politics to business and management benefit of real-time or near real-time event processing. Event detection is for example crucial in traffic management [1], fire control [2], TV show hosting [3], and smart-city management systems [4], [5]. In combination with other data sources, OSN can boost complex decision making and risk management methodologies [6]. For instance, Twitter has been effectively exploited in many real-world incidents to communicate disaster warnings and disseminate information, capture the evolving trends, control resource consumption, or discover effective mitigation strategies bottom-up [7], [8], [9]. However, OSN data must be treated with proper confidence levels. What is happening because of the Coronavirus is a lively case study. In fact, if during the epidemics risen before the advent of social media, experts had to wait for a publication in an academic journal to know the progress of the disease outbreaks. Nowadays sharing information between experts is much faster. On the other hand, the ease with which information is published and the speed with which it spreads pose new challenges when this information is incorrect or false (e.g. fake news). The so-called infodemics needs to be promptly dealt trying to eliminate the noise generated by the unverified news and by the alarms caused by the fear of contagion, spreading reliable information in the shortest time possible. In fact, the reliability of the social network-based event analysis depends on several factors. The actual presence of users on the ground acting as a sensor is the first one. In Florence, during a recent Arno river embankment collapse caused by a water pipe disruption, as the event happened at 6.15 AM on 25/05/2016, no relevant variation on Twitter had been detected, simply because there were no Twitter users on the site to comment [10]. Social media are general-purpose communication platforms, for this reason filtering the activities that are related to the domain of analysis is crucial to avoid introducing selection bias. To do so, a new class of agile and cost-effective methods and tools has been proposed to support operators in analyzing at a deeper level and closer to real-time the huge amount of data generated by OSN is paramount. Twitter, for example, gives researchers a gateway providing them with billions of information about users’ links, written contents, and community circles, giving analytics a gateway for improving their algorithms primarily in Natural Language Processing [11], Link Prediction [12], Community Detection [13] and Sentiment Analysis [14]. However, such methods require a relevant amount of data to be processed that impacts on the timeliness of the result provided as well as the resources needed. Several approaches have been proposed for mining OSNs while limiting the time and the budget required for mining [15], [16], [17], [18], [19], [20], [21], [22]. However, to the best of our knowledge, none of them is capable of using the mix of strategies we propose in this paper. In the present work, we propose a mining platform to help researchers and data collector to mine and directly analyze social networks, defining API-specific and budget-constrained strategies able to filter data collection based on concurrent sampling and ontology-enhanced filtering algorithms [23]. To test our approach we exploited it in creating a content-based recommender system. In our proposed architecture, recommendations are the results of the graph projection of social network nodes with their relationship and roles. In our approach, we are seeking machine learning to find people’s entity from their shared contents, and ontologies to build a knowledge graph that maps the relations between the accounts and their environment. Additionally, we consider our platform for building ontology enhanced knowledge graph and use it for recommendation purposes. The paper is organized as follows: In Section 2 we present the related work. In Section 3 we describe our proposed platform, including its architecture and components. In Section 4, we show the implementation of the ontologies as part of our platform and the complete workflow of our recommender system analysis. In Section 5, we present our system implementation. In Section 6, we present our experimental analysis, and finally the conclusion in Section 7.

Related works

The value of the data collected from OSN lies in the potential they have to reveal hidden patterns or predict future dynamics or trends [24], [25] that are mostly impossible to do in other ways. However, the quality of the results of the analysis is intrinsically related to the method through which the datasets are created. Although extensive research has been carried out on data collection, some uncertainty remains about the existence of a standard sampling methodology to efficiently collect datasets from OSN [26]. Most of the OSN managing platforms provide APIs allowing anyone to query and amass large amounts of information in a relatively short time. Usually, the process requires the registration of an application first, then platform returns a set of tokens granting access to the streaming API. As of 2015 Twitter, which is the main source of information for researchers, decided to limit the access to their data. Twitter supplies 10% randomly sampled tweets (known as the gardenhose) from its firehose for a fee, and 1% randomly sampled tweets for free in real time through Sample API. The randomness of a sample – each element has an equal probability of being chosen – is indeed of high importance for methodological integrity as a sample selected randomly is regarded as valid representation of the total population. However, the company does not reveal details about its data sampling mechanisms affecting the reliability of the outcomes based on them [27]. Current studies have used non-probability methods to collect data e.g., (Guillory et al. 2018 [28]; Hsieh and Murphy 2017 [29]), given the inherent coverage error of social media user base for representing general population. The non-probability design which these studies employ prevents inference to populations larger than the respondent sample. [30] Another limitation is related to the number of queries that can be executed in a 15-minutes window, which results in lowering the amount of information available for the analysis while increasing the time to mine the required resources. Another limitation includes the definition of a maximum number of request calls in a period that affects the informative potential of the generated datasets in case a huge amount of tweets is generated (and then lost) in certain cases (e.g. emergency). A work around is to create a number of accounts to access the APIs that can be used in a alternate fashion. Finally, the access to historical Twitter is not allowed via the Twitter API, and the limited number of characters of the message, and so on, force the developers to set up specific architectures and strategies for collecting tweets, while attempting to get them with a sufficient reliability [31], [32]. To avoid the aforementioned limitation, upgrade from standard to premium or enterprise API is needed and is subject to the payment of fees. However, researchers are finding new ways to address this issue, where a web scraper mentioned in [33] works by parsing hypertext tags and retrieving plain text information embedded onto them. Since web scraper does not get information directly using API, they are not restricted by the limitation posed by OSN providers and thus can mine large amounts of data with less time and budget. But using a web scraping tool has its limitations. A web scraper cannot be used for long term monitoring since websites are in constant changes over time, thus web scrappers must be updated constantly to be aligned to the new updates. This condition is particularly evident in [34], where a web-based crawler for collecting vulnerabilities information from the dark web should be adapted each time a harvesting campaign is about to start. Additionally, each OSN requires a custom web scraper. This issue is not different from using OSN API since each platform provides its own. However, the development of a reliable scrapper requires a fair amount of work and knowledge that a researcher may not have. Finally scrapping may pose the researcher to trials, e.g. in [4], LinkedIn sued peoples that anonymously scraped their website for different reasons like a violation of computer fraud and abuse act (CFAA), trespass and breach of contract. Another approach discussed in the literature is samplingwhere a small fraction of the OSN users is mined to create a sampling representative of the whole OSN. There are several sampling techniques for OSN that aim to optimize the effort in terms of time, computation load, and dataset representativeness. In [10] a sampling-based algorithm for efficiently exploring a user’s social network respecting its structure and for quickly approximating quantities of interest is proposed. In [35] and [32] the sampling strategy is based on the concept of Channel, which consists of a set of simple and complex search queries performed on the Twitter platform by the Crawler engine. The simplest Channel to be monitored can refer to collect and analyze tweets referring to a single Twitter user, user citation, hashtag, or keyword. Complex Channels may consist of several queries designed according to the search query syntax of Twitter APIs by combining keywords, user IDs, hashtags, citations, etc., with some operators (e.g., and, or, from). Thus, a user can design its channel and run a collection process on OSN. However, this method is limited by the fact that the user should make some assumptions in defining the query filters (e.g. hashtag) that could not be appropriate to create a comprehensive dataset to analyze a specific phenomenon. In [26], the authors explored the use of random search algorithms to sample OSN such as a Brownian walk (based on a normal distribution), a spiral-inspired walk, and a Reservoir sampling algorithm. The scope is to define a standard sampling methodology applicable where the OSN information flow is readily available. In [6], four sampling algorithms such as DLAS, EDLAS, ICLA-NS, and FLAS based on learning automata, are explored to produce a scale-down representative subgraphs from OSN. The random walk exploring strategy, adopted in [15], [16], [17], [18], provides the base method to ensure unbiased sampling. Random walk, however, requires a long mixing time, i.e. it requires a long startup period before guaranteeing good accuracy [19]. An effective way for overcoming this issue is to incorporate uniform node sampling (UNI) into random walk sampling and enable the strategy to jump to other parts of the graph. Different authors developed this random walk with a jump approach. An alternative solution to address the same issue is developing a multi-layered social network, where multiple sources can be followed to exit the blind roads or the local boundaries that a random walk can enter. Additionally, another form of traversal algorithm are mentioned in [36], [37], [38] to boost data transmission performance and to reduce energy and data consumption. Researchers have also pointed out that sampling based on social media APIs is biased by policies that are constructed to save the vendor’s resources and not for optimizing the sampling power of data [39], [40]. This means scholars using data obtained via API need to apply caution when drawing inferences from such data. In particular, it has been observed that the source of biases arise from the order connected nodes are returned [23] based on the age of the link created between two nodes and on the fixed time-frames used for selecting the nodes to be included in the sample APIs [41]. In the domain of the recommender systems, ratings and features are widely used to infer the recommendation probabilities. Depending on the needs, researchers used either one or both of them to achieve their promised results. In [42], Nilashi et al. propose an ontology-based recommendation system combined with dimensionality reduction in order to reduce the issues of a sparse dataset. It uses users’ ratings and features as an input to infer a probability of recommendation. Similarly, the use of dimensionality reduction is also discussed in [43], where the authors deploy the said methods under two Real-world experiments and compare it with collaborative filtering. In [44], the authors demonstrate an existing relationship between an item and its location by developing a location-aware recommendation system. The system takes advantage of the data localization and the ratings to produces recommendation probabilities. Alternative to the recommendation that focuses on the ratings, Yao et al. propose in [45] a recommendation system in the auto industry where the availability of item ratings are believed to be scarce and inapplicable. Hence, the authors take advantage of the customer’s common features to build relations between them. The approach followed in the present work is to let the user compose strategies using a mixture of approaches and constraints. This supports the setting of API-specific and domain-specific solutions with the ability to compare alternative strategies in order to assess them in real-world scenarios. Additionally, we ought to explore our platform to build ontologies by taking advantage of the graphical nature of the saved data. Furthermore, we propose an ontology enhanced graph analysis in the scope of recommendation systems, which to the best of our knowledge, none of the current approaches has addressed it yet.

A framework for sampling social networks

In this section, we present our framework. Accordingly, we discuss the system hierarchy organizing our architecture with its abstraction layer, we detail the workflow guiding the mining procedures, the network space exploring algorithms and the different filtering strategies that make the system effective.

System architecture and components

A Social Network is an ever-expanding data source. For this reason, an effective mining procedure must rely on real-time data collection. Also, an evolving domain may require to extend the computational capacity of the system. In order to address this issue, we used different technologies that helped to achieve maximum scalability in our architecture, by interfacing separated components using abstract classes. Fig. 2 presents the hierarchy of our architecture listing the abstract classes that compose it.

Fig. 2

The architectural hierarchy.

The system architecture directly reflects the elements of a strategy with a software component for managing each element independently. At the root level, we have a class for defining mining strategies. Each strategy is a combination of multiple settings managed by separated components. Data Sources. A strategy has to contain a connection to an input source that the miner uses for querying data. Sources could be online like Twitter or Facebook, or locally available like a local SQL Database, or data files (CSV, Excel). Abstracting such component allowed us to have a limitless source in which we were able to sustain the daily increasing number of the available data. Network Space Exploring Algorithm. As described in Section 3, the network space exploring algorithm is used to navigate through a data source embedding the mined data as a graph. Moreover, such algorithms are also used to sample the data source decreasing the time and budget spent on mining and data analysis. Event Subscribers. This component allows integrating further components beyond the one provided in the architecture. The idea came from the need for adding constantly additional features in one hand and a data transformation on another hand. Upon each step, this component broadcasts the current status to subscribers, thus allowing them to modify and transform the data. This component adopts a limitless amount of custom-defined filters and data embedders. As en example, we have introduced an ontology enhanced event subscriber as an advanced filtering technique to eliminate nodes that are unrelated to the specified case. More information is available in Section 4. Analytics. This component contains the algorithms used to run analytics from the mined data. Currently, we support all the graph algorithms natively supported by Neo4j, including Centrality algorithms, Community detection algorithms, Pathfinding algorithms, Similarity algorithms, and Link Prediction algorithms. Additionally, we have introduced a new recommender system algorithm that assigns recommendation probability for each node in the knowledge graph through a given ontology model.

System workflow

Fig. 3 describes in detail the data processing workflow of our system. The process starts when the network space exploring algorithm navigates the social network graph choosing the first node. The results differ based on the algorithm selected while defining the strategy. The next step is scanning the selected node, thus allowing further routes of the social graph to explore. Moreover, more detail on the nodes demands additional specialized requests to fetch them.

Fig. 3

The system workflow.

To reduce the number of API requests, we introduced a caching system that answers the call in case it was already available in the cache. Pre-Scan, Post-Scan, and Post-Fetch are event subscribers that run before and after scanning and after fetching. Such subscribers can be used as data filters and mappers introducing new procedural information to the knowledge graph. An example of Pre-Scan could be max-level filters that prevent adding any additional node located after the specified level, and max-fetch filters that limit the number of nodes that can be scanned thus reducing further the number of API requests. Post-Fetch filters are usually used for filtering data based on nodes attributes where such information is only available after the fetch request. Other usages of post fetch are like Entity Detectors, which introduce new information to the knowledge graph based on predefined procedures subscribed to receive updates when such an event arouses. Relationships between nodes in a graph database. The architectural hierarchy. The system workflow.

Network space exploring algorithm

An important part of our framework depends on the space exploring algorithms as they represent a crucial component when defining a strategy. With such algorithms, we can navigate through social networks and embed them as a graph in our data storing system. Graph Databases are used to store OSN accounts and posted-contents as nodes, while the link between them is captured by edges. Fig. 1 shows a graphical representation of the stored data. Edges can be labeled to define the type of relationship interconnecting two nodes, i.e. friends, follower, co-authors, etc. A post and its originator can be presented as two nodes connected by an edge labeled as posted-it. A post and a reader can be connected by like-it, hate-it. Addressing the limits posed by OSN providers, navigation algorithms combined with filters are used as data samplers working on a subset of data selected to be representative of the whole dataset. Sampling social network reduces the time and budget required to collect the minimum information needed. A similar approach has been discussed in [46], in which the authors comparatively assessed the accuracy of deterministic and probabilistic navigation algorithm. Since each case requires a different strategy, we have built our platform in such a way that allows the implementation of different navigation algorithms, allowing us to compare their performance and accuracy under different settings. The algorithms that are available in the platform can be divided into two groups: deterministic like Breadth-First, probabilistic like Forest Fire, Random Walker, and Metropolis Hasting. Additional focus has been given to probabilistic approaches, which can be supplied with hyper-parameters that are capable of changing the shape of the mined data, thus fitting more for data sampling work, while further widening the traversed space of the network. Frequent hyper-parameter used in the platform are forward weight and iterations. Forward weight controls the onward and backward jump rate of the random walker, the number range between and , and the higher the number to deeper the level explored by the algorithm. This parameter significantly affects the accuracy of the results.

Fig. 1

Relationships between nodes in a graph database.

Node filtering strategies

Usually, any data analytics procedure includes data cleaning. Filtering is intended to prune irrelevant data, thus reducing the number of wasted requests. Filters such as minimum account followers, creation dates, and scam detectors can be exploited to identify fake accounts. In our framework, filters are part of the event subscribers. A filter can be attached to receive continuous updates about the nodes in all of its three states, before scanning (Pre-Scan), after scanning (Post-Scan), and after fetching (Post-Fetch). Information provided to the event subscribers is relevant to the state they are subscribing to. In the Pre-Scan state, only the Id of the node and its current level are available, making it appropriate for a filter that depends on the level of the node. Post-Scan state provides more information about the shape of the network and how it will be extended. The scanned node will now reveal all of its possible children. Therefore, filters that depend on the number of node children like minimum twitter account followers are ideal for this state. The importance of the Post-Scan event resides by providing the last line of information that could eliminate a node before the actual fetch happens, thus claiming limited resources. Finally, the Post-Fetch state has the most information about the node and typically a higher impact on computational resources. Taking into consideration all the capabilities of the filter, it is possible to significantly narrow the area of our interest, therefore decreasing the time and the budget required.

Ontology enhanced event subscribers for a social network-based recommender system

We built our ontology-based recommender system as a layer upon the mentioned framework. Using the capabilities of the event subscribers, we were able to intercept the process of saving the node and update the graph accordingly with the selected ontology. Fig. 4 presents the complete steps of the proposed recommender system. The processes start by building an ontology model. In our approach, such a model is manually created using prior knowledge from the domain of analysis. The next steps are built independence on the mining framework, which will handle the role of collecting and fetching data from social networks. Furthermore, with the continuous update of the states of the graph provided by the framework, we examine each received node and detect its roles in the graph. Later, after we complete the required knowledge on the node, we start the filtering procedure by matching the node with the model. The process repeats until the stop conditions are reached, and later, the recommendation analysis is executed to assign a recommendation value for each node.

Fig. 4

Recommender system component workflow.

Ontology model

In the adopted scenario, we are studying the users interacting with the twitter account of an academic conference for creating a recommendation system. An academic-related ontology has been manually developed and is presented in Fig. 5. The conference class is one of its kind and it is manually provided upon the start of data collecting. The followers of each conference will be devised into a teacher, student, and attendee. To increase the accuracy of the recommender system, other properties that may affect the results have been also taken into consideration, e.g. the location of the conference, the location of the students and teacher, and the institution in which a student study_in or a teacher teach_at.

Fig. 5

Academic ontology model.

Data collection

Data collection is handled by our mining framework using a mining strategy. For example, the level breakdown of a space exploring algorithm allows us to focus on a specific area of the network rather than the whole, thus reducing a large amount of data processing that may lead to no or few results. In our case, the focus was on the lower level. Therefore, as a network space exploring algorithm, Breadth-First could be a relevant algorithm for this task. However, other algorithms can be tuned to focus also on the lower levels. For example, with a small forward probability, RandomWalker and Metropolis Hasting emphasize the backward moving rather than moving forward, causing the lower level nodes more significant.

Entity detection approaches

For a recommendation system to work accurately, nodes must be assigned to classes specified by the ontology model. Usually, specialized networks have well-defined entities assigned to each node. This is however not the case in generalist networks like twitter or other public social networks. We rely on labeled accounts with their description to train an entity classifier. With the description as input and the known entity as a label, we obtained an efficient training and test sets. The entity classifier is made using a classical supervised learning algorithm. Therefore support vector machine was a solid selection. Once we built the classifier, we can determine the entity of each node using its description. Such a classifier is used as Post-Fetch event subscriber, thus receiving nodes when they are fetched and updating their entity accordingly.

Filtering & ontology matching

The amount of data a social network can provide is substantial but uncontrolled. As a consequence, a large portion can be cleaned and filtered out. Apart from the filters that we mentioned in the previous section (MaxLevel Filter and MinFollowers), a new filter is deployed to reduce the amount of data exposed to the recommendation system. In this step, we are seeking the elimination of all possible inaccurate or faulty recommendations. A graph-based ontology matching filter is proposed while considering the ontology model to match the labeled graph available after the discovery of the node entity. An approach similar to the DSSim-ontology [47] is exploited to extract the similarities between the node environment and our model. The goal is eliminating nodes that did not follow a shape equivalent to the provided model.

Recommendation assignment module

The recommendation assignment module of our ontology-based recommendation system approach. Fig. 6 shows a sample of a graph results after the mining process. In this phase, we use the extracted entities from the generated knowledge graph to test for content similarity with the defined ontology model. Each relationship is associated with a weight relative to its equivalence in the model. For instance, a node that follows a similar structure as the model will be assigned with the same weight.

Fig. 6

Example of a labeled graph result.

Nodes level have been also taken into consideration. The farther the following node is, the lower is the possibility of it being recommended. Such a case is being handled by powering the weight by the level of the node. Moving to the recommendation algorithm, we adopted an improved version of the Adamic Adar algorithm to calculate the possibility of a node being recommend to another. Let be the possibility of node being recommended to attend an event . Let be the nodes adjacent to the node , , the nodes adjacent to node , and the nodes that are adjacent to and are remotely related to the node . Finally, let be the intersection between and including , and their adjacent nodes. The recommendation evaluation can be defined as: Based on Eq. (3), if , then the node log allocation index will be ignored and the results of the calculation will be lowered. Such a case happens when a relationship exists between both nodes that are not described in the defined model. For example, a follower node lives in a location that no one of the attendees lives in. Example of a labeled graph result.

System implementation

While the process of mining is always the same, mining source, graph navigation algorithm, event subscribers, and graph analysis always vary between strategies. Accordingly, in order to reduce the complexity and to increase the scalability, we tend toward abstracting all of the architecture main components, therefore, allowing them to have multiple implementations. Mined social network graphs are handled by Neo4j graph database. Using this technology, we are capable of maintaining the dynamic nature of social networks. Additionally, Neo4j is known for performant querying with further access to various graph algorithms that assist data analysis like centrality algorithms (e.g. PageRank). MongoDB is used to keep a history of the defined strategy and cache social network query results.

Mining process

The mining process manages the interaction between the architecture components using a set of seed nodes to start the exploring process. It provides the navigator with the required data source while broadcasting regularly to a set of event subscribers. Additional information is presented in Algorithm 1. We start by creating a root for the graph to be mined. Then the main components are initialized from the strategy that includes the implementation to be adopted in the mining process, e.g. Breadth-First Navigator for exploring and Twitter as a data source. The initialized navigator scans the root to get the seeds nodes. For each seed, it branches a navigator to explore further nodes. Navigation starts from the seed and ends when the navigator has no longer nodes to provide. The selected node that is going to be scanned will be broadcasted first to pre-scan event subscribers, thus allowing them to modify the node properties preventing it from being fetched or scanned further by marking it as unscannable. Later, after a node is accepted and the scan is performed, the node will be sent to Post-Scan subscribers. Similarly, a node is marked as rather fetchable or not by comparing the new properties with the working strategy. Further, when the node is fetched, it will be sent to post fetch event subscribers to decide whether to keep this node if it has proven beneficial to the study or prune it otherwise. As aforementioned, the difference between pre-scan, post-scan, and post-fetch is that the later has access to an array of attributes that are not available in previous states. Finally, the fetched node is cached and used later when the same node is requested again.

RandomWalk

As an example of the navigator procedure invoked by the Mining Process, we illustrate the RandomWalk algorithm. This algorithm serves under the network space exploring algorithm components and one of the three implementations besides breadth-first and metropolis hasting. The pseudo-code in Algorithm 2 shows the process in detail. It first initializes a random number and compares it with the weight defined in the strategy to decide to go further deep in the graph or returning to a higher level. The bigger the weight is, the deeper the walker will go.

Experimental results and analysis

In this section, we present the experimental environment in which we used to perform our tests. Also, we analyze in the detail the results of the comparison between different space exploring algorithms and the best use case of each. Finally, the enhanced recommendation algorithm is compared with its original equivalence while showing the difference between the results of both algorithms.

Experimental setup

For the experimental setup, we ran our framework on a virtual machine setup with 2 cores and 4 processors each, thus resulting in a machine with 8 threads with a frequency of 3.6 GHz. Rams is tuned to use only 8 GB. To hasten our experimental analysis, we point our miners to local data sources, which helped us comparing different strategies under different conditions in a minimum amount of time. We evaluate the accuracy of different strategies on two different data sets. In the first one, we built and tested our strategies using data provided from random generators. In the second one, experiments were performed on mined twitter accounts. For the recommendation system, we used a graph simulator to generate a knowledge graph that went through all the steps of the framework to finally be analyzed. The throughput of the analysis has been shown and compared to its original algorithm.

Network space algorithm comparison

We propose a case where we need to mine the first level followers of a set of twitter accounts that respects predefined attribute conditions described in the strategy. Table 1 contains an overview of the data generated using four seed nodes to serve the experiments, focusing on the percentage of the seed’s followers of Italian origins. Our experimental results are based on our miners fetching followers of this particular country.

Table 1

Set 1 data overview.

Seed	1	2	3	4
Italy	62%	27%	50%	25%

The mining strategies tested are illustrated in Table 2. In all of them, we filtered the data and excluded any account that is not located in Italy. The first mining strategy uses Breadth First as exploring algorithms, fetching only 10% of the maximum account followers. The rest of the strategies are a combination of using Random Walk and Metropolis Hasting as navigation algorithms. Different percentage of the maximum account number to be fetched has been used to test the algorithm accuracy. Lower fetching percentage results in a high sampling ratio since a lower number of accounts will be fetched and included in the test results. For Random Walk and Metropolis Hasting, we have 500 Iterations and 0.2 forward weight. Additionally, for metropolis hasting, we used Normal Distribution to generate the next mining position (see Table 2).

Table 2

Set 1 mining strategies.

	S1	S2	S3	S4	S5
Exploring algorithm	BF	RW	MH	RW	MH
Account fetched	10%	10%	10%	5%	5%
Location filter	Italy	Italy	Italy	Italy	Italy
Iterations	–	500	500	500	500
Forward weight	–	0.2	0.2	0.2	0.2
Distribution	–	–	Normal	–	Normal

Set 1 data overview. Set 1 mining strategies. Density of the Italian accounts. Experimental results in Fig. 7 show the density of the Italian accounts for each seed in each strategy. We observe that the first strategy displays the worst case compared to the original data. Using breadth-first as navigation algorithm yields 53% for seed 1 compared to 62%, 78% for seed 3 compared to 50%, and 78% for seed 4 compared to 25%. This scenario happens when using breadth-first supplied with an attribute filter and a fetch filters for sampling while an important portion of the data of interest is located in the least of the data set. Other strategies show acceptable results since they reflect proportionally the main data even when using a higher sampling ratio.

Fig. 7

Density of the Italian accounts.

In the second experiment, we used data provided from Kaggle [6], called Twitter Friends and hashtags. It is a collection of Twitter users that includes users’ information, friends, last seen, language, trending topics the user tweeted about, and others. We are interested in forming communities based on the frequent tags used by a user when posting on twitters. Our results show the density of the communities detected from the data. Since we do not have a global overview of the data, we include a strategy that iterates over a large portion of the dataset. Additionally, for the comparison, we have included other strategies that can be considered samples regarding the original one. Table 3 shows the different mining strategies used by miners. For the first strategy, we intend to have a general idea to use it as a comparison measurement. Therefore, breadth-first is used as a navigator since we are not focusing on sampling. We have not specified any restriction on the maximum accounts that can be fetched in each node. The limitation of a maximum 8th depth level can be considered very high since the data increases exponentially when depth level increases. For other strategies, we used the three remaining algorithms. Only accounts can be fetched under each node. Breadth-first is limited for the first levels, while Random Walk and Metropolis Hasting level are not set, including additional settings for “forward weight” 0.8 allowing the navigators to expand their exploration territory deeply in the lower levels, therefore allowing them to have a more wide view on the data.

Table 3

Set 2 mining strategies.

	S1	S2	S3	S4
Exploring algorithm	BF	BF	RW	MH
Account fetched	–	50	50	50
Max depth	8	3	–	–
Iterations	–	–	500	500
Forward weight	–	–	0.8	0.8
Distribution	–	–	–	Normal

Set 2 mining strategies. Community detected from Set 2. The amount of communities detected by miners is available in Table 4. The first strategy contains the highest possible knowledge about the dataset, therefore can be used to measure the accuracy of other sampling strategies. In total, 29 communities are identified in the first strategy, followed by 19 while employing Random Walk navigator. Metropolis Hasting (S4) ought the worst accuracy compared to the rest of the strategies.

Table 4

Community detected from Set 2.

Strategy	Communities detected
S1	29
S2	14
S3	19
S4	10

Set 2 execution time. Finally, Table 5 measures the number of API requests made, and the total execution time in seconds for each strategy. The numbers are based on simulated twitter APIs taking into consideration the limitation posed at the time of discussion. The results reveal that we can achieve through a well-defined strategy high accuracy compared to the original with a significant reduction in time and API requests.

Table 5

Set 2 execution time.

	S1	S2	S3	S4
API Req.	13 650	379	289	177
Exec. time (s)	56 456	1879	1400	970

Ontology enhanced recommender analysis

Fig. 8, Fig. 9 show the difference between the analysis results of the Ontology Enhanced Adamic Adar algorithm and the original one over a set of multiple nodes. Since we are giving a different weight for each level, Fig. 8 includes nodes that are the first-degree followers of the attendee, while Fig. 9 includes nodes that are second-degree followers of the attendee. One of the most noticeable differences is the height of the line that is scaled down in the case of the enhanced algorithms. The reason is that the weight assigned to each relationship is being a constant to 0.5, in which the lines are shifted down since the results scale proportionally with the weight.

Fig. 8

Level 1 comparison.

Fig. 9

Level 2 comparison.

Moreover, in Fig. 9, the changes in weight cause a scale down by two with respect to the results of the recommendation system in Fig. 8. Other noticeable changes are related to points in the graphs that show a variation between the original and the enhanced algorithms. For instance, at point 25, points ranged from 30 to 45 have quite different results. The reason is related to the remote nodes that are adjacent to the current node and the node of interests. For example, a second-degree follower is the one that lived in the same location as someone attending the event. Such differences are due to assigning weights based on the similarities between the nodes in the knowledge graph and its equivalence in the ontology model. Level 1 comparison. Level 2 comparison.

Recommendation system experiments

In this experiment, we employed data produced byMoviesLens, which consists of 100k ratings from different users [48]. It also contains additional demographic information about each user including gender, age, and occupations. We have built the new ontology illustrated in Fig. 10 for MoviesLens dataset that benefits from these features. For each movie, we have calculated the ratio of watching per occupation, age, and gender. For ages, we split them into four groups: Child, Teen, Grown, and Elder. We then devise a graph using the aforementioned ontology and apply our proposed algorithm to calculate the likelihood of each user for watching a specific movie. The results are normalized between 1 and 5 and act as a possible rating of a user for a movie.

Fig. 10

MovieLens representative ontology.

For evaluating the results, we apply the Precision and Recall techniques on each user over his/her watched movies to explore the capabilities of our solution in accurately recognizing the possibilities for an item to be highly rated by the users. In this context: MovieLens representative ontology. An item is considered Highly Rated if the rating is more than 3. The precision values reflect the percentage of the correctly identified items, while the recall values allow us to observe the percentage of identified items from the original set. Additionally, the Mean Absolute Error (MAE) is used to assess the accuracy of the results. These metrics provide insights about the expected error margins. Figs. 11, 12, 13 present the results of our experiments. In each iteration, we evaluate the analysis scores of each user based on his/her list of ratings. In Fig. 11 the results reveal a high precision value for the majority of the users followed by a low recall, which indicates that the number of correctly identified users is low but they are identified with high precision. The difference in precision and recall can be observed precisely in Fig. 12 where low values indicate a clear difference in both scores. Furthermore, In Fig. 13 MAE results are considered fairly high compared to other algorithms that rely on rating [49]. It is worth mentioning that, compared to the literature, the original ratings of the users are not needed in our proposed approach, which is a major advantage when there is limited or no knowledge about the current and new users’ preferences and historical data.

Fig. 11

Recommendations scores precision, recall.

Fig. 12

Recommendations scores F1-test.

Fig. 13

Recommendations scores MAE.

Recommendations scores precision, recall. Recommendations scores F1-test. Recommendations scores MAE.

Conclusions

OSNs can be considered as the main source of information for any Big Data analysis study. Our aim in this paper was to develop a scalable platform that keeps pace with the continuous development of OSNs and to bypass their restrictions that limit the effectiveness of the mined data. In this regard, we have introduced domain-specific sampling strategies that serve as input for platform miners. Moreover, we have demonstrated the capabilities of our platform by employing ontologies to reinforce our graphical representation with stronger relations and used them as part of the proposed recommendation system. In our experiments, we have explored the importance of the strategy definition step as well as its impact on the quality of the results. Additionally, we have illustrated the implication of the ontologies on the graph and have used it on a real word dataset for a recommendation system.

CRediT authorship contribution statement

Mohamad Arafeh: Conception and design of study, acquisition of data analysis and/or interpretation of data, Writing - original draft, Writing - review & editing. Paolo Ceravolo: Conception and design of study, analysis and/or interpretation of data, Writing - original draft, Writing - review & editing. Azzam Mourad: Conception and design of study, analysis and/or interpretation of data, Writing - original draft, Writing - review & editing. Ernesto Damiani: Conception and design of study, analysis and/or interpretation of data, Writing - original draft, Writing - review & editing. Emanuele Bellini: Conception and design of study, analysis and/or interpretation of data, Writing - original draft, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

3 in total

1. Social media mining for smart cities and smart villages research.

Authors: Miltiadis D Lytras; Anna Visvizi; Jari Jussila
Journal: Soft comput Date: 2020-06-08 Impact factor: 3.643

2. Towards a standard sampling methodology on online social networks: collecting global trends on Twitter.

Authors: C A Piña-García; Carlos Gershenson; J Mario Siqueiros-García
Journal: Appl Netw Sci Date: 2016-06-01

3. Recruiting Hard-to-Reach Populations for Survey Research: Using Facebook and Instagram Advertisements and In-Person Intercept in LGBT Bars and Nightclubs to Recruit LGBT Young Adults.

Authors: Jamie Guillory; Kristine F Wiant; Matthew Farrelly; Leah Fiacco; Ishrat Alam; Leah Hoffman; Erik Crankshaw; Janine Delahanty; Tesfa N Alexander
Journal: J Med Internet Res Date: 2018-06-18 Impact factor: 5.428

3 in total