Asma Rani1,2, Navneet Goyal2, Shashi K Gadia3. 1. Dr. B. R. Ambedkar Institute of Technology, Port Blair, India. 2. Birla Institute of Technology and Science, Pilani, India. 3. IOWA State University, Ames, USA.
Abstract
Social media has been playing a vital importance in information sharing at massive scale due to its easy access, low cost, and faster dissemination of information. Its competence to disseminate the information across a wide audience has raised a critical challenge to determine the social data provenance of digital content. Social Data Provenance describes the origin, derivation process, and transformations of social content throughout its lifecycle. In this paper, we present a Big Social Data Provenance (BSDP) Framework for key-value pair (KVP) database using the novel concept of Zero-Information Loss Database (ZILD). In our proposed framework, a huge volume of social data is first fetched from the social media (Twitter's Network) through live streaming and simultaneously modelled in a KVP database by using a query-driven approach. The proposed framework is capable in capturing, storing, and querying provenance information for different query sets including select, aggregate, standing/historical, and data update (i.e., insert, delete, update) queries on Big Social Data. We evaluate the performance of proposed framework in terms of provenance capturing overhead for different query sets including select, aggregate, and data update queries, and average execution time for various provenance queries.
Social media has been playing a vital importance in information sharing at massive scale due to its easy access, low cost, and faster dissemination of information. Its competence to disseminate the information across a wide audience has raised a critical challenge to determine the social data provenance of digital content. Social Data Provenance describes the origin, derivation process, and transformations of social content throughout its lifecycle. In this paper, we present a Big Social Data Provenance (BSDP) Framework for key-value pair (KVP) database using the novel concept of Zero-Information Loss Database (ZILD). In our proposed framework, a huge volume of social data is first fetched from the social media (Twitter's Network) through live streaming and simultaneously modelled in a KVP database by using a query-driven approach. The proposed framework is capable in capturing, storing, and querying provenance information for different query sets including select, aggregate, standing/historical, and data update (i.e., insert, delete, update) queries on Big Social Data. We evaluate the performance of proposed framework in terms of provenance capturing overhead for different query sets including select, aggregate, and data update queries, and average execution time for various provenance queries.
In computing world, data are defined as the factual information in digital form used as a basis for various qualitative and quantitative analysis. The twenty-first century will be known as the century of data as it has witnessed an unprecedented growth of data in almost all domains [5, 6]. With the rapid evolvement of social media and web-based communication, everyone has become more enthusiastic about sharing their thoughts, ideas, opinions and other content through a social media platform, causing an exponential growth in the size of social data [29]. Social media platforms are the major source of unstructured data in the current times. Unstructured data is characterized by ad hoc schema, and therefore cannot be stored in SQL databases. The growth of unstructured data has led to interest in NoSQL databases, as they are much better suited due to their flexible schema. Therefore, in this paper, our first motivation is to build a flexible KVP data model in Apache Cassandra by using a novel query-driven approach to correlate big social data through relationships and dependencies. From the past few years, social media has become a common platform for global conversation around the world due to its giant size, vast availability, intense speed, and wide range of variant content. On the other hand, several illegitimate activities are engendered by misusing these social content through social engineering [18, 19, 56] to accomplish various objectives. One of the main causes behind the illegitimate activities on social media is the separation of digital content from its provenance [12]. In this paper, our second motivation is to explore the need of provenance information associated with the digital content published on social media and to design an efficient social data provenance framework for key-value pair (KVP) database. Social Data Provenance [30] involves following three dimensions, viz. “What”, “Who”, and “When”. What provides the description about the social media posts, Who describes the correlations among social media users, and When characterizes the evolution of users’ behaviour over time. Like data provenance [24, 49], social data provenance also describes the ownership and origin of such information.
Challenges
The term “Big Data” is characterized by 7 V’s, viz. Volume, Velocity, Veracity, Variety, Variability, Visualization, and Value. Veracity of big data that is defined as quality, accuracy and truthfulness of source of data, is directly linked with data provenance. Currently, Big data and social media have become the synonymous to each other, as the major portion (over 90%) of total data in the world are produced through several social media platforms such as Twitter, Facebook, Instagram, etc. This rapidly growing large sized human-generated data is known as the Big Social Data (BSD) [37, 47]. One of the major challenges that is usually faced by the several big social data applications is to design a flexible data model in NoSQL databases, as the traditional data modelling approaches are not suitable for correct and efficient data model design in such databases. Provenance about derivation history of big social data is usually called Big Social Data Provenance (BSDP) [21]. In Social Data Analytics, the credibility of an analysis generally depends upon the quality and truthiness of input data which is assured by the Social Data Provenance [54]. In this way, social data provenance plays a major role in clarifying opinions to avoid rumours, investigations, and explaining how and when this information is created and by whom. However, distillation of provenance information from such a huge amount of complex data is an extremely tedious task, due to its diverse formats. Barbier [3] identified some of the following issues to address the key challenges in capturing, storing and querying provenance for social data:In addition, several other challenges such as designing automatic provenance capturing mechanism, minimizing provenance capturing and querying overhead, different granularity levels at which provenance needs to be captured, and provenance data analysis through provenance visualizations, etc., are also explored for provenance support in big data application by different authors in [7, 14, 15, 23]. Because of these remarkable challenges, the necessity of capturing and querying provenance information associated with social data has raised a growing interest in the era of social data analytics.Currently, no social media platform provides any provenance information to the users to identify the originators or sources of the published information.A wide variety of digital content including text, images, and multimedia files are dynamically generated through various social media sites. However, there is no common format of such data that is available to understand the provenance information associated with them.No common application programming interface (API) and architectural solutions are provided by the developers to access and manage provenance data.There is no widely accepted mechanism which has the potential to trace out the provenance objects from such unstructured distributed data.
Needs of social data provenance
In computer science, provenance has been studied mainly in the following two perspectives: first is database provenance or data provenance and second is scientific workflow provenance or workflow provenance [50]. Workflow Provenance is a coarse-grained information that captures the information about process and entities involved in that process as a black box, while Data Provenance captures fine-grained information. It focuses on how any result is derived, what queries are executed, what operations are performed on data. In this paper, our main focus is on “Data Provenance”. Social Data Provenance describes the origin, derivation and transformations of social content throughout its lifecycle. It is also categorized in the following two categories based on its granularity level, viz. fine-grained and coarse-grained provenance [50]. Recently, social data provenance has gained a lot of attentions, as it serves different purposes such as audit trail, data discovery, update propagation, incremental maintenance, rumour identification, justification of a query result, etc. Several web-based tools [30, 42] are developed to capture pre-defined provenance attributes such as name, gender, religion, location, etc., from different social networking accounts associated with a particular twitter user. Although these attributes capture complete details of a social media user, but it neither provides a provenance path nor a propagation history and updates of any social content published on a social media platform. To reconstruct and integrate provenance of messages in social media, a workflow provenance model PROV-SAID [16, 51, 52] based on W3C PROV data model is also proposed for a small dataset. However, most of the existing approaches are not scalable to track provenance metadata for social media efficiently. They are suitable to capture workflow provenance at coarse-grained level only. Further, as the social data are constantly changing over time, yet no any existing framework is capable to capture provenance for historical queries, which is an essential requirement of social data provenance. Therefore, the viability of such a framework becomes the necessity to engender the trust among social media users. To accomplish this, we propose a Big Social Data Provenance (BSDP) framework for key-value pair database that is capable in capturing, storing and querying provenance information for different query sets including select, aggregate, standing/historical, and data update (i.e., insert, delete, update) queries on a live streamed twitter data set.Relational databases have been the mainstay of the data community for decades starting from mid 1980’s. They are ideal for structured data and predictable workload. But these databases are not scalable for handling big data which encompasses not just structured data, but also semi-structured and unstructured data. Not only SQL (NoSQL) databases have been proposed as an alternative to SQL databases to handle the challenges posed by big data, as these databases efficiently support to a low latency, horizontal scalability, efficient storage, high availability, high concurrency, and reduced operational costs [8, 17, 35]. Apache Cassandra is one of the most popular key-value pair (KVP) database which belongs to the NoSQL family. The key strengths of Apache Cassandra [20] are its simplicity, scalability, and a very fast streamlined NoSQL architecture in which each column is a data structure that contains a key, value, and a timestamp. It is also used in application development by Facebook, Twitter, Cloudkick, Mahalo, etc. Social media is the major source of unstructured data in the current scenario. Therefore, Apache Cassandra is a good choice to handle this extremely high volumes of unstructured data.
Research contributions
For applications those are related to auditing, security, and accountability, there is a need to restore all the operations performed on a database to produce the same result as of their previous executions. This leads to the requirement of managing all the updates (i.e., insert, delete, and update operations) without any loss of information as a provenance data. But the conventional/snapshot database systems do not maintain the history of all data objects and store only the current snapshot of data, as a result they are ill-suited for such applications. Zero-Information Loss Database (ZILD) [4] is a special type of database which is based on temporal database and maintains temporal data as a history of all the updates along with the complete information of operational activities performed in that database. Therefore, it is well suited for designing the provenance framework especially in capturing provenance for update, insert, delete, and historical/standing queries. In this paper, we design and develop a Big Social Data Provenance (BSDP) framework for key-value pair database [45]. The proposed framework is capable to answer the following questions, viz. what type of provenance data should be reconstructed from social media?, To which extent it will be useful?, How to capture this provenance data?, How and where to store provenance data?, How to query/analyse provenance data?, etc. In summary, the main contributions of this paper are: In addition, all the extracted social data and their provenance information are stored in a common keyspace of Apache Cassandra using a query-driven approach for fast read/write operations and efficient provenance visualization. A case study of Twitter Social Network is given to show the feasibility and usefulness of our proposed framework in capturing, storing and querying social data provenance.BSDP, a novel provenance solution for live streamed big social data that integrates both online and offline module and captures fine-grained and coarse-grained provenance.Fine-grained provenance is captured in the form of Provenance Path Expressions that consists of keyspace, column family, row key and column name contributed towards each result tuple, while coarse-grained provenance is captured in form of query statements with their execution time.We introduced a novel query-driven data model design methodology for Apache Cassandra.Social data are constantly changing over time; however, no any existing approach is capable to capture provenance for historical queries, which is an essential requirement of social data provenance. On the contrary, our framework aims to maintain all data update (i.e., insert, delete, and update) operations without any loss of information.It supports to perform historical data queries (i.e., querying a data element with a given time in the past and with a time range specified in the query statement) using User-Defined Constructs (UDCs) in CQL (Cassandra Query Language) and capturing provenance for standing/historical queries (i.e., it traces the provenance for all the result tuples of a query executed in the past).Our proposed framework is developed around the novel concept of Zero Information Loss Database (ZILD) [4]. By a Zero Information Loss Database, we mean that no data value, no user, and no query and its result is ever lost. ZILDs are very useful in tracking any “data manipulations” that have taken place on social media.Existing solutions for social data provenance are dedicated to a particular social media platform with limited query support and suitable for a small data set. On contrary, our framework provides a generalized provenance solution which is capable to extract real-life social data from different social media platforms through live streaming by using their supporting APIs, for instance Graph API for Facebook social graph, Twitter’s Streaming API for Twitter’s Network.We propose different provenance generation algorithms for select, aggregate, standing, and data update queries with insert, delete and update operations. All the captured provenance is further stored in Zero-Information Loss Key-Value Pair Database (ZILKVD).
Related work
Social data analytics is an emerging research field that integrates social communications with data analytics. It extracts meaningful insight from extensively large data sets. It can be used to understand the user’s behaviour, and to model social interactions among social media users. Big Social Data [47] is mainly characterize by 3 V’s, viz. volume, velocity, and variety, where volume means rapidly growing social data, velocity is related to the dissemination of information with tremendous speed, and variety refers to diverse formats of social data. Nowadays, the volume, velocity, and variety of Big Social Data are facing the challenges of capturing provenance [12] and evaluating trustworthiness of social data [29]. Therefore, an efficient provenance data management system is required to trace out the provenance information through provenance capturing and querying for social data generated from various social media platforms. The importance of social data provenance in social media is also presented in [21, 36, 46] with several key challenges such as measuring quality and truthiness of social data, provenance storage, provenance querying, etc. [7, 9, 14, 15, 23, 49, 53]. Several research works are carried out to identify the suitability of NoSQL database to manage big social data with efficient storage, fast querying, and horizontal scalability [20, 33]. Different approaches are proposed to model a huge volume of Twitter data set in Apache Cassandra NoSQL database for an efficient querying [11, 26, 40].A provenance data model for data intensive workflows is proposed in [13] to capture provenance information for Map Reduce workflows using Kepler–Hadoop framework. The proposed provenance model is a good initiation for scientific workflows; however, it is not much efficient in terms of storage space and query execution overhead. In line with the provenance data model for scientific workflows, RAMP model is proposed in [28, 39] for Generalized Map and Reduce Workflows (GMRWs) using a wrapper-based approach for provenance capturing and tracing. In this model all the transformations are either map or reduce functions rather than having one map function, followed by one reduce function. Further, HadoopProv model [2] is introduced for provenance tracking in Map Reduce workflows, where provenance tracking takes place in Map and Reduce phases separately, and construction of provenance graph is deferred at query stage, to minimize the temporal overhead. Several web-based tools [30, 42] are developed to capture pre-defined provenance attributes such as name, gender, religion, location, etc., from different social networking accounts associated with a particular twitter user. Although these attributes capture complete details of a social media user, it neither provides a provenance path nor a propagation history and updates of any social content published on a social media platform. Further, a provenance path algorithm [25] is proposed to capture provenance path of an information, to explain how this information propagates in a social network but to a few known recipients only. To reconstruct and integrate provenance of messages in social media, a workflow provenance model PROV-SAID [16, 51, 52] based on W3C PROV data model is proposed. Although the proposed solution identifies the posted tweets that are copied from other published tweets without giving credit to original tweeter like a retweet, it is suitable for a small dataset only. Applications of standard PROV-DM model are proposed in [27] to manage provenance data for bioinformatics workflows in a cloud computing environment using different families of NoSQL databases. A provenance framework based on algebraic structure of semirings for three specific graph algorithms is presented in [41], to compute provenance of regular path queries (RPQ) over graph database via applying annotations like labels and weight functions which is a quite complex process. A provenance model for vertex-centric graph computation and a declarative data-log based query language is presented in [38], to capture and query graph analytics provenance for both online and offline mode. Further, a provenance model for stream processing system (s2p) is proposed in [55]. Although this model is suitable to capture fine-grained (operator level) and coarse-grained (process level) provenance through online and offline parts, yet does not provide provenance support for historical queries.To satisfy the need of Big Data Provenance, a rule-based framework for provenance identification and collection from log files is proposed in [22]. The proposed framework reduces the source code instrumentation, yet raises several questions about completeness of provenance information, as logs may not capture complete information including derivation process. Another big provenance framework is proposed in [10] for provenance collection and storage in an unstructured or semi-structured format, for scientific applications. The proposed framework is light-weighted and built on multi-layered provenance architecture that supports a wide range of provenance queries. A provenance model for Apache Cassandra, i.e. a key-value pair database, is proposed in [31, 32] to capture provenance information using provenance policies. In this model, provenance querying is performed through resource expressions and a set of predefined operators. The proposed model is implemented on a small sized patient information system and uses legacy thrift APIs rather than CQL3, that makes it difficult to write a query. Various change data capture (CDC) schemes are investigated in [48] for Apache Cassandra to track modifications in source data. The logic of each scheme is implemented in Cassandra by combining a Map Reduce framework with distributed computing. A layer-based architecture for provenance collection and querying in scientific applications is presented in [1], which stores semi-structured provenance documents in MongoDB in a BSON format. The proposed architecture is prominent for simple queries but not efficiently respond to complex queries.From the available literature, it is evident that most of the existing provenance models are suitable to capture provenance for workflows at coarse-grained level only rather than fine-grained level. Secondly, some of them are not suitable to capture provenance information for a large size social media data including all types of query set. In this paper, we try to bridge this gap by designing an efficient big social data provenance framework on the top of a key-value pair database for capturing, storing and querying provenance information for different query set including select, aggregate, standing/historical, and data update (i.e., insert, delete, update) queries on live streamed big social data. Summary of different characteristics of existing provenance solutions for social data and our proposed BSDP framework is given in Table 1.
Table 1
Summary of characteristics of existing provenance solutions for Social Data
Provenance model
Data model design
Provenance granularity
Provenance capture
Provenance visualization
Application domain
Select query
Aggregate query
Standing query
Update query
Insert query
Delete query
Justifying query results
Historical data
RAMP [39] (2011)
No
Workflow level
Yes
Yes
No
No
No
No
Yes
No
Generic
Web-based tool [42] (2013)
No
Fine-grained level (pre-defined attributes of social media user)
Yes
No
No
No
No
No
Yes (Limited)
No
Generic
Seeking provenance paths [25] (2013)
No
Fine-grained level
Yes
No
No
No
No
No
Yes
No
Generic
Hadoop-Prov [2] (2013)
No
Workflow Level
Yes
Yes
No
No
No
No
Yes
No
Generic
Millieu [10] (2013)
No
Workflow level
Yes
Yes
No
No
No
No
No
No
Generic
KVPM [32] (2013)
No
Fine-grained level (column level)
No
No
No
Yes
No
No
No
Yes
Application specific (patient information system)
Layer based architecture [1] (2014)
No
Workflow level
Yes
No
No
Yes
No
No
No
Yes
Generic
Semiring provenance [41] (2015, 2018)
No
Fine-grained level
Yes
No
No
No
No
No
Yes
No
Limited for three specific graph algorithms
PROV-SAID [16, 51, 52] (2018)
No
Workflow level
Yes
No
No
No
No
No
Yes
No
Application specific
SFM [30] (2019)
No
Fine-grained level (only metadata information of tweet)
Yes
No
No
No
No
No
Yes (limited)
No
Generic
Ariadane [38] (2019)
No
Fine-grained level
Yes
No
No
No
No
No
Yes
No
Vertex centric graph analysis
s2p [55] (2021)
No
Fine-grained (operator level like map, reduce, filter, window)
Yes
Yes
No
No
No
No
Yes
No
Generic
Proposed (BSDP) framework
Yes
Fine-grained level
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Generic
Summary of characteristics of existing provenance solutions for Social Data
Proposed provenance framework
In this paper, we propose a Big Social Data Provenance (BSDP) Framework build upon Zero-Information Loss Key-Value Pair Database (ZILKVD) that efficiently captures provenance for all queries including select, aggregate, standing/historical, and data update queries with insert, delete, and update operations. ZILKVD [45] is developed based on the concept of Zero-Information Loss Database (ZILD) [4, 43, 44, 46]. The proposed framework is very beneficial in tracing out the origin and derivation history of a query result. It also supports provenance querying for historical data. The major steps involved in designing the proposed provenance framework are as follows: In addition to the above tasks, a performance analysis of proposed provenance capturing and querying algorithms are also presented for different query sets.Fetching a huge volume of real-life social data from Twitter’s network through live streaming by using Twitter Streaming API’s.To design an efficient Key-Value Pair (KVP) data model based upon a query-driven approach to correlate big social data through relationships and dependencies, in appropriate formats so that it makes sense for further analysis.Designing ZILKVD architecture with data version support to maintain all insert, delete, and update operations in the form of provenance data, that will aid in Historical data queries and Standing queries.Proposing following three provenance generation algorithms, viz. SelectProv, AggreProv, and StandProv to generate provenance information for select, aggregate, and standing/historical queries, and to store captured provenance in ZILKVD.To provide provenance querying support for historical data and tracing out the origin and derivation history of a query result.
Social data streaming (Twitter case study)
Over the past few years, more than 90% of total sized data are contributed by the desperate usage of various social media platforms. Several leading social media platforms such as Twitter, Facebook, Instagram, etc., are fully responsible for this mammoth data. Out of these social media platforms, Twitter is one of the most precious mines of pretty specific and publicly available pullable social data, that allow users to share their thoughts with massive audience. It is tuned for very fast communications over internet with more than 150 million active users publishing approximate 500 million tweets daily. A twitter user can either create its own tweet or can retweet the information that has already been tweeted by some other user. A twitter user can choose to follow other users also. For instance, if a user A follows user B, then user A can see B’s tweets in his ‘timeline’. Twitter’s popularity as a massive source of information has led to research in various domains [34]. Researchers can obtain this information from twitter through publically available Twitter APIs. These APIs are categorized in the following two categories; first is REST APIs for conducting specific searches, reading user profile or posting new tweets, and second is Streaming APIs to collect a continuous stream of public information. In our framework, we are using Streaming APIs to continuously stream the tweets and related information whenever the new tweet is published as shown in Fig. 1.
Fig. 14
Twitter data streaming
Twitter data streamingTwitter provides an open standard for authorization known as Open Authentication (OAuth). This authentication mechanism allows controlled and limited access to protected information. Traditional authentication mechanism is vulnerable to theft, while OAuth mechanism provides a more secure approach without using user’s username and password. By using a three-way handshaking, it allows users to grant third party access to their data. As user’s password for his/her twitter account is never shared with this third-party application, therefore, user’s confidence in the application is also improved. Twitter APIs can only be accessed by a twitter application using OAuth authorization mechanism. To get the authorization for accessing the protected data, user first creates a twitter application which is also known as consumer. After registering this application on twitter, a consumer key and a consumer secret key is issued to the application by twitter that will uniquely identify this application. By using these consumer key and consumer secret key, application creates a unique twitter link through which user authenticate him/herself to twitter. After verifying the user’s identity, twitter issues an OAuth verifier to the user. Application uses this OAuth verifier to request an Access Token and Access Token Secret that is unique to the user. Now, twitter application authenticates the user on twitter by using these Access Token and Access Token Secret, and make API calls on behalf of the user, see Fig. 2. By using these Access Credentials, we fetched all the tweets related to a specific event through live streaming to design an efficient key-value pair data model as explained in Algorithm 1 (i.e., TweetCassandra). The two inputs to the algorithm are (1) Twitter API Access Credentials, i.e. Consumer Key (), Consumer Secret Key (), Access Token (), Access Token Secret (); and (2) Event name (E) for which related tweets are required to be fetched. An efficient query-driven KVP data model in Apache Cassandra is obtained as an output of this algorithm. After successful authorization on Twitter’s Network using access credentials, tweet set (T) related to input event (E) is fetched through live streaming of social data, refer to line 1. Then, on every fetched tweet (t) of tweet set, pre-processing is performed to extract the following information, viz. User (U) who posted the tweet, Hashtags (H) and Mentioned Users (M) in the Tweet, Tweet Body (T) in UTF-encoding, and other related information, etc., refer to lines 2 and 3. Simultaneously, for each User (U), the following related information, viz. list of user’s friends (), list of user’s followers () and user’s profile attributes () such as user_name, screen_name, profile created date, twitter id, location, etc., is also extracted from User’s Twitter profile, refer to line 4. Similarly, Friend Details () and Follower Details () of each user are also extracted, refer to lines 5 to 10. Finally, all the extracted information is stored in corresponding column families of Apache Cassandra in appropriate format. This information is continuously streamed and populated in different column families to build an effective query-driven KVP data model for efficient queries.
Fig. 15
Open authentication process of Twitter
Open authentication process of Twitter
KVP data model design in Apache Cassandra
The bulk proliferation of social data has been imposing several challenges in the field of social data analytics such as efficient data model design, querying techniques, etc. But traditional data management and processing tools are incapable to handle this limitless data. Relational/SQL databases are ideal for structured data and predictable workload but not scalable for handling Big Data which encompasses not just structured data, but also semi-structured and unstructured data. Social media platforms are the major source of unstructured data in the current times. Unstructured data is characterized by ad hoc schema and therefore cannot be stored in SQL databases. The growth of unstructured data has led to interest in NoSQL databases. NoSQL databases are much better suited due to their flexible schema. NoSQL represents a family of databases in which each database is quite different from others having literally nothing in common. The only commonality is that they use a data model with structure that is different from the traditional row-column relation model of RDBMSs. Graph, Document, Column-oriented, & Key-value pair are the four kinds of NoSQL databases. The basic architecture of a KVP database consists of a two-column hash table in which each row contains a unique id known as a “key”, and a “value” associated with this key. The KVP databases are a good choice to handle extremely high volumes of data in a distributed processing environment as they have a built-in redundancy, which is capable to handle the losses of storage nodes. The key strengths of KVP databases are their simplicity, scalability, and a very fast streamlined NoSQL architecture. These have the capabilities to perform an extremely fast read and write operations. Apache Cassandra is one of the most popular KVP database that comes under the ambit of NoSQL databases. It is a distributed column family store in which each column is a data structure that contains a key, a value, and a timestamp; therefore, it is also named as key-value pair column-oriented data store, see Fig. 3. The brief introduction of elementary components of information in Apache Cassandra is given below:
Fig. 16
Cassandra column, row and column-family structure
Cassandra column, row and column-family structureKVP data model in Apache CassandraColumn: Column is a smallest unit of information that contains a key, value, and timestamp.Super Column: Super Column or composite column is a group of similar columns, or columns likely to query together with common name.Row: A Row is a group of orderable columns, i.e., columns are stored in sorted order by their column names, with a unique row key or primary key that can uniquely identify data.Column Family: Column Family is similar to a table in relational database but no pre-defined schema, and also provides flexibility to have different number of columns in different rows. Column families are stored in separate files on the disk.Keyspace: Keyspace is the highest level of information in Apache Cassandra, analogues to the database in relational database, which is the set of related column families. It also maintains the information about data replication, and replication strategy on nodes.Zero-information loss KVP database architectureAlthough Apache Cassandra is known for flexible data management to manage world’s biggest datasets on clusters of several nodes deployed at different data centres, however, one of the major challenges that big social data applications face when choosing Apache Cassandra is data model design that is significantly different from traditional data model design methodologies. Traditional data model design methodology (i.e. used in relational databases) is purely a data-driven approach. On the contrary, data model design for Cassandra begins with application-specific queries, and it is purely a query-driven approach. Several SQL constructs such as data aggregation, table joins, etc., are not supported by Cassandra Query Language (CQL). Therefore, data modelling in Cassandra relies on denormalization of database schema that enable a complex query to execute on a single column family only, to retrieve the required information. In this way, data duplication is common in Cassandra column families to support a variety of queries. Database schema design for big social data in Cassandra requires not only the understanding of relationships and dependencies among social data, but also the understanding of needs to access this data through a query driven approach. In this paper, we applied a query-driven methodology in KVP data model design. By a query driven, we mean designing a data model on the basis of what type of queries our database will required to support. This approach provides not only the sequence of tasks but also aids in determining what type of data will be needed and when? In our proposed framework, we designed a query-driven data model based on frequent queries required to execute on Twitter dataset. Initially, all the tweets posted by different Twitter users in the response of a particular event are fetched through Twitter’s Streaming APIs. However, all such information is not being useful for our data model; therefore, only required information, viz. tweet id, tweet text, tweet published date, hashtags, user_name, screen_name, profile created date, twitter id, location, friend list, follower list, etc., is extracted from the input list of tweet objects. Simultaneously, pre-processing on extracted data is performed to convert them in a required format. Afterwards, all such pre-processed data is stored in different column families of Apache Cassandra. The snapshot of KVP data model design in Apache Cassandra is given in Fig. 4. Proposed data model contains a keyspace named “NewTwitter_Keyspace” that consists of 20 Column Families. The various column names of these column families with their row keys are also mentioned in Fig. 4. All 20 column families are organized on the basis of social data set fetched from the Twitter’s network to support different query sets for capturing, storing, and querying provenance. Cassandra Query Language (CQL) is used for querying and to communicate with Apache Cassandra.
Fig. 17
KVP data model in Apache Cassandra
ZILKVD architecture design
The proposed provenance framework for big social data is developed on top of Zero-Information Loss Key-Value Pair Database (ZILKVD). ZILKVD is designed by using the concept of Zero-Information Loss Database [4], to maintain all the insert, delete, and update operations without losing any information as a provenance data. The architecture of ZILKVD consists of following components, viz. Query Parser, Query Rewriter, Query Generator, Processing Module, and KVP Database, see Fig. 5. When user issues a query, it is sent to the Query Parser to parse the query and to identify the type of that query, i.e. Insert (I), Update (U), or Delete (D) query.
Fig. 18
Zero-information loss KVP database architecture
If issued query type is an “Insert Query” (i.e., to insert a new row in database), then the parsed results are sent to the Query Rewriter as mentioned in step and corresponding Rewritten Insert Query () is generated in step . Here, “valid_from” column of this new row in corresponding column family is being set to the “current date/time” and then it is sent to the KVP database for further execution.If issued query type is a “Delete Query” (i.e., to delete an existing row from the database), then the parsed results are sent to the Query Generator as mentioned in step and corresponding Update Query () is generated in step . Here, the value of “valid_to” column of the row to be deleted from the corresponding column family is being set to the “current date/time” and then it is sent to the KVP database for further execution.If issued query type is an “Update Query” (i.e., to update an existing row in database), then the parsed results are sent to both Query Generator and Processing Module in steps and , respectively. Then, in step , corresponding Select Query () generated from Query Generator is executed on KVP database, to retrieve the following information, viz. value of primary key columns of the row to be updated, old value of column before performing update, and its write time in database. This information is sent to the Processing Module in step , to generate corresponding Provenance Path Expression (ProvPathExp) in the following format, i.e., “Key- space/Column_Family/RowKey/Update_Column_ Name”, and then sent back to the Query Generator in step .Now, in step , Query Generator generates an insert query () to insert the following information in “update_provenance” column family, viz. Query statement, ProvPathExp, old_value, old_value writetime (i.e., its valid_from time), new_value, current Date/Time, etc., for further execution on KVP database.Afterwards, both the queries (i.e., generated insert query and issued update query U) are executed on KVP database in step and , respectively, to maintain the complete history of data update operations. Finally, following information, viz. Query Id, Query Statement, its time of execution, etc., are also inserted in “query_table” column family through an insert query executed on KVP database in step .The high-level details of the implementation code of ZILKVD Design are given in Algorithm 2 and 3. Two inputs to Algorithm 2 are (1) A KVP Database () and (2) A query Q (i.e., insert, delete or update query), and output of the algorithm is a ZILKVD database with complete history maintained.According to algorithm 2, the issued input query Q is first parsed to retrieve the required information, i.e., parsed result and to identify the query type, i.e., , refer to line 1. If is an insert query, then a corresponding rewritten insert query is generated and sent for the execution on , refer to lines 3 and 4. If is a delete query, then a corresponding update query is generated and sent for the execution on , refer to lines 6 and 7. If is an update query, then Algorithm 3, i.e., UpdateCassProv, is called, refer to line 9. The following two inputs, i.e., query Q and its parsed result , are passed to the algorithm 3 and provenance path expression () of updated columns and updated “query_table” and “update_provenance” column families are obtained as outputs of the algorithm 3. According to the algorithm 3, all the required information such as KS, CF, PK, , , , etc., are retrieved from , see line 1.If Q contains a “Where Clause” in its query statement, then value of is retrieved and assigned to RK to uniquely identify a row, refer to lines 2 to 4. Afterward, a corresponding select query is generated and executed to retrieve old value of column before update, and its write time in database, i.e., and , respectively, refer to lines 5 and 6. Now, provenance path (i.e., KS/CF/RK/) is generated and column family “update_provenance” is updated with updated values of , Q, , , , , , and current date/time, refer to lines 7 and 8.Similarly, if Q does not contain a “Where Clause,” then again is generated and executed to store all the query results in RS, refer to line 11. Now, for each result tuple r of RS, value of following parameters, i.e., , , , etc., are retrieved and value of is assigned to RK. Afterwards, corresponding provenance path (i.e., KS/CF/RK/CNu) is generated and column family “update_provenance” is updated with updated values of , Q, , , , , , and current date/time, refer to lines 12 to 16.Finally, Q is executed and column family “query_ table” is also updated with updated values of following parameters, i.e., Q, Q, current date/time, etc., refer to lines 19 and 20. A demonstration of above algorithms with illustrative example query 1 is given below:A snapshot of “update_provenance” column family: Update location of the user with name “DDNewsAndhra”.: update user_details set location= ‘Andhra’ where screen_name=‘DDNewsAndhra’;Initially, the above Example Query 1 is passed as an input query (Q) to the Algorithm 2, where the query is parsed to identify its type (i.e., Update Query) and to retrieve the required information. Now, both query (Q) and its parsed results () are passed as inputs to the Algorithm 3. Here, the provenance path expression (i.e., ProvPathExp) of updated tuples along with the updated column families, viz. “query_table” and “update_provenance” of underlying KVP database is obtained as outputs of above algorithm. A snapshot of “update_provenance” column family is shown in Fig. 6.
Fig. 19
A snapshot of “update_provenance” column family
Provenance generation algorithm
We designed and implemented three provenance generation algorithms for select, aggregate, and standing queries, respectively. The high level details of all the algorithms along with their illustrative example queries are given in the following subsections:
Provenance generation for select queries
Proposed framework supports to capture provenance information for select queries. The high-level details of provenance generation algorithm for select queries, i.e., “SelectProv”, are given in Algorithm 4. In proposed algorithm, a select query () and its query id () are passed as inputs and a comma separated list of provenance path expression (P) for each value exists in the result tuple of a query result along with the following updated column families, viz. “select_provenance”, and “query_table” are obtained as outputs of the algorithm. Initially, is parsed and the following information, viz. KS, CF, PK, CN, etc., is retrieved from the query statement in the form of parsed result , refer to lines 1 and 2. Then, a rewritten select query is generated by appending a predicate (i.e., “valid_to”) in the query statement, refer to line 3. The value of this predicate is being set to the Null to retrieve currently existing rows. Afterwards, is executed and all its result tuples are stored in record set (RS), refer to line 4. Now, for each result tuple r of result set, a unique result tuple id is generated by using , refer to lines 5, 7 and 17. Initially, the value of P for all columns of each result tuple is being set to the null, refer to line 8. Then, the value of is retrieved from result tuple and assigned to RK, refer to lines 9 and 10. After that, for each non-key column of r, provenance path expression is generated (i.e., KS/CF/RK/) and added in the corresponding r and further appended in P, refer to lines 11 to 14. A provenance path expression consists of a keyspace name, column family, row key, and column name in the following form: “keyspace/columnfamily/rowkey/columnname”. Provenance path expression provides a detailed provenance for each of the result tuple exists in the query result at different granularity levels, i.e., How a value in result tuple is derived. Finally, column families “select_provenance” and “query_table” are also updated, refer to lines 14 to 20. Demonstration of Algorithm 4 with illustrative example queries 2 and 3 is given below:Display the location of user with Screen_Name ‘Gagan4041’.select location from user_details where screen_name=‘Gagan4041’;Query result of the above select query contains the following two columns, viz. “LOCATION” and “LOCATION_PROVENANCE” with values “India” and “[ NewTwitter_Keyspace/user_ details/Gagan4041/ location]”, respectively. Here, the value under the column name “LOCATION_PROVENANCE” justifies the query result, i.e., “India”. It explains that the value in result set is derived from keyspace: NewTwitter_ Keyspace, column family: user_details, row key: Gagan- 40041, column: location.Display all hashtags used in the tweets posted by a user with Screen_Name ‘mkzangid’.select hashtag from user_tweet _hashtag where screen_name=‘mkzangid’;Query result of the above query is shown in Fig. 7, which shows that the user “mkzangid” used hashtag “Vikramlander” in two of his tweets with tweet id’s “1181510817377767426” and “1181512471518990342”. Provenance path expression under column “Hashtag_ Provennace” shows the derivation process of the value present in result set, i.e., value “Vikramlander” in result set is derived from two different rows with row key (composite primary key of screen_name and tweet id) “mkzangid-1181510817377767426” and “mkzangid-1181512471518990342”.
Fig. 20
Example query 3 result
Example query 3 result
Provenance generation for aggregate queries
Proposed framework supports to capture provenance information for aggregate queries too. The high-level details of provenance generation algorithm for aggregate queries, i.e., “AggreProv”, are given in Algorithm 5. According to this algorithm, an Aggregate Query () with its Query Id () is passed as an input and a comma separated list of Provenance Path Expressions pv[i] for each of its result tuple exists in query result is obtained as an output in Provenance Vector (pv). The provenance path expression consists of all the source rows and column names of a column family in a keyspace that contributed to generate the corresponding result tuple. All the steps of this algorithm are very similar to Algorithm 4, i.e., “SelectProv” except the concept of provenance vector. Although, Provenance path is generated in the same way as in Algorithm 4; however, iteration is performed on all source rows that contributed to produce one result row in result set to generate pv[i] of all source rows, refer to lines 13 to 21. Further, provenance of result tuples and corresponding aggregate query are stored in “select_provenance” and “query_table” column families, respectively. Demonstration of Algorithm 5 with illustrative example queries 4 and 5 is given below:Display the total no of tweets posted by a user “sunilthalia” on “08/10/2019”.select count(tweet_body) from tweets_user_day where screen_name=‘sunilthalia’ and published_day=8 and published_date>=‘2019-10-08’ and published_date<‘2019-10-09’ group by screen_name allow filtering;Example query 4 resultThe above query is an example of aggregate query to retrieve the total number of tweets posted by a specific user on a given day. This aggregate query efficiently executes on “tweets_user_day” column family with composite primary key, i.e., “screen_ name, published_day, and published_date”. Figure 8 shows the partial result of above aggregate query where the total number of tweets posted by the given user on 08/10/2019 are 7 (mentioned under the column name “SYSTEM.COUNT(TWEET_BODY)”) along with a comma separated list of provenance path expressions for all the 7 rows with the name of column families that have contributed to the result set under the column name “SYSTEM.COUNT( TWEET_BODY )_PROVENANCE”.
Fig. 21
Example query 4 result
Example query 5 resultDisplay the total no of tweets posted on each day in month of October, 2019.select published_day, count( tweet _body) from tweets_day where published_date>=‘2019-10-01’ and published_date<‘2019-11-01’ group by published_day allow filtering;The above aggregate query executed on “tweets_day” column family with composite primary key, i.e., “published_day, published_date”, and counts the total number of tweets posted on each day of October, 2019. Partial result of above aggregate query is shown in Fig. 9, where the total number of tweets posted on each day is shown under the column name “SYSTEM.COUNT (TWEET_BODY)”, along with the tweets posted day, and a comma separated list of provenance path expression for all the rows that contributed towards aggregated result under the column name “SYSTEM.COUNT (TWEET_BODY)_PROVENANCE”.
Fig. 22
Example query 5 result
Provenance storage
Provenance generation for standing queries
Proposed framework also supports to capture provenance information for historical/standing queries using data versioning support in ZILKVD. The high-level details of provenance generation algorithm for standing queries, i.e., “StandProv”, are given in Algorithm 6, where a Standing Query (Q) along with its Time of Execution(t) is passed as an input and a comma separated list of Provenance Path Expressions () for all of its result tuple exists in query result is obtained as an output. Initially, query Q is parsed to retrieve the following information, viz. Keyspace, Column Family, Column Names, Primary Key, etc., refer to lines 1 and 2. Afterwards, a Rewritten Select Query () is generated to retrieve Row Key (RK), i.e., values of primary key column of column family, and each result tuple with predicate “valid_to”. The value of this predicate is set to time “t” (i.e., given in input) and then query is executed on database, refer to lines 3 to 7. Now, for every value in result set of , its “writetime” (time of existence in database) is compared with “t”. If “writetime” is less than or equal to “t”, then provenance path expression () is generated with corresponding source row and column contributed towards its generation and further, added in result tuple refer to lines 9 to 11,. But, if “writetime” is greater than “t”, then corresponding column value and provenance path are retrieved from “update_provenance” column family, refer to lines 13 to 15. At the end, the value of column and provenance path expression that are retrieved from “update_ provenance” column family are updated in result set and finally, updated result set along with provenance information is obtained, refer to lines 16 to 21.A snapshot of select_provenance column family
Provenance storage
In our proposed framework, all the captured provenance is stored in the following three column families of Apache Cassandra for further analysis, viz. “query_table”, “select_provenance”, and “update_provenance”, see Fig. 10. Provenance information of all the executed queries with their query id and time of executions is stored in “query_table” column family. Provenance path expressions for all the result tuples of select/aggregate queries are stored in “select_provenance” column family along with their query statement, result tuple id and time of executions as shown in Fig. 11. Similarly, the column family “update_provenance” keeps the provenance information about all the update operations along with following attributes, i.e., query statement, provenance path expression, old value and its write time, new value, column type, and time of update (current date/time), see Fig. 6. The captured provenance is used in source tracing, update tracking, and in querying historical data. Further, the visualization of this provenance data is helpful in analysing and determining the truthiness of a query result.
Fig. 23
Provenance storage
Fig. 24
A snapshot of select_provenance column family
Provenance querying
The proposed framework also supports querying provenance information for various purposes such as audit trail, updates tracking, source tracing, data discovery, etc. Provenance querying on captured provenance is carried out to achieve the following two objectives: first, How any result tuple of select query is derived?, i.e., querying provenance to know about the source of information, and second How to track all the updates performed on a given data?, i.e., querying provenance for historical data. Framework provides the following two column families to accomplish the above tasks, viz. “select_provenance” and “update_provenance”. Provenance path expressions for all the result tuples of select/aggregate queries along with their query statement, result tuple id and time of executions are stored in “select_ provenance” column family. This provenance information is used in provenance querying to know about the source of information as shown in Fig. 11. Similarly, the column family “update_provenance” stores the provenance information about all the update operations performed along with the following parameters, i.e., query statement, provenance path expression, old value and its write time, new value, column type, and time of update (current date/time). This provenance information is used in provenance querying for historical data, see Fig. 6. In addition to above column families, one more column family, i.e., “query_table” is also used in provenance querying to obtain the information about all the queries executed till a particular date with their time of execution. The illustrative examples of provenance querying are given below:Explain how result tuple q6t1 of query q6 (as shown in Fig. 11) is derived?The above query is executed on “select_provenance” column family to retrieve provenance path expressions for result tuple q6t1 of query q6 along with its time of execution. Here, provenance path expression of resultant tuple is “[NewTwitter_Keyspace/ user_details/ Gagan4041/location]” and time of query execution is “2019-12-16 05:02:34.266000+0000”. This indicates that the source keyspace name of required tuple is “NewTwitter_Keyspace”, name of column family is “user_details”, row key is “Gagan4041”, column name is “location” and time of query execution is “2019-12-1605:02:34.26600 0+0000”. Now, “user_details” column family is queried with this row key, column name and execution time to retrieve all the rows that contributed to produce the result tuple t1 of query q6 which justify the resultant tuple. However, if the source has been modified after query execution, in that case, the original source can still be devised through querying historical data. To support provenance querying for historical data, we designed the following four User-Defined CQL Constructs (UDCs), viz. “all”, “instance”, “validon now”, and “validon date”. These constructs are further categorized in the following two categories, viz. T1 (“all”, “instance”) and T2 (“validon now”, “validon date”).The high level details of provenance querying algorithm for historical data, i.e., “QueryProv_HistData”, are given in Algorithm 7, in which an Extended Query (Q) (i.e., a CQL query with UDCs) is passed as an input and a corresponding Result Set (RS) of historical data are obtained as an output. In the beginning, Q is sent to the Query Parser to retrieve all the UDCs (T1 and T2) used in Q along with the CQL Query Q (i.e., CQL query without UDCs) and parsed result (R), refer to lines 1 and 2. In addition to this, some other information such as Keyspace Name (KS), Column Family (CF), Primary Key (PK), and Column Name (CN) associated with Q is also extracted from R, refer to line 3. Now, query Q executes on the related column families to retrieve required historical data as per the following conditions mentioned from lines 4 to 16.Demonstrations of Algorithm 7 with illustrative examples of provenance queries 2, 3, 4 and 5 are given below:If UDC T1 and T2 are “instance” and “validon now” type constructs respectively, then query Q executes on the column families mentioned in issued query statement only, refer to lines 4 and 5.If UDC T1 and T2 are “instance” and “validon date” type constructs respectively, then the “write time” of current value is first fetched and compared with “validon date”. If the “write time” of current value is lesser than “validon date”, then query Q executes on the column families mentioned in issued query statement only; otherwise, it executes on “update_provenance”, refer to lines 6 to 10.If UDC T1 and T2 are “all” and “validon now” type constructs, respectively, then query Q executes on both “update_provenance” and the column families mentioned in issued query statement to retrieve the complete history of all the updates of a column value, refer to lines 13 to 16.Similarly, If UDC T1 and T2 are “all” and “validon date” type constructs respectively, then again “write time” of current value is fetched and compared with “validon date” . If the “write time” of current value is lesser than “validon date”, then query Q executes on both “update_provenance” and the column families mentioned in issued query statement; otherwise, it executes only on “update_provenance”, refer to lines 13 to 16.Display all the location updates of a specific user named ‘MemeBaaaz’ till now.: select all location from user_ details where screen_name=‘MemeBaaaz’ validon now;The above is parsed first to retrieve all the UDCs used in this extended query, i.e., “all” and “validon now”, respectively. Now, CQL query Q is executed on “user_ details” and “update_provenance” column families to retrieve all the location updates of the given user “MemeBaaaz”. The query result of above provenance query is shown in Table 2
Table 2
Example provenance query 2 result
Location
VALID_FROM
Meme Ki Duniya, India
Wed Oct 02 13:33:27 IST 2019
Kolkata
Wed Oct 23 08:20:18 IST 2019
Mumbai
2019-12-17 10:22:22.0
Example provenance query 2 resultDisplay all the location updates of a specific user named ‘MemeBaaaz’ till 23/10/2019 9:50AM.: select all location from user_ details where screen_name=‘MemeBaaaz’ validon 2019-10-23 09:50:16.The query result of above provenance query 3 is shown in Table 3, i.e., all the location updates till ‘2019-10-23 09:50:16’.
Table 3
Example provenance query 3 result
Location
Valid_from
Meme Ki Duniya, India
Wed Oct 02 13:33:27 IST 2019
Kolkata
Wed Oct 23 08:20:18 IST 2019
Display the current location of a specific user named ‘MemeBaaaz’.: select instance location from user_details where screen_name=‘MemeBaaaz’ validon now.The above example provenance query 4 generates current location of user as “Mumbai” which is valid from “2019-12-17 10:22: 22.0”.Display the location of a specific user named ‘MemeBaaaz’ on date 23/10/2019 8:22:16AM.Example provenance query 3 result: select instance location from user_details where screen_name=‘MemeBaaaz’ validon 2019-10-23 08:22:16.The above example provenance query 5 generates location of user on 23/10 /2019 8:22:16AM as “Kolkata” which is valid from “Wed Oct 23 08:20:18 IST 2019” to “2019-12-17 10:22:21.0”.
Data set and evaluation
To evaluate the performance of proposed framework, all the experiments are performed on a single node Apache Cassandra Cluster on Intel i7-8700 processor @ 3.20GHz with 16GB RAM, and 1TB disk. Apache Cassandra version 3.11.3 has been used for the experiments. In the proposed framework, big social data are fetched from the Twitter’s network through live streaming and modelled in Apache Cassandra. This big social data consists of around 2.4 lakh twitter users, 2.1 lakh user’s friends, 1.8 lakh user’s followers, and their related information such as tweet’s body, tweet’s id, tweeter’s screen name, tweet created date, user’s personal information, etc. The proposed key-value pair data model contains a keyspace named “NewTwitter_Keyspace” that consists of 20 Column Families those are used to store this huge volume of social data. On execution of each query, the provenance information is captured and stored in the following three column families, viz. “select_provenance”, “update_provenance”, and “query table” that gradually increases the size of database. Java version 8 has been used as front-end programming language to interact with Cassandra, and Twitter’s network. Cassandra Query Language (CQL) is used for querying and to communicate with Apache Cassandra. The performance analysis of proposed framework in terms of provenance capturing overhead and provenance query execution time for different query sets including, select, aggregate, data update and provenance queries are presented in the following subsections.Sample select queriesPerformance of select queries without and with provenance
Provenance capture analysis
To perform an experimental analysis on provenance capture, several query sets of different type of queries including select, aggregate, and data update queries are executed on ZILKVD architecture. A sample set of select queries are shown in Table 4. Initially, all the queries are executed 12 times without provenance support and then, the same set of queries are again executed with provenance support. To calculate the average execution time of each query, we dropped the minimum and the maximum execution time and then taken the average of remaining 10 values. The execution performance of all the select queries in terms of average execution times is shown in Fig. 12. The average execution time of select queries with provenance support is slightly larger than the select queries without provenance support. However, it indicates that the performance overhead of most of the select queries with provenance support is very minimal in respect to the select queries without provenance support, except query Q8. In case of query Q8, a very large number of result tuples generates in its result set which in turns increases the execution time, as the proposed framework captures and stores the provenance for all its result tuples exists in the query result.
Table 4
Sample select queries
QID
Query
Q1
Find location of user with Screen_Name=‘Gagan4041’
Q2
Display all tweets by user with screen_name=‘SunilThalia’
Q3
Display all hashtags used by a user in one tweet
Q4
Display all hashtag used by a user in all tweets posted by a user
Q5
Display all tweets posted by a user on one particular day
Fig. 25
Performance of select queries without and with provenance
Sample aggregate queriesPerformance of aggregate queries without and with provenanceThe proposed framework also provides the provenance support for aggregate queries with following aggregate functions such as count, max, min, etc. A sample set of aggregate queries are shown in Table 5. The performance analysis of aggregate queries in terms of average execution time with and without provenance support is also shown in Fig. 13. It indicates that the framework efficiently captures provenance for aggregate queries such as query Q1, Q2, and Q4. However, more execution time is measured for those queries in which aggregation is performed on a large number of input tuples such as query Q3, and Q5. For example, let’s consider the query Q3, i.e., “count the total number of tweets posted in one month”. Here, as the aggregation is performed on all the tweets of that month, which requires to capture the provenance for all such rows those are contributed to generate the result set, as a result it adds some measurable execution overhead.
Table 5
Sample aggregate queries
QID
Query
Q1
Display total no. of tweets posted by a user on one particular day of a month
Q2
Display total no. of tweets posted by a user in one month
Q3
Display total tweets posted in one specific month
Q4
Display all the users with tweets count in a specific month in descending order
Q5
Display total no. of tweets posted every day of a specific month
Fig. 26
Performance of aggregate queries without and with provenance
Sample update queriesProvenance capturing for data update queries is also supported by the proposed framework using ZILKVD architecture. A sample set of data update queries are shown in Table 6. The performance analysis of update queries in terms of average execution time with and without provenance support is shown in Fig. 14. It also indicates that the framework efficiently captures provenance for update queries with minimum execution overhead. The captured provenance information for update queries is stored in “update_provenance” column family. The following parameters such as “value_type”, “old_value”, “new_value”, “old_value_writetime”, and “provenance_path_expression”, etc., are used to capture the provenance information. These parameters are further used for historical data queries, and queries executed in the past at any specific time, i.e., standing/historical queries as explained in Sect. 3.4.3.
Table 6
Sample update queries
QID
Query
Q1
Update location of user with screen_name “DDNewsAndhra”
Q2
Update location of friend named “Ashutosh” of user with screen_name “Bandho”
Q3
Update url of user with screen_name “myshowmytalks”
Q4
Delete user with screen_name “DDNewsAndhra”
Q5
Insert a posted tweet in tweetset
Fig. 27
Performance of update queries without and with provenance
Performance of update queries without and with provenanceOverall query performance without and with provenanceUltimately, the overall performance of all types of queries with and without provenance support is shown in Fig. 15. The average query execution time for “update”, “select”, and “aggregate” queries with and without provenance supports are summarized in Table 7. It indicates that our proposed framework is very efficient in capturing provenance information for “update”, and “select” queries, while a very small overhead is measured in case of “aggregate” queries, see Fig. 16.
Fig. 28
Overall query performance without and with provenance
Table 7
Provenance performance overhead (ms)
Update queries
Select queries
Aggregate queries
Without provenance
782
788
794
With provenance
838
866
1124
Fig. 29
Provenance overhead for different query set
Provenance querying analysis
The performance analysis of querying provenance information stored in Apache Cassandra is presented in the following section. A set of different provenance queries are executed for the performance analysis of provenance querying. A sample set of provenance queries are shown in Table 8. Initially, all the provenance queries are executed 12 times. To calculate the average execution time of each query, we dropped the minimum and the maximum execution time and then taken average of remaining 10 values. The execution performance of all the queries is shown in Fig. 17. Average execution times of all the provenance queries are mentioned in milliseconds (ms). According to Fig. 17, the average execution time of provenance queries is varying from 1000ms to 1800ms. It shows that the proposed framework provides support for efficient provenance querying for both justifying answers of a query result, and historical data queries at an accepted level of precision.
Table 8
Sample provenance queries
QID
Query
Q1
Display all the rows contributed to produce result tuple of query Q2 of Table 5
Q2
Display the row keys of all the rows those are contributed to produce result tuple t1 of query Q1 of Table 4
Q3
Display all location updates of a specific user till now
Q4
Display all location updates of a specific user till time 22/10/2019 8:00AM
Q5
Display the location of a specific user at time 22/10/2019 8:00AM
Fig. 30
Querying provenance
Provenance performance overhead (ms)Sample provenance queriesProvenance overhead for different query setQuerying provenance
Applications
Proposed framework is beneficial in attempting to understand the social processes and behaviour of a social media user. Some of the application scenarios are given below:It is a generalized framework that provides provenance solutions for other social media platforms also. Proposed algorithm can be used in fetching social data from other social networks such as Facebook, Instagram, etc., by using their supporting APIs, for instance Graph APIs for Facebook Social Graph.Proposed framework is applicable for such applications where progressive user profile maintenance is required. For example, a social media user frequently updates his profile by adding, removing or changing his information. In such cases, our framework maintains all the data updates performed without losing any information.In current pandemic situation of COVID-19, where health-related data is provided by almost all the countries across the world. This data is valuable, but came in diverse format and scattered at different portals across the Internet. BSDP can be applied to extract and analyse this data for a better understanding of current situation and in fighting against the COVID-19 pandemic.
Conclusion and future work
In this paper, we designed and implemented a Zero-Information Loss Key-Value Pair Database (ZILKVD) on top of which a Big Social Data Provenance (BSDP) Framework has been developed to capture and query provenance for live streamed Twitter data set. The proposed framework is capable to capture fine-grained provenance for various query sets including select, aggregate, and data update queries with insert, delete, and update operations. It also supports to capture provenance for historical/standing queries using data version support in ZILKVD. The proposed ZILKVD architecture and KVP data model leads to an adequate design methodology that provides a flexible provenance management system for social data.The proposed framework is efficient in terms of average execution time for capturing and storing provenance for select, and data update queries. However, a small execution overhead is measured for some aggregate queries, where the aggregation is performed on a larger number of input tuples. Proposed framework supports efficient provenance querying for both justifying answers of a query result, and historical data queries at an accepted level of precision. Our provenance capturing and querying algorithms prove to be very promising, retrieving more precise information with an optimal latency.However, our framework has the following limitations. First, proposed framework provides single-layer provenance support (i.e., tracing out direct sources that contributed to a query result) at this stage. Second, currently BSDP framework is implemented for a single node Apache Cassandra rather than for several distributed nodes in a cluster.In the future, we plan to extend BSDP framework for multi-layer provenance support (i.e., tracing out both direct and indirect sources that contributed to a query result) by using multi-depth provenance querying. We also plan to further extend our framework for a distributed environment where data is redundantly stored across multiple nodes in a cluster.