Literature DB >> 25760732

The evolution of your success lies at the centre of your co-authorship network.

Sandra Servia-Rodríguez1, Anastasios Noulas2, Cecilia Mascolo2, Ana Fernández-Vilas3, Rebeca P Díaz-Redondo3.   

Abstract

Collaboration among scholars and institutions is progressively becoming essential to the success of research grant procurement and to allow the emergence and evolution of scientific disciplines. Our work focuses on analysing if the volume of collaborations of one author together with the relevance of his collaborators is somewhat related to his research performance over time. In order to prove this relation we collected the temporal distributions of scholars' publications and citations from the Google Scholar platform and the co-authorship network (of Computer Scientists) underlying the well-known DBLP bibliographic database. By the application of time series clustering, social network analysis and non-parametric statistics, we observe that scholars with similar publications (citations) patterns also tend to have a similar centrality in the co-authorship network. To our knowledge, this is the first work that considers success evolution with respect to co-authorship.

Entities:  

Mesh:

Year:  2015        PMID: 25760732      PMCID: PMC4356565          DOI: 10.1371/journal.pone.0114302

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Success, according to the Merriam Webster dictionary, is “getting or achieving wealth, respect, or fame”. But, how can we measure wealth, respect or fame and, ultimately, success? These concepts are subjective and, as Barabási claimed, success is a collective phenomenon in the sense that a person (or other entity) is successful because others around him believe that he is (http://www.northeastern.edu/news/2013/06/scienceofsuccess/). Despite this, some objective measures have been proposed for quantifying success, measures that depend on the domain (context) in which success is assessed. For instance, if we think about success in the marketing domain, a campaign will be successful if it gets to increase the profit. On the contrary, in social media, we could quantify the success of a video in YouTube by means of the number of views, the success of a tweet by the number of retweets received or the success of a Twitter user by the number of users who follow him (his followers). In this paper we study academic researchers success: academic success can be tracked through the monitoring of the publication record of the academics in conferences and journals over the years. Many tools exist to gather data about published articles and citations and we will rely on these for our study. More specifically in this research we will focus on (i) the dissemination of results and (ii) the attention that these results receive in the research community as objective signs of scholars’ success. That is, we will focus on the publications (be conference papers, journals, books or patents) that authors get and the citations that these publications receive. We will also focus on scholars’ collaborations, and more specifically, on analysing if the volume of collaborations of one author, together with the relevance of his collaborators, is somewhat related to his research performance over time (temporal success). Therefore, the two main innovative factors of this work are the temporal evolution and the relationship of one author’s success to the success of his collaborators: Temporal evolution: an important factor when assessing the success of an author should be the variability in the diffusion of their achievements over time. We study the temporal variation of an author’s publications and citations record: the number is not the only important measure, but the pace of publication activity and the impact of this activity are also very relevant. Collaboration: the collaboration among researchers [1, 2] or their institutions [3-5] drives researchers success. As an outstanding example, the recent study of Bellotti [5] indicates that, in order to be successfully funded, the interaction over the years with different research groups counts more than working in a large university. Although this study is limited to research projects in the physics discipline funded by the Italian Ministry of University and Research, a look at the collaboration clusters winning research grant calls seems to confirm this informally too. In order to prove the relation between the volume of publications/citations of one author over time and the collaboration network of co-authors of his papers, we collected the temporal distributions of scholars’ publications and citations from the Google Scholar platform (http://scholar.google.com/) and the co-authorship network (of Computer Scientists) underlying the well-known DBLP bibliographic database (http://www.informatik.uni-trier.de/∼ley/db/). Although Google Scholar contains the temporal distributions of all the publications and citations of each author (at least the ones from venues indexed by Google Scholar), the fact that not all the authors have a profile in this platform hampers us from having the complete view of the real collaboration network (henceforth, co-authorship network). On the contrary, the DBLP bibliographic database, despite only including Computer Science researches, provides a complete view of the co-authorship network, both in terms of nodes (scholars) and edges. However, this bibliographic database does not include citations data and therefore cannot be used for analysing the problem by itself alone. As a solution, we combine the aforementioned datasets: we apply time series clustering on the citations data of Google Scholar obtaining authors clusters; we then consider a collaboration network extracted from DBLP and apply social network analysis on it. We observe that the (median) centrality of those scholars with similar publications (citations) patterns (i.e. in the same cluster) is different from the (median) centralities in the rest of clusters. These findings open the door to potentially new publishing strategies as well as prediction of success for young scholars, potential recruits as well as journal editors. The main novelty of our approach is to consider the whole scholars’ timeline as sign of their research activity. Although both citation and publication counts have been widely used as indicators of scholars’ impact [6-8], most of the existing metrics only consider one time window which may be difficult to select as it depends on several cross-cutting factors [9, 10]. Given the importance of time of publications/citations in determining success, the complete timeline of the scholars may give a more complete view. Moreover, our work establishes scholars’ collaboration according to their relative importance within the co-authorship network (centrality) as the most formal sign of academic teamwork. In order to establish the dependence among the longitudinal data (time series of citations and publications) by obtaining some derivable variable from the series, we propose retaining all the information of the temporal research activity and exploring temporal patterns for scholars by clustering their timelines. Then, after studying the degree, closeness, betweenness and eigenvector centrality measures considering the collaboration network for the clustered authors and observing that they are far from being normally distributed, we use non-parametric statistics (Kruskal-Wallis statistical test [11]) in order to check whether there are differences, in terms of authors’ centrality, among the different clusters. That is, whether the median centrality of the authors classified in the same cluster is different from the median centralities in the rest of clusters. The main findings of our study are: The volume of publications/citations over time of one Computer Scientist who started his career between 1979 and 2009 is related to the volume and relevance of his collaborators. That is, to his centrality in the co-authorship network. This relation holds for most of the period considered, which means that collaboration affects (and is affected by) research performance over time. What is likely is that, the more collaborations one author has, the higher his number of publications will be and the more attention they will receive.

Materials and Methods

Datasets collection

We now describe the two datasets we have worked with and the reasoning behind this choice. The Google Scholar platform allows academics to create profile pages that contain, in addition to their affiliation and areas of interest, their history of publications and citation counts over time. A profile page in this platform allows other academics to see, at a glance, the author’s publications without having to search in the traditional Google Scholar page. Also, and more importantly, instead of only showing the total number of citations of a paper (author), it displays the citations distribution of the paper (author) over time. Google Scholar also allows researchers to link to collaborators, however this function is not very popular and therefore it is almost impossible to gather collaboration network data from Scholar. To complement Google Scholar and be able to obtain collaboration network information, we used the DBLP Computer Science Bibliography, a tool developed by researches from the University of Trier: with this, we trace co-authorship in the work of Computer Science researchers. Although considering only Computer Science researchers limits the analysis, this gives us a complete network of co-authors to work with. We now describe how we obtained the data.

Google Scholar

Every author in Google Scholar has his own identifier, but there is no way to know all of the identifiers of Google Scholar authors. So, one indirect way for getting the given identifiers (or at least most of them) consists of crawling Google Scholar through the connections among the co-authors that authors have previously included in their profiles. But, as aforementioned, the number of authors that have explicitly indicated their co-authors is very small. However, many of them do have indicated their areas of research. As Google Scholar allows to query authors interested in a given area, if we knew all the areas of interest in Google Scholar, we could get all authors’ identifiers (at least the ones that have indicated their areas of interest). Again, as authors can freely indicate their areas of expertise (there is not a predefined group of them) there is not direct way to know all the areas of interest in Google Scholar. We have devised an algorithm that, starting with an author’s profile (acting as seed), retrieves his areas of interest and all the authors that have indicated the same areas as him. It then randomly selects one of these authors and repeats this procedure, retrieving the areas and the authors that have not been crawled yet. The algorithm finishes when there are not more authors to select. Our dataset contains information about 192,930 authors with profile in Google Scholar and, at least, one publication indexed by Google Scholar and 9,030,060 papers. Table 1 summarises (i) authors’ and (ii) papers’ attributes contained in this dataset (previously retrieved from the Google Scholar platform).
Table 1

Google Scholar profiles dataset.

Author’s information Paper’s information
nametitle
affiliationauthors (name, not links to authors’ ids in Scholar)
domainpublication type (conference, journal, patent)
citations and publications time seriespublication name
total number of citations (after 2008 and all)publication date
h index (after 2008 and all)publisher
i 10 index (after 2008 and all)citations time series
areas of interesttotal number of citations

DBLP

The DBLP Computer Science Bibliography is a tool developed by researches from the University of Trier to trace the work of Computer Science researchers. The whole DBLP content is available, in form of XML file, in , whose structure is described in . An updated version of the XML file is released daily. As of 2013, the dataset includes 1,227,149 authors and 9,822,354 co-authorship relations. The information in this dataset is presented in a publication-centred perspective. Different publications are considered, following a Bibtex style, as: article, inproceedings, proceedings, book, incollection, phthesis, mastherthesis or www. Data associated to each publication are shown in Table 2. The relevant information for our purposes is authors’ names and publications, which will allow us to obtain the co-authorship network. That is, the network in which nodes are the computer science authors and a link between them exists if they have co-authored, at least, one publication.
Table 2

DBLP dataset: publication’s information.

titlepublication dateyearnumberschoolchapter
authors’ (editors’) namepublication typejournalmonthpublisherseries
authors’ (editors’) addresspagesvolumeurlnoteisbn
The aforementioned datasets (Google Scholar and DBLP) have different levels of coverage: while Google Scholar contains information about researchers in whatever area of expertise and who have created a profile in this platform, DBLP contains information about all the researchers in the Computer Science domain (or at least about those that have published, at least once, in a conference/journal indexed by DBLP). The two platforms associate users to different identifiers which means that we have to match authors in both platforms. To this aim, we applied a conservative solution: the lexicographic comparison of authors’ names in both datasets, obtaining that the number of authors with profile in both datasets is 57416, which represents around 30% of our entire Google Scholar dataset.

Analysis Techniques

As mentioned, one of the main contributions of this work is the ability to track the evolution of authors’ success through the use of the temporal data contained in our datasets. We analyze the data longitudinally without resorting to aggregates or univariate summaries (e.g. averages or time slopes). Our approach can be briefly described as follows: we first apply hierarchical clustering to the Google Scholar longitudinal data containing citations and publications, operation that divides our authors into groups with similar patterns of publications (citations). We then consider the DLBP collaboration network and apply social network analysis to determine author centrality. Finally we study the relation between the author centralities and the clusters. In the next sections we detail the various steps we just described.

Time Series Analysis

We start by exploring the longitudinal dataset to discover groups of authors that share similar publication/citation dynamics as a mean to identify different patterns of success evolution in the pool of authors. For that, among unsupervised techniques for exploratory data analysis, clustering is a strong candidate for finding strongly related data points. An ample variety of algorithms have been proposed for clustering purposes (see [12] for a review). Given the scarce knowledge about temporal patterns in scholars’ activity to date and considering its conceptual simplicity and good scaling with a large number of points, a hierarchical clustering algorithm (The Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [13]) seems appropriate as it constructs a rooted tree (dendrogram) for inspection. This dendrogram reflects the structure present in a pairwise similarity matrix (or a dissimilarity matrix) by accomplishing iterative steps as follows: Place each data point into its own singleton group. Merge the two closest groups. Update distances D between the new cluster (group) and each of the old clusters (see below). Repeat 2 and 3 until all the data are merged into a single cluster. Given a distance measure d between points UPGMA obtains the intergroup similarity between the clusters C and H as: where N(N) is the size of the cluster C(H) and d is the distance between the data points i and j. For the case of UPGMA over time series, the pairwise similarity matrix is composed of similarity measures between every two time series (authors’ time series of publications or citations as corresponds). Our methodology constructs this matrix by applying the well-known Dynamic Time Warping (DTW) method [14], which finds an optimal alignment between two time-dependent sequences S 1 and S 2 by warping the time dimension in S 1 that minimises the difference between the two series so that time series are not need to be of equal length. Although DTW was initially applied to word’s recognition continuous human speech [15, 16], its use has been extended to other domains. In our experiment, we alleviate the problem of alignment by classifying authors in advance by their first publications’s year, so that only time series of equal length are considered. DTW works as follows. Let us suppose that we have two time series S 1 = {s 1, s 1, …, s 1} and S 2 = {s 2, s 2, …, s 2} that represent the temporal distribution of publications (citations) of author 1 and author 2 respectively. If i = 1, 2, …n and j = 1, 2, …, m are indices into S 1 and S 2 and W = w 1, w 2, …, w, is an optimal mapping between the two sequences given as, we can compute the dissimilarity or distance between the two sequences as, where δ(w) is a non-negative function to compute dissimilarity between individual elements of S 1 and S 2. As most clustering algorithms, the key issue in UPGMA is fixing the number of resulting clusters. Being UPGMA a hierarchical method, this turns into the problem of identifying the relevant branches of the cluster tree (dendrogram pruning). In order to not have to decide for each individual branch, the most widely used technique is based on fixing the height of the dendrogram so that each contiguous branch below that height is considered a separate cluster. Unfortunately, selecting this height is not a trivial task and, alternatively, Dynamic Tree Cut (DTC) [17] proposes to prune the dendrogram by taking its shape into consideration. DTC iteratively applies decomposition and combination of clusters until their number becomes stable. After obtaining a few large clusters by the fixed height branch cut method, the joining heights of each cluster are analysed for a sub-cluster structure. Clusters with this sub-cluster structure are recursively split and, with the aim of avoiding over-splitting, very small clusters are joined to their neighbouring major clusters.

Social Network Analysis

Once we have discovered the structure of the temporal dataset, the next step is to establish if this structure is related to scholars’ centrality in the co-authorship network. For that, we take advantage of the DBLP dataset to obtain the current network of co-authorship. We represent this network as an undirected graph G = (V, E), where V denotes the set of scholars (authors) in the dataset, and an edge e ∊ E between two scholars u, v ∊ V exists if they have co-authored, at least, one publication. We will then be able to determine if the position of the author in the given network has relation with his citations (publications) behaviour in Google Scholar. That is, if the fact that two authors are included in the same cluster implies that they have similar centrality (we have tried various measures of centrality) in the network. This will indicate if researchers’ collaboration activity is dependent on the temporal distribution of the publications or the citations that they receive. With this aim, we applied social network analysis over the co-authorship network extracted from DBLP. The next step consists in determing the centrality of each node within the network. Several measures of centrality have been proposed to date. The most fundamental and popular definitions of centrality were proposed by Freeman [18]. By this definition, the centrality is measured based on degree, closeness and betweenness. These measures, together with the eigenvector centrality, are detailed below. Degree centrality: The degree of a node n in a graph is the number of direct connections that a node has. It is the simplest and easiest way of measuring its centrality, being a local measure of a node’s importance. The degree centrality of a node reflects the popularity and relational activity of the node. where N is the number of nodes in the network and a(n) is a distance function. a(n) = 1, if and only if node n and node n are connected. a(n) = 0 otherwise. Betweenness centrality: The betweenness centrality of a node n is defined to be the fraction of shortest paths between pairs of nodes in a network that pass through n. It represents the node’s capability to influence or control interaction between nodes that it links. where g (n) is the number of shortest paths connecting n and n passing through n and g is the total number of shortest paths connecting them. Closeness centrality: The closeness centrality of a node measures how many steps are required to reach a given node from every other node. It indicates the node’s availability, safety and security depending on the application context considered. where N is the number of nodes in the network and d(n, n) is the number of hops in the shortest path between n and n. Eigenvector centrality: The eigenvector centrality is a measure of the importance of a node in a network and it is based on the idea that a node is more central if it is in relation with nodes that are themselves central. where λ is a constant, N is the number of nodes in the network and a(n) is a distance function. a(n) = 1, if and only if node n and node n are connected. a(n) = 0 otherwise. The next section shows the results of our analysis.

Results

Our methodology for analysing the data obtained from both Google Scholar and DBLP processes each dataset separately. Then, the results are analyzed through statistical tests to determine if there exists any relation between them. In the case of Google Scholar, we also process separately publications and citations, as they are different indicators of authors’ success: whereas a publication is reviewed by a group of previously selected experts in the domain (reviewers), any author, in or out of the domain, can cite a previously published publication. On the other hand, authors have started to publish (receive citations) in different years and their publications and citations are conditioned by the state of the research environment in every moment of their trajectory. The number of existing conferences and journals or the accessibility to bibliographic resources are only some of the factors that condition authors’ publications and citations in a given period. To avoid the impact of this aspect on the results, we decided to split authors in groups depending on the year in which they started to publish (in the case of publications) or to receive the first citation (in the case of citations). In the case of DBLP, as the co-authors network is the result of collaborations between authors through the years, we work with the whole network. With respect to the distribution of authors by year of their first publication (citation), Fig. 1A represents the number of authors classified by (i) the year in which they achieved their first publication and (ii) the year in which any of their publications received its first citation. Both the Computer Science discipline and the Google Scholar platform are relatively recent in comparison to other scientific disciplines and other scientific databases, which could explain why the curves of the number of authors per year are increasing.
Fig 1

Time series clustering in Google Scholar and Centrality measures in DBLP.

(a) Number of authors attending to the year of their first publication (citation). (b) Number of clusters of authors per year. (c) Number of authors per cluster per year (attending to the temporal evolution of their publications). (d) Number of authors per cluster per year (attending to the temporal evolution of their citations). (e) Degree centrality in the co-authorship network (for authors simultaneously in both datasets and for authors in the entire DBLP). (f) Betweenness centrality in the co-authorship network (for authors simultaneously in both datasets and for authors in the entire DBLP). (g) Closeness centrality in the co-authorship network (for authors simultaneously in both datasets and for authors in the entire DBLP). (h) Eigenvector centrality in the co-authorship network (for authors simultaneously in both datasets and for authors in the entire DBLP).

Time series clustering in Google Scholar and Centrality measures in DBLP.

(a) Number of authors attending to the year of their first publication (citation). (b) Number of clusters of authors per year. (c) Number of authors per cluster per year (attending to the temporal evolution of their publications). (d) Number of authors per cluster per year (attending to the temporal evolution of their citations). (e) Degree centrality in the co-authorship network (for authors simultaneously in both datasets and for authors in the entire DBLP). (f) Betweenness centrality in the co-authorship network (for authors simultaneously in both datasets and for authors in the entire DBLP). (g) Closeness centrality in the co-authorship network (for authors simultaneously in both datasets and for authors in the entire DBLP). (h) Eigenvector centrality in the co-authorship network (for authors simultaneously in both datasets and for authors in the entire DBLP).

Grouping authors

As aforementioned, we applied hierarchical clustering over the time series of authors’ publications (citations) in order to extract groups of authors with similar publication (citation) patterns. Fig. 1 contains information about the resulting clusters: number of resulting clusters (Fig. 1B) and number of authors per cluster in the case of publications (Fig. 1C) and citations (Fig. 1D). Focusing on Fig. 1B, we see that the number of resulting clusters increases with the year of the first publication (citation) and, therefore, with the number of time series (authors) to be clustered, at least until the late 2000s. That is, the more authors to cluster, the more different patterns emerge. Although this tendency is notable both in publications and citations, the increase is higher in the case of publications, highlighting the variability of the publication records. With respect to the number of authors (time series) per cluster, we appreciate that the resulting clusters have, in average and independently of the year, equal number of elements (authors). On the contrary, the curve of the maximum number of elements in a cluster is increasing with the number of authors in a cluster until mid-2000s. When there are more authors who started to publish (receive citations) in the same year, (i) the diversity in terms of publication/citation patterns is higher and (ii), although in average the number of authors with similar patterns (classified in the same cluster) keeps constant with respect previous years, there are some big groups of authors (clusters) that share a similar pattern of publications (citations). However, and as aforementioned, there are certain periods in mid-2000s in which, although (i) the number of authors who started their activity around these years and (ii) the diversity in terms of citation/publication patterns are higher than in previous years, there are less authors that share similar patterns (less authors per cluster). A feasible explanation for this could be that, as these authors are relatively new to the research area, they are in a “transitory” period of their careers, which could cause the high diversity of behaviour among them. Finally, Fig. 1 also shows that recent scholars, those who start to publish/being cited between 2010 to 2012 and therefore have had less than 3 years to develop their research activity, are, in general, authors with very few publications/citations which makes that almost all of them present the same publication/citation pattern.

Obtaining authors’ centrality

The CCDF (Complementary Cumulative Distribution Function) curves corresponding to the centrality measures previously described are included in Fig. 1. For each measure, we provide two curves: the one that represents the distribution of the given centrality for the authors in the whole DBLP dataset and the other that limits the curve to only those authors that simultaneously appear in DBLP and in Google Scholar dataset (those subject of our study). Specifically, Fig. 1E shows the distribution of degree centrality among authors, that is, the number of authors with whom they have collaborated at least once. Although the tendency of the curve is similar in the case of all computer science authors or only computer science authors with profile in Google Scholar, the degree is slightly higher in the second case. Looking at the CCDF for the whole dataset, we see that around the 40% of the authors have a degree centrality value higher than 5, whereas when only authors in our Google Scholar dataset are considered this percentage increases to the 70%. This means that authors who own a profile in Google Scholar are more prone to co-author papers with different authors than the rest of computer science authors. In the case of eigenvector centrality (Fig. 1H) most of the nodes (authors) in the network have an eigenvector centrality equal to zero. That is, around the 11% of the authors in the whole dataset have a positive eigenvector centrality, whereas this percentage increases to the 27% when only authors in Google Scholar are taken into account. Talking in terms of research collaborations, this means that scholars are prone to collaborate with both, central and non-central authors, which means that collaborations are quite distributed across the network instead of being topologically focused on only one or a few areas in the network. In the case of betweenness centrality (Fig. 1F), around the 30% of authors that are simultaneously in both datasets (Google Scholar and DBLP) have no shortest path through them, whereas this percentage increases up to almost 60% when the entire DBLP dataset is taken into account. This suggests that authors who own a profile in Google Scholar are more prone to serve as links between scholars (co-authorise papers with well-connected authors) than the rest of computer science authors. With respect to authors with positive values of betweenness, there is no difference between the distribution of this variable among the total authors in computer science (in the whole DBLP dataset) with respect to the distribution when only Computer Scientists with a profile in Google Scholar are considered. Finally, in the case of closeness centrality (Fig. 1G) the distribution of closeness centrality of authors in Google Scholar is similar to the one of all computer science authors. Specifically, the total of computer science scholars have a value of closeness higher than 0.12 and only the 10% of them have a value of closeness equal to 1, being this percentage reduced until the 5% in the case of authors with profile in Google Scholar. Talking in terms of co-authorship relations, this means that only a reduced percentage of scholars are at a reachable distance, in terms of co-authorship links, of the rest of computer science authors. But, the immense majority of authors is far from having collaborated with the rest of nodes in the network.

Relating academic success and centrality: the Kruskal-Wallis test

The last step of our analysis deals with checking the existence of relation among the authors included in the same cluster and their centrality in the co-authorship network. To this aim, the common approach in the literature is the one-way analysis of variance [19] (one-way ANOVA), a statistical technique used to compare the means of two or more samples (being the samples, in our case, the authors classified into each cluster). Specifically, ANOVA tests the null hypothesis that samples in two or more groups are drawn from populations with the same mean values. In our scenario, this implies testing if the authors classified in the previously calculated clusters (response or dependent variable) are drawn from authors with the same mean values of centrality in the co-authorship network (factor or independent variable). But the reliability of one-way ANOVA results is conditioned, among others assumptions, by normality of the response variables, and the application of the Shapiro-Wilk statistical test [20] revealed that centrality distribution of authors within a cluster is far for being normally distributed. Under these circumstances, we opted for a non-parametric alternative to the one-way ANOVA, the Kruskal-Wallis test [11]. Contrary to one-way ANOVA, which worked with mean values, Kruskal-Wallis tests the null hypothesis that samples in two or more groups are drawn from populations with the same median values. Thus, we applied Kruskal-Wallis test to determine if the centralities of the authors with the same publication (citation) pattern had the same median value and if this value was different from the median value of authors with different publication (citation) patterns. That is, if authors with similar centrality in the network have similar publication (citation) patterns and these patterns are different from the ones from authors with different centrality values. The Kruskal-Wallis statistical results considering degree, betweenness, closeness and eigenvector centrality measures for the different years are shown in Tables 3 and 4 for publications and citations respectively. Taking the standard α = 0.05 as the significance level for all the Kruskall-Wallis tests conducted, results revealed that the p-value is lower than the significance value, and therefore the null hypothesis can be rejected for the most relevant cases, those which includes authors whose starting publication date was between 1979 and 2009 (Table 3), for all the centrality with the exception of the closeness centrality during some years between 1979 and 1987. These results are highly conclusive since, for authors who started to publish between 2010 and 2012 (classified into two or more clusters attending to their publications rate), the length of their time series (2 or 3 points) is not enough for obtaining accurate values of similarity and, consequently, for a suitable (clustering) classification. A similar situation can be appreciated in authors that started to publish before 1978. In this period, there exists a huge diversity, both in terms of measures and years, in which the null hypothesis cannot be rejected. In any case, and due to the peculiarities of the Google Scholar dataset (that counts publications since 1974) and that Computer Science is a relatively recent discipline, the number of authors that started to publish before 1980 is quite limited with respect to the number of computer scientists that started their activity later and these results should not be considered relevant. Again, clustering for this period may not be considered reliable enough. With all of this, the conclusions so far are that (i) in around the 80% of the considered period authors’ centrality is highly related with the distribution of publications over time and (ii) the periods in which this relation is not determined correspond to periods in which there are very few authors or periods of temporal series of publications extremely short.
Table 3

Kruskal-Wallis statistics results for the publications case.

Degree Betweenness Closeness Eigenvector
yeardf χ 2 p χ 2 p χ 2 p χ 2 p
1974 710.00.18710.60.15513.00.07116.10.024
1975 1115.00.18318.10.0797.30.77813.40.268
1976 1243.41.9e-0539.39.2e-0521.20.04834.60.001
1977 1022.60.01224.30.00716.70.08117.00.073
1978 1424.60.03919.50.14814.10.44428.70.011
1979 1132.20.00128.50.00321.40.02920.30.041
1980 1155.17.4e-0846.13.1e-0621.60.02840.23.2e-05
1981 1435.50.00129.80.00820.40.11734.10.002
1982 1654.05.3e-0655.52.9e-0629.00.02450.51.9e-05
1983 1851.84.0e-0557.55.2e-0634.10.01245.04.1e-04
1984 2161.57.6e-0653.51.2e-0431.70.06343.50.003
1985 2181.25.2e-0979.01.2e-0841.40.00543.10.003
1986 2197.48.2e-12105.33.3e-1331.50.06648.10.001
1987 2097.33.9e-1294.81.1e-1129.90.07151.71.3e-04
1988 26144.32.3e-18138.72.3e-1762.09.1e-0589.37.1e-09
1989 34136.43.2e-14136.63.0e-1493.91.6e-07102.29.5e-09
1990 25178.05.0e-25173.92.9e-2475.65.5e-07119.42.9e-14
1991 47162.01.5e-14149.51.3e-1290.71.4e-0489.71.8e-04
1992 31181.83.0e-23178.61.0e-2270.46.8e-0587.52.7e-07
1993 44203.82.0e-22211.78.5e-24107.63.0e-07126.07.8e-10
1994 51279.43.0e-33298.41.1e-36112.61.5e-06151.27.4e-12
1995 61277.12.6e-29269.65.2e-28124.92.7e-06169.83.3e-12
1996 61233.26.0e-22241.82.3e-23134.61.8e-07147.04.6e-09
1997 66331.45.4e-37328.41.8e-36123.42.4e-05170.73.3e-11
1998 78315.62.5e-30302.53.5e-28176.21.5e-09195.05.3e-12
1999 87323.46.6e-29324.64.2e-29158.14.9e-06178.92.5e-08
2000 84339.02.4e-32310.61.0e-27150.81.1e-05182.92.6e-09
2001 91401.61.3e-40382.32.3e-37181.75.3e-08176.62.0e-07
2002 107390.47.1e-34375.41.7e-31204.64.3e-08218.01.4e-09
2003 140506.26.5e-43452.91.2e-34247.35.9e-08239.13.7e-07
2004 162498.26.9e-36429.14.9e-26284.78.8e-09239.37.5e-05
2005 175504.99.4e-34407.71.3e-20263.21.8e-05222.90.008
2006 179508.32.6e-33445.31.1e-24226.60.009244.50.001
2007 208512.71.9e-27436.82.3e-18258.30.010256.20.013
2008 180462.21.0e-26380.32.0e-16227.30.010216.20.034
2009 183433.62.2e-22344.25.8e-12228.70.012218.10.039
2010 114215.03.4e-08196.02.8e-06133.30.104125.00.227
2011 3394.18.6e-0884.22.3e-0655.00.00937.70.262
2012 616.50.01112.00.0616.40.3842.30.889
Table 4

Kruskal-Wallis statistics results for the citations case.

Degree Betweenness Closeness Eigenvector
yeardf χ 2 p χ 2 p χ 2 p χ 2 p
1974 710.30.1739.50.2214.30.7502.80.907
1975 410.30.03510.80.0293.50.48516.60.002
1976 22.40.3013.30.1920.30.8451.70.432
1977 64.10.6615.60.4651.40.9685.40.496
1978 31.20.7491.60.6611.20.7552.60.453
1979 53.80.5822.10.8362.20.8211.00.963
1980 41.30.8672.30.6815.50.2413.40.487
1981 620.00.00320.80.0027.20.29912.80.046
1982 35.90.1153.90.2694.40.2184.30.233
1983 918.10.03416.40.05911.90.21619.60.020
1984 811.50.17612.10.1457.70.46419.70.011
1985 1017.80.05916.20.09412.00.28213.50.195
1986 914.00.12113.50.14210.40.3238.50.484
1987 1215.10.23314.90.25014.80.25512.20.429
1988 1034.61.4e-0434.11.8e-0416.10.09732.04.0e-04
1989 1434.70.00732.80.00318.20.19628.60.012
1990 1863.26.1e-0757.16.0e-0639.60.00251.15.2e-05
1991 1835.10.00932.20.02123.00.19236.80.006
1992 21110.83.4e-14100.72.2e-1281.93.8e-09103.08.4e-13
1993 25128.66.5e-16118.83.7e-1487.57.4e-09102.02.8e-11
1994 25140.15.5e-18137.02.0e-17120.32.0e-14133.58.6e-17
1995 36126.35.9e-12109.62.3e-0990.51.4e-06139.93.6e-14
1996 38164.21.2e-17149.24.3e-15106.42.2e-08163.41.7e-17
1997 46223.43.9e-25211.05.5e-23144.74.0e-12219.61.8e-24
1998 46159.42.0e-14135.69.0e-11113.31.3e-07182.93.1e-18
1999 50243.03.7e-27231.63.5e-25148.89.5e-12221.91.6e-23
2000 66304.22.8e-32272.56.5e-27179.42.0e-12276.51.4e-27
2001 76319.01.6e-31266.45.7e-23182.59.6e-11284.38.2e-26
2002 66364.76.5e-43331.25.8e-37232.22.3e-20286.82.6e-29
2003 82465.85.1e-55333.74.4e-32277.94.0e-23342.81.3e-33
2004 94507.36.0e-58345.02.3e-30269.86.7e-19329.27.5e-28
2005 109520.46.5e-55398.91.1e-34359.61.6e-28369.74.4e-30
2006 110489.52.8e-49319.42.7e-22308.11.1e-20373.52.2e-30
2007 115472.86.8e-45311.84.5e-20251.23.7e-12296.65.6e-18
2008 111508.53.5e-52317.01.0e-21254.42.9e-13325.95.4e-23
2009 147501.53.0e-40316.81.7e-14315.82.2e-14330.43.9e-16
2010 147382.86.4e-23246.84.8e-07238.32.8e-06289.52.3e-11
2011 137318.61.7e-16245.83.4e-08267.31.9e-10245.73.5e-08
2012 3398.71.8e-0871.21.2e-0470.11.7e-0474.54.9e-05
A similar tendency is appreciated in the case of citations (Table 4). Specifically, the null hypothesis can be rejected when considering authors that start to receive citations between 1988 and 2012, with exceptions during some years, again when closeness centrality is considered. However, the interval in which we cannot deduce anything about the relation between citation patterns and centrality is extended to authors that started to receive citations between 1974 and 1988. Finally, similar conclusions are drawn when considering different measures of centrality, where there are not differences between authors who started to receive citations in one year or another (except when considering authors that started to receive citations in 1984).

Comparisons with a null model

Kruskal-Wallis using a random graph as response

In order to demonstrate the significance of the results achieved, we used the Kruskal-Wallis test on a shuffled DBLP network: instead of considering the real edges in the DBLP network, we modeled these edges by means of a classic random network model, the Erdös-Rényi random graph, G(n, p) [21]. This graph, G(n, p) is defined by two parameters: the number of nodes in the graph (n) and the edge probability for drawing an edge between two arbitrary nodes (p). Our random network was obtaining using n = 57416 (equal to the number of authors simultaneously in both datasets) and p = 1.4e–4 (in order to have a similar level of connection than the original DBLP network), although with different distribution (random). With this test we wanted to exclude the correlation of random collaboration network distributions with our clusters. Results achieved are shown in Tables 5 and 6. According to these tables, both for the publications and citations cases, the null hypothesis cannot be rejected in almost all the the years (in which authors started to publish/receive citations) and for all the different centrality measures considered, which confirms the significance of our results.
Table 5

Kruskal-Wallis statistics results for the publications case (using a random graph as a response).

Degree Betweenness Closeness Eigenvector
yeardf χ 2 p χ 2 p χ 2 p χ 2 p
1974 78.20.3129.60.2129.50.2229.60.214
1975 118.00.7157.40.7676.80.8196.90.808
1976 126.60.8856.90.8629.20.6859.30.679
1977 108.30.6008.40.5937.50.6787.70.659
1978 1413.70.47413.30.50513.60.47913.30.502
1979 1114.50.20912.90.29711.00.44610.70.469
1980 117.20.7798.50.6729.10.6129.30.595
1981 147.50.9158.40.86611.10.68011.30.660
1982 1613.10.66312.50.70710.80.82110.80.823
1983 1817.10.51916.40.56615.30.63915.50.628
1984 2121.60.42322.30.38424.40.27424.00.295
1985 2133.20.04434.60.03132.60.05132.70.049
1986 2132.40.05430.50.08225.50.22426.00.207
1987 2015.30.75716.00.71917.20.63717.00.651
1988 2623.40.61222.30.67022.50.65922.70.652
1989 3450.70.03348.60.05046.60.07446.10.080
1990 2534.10.10635.90.07334.50.09834.40.099
1991 4751.20.31350.50.33744.60.57144.30.587
1992 3133.20.36332.90.37530.40.49730.70.483
1993 4468.80.01067.60.01363.10.03163.30.030
1994 5156.10.28956.40.28054.90.32954.60.338
1995 6189.20.01186.10.01982.80.03381.10.044
1996 6176.60.08675.20.10576.40.08976.40.088
1997 6666.40.46268.30.39965.90.48265.70.486
1998 7875.70.55177.80.48580.20.41080.10.414
1999 8799.80.16598.30.19292.00.33692.40.327
2000 8496.10.17397.70.14697.00.15796.30.169
2001 91119.70.024112.00.067101.80.207101.60.210
2002 10794.20.80790.80.86992.20.84691.40.860
2003 140158.60.134156.00.168147.00.326146.90.328
2004 162165.70.404161.20.504158.80.556159.30.546
2005 175164.30.708164.00.714171.40.562170.30.587
2006 179185.70.350186.00.344189.50.282190.30.268
2007 208197.70.684204.90.547212.80.395212.70.397
2008 180240.90.002241.00.002231.30.006232.50.005
2009 183190.10.344197.70.216203.30.145202.40.156
2010 114100.60.810100.30.81796.10.88696.20.885
2011 3325.10.83828.10.71030.30.60530.50.593
2012 65.20.5154.40.6173.20.7883.20.789
Table 6

Kruskal-Wallis statistics results for the citations case (using a random graph as a response).

Degree Betweenness Closeness Eigenvector
yeardf χ 2 p χ 2 p χ 2 p χ 2 p
1974 78.40.3028.10.3288.10.3268.20.311
1975 410.70.03110.30.0358.00.0907.90.094
1976 20.50.7670.40.8170.20.8830.30.868
1977 66.20.3976.20.4065.90.4326.30.390
1978 32.90.4013.00.3893.10.3733.10.375
1979 55.60.3485.70.3344.10.5364.10.541
1980 43.50.4733.60.4643.30.5053.60.470
1981 611.20.08312.00.06112.00.06312.10.059
1982 31.90.6032.10.5532.60.4552.60.465
1983 97.90.5458.20.5189.50.3959.40.401
1984 83.70.8814.10.8513.60.8913.70.885
1985 106.30.7936.20.7967.30.6957.10.714
1986 94.50.8755.40.8027.00.6347.50.589
1987 129.80.63010.50.57020.00.61810.30.593
1988 107.60.6717.10.7127.50.6807.40.686
1989 1412.20.59110.80.69910.20.74610.40.735
1990 1824.40.14224.00.15521.10.27520.70.293
1991 1816.60.54916.70.54617.30.50117.40.499
1992 2132.80.04930.30.08626.80.17726.10.202
1993 2518.90.80116.20.90714.20.95814.00.961
1994 2526.60.37725.50.43523.80.53324.30.504
1995 3623.80.94020.00.98619.00.99118.90.991
1996 3845.20.19743.00.26542.80.27142.80.273
1997 4657.90.11255.00.17152.10.25052.30.244
1998 4636.60.83936.70.83636.20.84936.60.839
1999 5040.50.82845.40.65949.10.51049.70.486
2000 6686.70.04583.30.07477.20.16377.00.167
2001 7676.30.46976.80.45375.00.51275.30.502
2002 6685.50.05384.30.06484.50.06284.40.063
2003 8295.80.14291.90.21387.70.31488.90.283
2004 9485.20.73191.20.562101.30.286101.90.271
2005 109109.00.482106.30.556106.50.550106.30.555
2006 110107.90.539109.40.497114.70.360115.10.352
2007 115148.20.020156.50.006154.30.008154.10.009
2008 111129.90.106130.10.104129.80.104128.60.121
2009 147139.60.656140.00.646140.20.641139.80.651
2010 147169.10.102166.50.129161.00.203162.20.185
2011 137161.60.074162.50.067163.40.061164.50.055
2012 3330.40.59832.10.51437.20.28336.80.298

Correlation measures between authors’ centrality and their publications (citations) counts

With the aim of testing the relation between authors’ centrality in the co-authors network and their total number of publications (citations), we calculated three well-known correlation indexes: Pearson, Spearman and Kendall. Fig. 2 shows the values of the different indexes taking into account different centrality measures (degree, betweenness, closeness and eigenvector centrality) and splitting authors according to the year of their first publication/citation. Results revealed that the correlation between centrality and number of publications (citations) is not significative (lower than 0.5 in almost cases), with the only exception of the authors who started their activity around 1990 when considering the correlation between their publications and degree (betweenness) centrality (Fig. 2A and Fig. 2B). This justifies the necessity of considering the whole scholars’ timeline as sign of their research activity.
Fig 2

Correlation measures between authors’ centrality and their publications (citations) count.

(a) Pearson correlation between authors’ centrality and their publications count. (b) Spearman correlation between authors’ centrality and their publications count. (c) Kendall correlation between authors’ centrality and their publications count. (d) Pearson correlation between authors’ centrality and their citations count. (e) Spearman correlation between authors’ centrality and their citations count. (f) Kendall correlation between authors’ centrality and their citations count.

Correlation measures between authors’ centrality and their publications (citations) count.

(a) Pearson correlation between authors’ centrality and their publications count. (b) Spearman correlation between authors’ centrality and their publications count. (c) Kendall correlation between authors’ centrality and their publications count. (d) Pearson correlation between authors’ centrality and their citations count. (e) Spearman correlation between authors’ centrality and their citations count. (f) Kendall correlation between authors’ centrality and their citations count. Other experiments without considering time. Finally, we run an experiment to demonstrate the importance of classifying authors according to the year of their first publications/citation when looking for their relation with centrality in the co-authorship network. Specifically, we computed the Pearson, Spearman and Kendall correlation indexes between the number of publications/citations and the degree centrality (the one with best results in Fig. 2) of all the authors in the dataset (independently of the length of their careers). Results in Table 7, with correlation values within the ranges of the ones in Fig. 2 confirm the importance of splitting authors by the year in which their careers started.
Table 7

Correlation measures between authors’ centrality and their publications (citations) counts without splitting authors per year of their first publication (citation).

Publications Citations
Pearson 0.24208220.1973301
Spearman 0.41461430.3929034
Kendall 0.3034510.2848471

Discussion

The main focus of this work is on analysing the importance of collaborations in academic success over time, by considering success as the volume of publications and cites to these publications. That is, on analysing if the volume of collaborations of one author together with the relevance of his collaborators is related to his research performance over time. We made use of two different datasets: one obtained by crawling the Google Scholar platform and the other the well-known DBLP dataset. By the application of a two-steps methodology based on (i) clustering scholars’ timelines to explore their temporal patterns and (ii) the application of non-parametric statistics to establish the correlation among timeline and centrality, we confirmed our hypothesis that computer scientists’ centrality in the co-authorship network is related with the patterns that their publications (citations) followed along the years. Although this relation was found both in the case of publications and citations and independently of the centrality measure, we could not guarantee the existence of relation (i) for scholars that started to publish/receive citations recently (very short time series) neither (ii) for those that started to publish before 1979 in the case of publications (1988 in the case of citations). This relation holds for most of the period considered: this confirms our initial hypothesis that collaboration affects (and is affected by) research performance over time. What is likely is that, the more collaborations one author has, the higher his number of publications will be and the more attention they will receive. Moreover, the relevance of these collaborations seems to play a key role in the increase the number of publications and their visibility. With respect to future work, our analysis could be extended in three different ways. Firstly, our mechanism for matching authors in the datasets could be improved by (i) taking into account measures of lexicographic distance between strings as the Levenshtein distance and (ii) applying techniques to detect ambiguous author names in scholarly data as the one proposed by Sun et al. [22, 23]. At the moment authors with the same name are considered the same person while by using the aforementioned techniques, we expect to increase the number of authors simultaneously in both datasets and, therefore, increasing the size of the sample, besides avoiding false positives in the matching-mechanism. Secondly, by considering other network measures, such as cross-clustering coefficient and cross-transitivity [24, 25] to conduct social network analysis to see whether more further information could be uncovered. Finally, once temporal series are classified in clusters, it could be interesting to characterise those series according to the different types of continuous dynamics that they exhibit: periodic, chaotic, and periodic with noisy [26-28]. To conclude, the extension of our study to other disciplines besides Computer Science would give us a better understanding of the dynamics of research. Once discovered these dynamics, the next logical step should be the development of a prediction framework that allows scholars to improve their collaboration network in order to increase their temporal success.

Anatomy of the Google Scholar dataset.

(PDF) Click here for additional data file.
  7 in total

1.  Coauthorship networks and patterns of scientific collaboration.

Authors:  M E J Newman
Journal:  Proc Natl Acad Sci U S A       Date:  2004-01-26       Impact factor: 11.205

2.  Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R.

Authors:  Peter Langfelder; Bin Zhang; Steve Horvath
Journal:  Bioinformatics       Date:  2007-11-16       Impact factor: 6.937

3.  Superfamily phenomena and motifs of networks induced from time series.

Authors:  Xiaoke Xu; Jie Zhang; Michael Small
Journal:  Proc Natl Acad Sci U S A       Date:  2008-12-08       Impact factor: 11.205

4.  Multivariate recurrence network analysis for characterizing horizontal oil-water two-phase flow.

Authors:  Zhong-Ke Gao; Xin-Wang Zhang; Ning-De Jin; Norbert Marwan; Jürgen Kurths
Journal:  Phys Rev E Stat Nonlin Soft Matter Phys       Date:  2013-09-13

5.  Complex network from pseudoperiodic time series: topology versus dynamics.

Authors:  J Zhang; M Small
Journal:  Phys Rev Lett       Date:  2006-06-14       Impact factor: 9.161

6.  Career on the move: geography, stratification, and scientific impact.

Authors:  Pierre Deville; Dashun Wang; Roberta Sinatra; Chaoming Song; Vincent D Blondel; Albert-László Barabási
Journal:  Sci Rep       Date:  2014-04-24       Impact factor: 4.379

7.  Driving forces of researchers mobility.

Authors:  Floriana Gargiulo; Timoteo Carletti
Journal:  Sci Rep       Date:  2014-05-07       Impact factor: 4.379

  7 in total
  2 in total

1.  Discovering latent node Information by graph attention network.

Authors:  Weiwei Gu; Fei Gao; Xiaodan Lou; Jiang Zhang
Journal:  Sci Rep       Date:  2021-03-26       Impact factor: 4.379

2.  The Underlying Social Dynamics of Paradigm Shifts.

Authors:  Carlos Rodriguez-Sickert; Diego Cosmelli; Francisco Claro; Miguel Angel Fuentes
Journal:  PLoS One       Date:  2015-09-29       Impact factor: 3.240

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.