Literature DB >> 22355696

Statistical regularities in the rank-citation profile of scientists.

Alexander M Petersen¹, H Eugene Stanley, Sauro Succi.

Abstract

Recent science of science research shows that scientific impact measures for journals and individual articles have quantifiable regularities across both time and discipline. However, little is known about the scientific impact distribution at the scale of an individual scientist. We analyze the aggregate production and impact using the rank-citation profile c(i)(r) of 200 distinguished professors and 100 assistant professors. For the entire range of paper rank r, we fit each c(i)(r) to a common distribution function. Since two scientists with equivalent Hirsch h-index can have significantly different c(i)(r) profiles, our results demonstrate the utility of the β(i) scaling parameter in conjunction with h(i) for quantifying individual publication impact. We show that the total number of citations C(i) tallied from a scientist's N(i) papers scales as [Formula: see text]. Such statistical regularities in the input-output patterns of scientists can be used as benchmarks for theoretical models of career progress.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 22355696 PMCID： PMC3240955 DOI： 10.1038/srep00181

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

A scientist's career path is subject to a myriad of decisions and unforeseen events, such as Nobel Prize worthy discoveries1, that can significantly alter an individual's career trajectory. As a result, the career path can be difficult to analyze since there are potentially many factors (individual, mentor-apprentice, institutional, coauthorship, field)23456789 to account for in the statistical analysis of scientific panel data. The rank-citation profile, c(r), represents the number of citations of individual i to his/her paper r, ranked in decreasing order c(1) ≥ c(2) ≥ …c(N), and provides a quantitative synopsis of a given scientist's publication career. Here, we analyze the rank-ordered citation distribution c(r) for 300 scientists in order to better understand patterns of success and to characterize scientific production at the individual scale using a common framework. The review of scientific achievement for post-doctoral selection, tenure review, award and academy selection, at all stages of the career is becoming largely based on quantitative publication impact measures. Hence, understanding quantitative patterns in production are important for developing a transparent and unbiased review system. Interestingly, we observe statistical regularities in c(r) that are remarkably robust despite the idiosyncratic details of scientific achievement and career evolution. Furthermore, empirical regularities in scientific achievement suggest that there are fundamental social forces governing career progress10111213. We group the 300 scientists that we analyze into three sets of 100, referred to as datasets A, B and C, so that we can analyze and compare the complete publication careers of each individual, as well as across the three groups: [A] 100 highly-profile scientists with average h-index 〈h〉 = 61 ± 21. These scientists were selected using the citation shares metric9 to quantify cumulative career impact in the journal Physical Review Letters (PRL). [B] 100 additional “control” scientists with average h-index 〈h〉 = 44 ± 15. [C] 100 current Assistant professors with average h-index 〈h〉 = 14 ± 7. We selected two scientists from each of the top-50 US physics departments (departments ranked according to the magazine U.S. News). In the methods section we describe in detail the selection procedure for datasets A, B, and C and in tables S1-S6 we provide summary statistics for each career. There are many conceivable ways to quantify the impact of a scientist's N publications. The h-index14 is a widely acknowledged single-number measure that serves as a proxy for production and impact simultaneously. The h-index h of scientist i is defined by a single point on the rank-citation profile c(r) satisfying the condition To address the shortcomings of the h-index, numerous remedies have been proposed in the bibliometric sciences15. For example, Egghe proposed the g-index, where the most cited g papers cumulate g2 citations overall16, and Zhang proposed the e-index which complements the h and g indices quantitatively17. To justify the importance of analyzing the entire profile c(r), consider a scientist i = 1 with rank-citation profile c1(r) ≡ [100, 50, 33, 25, 20, 16, 14, 12, 11, 10, 9…] and a scientist i = 2 with c2(r) ≡ [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 9…]. Both scientists have the same h-index value h = 10, although c1(r) tallies 2.9 times as many citations as c2(r) from his/her most-cited 10 papers. Hence, an additional parameter β is necessary in order to distinguish these two example careers. Specifically, the β parameter quantifies the scaling slope in c(r) for the high-rank papers corresponding to small r values. In this simple illustration, β1 ≈ 1 while β2 ≈ 0. In Fig. 1 we plot c(r) for 5 extremely high-impact scientists. The individuals EW, ACG, MLC, and PWA are physicists with the largest h values in our data set; BV is a prolific molecular biologists who we include in this graphical illustration in order to demonstrate the generality of the statistical regularity we find, which likely exists across discipline. However, citation and h-index metrics should not be compared across discipline since baseline publication and citation rates can vary significantly between research fields Refs[8, 9]. To demonstrate how the singe point c(h) is an arbitrary point along the c(r) curve, we also plot the lines H(r) ≡ p r for 5 values of p = {1, 2, 5, 20, 80}. The value p ≡ 1 recovers the h-index h1 = h proposed by Hirsch. The intersection of any given line H(r) with c(r) corresponds to the “generalized h-index” h, proposed in18 and further analyzed in19, with the relation h ≤ h for p > q. Since the value p ≡ 1 is chosen somewhat arbitrarily, we take an alternative approach which is to quantify the entire c(r) profile at once (which is also equivalent to knowing the entire h spectrum). Surprisingly, because we find regularity in the functional form c(r) for all 300 scientists analyzed, we can relate the relative impact of a scientist's publication career using the small set of parameters that specify the c(r) profile for the entire set of papers ranging from rank r = 1…N. Using a much smaller parameter space than the h spectrum, we can begin to analyze the statistical regularities in the career accomplishments of scientists.

Figure 1

The citation distribution of individual scientists is heavy-tailed.

We show 5 empirical rank-citation c (r) profiles , each belonging to an extremely high-impact scientists (E. Witten, A. C. Gossard, M. L. Cohen, P. W. Anderson and B. Vogelstein) whose initials and h-index as of Jan. 2010 are listed in the figure legend. The hierarchical scaling pattern in c (r) for small r values indicate that the pillar contributions of top scientists are “off-the-charts” since they have no characteristic scale. Put in the framework of the citation distribution, consider the probability distribution P (c) of the citation impact c calculated for an individual's N papers. If P(c) is heavy tailed with asymptotic power-law scaling , then ζ = 1 + 1/β. Ref. [9] calculates ζ ≈ 3, corresponding to β = 1/2, using the entire set of citations for papers from six individual journals. Hence, the citation impact of stellar scientists can be significantly more skewed than the aggregate population. This statistical regularity demonstrates the utility of the β scaling exponent in characterizing the highly cited papers of a given scientist i. Interestingly, each scientist has coauthored a significant number of papers that are significantly lower impact than their c(1) pillar paper. The c(r) distributions show significant variability in both the high-rank (β) and low-rank (γ) regimes. Moreover, for c(r) with similar h values, the h-index (a single point on each curve) is insufficient to adequately distinguish career profiles. The solid curves are the best-fit DGBD functions (see Eq. 3) for each corresponding c(r) over the entire rank range in each case. The intersection of c(r) with the line H(r) corresponds to the generalized h-index h, which together uniquely quantify the c(r) profile. Five H(r) lines are provided for reference, with p = {1, 2, 5, 20, 80}.

The aim of this analysis is not to add another level of scrutiny to the review of scientific careers, but rather, to highlight the regularities across careers and to seed further exploration into the mechanisms that underlie career success. The aim of this brand of quantitative social science is to utilize the vast amount of information available to develop an academic framework that is sustainable, efficient and fruitful. Young scientific careers are like “startup” companies that need appropriate venture funding to support the career trajectory through lows as well as highs13.

Results

A Quantitative Model for c(r)

For each scientist i, we find that c(r) can be approximated by a scaling regime for small r values, followed by a truncated scaling regime for large r values. Recently a novel distribution, the discrete generalized beta distribution (DGBD) has been proposed as a model for rank profiles in the social and natural sciences that exhibit such truncated scaling behavior2021. The parameters A, β, γ and N are each defined for a given c(r) corresponding to an individual scientists i, however we suppress the index i in some equations to keep the notation concise. We estimate the two scaling parameters β and γ using Mathematica software to perform a multiple linear regression of ln c(r) = ln A − β ln r + γ ln(N + 1 − r) in the base functions ln r and ln(N + 1 − r). In our fitting procedure we replace N with r1, the largest value of r for which c(r) ≥ 1 (we find that r1/N ≈ 0.84 ± 0.01 for careers in datasets A and B). Figs. 1 and 2 demonstrate the utility of the DGBD to represent c(r), for both large and small r. The regression correlation coefficient R > 0.97 for all ln c(r) profiles analyzed.

Figure 2

Data collapse of each c(r) along a universal curve.

A comparison of 100 rank-citation profiles c(r) demonstrates the statistical regularity in career publication output. Each scientist produces a cascade of papers of varying impact between the c(1) pillar paper down to the least-known paper c(N). (a) Zipf rank-citation profiles c(r) for 100 scientists listed in dataset [A]. For reference, we plot the average of these 100 curves and find with β = 0.92 ± 0.01. The solid green line is a least-squares fit to over the range 1 ≤ r ≤ 100. We also plot the H2(r) and H80(r) lines for reference. (b) We re-scale the curves in panel (a), plotting c(r′) ≡ c(r)/A(r − r)γ, where we use the best-fit γ and A parameter values for each individual c(r) profile. Using the rescaled rank value , we show excellent data collapse onto the expected curve c(r′) = 1/r′. (see Figs. S1 and S2 for analogous plots for dataset [B] and [C] scientists). Green data points correspond to the average c(r′) value with 1σ error bars calculated using all 100 c(r′) curves separated into logarithmically spaced bins.

The DGBD proposed in20 is an improvement over the Zipf law (also called the generalized power-law or Lotka-law22) model and the stretched exponential model14 since it reproduces the varying curvature in c(r) for both small and large r. Typically, an exponential cutoff is imposed in the power-law model, and justified as a finite-size effect. The DGBD does not require this assumption, but rather, introduces a second scaling exponent γ which controls the curvature in c(r) for large r values. The DGBD has been successfully used to model numerous rank-ordering profiles analyzed in2021 which arise in the natural and socio-economic sciences. The relative values of the β and γ exponents are thought to capture two distinct mechanisms that contribute to the evolution of c(r)2021. Due to the data limitations in this study, we are not able to study the dynamics in c(r) through time. Each c(r) is a “snapshot” in time, and so we can only conjecture on the evolution of c(r) throughout the career. Nevertheless, we believe that there is likely a positive feedback effect between the “heavy-weight” papers and “newborn” papers, whereby the reputation of the “heavy-weight” papers can increase the exposure and impact the perceived significance of “newborn” papers during their infant phase. Moreover, the 2-regime power-law behavior of c(r) suggests that the reinforcement dynamics can be quantified by the scale-free parameters β and γ. The β value determines the relative change in the c(r) values for the high-rank papers, and thus it can be used to further distinguish the careers of two scientists with the same h-index. In particular, smaller β values characterize flat profiles with relatively low contrast between the high and low-rank regions of any given profile, while larger β values indicate a sharper separation between the two regions. In Fig. 2(a) we plot c(r) for each scientist from dataset [A] as well as the average of the 100 individual curves (see Figs. S1 and S2 for analogous plots for datasets [B] and [C]). We find robust power-law scaling for 100 ≤ r ≤ 102. The scaling value calculated for other rank-size (Zipf) distributions in the social and economic sciences is typically around unity, β ≈ 1, for example in studies of word frequency23 and city size202124. Here we calculate β for each individual author and observe a distribution which is centered around characteristic values 〈β〉 = 0.83 ± 0.23 [A], 〈β〉 = 0.70 ± 0.16 [B], 〈β〉 = 0.79 ± 0.38 [C]. We calculate each β value using a multilinear least-squares regression of ln c(r) for 1 ≤ r ≤ r1 using the DGBD model defined in Eq. [3]. To properly weight the data points for better regression fit over the entire range, we use only 20 values of c(r) data points that are equally spaced on the logarithmic scale in the range r ∈ [1, r1]. We elaborate the details of this fitting technique in the methods section. We plot five empirical c(r) along with their corresponding best-fit DGBD functions in Fig. 1 to demonstrate the goodness of fit for the entire range of r. In order to demonstrate the common functional form of the DGBD model, we collapse each c(r) along a universal scaling function c(r′) = 1/r′, by using the rescaled rank values defined for each curve. In Figs. 2(b), S1(b) and S2(b), we plot the quantity c(r′) ≡ c(r)/A(r1 + 1 − r), using the best-fit γ and A parameter values for each individual c(r) profile. While the curves in Fig. 2(a) are jumbled and distributed over a large range of c(r) values, the rescaled c(r) curves in Fig. 2(b) all lie approximately along the predicted curve c(r′) = 1/r′.

Using c(r) to quantify career production and impact

A main advantage of the h-index is the simplicity in which it is calculated, e.g. ISI Web of Knowledge25 readily provides this quantity online for distinct authors. Another strength of the h-index is its stable growth with respect to changes in c(r) due to time and information-dependent factors26. Indeed, the h-index is a “fixed-point” of the citation profile. This time stability is evident in the observed growth rates of h for scientists. Average growth rates, calculated here as h/L, where L is the duration in years between a given author's first and most recent paper, typically lie in the range of one to three units per year (this annual growth rate corresponds to the quantity m introduced by Hirsch14). Annual growth rates h/L ≈ 3 correspond to exceptional scientists (for the histogram of P(h/L) see Fig. S3 and for h/L values see the SI text (Tables S1–S6)). As a result, h/L is a good predictor for future achievement along with h27. It is truly remarkable how a single number, h, correlates with other measures of impact. Understandably, being just a single number, the h-index cannot fully account for other factors, such as variations in citation standards and coauthorship patterns across discipline282930, nor can h incorporate the full information contained in the entire c(r) profile. As a result, it is widely appreciated that the h-index can underrate the value of the best-cited papers, since once a paper transitions into the region r ≤ h, its citation record is discounted, until other less-cited papers with r > h eventually overcome the rank “barrier” r = h. Moreover, as noted in14, the papers for which r > h do not contribute any additional credit. Instead of choosing an arbitrary h as an productivity-impact indicator, we use the analytic properties of the DGBD to calculate a crossover value . In the methods section, we derive an exact expression for which highlights the distinguished papers of a given author. To calculate , we use the logarithmic derivative χ(r) ≡ d ln c(r)/dr to quantify the relative change in c(r) with increasing r. We defined papers as “distinguished” if they satisfy the inequality , where is the average value of χ(r) over the entire range of r values. This inequality selects the peak papers which are significantly more cited than their neighbors. The peak region corresponds to a “knee” in c(r) when plotted on log-linear axes. The dependence of and on the three DGBD parameters β, γ and N are provided in the methods section. The advantage of is that this characteristic rank value is a comprehensive representation of the stellar papers in the high-rank scaling regime since it depends on the DGBD parameter values β, γ and N, and thus probes the entire citation profile. Fig. 3 shows a scatter plot of the “c-star” and h values calculated for each scientist and demonstrates that there is a non-trivial relation between these two single-value indices. It also shows that for scientists within a small range of c* there is a large variation in the corresponding h values, in some cases straddling across all three sets of scientists. Also, there are several values which significantly deviate from the trend in Fig. 3, which is plotted on log-log axes. These results reflect the fact that the h-index cannot completely incorporate the entire c(r) profile. We plot the histogram of and values in Figs. S4 and S5, respectively.

Figure 3

Limitations to the use of the h-index alone.

The h-index can be insufficient in comprehensively representing c(r). (a) The h-index does not contain any information about c(r) for r < h, and can shield a scientist's most successful accomplishments which are the basis for much of a scientist's reputation. This is evident in the cases where , in which case the h-index cannot account for the stellar impact of the papers. (b) For a given h value, prolific careers are characterized by a large β value, as it is harder to maintain large β values for large h. As a result, the β vs h parameter space can be used to identify anomalous careers and to better compare two scientists with similar h indices. We find that a third career metric C, the total number of citations to the papers of author i, can be calculated with high accuracy by the scaling relation , which we illustrate in Fig. 4(b).

To further contrast the values of and the h-index, we propose the “peak indicator” ratio , which corrects specifically for the h-index penalty on the stellar papers in the peak region of c(r). Thus, all papers in the peak region of c(r) satisfy the condition c(r) ≥ hΛ. In an extreme example, R. P. Feynman has a peak value Λ ≈ 36, indicating that his best papers are monumental pillars with respect to his other papers which contribute to his h-index. Fig. S6 shows the histogram of Λ values, with typical values for dataset [A] scientists 〈Λ〉 ≈ 3.4 ± 3.9, and for dataset [B] scientists 〈Λ〉 ≈ 2.2 ± 1.1. This indicator can only be used to compare scientists with similar h values, since a small h can result in a large Λ. An alternative “single number” indicator is C, an author's total number of citations which incorporates the entire c(r) profile. However, it has been shown that correlates well with h31, a result which we will demonstrate in Eq. [6] to follow directly from a c(r) with β ≈ 1. We test the aggregate properties of c(r) by calculating the aggregate number of citations C for a given profile, where H is the generalized harmonic number and is of order O(1) for β ≈ 1. We neglect the γ scaling regime since the low-rank papers do not significantly contribute to an author's C tally. We approximate the coefficient A in Eq. [6] using the definition c(h) ≡ h, which implies that A/h ≈ h. We use the value N′ ≡ 3 h, so that C can be approximated by only the two parameters h and β for any given author. We justify this choice of N′ by examining the rescaled c(r/h), which we consider to be negligible beyond rank r = 3 h for most scientists. In Fig. 4(a), we plot for each scientist the predicted C value versus the empirical C value, and we find excellent agreement with our theoretical prediction given by Eq. [6]. In Fig. 4(b), we plot for each scientist the total number of citations using the best-fit DGBD model c(r) ≡ c(r; β, γ, A, r1) to approximate c(r). The excellent agreement demonstrates that the fluctuations in the residual difference c(r) − c(r) cancel out on the aggregate level. Furthermore, a comparison of the quality of agreement between the theoretical C values and the empirical C values in Fig. 4(a) and (b) shows the importance of the additional γ scaling regime in the DGBD model.

Figure 4

Aggregate publication impact C.

The total number of citations C is also comprehensive productivity-impact measure.For most best-fit DGBD model curves, the C value is preserved with high precision. This shows that the difference between a given c(r) and the corresponding best-fit DGBD model function are negligible on the macroscopic scale. (a) The exact aggregate number of citations C, calculated from c(r) using Eq. [5], can be analytically approximated by using Eq. [6] which depends only on the scientist's β and h values. (b) We justify the use of the DGBD model defined in Eq. [3] for the approximation of c(r) by comparing the aggregate citations C with the expected aggregate citations calculated from the best-fit DGBD model c(r). Including the extra scaling-parameter, as in the DGBD model, improves the agreement between the theoretical and empirical C values in (a) and (b). We plot the line y = x (dashed-green line) for visual reference.

Discussion

We use the DGBD model to provide an analytic description of c(r) over the entire range of r, and provide a deeper quantitative understanding of scientific impact arising from an author's career publication works. The DGBD model exhibits scaling behavior for both large and small r, where the scaling for small r is quantified by the exponent β, which for many scientists analyzed, can be approximated using only two values of the generalized h-index h (see SI text). In particular, we show that for a given h-value, a larger β value corresponds to a more prolific publication career, since . Many studies analyze only the high rank values of generic Zipf ranking profiles c(r), e.g. computing the scaling regime for r < r below some some rank cutoff r. However, these studies cannot quantitatively relate the large observations to the small observations within the system of interest. To account for this shortcoming, our method for calculating the crossover values , r×, and , which we elaborate in the methods section, can be used in general to quantitatively distinguish relatively large observations and relatively small observations within the entire set of observations. Moreover, the DGBD model has been shown to have wide application in quantifying the Zipf rank profiles in various phenomena21. To measure the upward mobility of a scientist's career, in the SI text we address the question: given that a scientist has index h, what is her/his most likely h-index value Δt years in the future? In consideration of the bulk of c(r), and following from the regularity of c(r) for r ≈ h, we propose a model-free gap-index G(Δh) as both an estimate and a target for future achievement which can be used in the review of career advancement. The gap index G(Δh), defined as a proxy for the total number of citations a scientist needs to reach a target value h+Δh, can detect the potential for fast h-index growth by quantifying c(r) around h. This estimator differs from other estimators for the time-dependent h-index333435 in that G(Δh) is model independent. Even though the productivity of scientists can vary substantially936373839, and despite the complexity of success in academia, we find remarkable statistical regularity in the functional form of c(r) for the scientists analyzed here from the physics community. Recent work in8940 calculates the citation distributions of papers from various disciplines and shows that proper normalization of impact measures can allow for comparison across time and discipline. Hence, it is likely that the publication careers of productive scientists in many disciplines obey the statistical regularities observed here for the set of 300 physicists. Towards developing a model for career evolution, it is still unclear how the relative strengths of two contributing factors (i) the extrinsic cumulative advantage effect239 versus (ii) the intrinsic role of the “sacred spark” in combination with intellectual genius37 manifest in the parameters of the DGBD model. With little calculation, the β metric developed here, used in conjunction with the h, can better answer the question, “How popular are your papers?”41. Since the cumulative impact and productivity of individual scientists are also found to obey statistical laws911, it is possible that the competitive nature of scientific advancement can be quantified and utilized in order to monitor career progress. Interestingly, there is strong evidence for a governing mechanism of career progress based on cumulative advantage91142 coupled with the the inherent talent of an individual, which results in statistical regularities in the career achievements of scientists as well as professional athletes114344. Hence, whenever data are available4546, finding statistical regularities emerging from human endeavors is a first step towards better understanding the dynamics of human productivity.

Methods

Selection of scientists and data collection

We use disambiguated “distinct author” data from ISI Web of Knowledge. This online database is host to comprehensive data that is well-suited for developing testable models for scientific impact93240 and career progress11. In order to approximately control for discipline-specific publication and citation factors, we analyze 300 scientists from the field of physics. We aggregate all authors who published in Physical Review Letters (PRL) over the 50-year period 1958–2008 into a common dataset. From this dataset, we rank the scientists using the citations shares metric defined in9. This citation shares metric divides equally the total number of citations a paper receives among the n coauthors, and also normalizes the total number of citations by a time-dependent factor to account for citation variations across time and discipline. Hence, for each scientist in the PRL database, we calculate a cumulative number of citation shares received from only their PRL publications. This tally serves as a proxy for his/her scientific impact in all journals. The top 100 scientists according to this citation shares metric comprise dataset [A]. As a control, we also choose 100 other dataset [B] scientists, approximately randomly, from our ranked PRL list. The selection criteria for the control dataset [B] group are that an author must have published between 10 and 50 papers in PRL. This likely ensures that the total publication history, in all journals, be on the order of 100 articles for each author selected. We compare the tenured scientists in datasets A and B with 100 relatively young assistant professors in dataset [C]. To select dataset [C] scientists, we chose two assistant professors from the top 50 U.S. physics and astronomy departments (ranked according to the magazine U.S. News). For privacy reasons, we provide in the SI tables only the abbreviated initials for each scientist's name (last name initial, first and middle name initial, e.g. L, FM). Upon request we can provide full names. We downloaded datasets A and B from ISI Web of Science in Jan. 2010 and dataset C from ISI Web of Science in Oct. 2010. We used the “Distinct Author Sets” function provided by ISI in order to increase the likelihood that only papers published by each given author are analyzed. On a case by case basis, we performed further author disambiguation for each author.

Statistical significance tests for the c(r) DGBD model

We test the statistical significance of the DGBD model fit using the χ2 test between the 3-parameter best-fit DGBD c(r) and the empirical c(r). We calculate the p-value for the χ2 distribution with r1 − 3 degrees of freedom and find, for each data set, the number of c(r) with p-value [A], 19 [B], 22 [C] for p = 0.05, and 8 [A], 22 [B], 37 [C] for p = 0.01. The significant number of c(r) which do not pass the χ2 test for P = 0.05, results from the fact that the DGBD is a scaling function over several orders of magnitude in both r and c(r) values, and so the residual differences [c(r) − c(r)] are not expected to be normally distributed since there is no characteristic scale for scaling functions such as the DGBD. Nevertheless, the fact that so many c(r) do pass the χ2 test at such a high significance level, provides evidence for the quality-of-fit of the DGBD model. For comparison, none of the c(r) pass the χ2 test using the power-law model at the P = 0.05 significance level. In the next section, we will also compare the macroscopic agreement in the total number of citations for each scientist and the total number of citations predicted by the DGBD model for each scientist, and find excellent agreement.

Derivation of the characteristic DGBD r values

Here we use the analytic properties of the DGBD defined in Eq. [3] to calculate the special r values from the parameters β, γ and N which locate the two tail regimes of c(z), and in particular, the distinguished paper regime. The scaling features of the DGBD do not readily convey any characteristic scales which distinguish the two scaling regimes. Instead, we use the properties of ln c(r) to characterize the crossover between the high-rank and the low-rank regimes of c(r). We begin by considering c(r) under the centered rank transformation z = r − z0, where z0 = (N + 1)/2, then in the domain z ∈ [− (z0 − 1), (z0 − 1)]. The logarithmic derivative of c(z) expresses the relative change in c(z), where x = z/z0, , and . The extreme values of for are given by and the average value is calculated by, The function χ(z) takes on the value of twice at the values corresponding to the solutions to the quadratic equation, which has the solution for . Converting back to rank, then and so the value is the special rank value which distinguishes the set of excellent papers of each given author. The c-star value c(r*) is thus a characteristic value arising from the special analytic properties of c(r). This method for determining the crossover value r* can be applied to any general rank order profile which can be modeled by the DGBD. Furthermore, the crossover zx between the β scaling regime and the γ scaling regime is calculated from the inflection points of ln c(z), which has 2 solutions , where . only is a physical solution. Transforming back to rank values, we find . We illustrate these special z values in Fig. 5.

Figure 5

Characteristic properties of the DGBD.

We graphically illustrate the derivation of the characteristic c(r) crossover values that locate the two tail regimes of c(r), in particular, the distinguished “peak” paper regime corresponding to paper ranks r ≤ r* (shaded region). The crossover between two scaling regimes suggests a complex reinforcement relation between the impact of a scientist's most famous papers and the impact of his/her other papers. (a) The c(r) plotted on log-log axes with N = 278, β = 0.83 and γ = 0.67, corresponding to the average values of the Dataset [A] scientists. The hatched magenta curve is the H1(z) line on the log-linear scale with corresponding h-index value h = 104. The r* value for c(r) is not visibly obvious. (b) We plot on log-linear axes the centered citation profile c(z) (solid black curve) given by the symmetric rank transformation z = r − z0 in Eq. [7]. This representation better highlights the peak paper regime, but fails to highlight the power-law β scaling. (c) We plot the corresponding logarithmic derivative χ(z) of c(z) (solid black curve), which represents the relative change in c(z). The dashed red line corresponds to , where is the average value of χ(z) given by Eq. [12]. The values of , indicated by the solid vertical green lines, are defined as the intersection of with χ(z) given by Eq. [13]. The regime corresponds to the best papers of a given author. The hatched blue line corresponds to which marks the crossover between the β and γ scaling regimes.

Author Contributions

A. M. P., H. E. S., & S. S. designed research, performed research, wrote, reviewed and approved the manuscript. A. M. P. performed the numerical and statistical analysis of the data.

13 in total

1. Methods for measuring the citations and productivity of scientists across time and discipline.

Authors: Alexander M Petersen; Fengzhong Wang; H Eugene Stanley
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2010-03-24

2. Quantitative and empirical demonstration of the Matthew effect in a study of career longevity.

Authors: Alexander M Petersen; Woo-Sung Jung; Jae-Suk Yang; H Eugene Stanley
Journal: Proc Natl Acad Sci U S A Date: 2010-12-20 Impact factor: 11.205

3. An index to quantify an individual's scientific research output.

Authors: J E Hirsch
Journal: Proc Natl Acad Sci U S A Date: 2005-11-07 Impact factor: 11.205

4. Team assembly mechanisms determine collaboration network structure and team performance.

Authors: Roger Guimerà; Brian Uzzi; Jarrett Spiro; Luís A Nunes Amaral
Journal: Science Date: 2005-04-29 Impact factor: 47.728

5. Does the H index have predictive power?

Authors: J E Hirsch
Journal: Proc Natl Acad Sci U S A Date: 2007-11-26 Impact factor: 11.205

6. Universality of citation distributions: toward an objective measure of scientific impact.

Authors: Filippo Radicchi; Santo Fortunato; Claudio Castellano
Journal: Proc Natl Acad Sci U S A Date: 2008-10-31 Impact factor: 11.205

7. Rescaling citations of publications in physics.

Authors: Filippo Radicchi; Claudio Castellano
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2011-04-22

8. Social science. Computational social science.

Authors: David Lazer; Alex Pentland; Lada Adamic; Sinan Aral; Albert-Laszlo Barabasi; Devon Brewer; Nicholas Christakis; Noshir Contractor; James Fowler; Myron Gutmann; Tony Jebara; Gary King; Michael Macy; Deb Roy; Marshall Van Alstyne
Journal: Science Date: 2009-02-06 Impact factor: 47.728

10. The possible role of resource requirements and academic career-choice risk on gender differences in publication rate and impact.

Authors: Jordi Duch; Xiao Han T Zeng; Marta Sales-Pardo; Filippo Radicchi; Shayna Otis; Teresa K Woodruff; Luís A Nunes Amaral
Journal: PLoS One Date: 2012-12-12 Impact factor: 3.240

Statistical regularities in the rank-citation profile of scientists.

Results

A Quantitative Model for c(r)

Using c(r) to quantify career production and impact

Discussion

Methods

Selection of scientists and data collection

Statistical significance tests for the c(r) DGBD model

Derivation of the characteristic DGBD r values

Author Contributions

1. Methods for measuring the citations and productivity of scientists across time and discipline.

2. Quantitative and empirical demonstration of the Matthew effect in a study of career longevity.

3. An index to quantify an individual's scientific research output.

4. Team assembly mechanisms determine collaboration network structure and team performance.

5. Does the H index have predictive power?

6. Universality of citation distributions: toward an objective measure of scientific impact.

7. Rescaling citations of publications in physics.

8. Social science. Computational social science.

9. How citation boosts promote scientific paradigm shifts and nobel prizes.

10. Universality of rank-ordering distributions in the arts and sciences.

1. Persistence and uncertainty in the academic career.

2. Quantifying the impact of weak, strong, and super ties in scientific careers.

3. Reputation and impact in academic careers.

4. Three dimensions of scientific impact.

5. Beyond Zipf's Law: The Lavalette Rank Function and Its Properties.

6. Predicting scholars' scientific impact.

7. The advantage of short paper titles.

8. Network-driven reputation in online scientific communities.

9. Universality of Citation Distributions for Academic Institutions and Journals.

10. The possible role of resource requirements and academic career-choice risk on gender differences in publication rate and impact.