| Literature DB >> 24266891 |
Abstract
BACKGROUND: The dynamic, decentralized world-wide-web has become an essential part of scientific research and communication. Researchers create thousands of web sites every year to share software, data and services. These valuable resources tend to disappear over time. The problem has been documented in many subject areas. Our goal is to conduct a cross-disciplinary investigation of the problem and test the effectiveness of existing remedies.Entities:
Mesh:
Year: 2013 PMID: 24266891 PMCID: PMC3851533 DOI: 10.1186/1471-2105-14-S14-S5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Growth of scholarly online resources. Not only are the number of URL-containing articles (those with "http" in the title or abstract) published per year increasing (dotted line), but also the percentage of published items containing URLs (solid line). The annual increase in articles according to a linear fit was 174 with R2 0.97. The linear trend for the percentage was an increase of 0.010% per year with R2 0.98. Source: Thomas Reuter's Web of Science
Link decay has been studied for several years in specific subject areas.
| Field | Links Source/Type | Year(s) of URLs | N | Reference |
|---|---|---|---|---|
| Biology & Medicine | Science curriculum web links | 2000 | 515 | [ |
| Full text of 3 dermatology journals | 1999-2004 | 1113 | [ | |
| Sample of bibliographies being published on PubMed | 2006 | 840 | [ | |
| References made in the | 2000, 2003, 2005 | 586 | [ | |
| References in 5 biomedical informatics journals. | 1999-2004 | 1049 | [ | |
| MEDLINE titles & abstracts | 1994-2006 | 10208 | [ | |
| Internet citations in 5 health care management journals from 2002-2004 | 2009-2010 | 2011 | [ | |
| MEDLINE abstracts | 1995-2007 | 7462 | [ | |
| Communications | Citations appearing in research articles in 6 leading communications journals | 2000-2003 | 1600 | [ |
| Ecology | URLs appearing in the full text of 4 Ecological Society of America journals | 1997-2005 | 2100 | [ |
| Law | Samples from a collection of born-digital law- and policy-related reports and documents | 2007-2010 | 2372 | [ |
| Library/Information Science | Citations appearing in 3 leading Information Science journals | 1997-2003 | 2516 | [ |
| Sample of citations appearing in library and information science journals | 1999-2000 | 500 | [ | |
| Social Sciences | URLs appearing in the full text of 2 well-respected historical journals | 1999-2006 | 510 | [ |
| Citations from articles in the Chinese Social Sciences Index | 1998-2007 | 44973 | [ | |
| Various | Random Collection of web URLs | 1996 | 371 | [ |
| Various | Citations in 3 highly circulated journals | 2002-2003 | 672 | [ |
| Various | Supplementary information published in 6 top-cited journals | 2000, 2003 | 585 | [ |
| Various | Citations from conference articles | 1995-2003 | 1068 | [ |
| Various Collections | [ | |||
* denotes studies most similar to the current.
Figure 2The accessibility of URLs from a particular year is closely correlated with age. The probability of being available (solid line) declines by 3.7% every year based on a linear model with R2 0.96. The surveyed archival engines have about a 70-80% archival rate (dotted line) following an initial ramp time.
Figure 3URL presence in the archives. Percentage of URLs found in the archives of the Internet Archive (dashed line), WebCite (dotted line) or in any group (solid line). IA is older, and thus accounts for the lion's share of earlier published URLs, though as time goes on WebCite is offering more and more.
Comparison of certain statistics based on the subject of a given URL.
| Subject | Total | # Alive (%) | Median Survival with 95% CI in years |
|---|---|---|---|
| Biochemistry & Molecular Biology | 4585 | 3231 (70%) | 10.8 (9.0,11.0) |
| Biotechnology & Applied Microbiology | 2225 | 1586 (71%) | 9.0 (8.8,9.0) |
| Computer Science | 2073 | 1225 (59%) | 8.3 (7.0,9.0) |
| Biochemical Research Methods | 2023 | 1463 (72%) | 8.5 (8.5,8.6) |
| Mathematical & Computational Biology | 1661 | 1200 (72%) | 7.5 (7.5,9.0) |
| Genetics & Heredity | 1302 | 914 (70%) | 8.8 (8.8,10.0) |
| Physics | 809 | 458 (57%) | 8.0 (7.6,9.0) |
| Engineering | 703 | 419 (60%) | 7.2 (7.1,10.5) |
| Statistics & Probability | 699 | 440 (63%) | 7.6 (7.0,9.0) |
| Chemistry | 591 | 397 (67%) | 11.4 (9.0,11.9) |
| Biophysics | 432 | 270 (63%) | 10.1 (10.1,10.1) |
| Astronomy & Astrophysics | 416 | 268 (64%) | 11.3 (11.1,NA) |
| Mathematics | 406 | 254 (63%) | 10.7 (4.5,NA) |
| Zoology | 357 | 319 (89%) | 11.2 (9.6,NA) |
| Cell Biology | 353 | 242 (69%) | 8.0 (8.0,10.8) |
| Biology | 346 | 242 (70%) | 9.8 (7.3,NA) |
| Oncology | 342 | 239 (70%) | 6.9 (6.9,7.0) |
| Plant Sciences | 315 | 235 (75%) | 9.8 (8.2,NA) |
| Environmental Sciences | 304 | 190 (63%) | 8.0 (7.6,9.5) |
| Medicine | 293 | 219 (75%) | 13.3 (10.0,NA) |
Subjects are assigned to journals and not specific papers. Note that in these models, a given URL could contribute to multiple subjects due to appearing in multiple journals which could also have multiple subject areas. Where possible, specific subjects were generalized (for example, "Computer Science, Interdisciplinary Applications" became "Computer Science"). Median survival estimated using R's survfit(). "NA" indicates that an upper 95% limit was unable to be computed.
Results of fitting a parametric survival regression using the logistic distribution to the unique URLs.
| Variable | Value | p | 5% | 95% |
|---|---|---|---|---|
| (Intercept) | 5.22 | 3.3E-30 | 4.46 | 5.97 |
| Log2(URL published) | 3.57 | 1.4E-17 | 2.88 | 4.25 |
| depth | -1.46 | 7.0E-32 | -1.66 | -1.25 |
| Log2(TimesCited + 1) | 0.25 | 2.8E-04 | 0.13 | 0.36 |
| Funding text present | 3.43 | 2.8E-11 | 2.59 | 4.28 |
| au | 4.53 | 1.5E-04 | 2.56 | 6.49 |
| be | 3.31 | 1.9E-02 | 0.99 | 5.64 |
| ca | 4.88 | 1.7E-06 | 3.20 | 6.56 |
| ch | 6.45 | 7.2E-08 | 4.48 | 8.42 |
| cn | 1.50 | 1.3E-01 | -0.13 | 3.13 |
| com | 6.02 | 2.2E-18 | 4.89 | 7.16 |
| de | 5.74 | 6.1E-16 | 4.57 | 6.91 |
| dk | 7.66 | 5.7E-07 | 5.14 | 10.18 |
| edu | 3.77 | 1.6E-13 | 2.93 | 4.61 |
| es | 3.05 | 5.4E-03 | 1.25 | 4.85 |
| fr | 3.65 | 6.6E-07 | 2.44 | 4.85 |
| gov | 5.51 | 1.2E-15 | 4.38 | 6.64 |
| il | 5.92 | 3.6E-04 | 3.19 | 8.65 |
| in | 4.78 | 2.2E-04 | 2.65 | 6.91 |
| it | 5.51 | 1.4E-08 | 3.91 | 7.11 |
| jp | 5.07 | 8.0E-09 | 3.62 | 6.51 |
| kr | -3.35 | 2.0E-02 | -5.73 | -0.97 |
| net | 7.01 | 4.2E-11 | 5.26 | 8.76 |
| nl | 6.78 | 1.1E-06 | 4.49 | 9.07 |
| org | 8.10 | 2.4E-36 | 7.04 | 9.16 |
| ru | 3.90 | 2.3E-03 | 1.80 | 6.01 |
| se | 1.71 | 2.4E-01 | -0.69 | 4.12 |
| tw | 1.64 | 1.7E-01 | -0.33 | 3.61 |
| uk | 4.49 | 4.2E-12 | 3.42 | 5.56 |
| Bioinformatics | -2.04 | 5.7E-03 | -3.25 | -0.83 |
| BMC Bioinformatics | 2.69 | 3.9E-05 | 1.62 | 3.77 |
| BMC Genomics | 0.88 | 4.7E-01 | -1.13 | 2.89 |
| Comp. Physics Comm. | -4.00 | 3.0E-05 | -5.57 | -2.42 |
| Genome Research | 0.56 | 7.1E-01 | -1.92 | 3.04 |
| Nucleic Acids Research | 1.28 | 8.6E-04 | 0.65 | 1.91 |
| PLoS ONE | -0.39 | 8.0E-01 | -2.95 | 2.18 |
| Zoological Studies | 16.42 | 2.2E-15 | 13.01 | 19.83 |
Positive numbers indicate longer median lifetimes. Much like a logistic model, coefficients can be added to the intercept value (after multiplying in the case of numeric predictors) to obtain a median lifetime. For example, the median expected lifetime for a URL published once, with depth 0, whose publishing article had 1 citation, no funding text, domain au and published in a Journal not listed (ie- in the default) would be: (Intercept) 5.22 + Log2(1)*3.57 + 0*-1.46 + Log2(1+1)*0.25 + 0*3.43 + 4.53 = 10 years
Figure 4How important is each predictor in predicting whether a URL is available? This graph compares what portion of the overall deviance is explained uniquely by each predictor for each of the measured outcomes. A similar list of predictors (differing only in whether the first or last year a URL was published) without interaction terms was employed to construct 3 logistic regression models. The dependent variable for each of the outcomes under study (Live Web, Internet Archive and WebCite) was availability at the time of measurement. Unique deviance was calculated by dropping each term and measuring the change in explained deviance in the logistic model. Results were then expressed as a percentage of the total uniquely explained deviance for each of the 3 methods.
Figure 5Coverage of the scholarly URL list for each archival engine at different times. All URLs marked as alive in 2011 but missing from an archive were submitted between the 2012 and 2013 surveys. The effect of submitting the URLs is most evident in the WebCite case though the Internet Archive also showed substantial improvement. Implementing an automated process to do this could vastly improve the retention of scholarly static web pages.