| Literature DB >> 27122648 |
Antal van den Bosch1, Toine Bogers2, Maurice de Kunder3.
Abstract
One of the determining factors of the quality of Web search engines is the size of their index. In addition to its influence on search result quality, the size of the indexed Web can also tell us something about which parts of the WWW are directly accessible to the everyday user. We propose a novel method of estimating the size of a Web search engine's index by extrapolating from document frequencies of words observed in a large static corpus of Web pages. In addition, we provide a unique longitudinal perspective on the size of Google and Bing's indices over a nine-year period, from March 2006 until January 2015. We find that index size estimates of these two search engines tend to vary dramatically over time, with Google generally possessing a larger index than Bing. This result raises doubts about the reliability of previous one-off estimates of the size of the indexed Web. We find that much, if not all of this variability can be explained by changes in the indexing and ranking infrastructure of Google and Bing. This casts further doubt on whether Web search engines can be used reliably for cross-sectional webometric studies.Entities:
Keywords: Longitudinal study; Search engine index; Webometrics
Year: 2016 PMID: 27122648 PMCID: PMC4833824 DOI: 10.1007/s11192-016-1863-z
Source DB: PubMed Journal: Scientometrics ISSN: 0138-9130 Impact factor: 3.238
Real versus estimated numbers (with standard deviations) of documents on four textual corpora, based on the DMOZ training corpus statistics: two news resources (top two) and two collections of web pages (bottom two)
| Corpus | Words per document | |||||
|---|---|---|---|---|---|---|
| Mean | Median | Number of # documents | Estimate | SD | Difference (%) | |
| New York times | 837 | 794 | 1,234,426 | 2,789,696 | 1,821,823 | +126 |
| Reuters RCV1 | 295 | 229 | 453,844 | 422,271 | 409,648 | −7.0 |
| Wikipedia | 447 | 210 | 2,112,923 | 2,189,790 | 1,385,105 | +3.6 |
| DMOZ test sample | 477 | 309 | 19,966 | 19,699 | 5,839 | −1.3 |
Fig. 1Labeled scatter plot of per-word DMOZ frequencies of occurrence and estimates of the Wikipedia test corpus. The x axis is logarithmic. The solid horizontal line represents the actual number of documents in the Wikipedia test corpus (2,112,923); the dashed horizontal line is the averaged estimate of 2,189,790. The dotted slanted line represents the log-linear regression function
Fig. 2Estimated size of the Google and Bing indices from March 2006 to January 2015. The lines connect the unweighted running daily averages of 31 days. The colored, numbered markers at the top represent reported changes in Google and Bing’s infrastructure. The colors of the markers correspond to the color of the search engine curve they related to; for example, red markers signal changes in Google’s infrastructure (the red curve). Events that line up with a spike are marked with an opened circle, other events are marked with an times
Fig. 3Estimated size of the Google index from March 2006 to January 2015 for three pivot words, the, basketball, and illini, and the average estimate over all 28 words (black line). The lines connect the unweighted running daily averages of 31 days