Googlewhacks for Fun and Profit
We study the number of Internet search results returned from multi-word queries based on the number of results returned when each word is searched for individually. We derive a model to describe search result values for multi-word queries using the total number of pages indexed by Google and by applying the Zipf power law to the word per page distribution on the internet and Heaps' law for unique word counts. Based on data from 351 word pairs each with exactly one hit when searched for together, and a Zipf law coefficient determined in other studies, we approximate the Heaps' law coefficient for the indexed world wide web (about 8 billion pages) to be beta=.52. Previous studies used under 20,000 pages. We demonstrate the validity of our method by using a different set of word pairs and with word triplets. We demonstrate through examples how the model can be used to analyze automatically the relatedness of word pairs assigning each a value we call "Strength of Associativity." . We then use our model to compare the index sizes of competing search giants Yahoo and Google.
Speaker: Jonathan Lansey