The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms of graph structure and not on page content. We propose that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities. In particular, using a topic taxonomy and an automatic classifier, we can measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions. In addition, we can measure the probability that a page about one broad topic will link to another broad topic. Extending this experiment, we can measure how quickly topic context is lost while walking randomly on the Web graph. Estimates of this topic mixing distance may explain why a global PageRank is still meaningful in the context of broad queries. In general, our measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.
|
4923
|
Elements of Information Theory
– Cover, Thomas
- 1991
|
|
2329
|
Introduction to modern information retrieval
– Salton
- 1983
|
|
1839
|
The Anatomy of a Large-Scale Hypertextual Web Search Engine
– Brin, Page
- 1998
|
|
1669
|
Authoritative sources in a hyperlinked environment
– Kleinberg
- 1999
|
|
1309
|
Randomized algorithms
– Motwani, Raghavan
- 1995
|
|
1064
|
The PageRank Citation Ranking: Bringing Order to the Web
– Page, Brin, et al.
- 1999
|
|
829
|
Emergence of scaling in random networks
– Barabasi, Albert
- 1999
|
|
786
|
On power-law relationships of the Internet topology
– Faloutsos, Faloutsos, et al.
- 1999
|
|
630
|
Social Network Analysis: Methods and Applications
– Wasserman, Faust
- 1994
|
|
514
|
A comparison of event models for naive bayes text classification
– McCallum, Nigam
- 1998
|
|
339
|
Focused crawling: a new approach to topic-specific (web) resource discovery
– Chakrabarti, Berg, et al.
- 1999
|
|
254
|
Enhanced hypertext categorization using hyperlinks
– Chakrabarti, Dom, et al.
- 1998
|
|
220
|
Trawling the web for emerging cyber communities
– Kumar, Raghavan, et al.
- 1999
|
|
147
|
Focused crawling using context graphs
– Diligenti, Coetzee, et al.
- 2000
|
|
113
|
A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ ˜mccallum/bow
– Bow
|
|
91
|
The Connectivity Server: Fast access to linkage information on the Web
– Bharat, Bröder, et al.
- 1998
|
|
88
|
Using reinforcement learning to spider the webefficiently
– Rennie, McCallum
- 1999
|
|
85
|
Topical locality in the web
– Davison
- 2000
|
|
85
|
Self-organization and identification of web communities
– Flake, Lawrence, et al.
|
|
79
|
Graph structure in the web: experiments and models
– Broder, Kumar, et al.
- 2000
|
|
68
|
Winners don’t take all: characterizing the competition for links on the web
– Pennock, Flake, et al.
- 2002
|
|
64
|
On nearuniform URL sampling
– Henzinger, Heydon, et al.
- 2000
|
|
63
|
Generating network topologies that obey power laws
– Palmer, Steffan
- 2000
|
|
60
|
Breadth-First Search Crawling Yields High-Quality Pages
– Najork, Wiener
- 2001
|
|
48
|
Approximating Aggregate Queries about Web Pages via Random Walks
– Bar-Yossef, Berg, et al.
- 2000
|
|
47
|
Accelerated focused crawling through online relevance feedback
– Chakrabarti, Punera, et al.
- 2002
|
|
22
|
Power Law distribution of the World Wide Web
– Adamic, Huberman
|
|
19
|
Methods for sampling pages uniformly from the World Wide Web
– RUSMEVICHIENTONG, PENNOCK, et al.
|
|
12
|
Surfing the web backwards
– Chakrabarti, Gibson, et al.
- 1999
|
|
10
|
Probabilistic combination of content and links
– DUMAIS, R
|
|
2
|
Online at http: //www.almaden.ibm.com/cs/k53/fractal.ps
– Dill, Kumar, et al.
- 2001
|
|
2
|
tell us about lexical and semantic Web content
– Links
- 2001
|
|
1
|
Random jumps in WebWalker
– Berg
- 2001
|
|
1
|
ATTICS: A toolkit for text classification and text mining
– Lewis
- 2000
|