MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

The Structure of Broad Topics on the Web (2002) [36 citations — 0 self]

by Soumen Chakrabarti ,  Mukul M. Joshi ,  Kunal Punera ,  David M. Pennock
INTERNATIONAL WORLD WIDE WEB CONFERENCE
Add To MetaCart

Abstract:

The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms of graph structure and not on page content. We propose that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities. In particular, using a topic taxonomy and an automatic classifier, we can measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions. In addition, we can measure the probability that a page about one broad topic will link to another broad topic. Extending this experiment, we can measure how quickly topic context is lost while walking randomly on the Web graph. Estimates of this topic mixing distance may explain why a global PageRank is still meaningful in the context of broad queries. In general, our measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.

Citations

4923 Elements of Information Theory – Cover, Thomas - 1991
2329 Introduction to modern information retrieval – Salton - 1983
1839 The Anatomy of a Large-Scale Hypertextual Web Search Engine – Brin, Page - 1998
1669 Authoritative sources in a hyperlinked environment – Kleinberg - 1999
1309 Randomized algorithms – Motwani, Raghavan - 1995
1064 The PageRank Citation Ranking: Bringing Order to the Web – Page, Brin, et al. - 1999
829 Emergence of scaling in random networks – Barabasi, Albert - 1999
786 On power-law relationships of the Internet topology – Faloutsos, Faloutsos, et al. - 1999
630 Social Network Analysis: Methods and Applications – Wasserman, Faust - 1994
514 A comparison of event models for naive bayes text classification – McCallum, Nigam - 1998
339 Focused crawling: a new approach to topic-specific (web) resource discovery – Chakrabarti, Berg, et al. - 1999
254 Enhanced hypertext categorization using hyperlinks – Chakrabarti, Dom, et al. - 1998
220 Trawling the web for emerging cyber communities – Kumar, Raghavan, et al. - 1999
147 Focused crawling using context graphs – Diligenti, Coetzee, et al. - 2000
113 A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ ˜mccallum/bow – Bow
91 The Connectivity Server: Fast access to linkage information on the Web – Bharat, Bröder, et al. - 1998
88 Using reinforcement learning to spider the webefficiently – Rennie, McCallum - 1999
85 Topical locality in the web – Davison - 2000
85 Self-organization and identification of web communities – Flake, Lawrence, et al.
79 Graph structure in the web: experiments and models – Broder, Kumar, et al. - 2000
68 Winners don’t take all: characterizing the competition for links on the web – Pennock, Flake, et al. - 2002
64 On nearuniform URL sampling – Henzinger, Heydon, et al. - 2000
63 Generating network topologies that obey power laws – Palmer, Steffan - 2000
60 Breadth-First Search Crawling Yields High-Quality Pages – Najork, Wiener - 2001
48 Approximating Aggregate Queries about Web Pages via Random Walks – Bar-Yossef, Berg, et al. - 2000
47 Accelerated focused crawling through online relevance feedback – Chakrabarti, Punera, et al. - 2002
22 Power Law distribution of the World Wide Web – Adamic, Huberman
19 Methods for sampling pages uniformly from the World Wide Web – RUSMEVICHIENTONG, PENNOCK, et al.
12 Surfing the web backwards – Chakrabarti, Gibson, et al. - 1999
10 Probabilistic combination of content and links – DUMAIS, R
2 Online at http: //www.almaden.ibm.com/cs/k53/fractal.ps – Dill, Kumar, et al. - 2001
2 tell us about lexical and semantic Web content – Links - 2001
1 Random jumps in WebWalker – Berg - 2001
1 ATTICS: A toolkit for text classification and text mining – Lewis - 2000