## The Structure of Broad Topics on the Web (2002)

### Cached

### Download Links

Venue: | INTERNATIONAL WORLD WIDE WEB CONFERENCE |

Citations: | 50 - 1 self |

### BibTeX

@INPROCEEDINGS{Chakrabarti02thestructure,

author = {Soumen Chakrabarti and Mukul M. Joshi and Kunal Punera and David M. Pennock},

title = {The Structure of Broad Topics on the Web},

booktitle = {INTERNATIONAL WORLD WIDE WEB CONFERENCE},

year = {2002},

pages = {251--262},

publisher = {ACM}

}

### Years of Citing Articles

### OpenURL

### Abstract

The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms of graph structure and not on page content. We propose that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities. In particular, using a topic taxonomy and an automatic classifier, we can measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions. In addition, we can measure the probability that a page about one broad topic will link to another broad topic. Extending this experiment, we can measure how quickly topic context is lost while walking randomly on the Web graph. Estimates of this topic mixing distance may explain why a global PageRank is still meaningful in the context of broad queries. In general, our measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.

### Citations

9231 |
Elements of Information Theory
- Cover, Thomas
- 1990
(Show Context)
Citation Context ...)||p(D2)) = � c pc(D1) log pc(D1) pc(D2) because of two problems: it is asymmetric, and more seriously, it cannot deal with zero probabilities gracefully. The symmetric Jensen-Shannon (JS) divergenc=-=e [12]-=- also has problems with zeroes. Bar-Yossef et al. found that an undirected random walk touching about 300 physically distinct pages was adequate to collect a URL sufficiently unbiased to yield a good ... |

3648 | The Anatomy of a Large-Scale Hypertextual Web Search Engine
- Brin, Page
- 1998
(Show Context)
Citation Context ... lose the topic memory at different rates. These phenomena give us valuable insight into the success of focused crawlers [11, 14, 31] and the effect of topical clusters on Google’s PageRank algorith=-=m [6, 28]-=-. Link-based vs. content-based Web communities: We extend the above measurements to construct a topic citation matrix in which entry (i, j) represents the probability that a page about topic i cites a... |

3423 |
Introduction to Modern Information Retrieval
- Salton, McGill
- 1983
(Show Context)
Citation Context ...m outlink v was chosen such that u and v were from different hosts (identified by name). Davison represented the text on these pages using the standard “vector space model” from Information Retrie=-=val [33]-=- in which each document u is represented by a vector u of suitably normalized term counts in a geometric space with an axis for each term. He measured the dot product between each pair of document vec... |

2999 | Authoritative sources in a hyperlinked environment
- Kleinberg
- 1999
(Show Context)
Citation Context ...ve been very successful in recent years. Google (http: //google.com) uses as a subroutine the PageRank algorithm [6, 28], which we have already reviewed in §2.1. Kleinberg proposed the HITS algorithm=-= [20]-=- which has many variants. Unlike PageRank, HITS does not analyze the whole Web graph, but collects a subgraph Gq = (Vq, Eq) of the Web graph G in response to a specific query q. It uses a keyword sear... |

2440 | Emergence of scaling in random networks
- Barabási, Albert
- 1999
(Show Context)
Citation Context ...modeling in recent years. 1.1 Graph-theoretic models and measurements With a few notable exceptions, most studies conducted on the Web have focused on its graph-theoretic aspects. Barabási and Albert=-= [3]-=- proposed a local model for social network evolution based on preferential attachment: nodes with large degree are proportionately more likely to become incident to new links. They applied it to the W... |

2401 | The PageRank citation ranking: Bringing order to the web
- Page, Brin, et al.
- 1999
(Show Context)
Citation Context ... lose the topic memory at different rates. These phenomena give us valuable insight into the success of focused crawlers [11, 14, 31] and the effect of topical clusters on Google’s PageRank algorith=-=m [6, 28]-=-. Link-based vs. content-based Web communities: We extend the above measurements to construct a topic citation matrix in which entry (i, j) represents the probability that a page about topic i cites a... |

2182 |
Social Network Analysis: Methods and Applications
- Wasserman, Faust
- 1994
(Show Context)
Citation Context ...ess of topic directories. 3.4 Topic-specific degree distributions Several researchers have corroborated that the distribution of degrees of nodes in the Web graph (and many social networks in general =-=[16, 34]) as-=-ymptotically follow a power law distribution [1, 7, 21]: the probability that a randomly picked node has degree i is proportional to 1/i x , for some constant ‘power’ x > 1. The powers x for in- a... |

1923 | Randomized Algorithms
- Motwani, Raghavan
- 1995
(Show Context)
Citation Context ...ng to the clique. It is easy to see that a walk starting on the stick is doomed to enter the clique with high probability, whereas getting out from the clique to the stick will take a long, long time =-=[26]-=-. Unfortunately, lollipops and near-lollipops are not hard to find on the Web: http://www.amazon.com/, http:// www.stadions.dk/ and http://www.chipcenter.com/ are some prominent examples. Hence we add... |

1340 | On Power-law Relationships of the Internet Topology
- Faloutsos, Faloutsos, et al.
- 1999
(Show Context)
Citation Context ...ess of topic directories. 3.4 Topic-specific degree distributions Several researchers have corroborated that the distribution of degrees of nodes in the Web graph (and many social networks in general =-=[16, 34]) as-=-ymptotically follow a power law distribution [1, 7, 21]: the probability that a randomly picked node has degree i is proportional to 1/i x , for some constant ‘power’ x > 1. The powers x for in- a... |

835 | A comparison of event models for naive bayes text classification
- McCallum, Nigam
- 1988
(Show Context)
Citation Context ... of documents labeled with c and T is the entire vocabulary. The NB learner assumes independence between features, and estimates � Pr(c|d) ∝ Pr(c) Pr(d|c) ≈ Pr(c) Nigam et al. provide further de=-=tails [24]. t-=-∈d 3 Background topic distribution θ n(d,t) c,t . (1) In this section we seek to characterize and estimate the distribution of topics on the Web, i.e., the fractions of Web pages relevant to a set ... |

534 | Focused Crawling: A New Approach to Topic-specific Web Resource Discovery
- Chakrabarti, Berg, et al.
- 1999
(Show Context)
Citation Context ... they do not approach the background distribution either. Different communities lose the topic memory at different rates. These phenomena give us valuable insight into the success of focused crawlers =-=[11, 14, 31] a-=-nd the effect of topical clusters on Google’s PageRank algorithm [6, 28]. Link-based vs. content-based Web communities: We extend the above measurements to construct a topic citation matrix in which... |

406 | Enhanced hypertext categorization using hyperlinks
- Chakrabarti, Dom, et al.
- 1998
(Show Context)
Citation Context ...erms of the textual tokens appearing in d. Pages on the Web are not isolated entities, and the (estimated) topics of the neighbors N(u) of page u may lend valuable evidence to estimate the topic of u =-=[8, 19]. Th-=-us we need to estimate a joint distribution for Pr � c(u), c(N(u)) � , which is a direct application of the topic citation matrix. Enhanced focused crawling: A focused crawler is given a topic tax... |

313 | Trawling the Web for emerging cyber-communities
- Kumar, Raghavan, et al.
- 1999
(Show Context)
Citation Context ...omponents and bipartite cores would be extremely unlikely. The Web is not random, and such subgraphs abound on the Web. A small bipartite core is often an indicator of an emerging topic. Kumar et al. =-=[21] min-=-e tens of thousands of such bipartite cores and empirically observed that a large fraction are in fact topically coherent, but the definition of a ‘topic’ was left informal. Dense bipartite subgra... |

212 | Focused crawling using context graph
- Diligenti, Coetzee, et al.
- 2000
(Show Context)
Citation Context ... they do not approach the background distribution either. Different communities lose the topic memory at different rates. These phenomena give us valuable insight into the success of focused crawlers =-=[11, 14, 31] a-=-nd the effect of topical clusters on Google’s PageRank algorithm [6, 28]. Link-based vs. content-based Web communities: We extend the above measurements to construct a topic citation matrix in which... |

168 | Self-organization and identification of web communities
- Flake, Lawrence, et al.
(Show Context)
Citation Context ...fact topically coherent, but the definition of a ‘topic’ was left informal. Dense bipartite subgraphs are an important indicator of community formation, but they may not be the only one. Flake et =-=al. [17]-=- propose a general definition of a community as a subgraph whose internal link density exceeds the density of connection to nodes outside it by some margin. They use this definition to drive a crawler... |

163 |
A toolkit for statistical language modeling, text retrieval, classi and clustering. http://www.cs.cmu.edu/ mccallum/bow
- Bow
- 1996
(Show Context)
Citation Context ...out 120,000 URLs. 1 All nodes have equal degree in a regular graph.sFor the classifier we used the public domain BOW toolkit and the Rainbow naive Bayes (NB) classifier created by McCallum and others =-=[23]. -=-Bow and Rainbow are very fast C implementations which let us classify pages in real time as they were being crawled. Rainbow’s naive Bayes learner is given a set of training documents, each labeled ... |

130 | Winners dont take all: Characterizing the competition for links on the web
- Pennock, Flake, et al.
- 2002
(Show Context)
Citation Context ...Web graph, and the model predicted the power-law degree distribution quite accurately, except for underestimating the density of low-degree nodes. This discrepancy was later removed by Pennock et al. =-=[30]-=- by using a linear combination of preferential and random attachment. Random graph models materialize edges independently with a fixed probability. If the Web were a random graph, large densely connec... |

128 | Topical locality in the Web
- Davison
- 2000
(Show Context)
Citation Context ...nteraction between topics, or the radius of topical clusters, were not addressed.s1.2 Content-based locality measurements A few studies have concentrated on textual content. Davison pioneered a study =-=[13]-=- over about 100,000 pages sampled from the repository of a research search engine called DiscoWeb. He collected the following kinds of pairs of Web pages: Random: Two different pages were sampled unif... |

120 |
The Connectivity Server: fast access to linkage information on the Web
- Bharat, Broder, et al.
- 1998
(Show Context)
Citation Context ...his observation. How topic-biased are breadth-first crawls? Several production crawlers follow an approximate breadth-first strategy. A breadth-first crawler was used to build the Connectivity Server =-=[5, 7]-=- at Alta Vista. Najork and Weiner [27] report that a breadth-first crawl visits pages with high PageRank early, a valuable property for a search engine. A crawl of over 80 million pages at the NEC Res... |

110 | Using reinforcement learning to spider the web efficiently
- Rennie, McCallum
- 1999
(Show Context)
Citation Context ... they do not approach the background distribution either. Different communities lose the topic memory at different rates. These phenomena give us valuable insight into the success of focused crawlers =-=[11, 14, 31] a-=-nd the effect of topical clusters on Google’s PageRank algorithm [6, 28]. Link-based vs. content-based Web communities: We extend the above measurements to construct a topic citation matrix in which... |

109 |
On Near-Uniform URL Sampling
- Henzinger, Heydon, et al.
- 2000
(Show Context)
Citation Context ...he directory, and try to understand why. Topic convergence on directed walks: We also study (§4) page samples collected from ordinary random walks that only follow hyperlinks in the forward direction=-= [18]-=-. We discover that these ordinary walks do not lose the starting topic memory as quickly as undirected walks, and they do not approach the background distribution either. Different communities lose th... |

108 | Breadth-First Search Crawling Yields High-Quality Pages
- Najork, Wiener
- 2001
(Show Context)
Citation Context ...eadth-first crawls? Several production crawlers follow an approximate breadth-first strategy. A breadth-first crawler was used to build the Connectivity Server [5, 7] at Alta Vista. Najork and Weiner =-=[27]-=- report that a breadth-first crawl visits pages with high PageRank early, a valuable property for a search engine. A crawl of over 80 million pages at the NEC Research Institute broadly follows a brea... |

106 |
Graph structure in the web: experiments and models
- Broder, Kumar, et al.
- 2000
(Show Context)
Citation Context ...to nodes outside it by some margin. They use this definition to drive a crawler, starting from exemplary members of a community, and verify that a coherent community graph is collected. Bröder et al.=-= [7]-=- exposed the large-scale structure of the Web graph as having a central, strongly connected core (SCC); a subgraph with directed paths leading into the SCC, a component leading away from the SCC, and ... |

90 | Steffan: Generating Network Topologies That Obey Power Laws
- Palmer, G
- 2000
(Show Context)
Citation Context ...bers of links, in agreement with the Pennock et al. findings [30], and in contrast to the global in-degree distribution which is nearly a pure power law [7]. An empirical result of Palmer and Steffan =-=[29] may-=- help explain why we would expect to see the power law upheld by pages on specific topics. They showed through experimentssthat the following simple “80-20” random graph generator fits power-law d... |

78 | Approximating Aggregate Queries about Web Pages via Random Walks
- Bar-Yossef, Berg, et al.
- 2000
(Show Context)
Citation Context ... topics in the Web graph. Convergence of topic distribution on undirected random walks: Algorithms for sampling Web pages uar have been evaluated on structural properties such as degree distributions =-=[2, 32].-=- Extending these techniques, we design a certain undirected random walk (i.e., assuming hyperlinks are bidirectional) to estimate the distribution of Web pages w.r.t. the Dmoz topics (§3). We start f... |

72 | Accelerated focused crawling through online relevance feedback
- CHAKRABARTI, PUNERA, et al.
(Show Context)
Citation Context ...RLs. Since the set of topics was very large and many topics had scarce training data, we pruned the Dmoz tree to a manageable frontier as described in §3.1 of our companion paper in these proceedings=-= [10]-=-. The pruned taxonomy had 482 leaf nodes and a total of 144,859 sample URLs. Out of these we could successfully fetch about 120,000 URLs. 1 All nodes have equal degree in a regular graph.sFor the clas... |

42 | Methods for sampling pages uniformly from the world wide web
- Rusmevichientong, Pennock, et al.
- 2001
(Show Context)
Citation Context ... topics in the Web graph. Convergence of topic distribution on undirected random walks: Algorithms for sampling Web pages uar have been evaluated on structural properties such as degree distributions =-=[2, 32].-=- Extending these techniques, we design a certain undirected random walk (i.e., assuming hyperlinks are bidirectional) to estimate the distribution of Web pages w.r.t. the Dmoz topics (§3). We start f... |

41 | Power-law distribution of the world wide web
- Adamic
- 2000
(Show Context)
Citation Context ...ibutions Several researchers have corroborated that the distribution of degrees of nodes in the Web graph (and many social networks in general [16, 34]) asymptotically follow a power law distribution =-=[1, 7, 21]: th-=-e probability that a randomly picked node has degree i is proportional to 1/i x , for some constant ‘power’ x > 1. The powers x for in- and out-degrees were estimated in 1999 to be about 2.1 and 2... |

19 | Surfing the Web backwards
- Chakrabarti, Gibson, et al.
- 1999
(Show Context)
Citation Context ...ial bias in the sample towards pages close to the starting point of the walk. Unfortunately, there is no easy way around this bias until and unless hyperlinks become bidirectional entities on the Web =-=[9]. H-=-owever we can assess the quality of the samples through other means. The graph is made regular by adding sufficient numbers of self-loops at each node; see §3. We use a variant of Bar-Yossef’s walk... |

12 |
Probabilistic combination of content and links
- Jin, Dumais
- 2001
(Show Context)
Citation Context ...erms of the textual tokens appearing in d. Pages on the Web are not isolated entities, and the (estimated) topics of the neighbors N(u) of page u may lend valuable evidence to estimate the topic of u =-=[8, 19]. Th-=-us we need to estimate a joint distribution for Pr � c(u), c(N(u)) � , which is a direct application of the topic citation matrix. Enhanced focused crawling: A focused crawler is given a topic tax... |

2 |
Online at http: //www.almaden.ibm.com/cs/k53/fractal.ps
- Dill, Kumar, et al.
- 2001
(Show Context)
Citation Context ...he “bow-tie” model of the Web. They also measured interesting properties like the average path lengths between connected nodes and the distribution of in- and out-degree. Followup work by Dill et =-=al. [15]-=- showed that subgraphs selected from the Web as per specific criteria (domain restriction, occurrence of keyword, etc.) also appear to often be bow-tielike, although the ratio of component sizes vary ... |

2 |
tell us about lexical and semantic Web content
- Links
- 2001
(Show Context)
Citation Context ... have almost nothing in common. Linked pages are more similar when the pages are from the same domain. Sibling pages are more similar than the linked pages of different domain. More recently, Menczer =-=[25]-=- has studied and modeled carefully the rapid decay in TFIDF similarity to a starting node as one walks away from that node. A single starting page u0 may be a noisy indicator of semantic similarity wi... |

1 |
Random jumps in WebWalker
- Berg
- 2001
(Show Context)
Citation Context ...he original Bar-Yossef algorithm [2], set to 0.01–0.05 throughout our experiments, i.e., with this probability at every step, we jump uar to a node visited earlier in the Sampling walk. Berg confirm=-=s [4]-=- that this improves the stability and convergence of the Sampling walks. 3.1 Convergence Bar-Yossef et al. showed that the samples collected by a Sampling walk have degree distributions that converge ... |

1 |
ATTICS: A toolkit for text classification and text mining
- Lewis
- 2000
(Show Context)
Citation Context ...(D1) = p(d). (2) |D1| d∈D1sThis is a form of soft counting. The ‘hard’ analog would be to assign each d to its highest scoring class and count up the number of documents assigned to each class. =-=Lewis [22] n-=-otes that soft counting gives better estimates than hard counting for small sample sizes. We characterize the difference between p(D1) and p(D2) as the L1 � difference between the two vectors, c |pc... |