Abstract:
: The web harbors a large number of communities - groups of content-creators sharing a common interest which manifests itself as a set of web pages. Whereas newgroups and commercial web directories together contain of the order of 10000 such communities, our particular interest here is on emerging communities - those that have little or no representation in such fora. The subject of this paper is the systematic enumeration of over 100,000 such emerging communities from a web crawl: we call our process trawling. We motivate a graph-theoretic approach to locating such communities, and describe the algorithms, and the algorithmic engineering necessary to find structures that subscribe to this notion, the challenges in handling such a huge data set, and the results of our experiment. 1. Overview The web has several thousand well-known, explicitly-defined communities --- groups of individuals who share a common interest, together with the web pages most popular amongst them. Consider for ...
Citations
|
1669
|
Authoritative sources in a hyperlinked environment
– Kleinberg
- 1999
|
|
1636
|
Indexing by latent semantic analysis
– Deerwester, Dumais, et al.
- 1990
|
|
595
|
The Lorel Query Language for Semistructured Data
– Abiteboul, Quass, et al.
- 1997
|
|
349
|
Improved algorithms for topic distillation in hyperlinked environments
– Bharat, Henzinger
- 1998
|
|
263
|
Syntactic clustering of the Web
– Broder, Glassman, et al.
|
|
253
|
Inferring Web communities from link topology
– Gibson, Kleinberg, et al.
- 1998
|
|
244
|
Automatic resource compilation by analyzing hyperlink structure and associated text
– Chakrabarti, Dom, et al.
- 1998
|
|
205
|
Silk from a sow’s ear: Extracting usable structures from the Web
– Pirolli, Pitkow, et al.
- 1996
|
|
128
|
A.: A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines
– Bharat, Broder
- 1998
|
|
101
|
The anatomy of a large scale hypertextual Web search engine
– Brin, Page
- 1998
|
|
86
|
WebQuery: Searching and visualizing the Web through connectivity
– Carrière, Kazman
|
|
86
|
Finding Regular Simple Paths in Graph Databases
– Mendelzon, Wood
- 1995
|
|
65
|
Applications of a Web query language
– Arocena, Mendelzon, et al.
|
|
39
|
Intermediaries: New Places for Producing and Manipulating Web Content
– Barrett, Manglio
- 1998
|
|
19
|
The limits of Web metadata, and beyond
– Marchiori
- 1998
|
|
15
|
Rajeev Motwani, Svetlozar Nestorov, and Arnon Rosenthal. Query flocks: A generalization of association-rule mining
– Tsur, Ullman, et al.
- 1998
|
|
11
|
Information gathering on the World Wide Web: the W3QL query language and the W3QS system. Trans. on Database Systems
– Konopnicki, Shmueli
- 1998
|
|
6
|
Prabhakar Raghavan, Sridhar Rajagopalan and Andrew Tomkins, Mining the Link Structure of the World Wide Web
– Chakrabarti, Dom, et al.
- 1999
|
|
4
|
Raghavan P: Inferring Web communities from link topology
– Gibson, Kleinberg
- 1998
|
|
1
|
Srikant 94 Rakesh Agrawal and Ramakrishnan Srikanth. Fast Algorithms for mining Association rules
– Agrawal
- 1994
|
|
1
|
Mendelzon 98 Daniela Florescu, Alon Levy, Alberto Mendelzon. Database Techniques for the World-Wide Web: A Survey
– Florescu
- 1998
|
|
1
|
Rajagopalan 98 Monika Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan. Computing on data streams. AMS-DIMACS series, special issue on computing on very large datasets. Also technical note
– Henzinger
- 1998
|
|
1
|
Oded Shmueli, Information gathering on the world wide web: the W3QL query language and the W3QS system. Transactions on Database Systems
– Konopnicki, Shmueli
- 1998
|
|
1
|
A Declarative Approach to Querying and Retsructuring the World-Wide-Web
– Lakshmanan, Sadri, et al.
- 1996
|
|
1
|
received his Ph.D in Computer Science from Cornell University in 1998 and since then he has been a Research Staff Member at the IBM Almaden Research Center. His research interests include randomization, complexity theory, and information processing. Prabh
– Sci, Technology
- 1989
|
|
1
|
Mendelzon 98 Gustavo Arocena and Alberto Mendelzon. WebOQL: Restructuring Documents
– Arocena
- 1998
|
|
1
|
Sridhar Rajagopalan and Andrew Tomkins. Experiments in Topic Distillation
– Chakrabarti, Dom, et al.
- 1998
|
|
1
|
Rajeev Motwani, and Jeffrey Ullman Computing iceberg queries efficiently
– Fang, Shivakumar, et al.
- 1998
|
|
1
|
Spertus 97 Ellen Spertus. ParaSite: mining structural information on the Web
– Spertus, Stein
- 1998
|