## Efficient Identification of Web Communities (2000)

### Cached

### Download Links

- [gdit.iiit.net]
- [www.neci.nj.nec.com]
- [www.neci.nec.com]
- DBLP

### Other Repositories/Bibliography

Venue: | In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining |

Citations: | 231 - 12 self |

### BibTeX

@INPROCEEDINGS{Flake00efficientidentification,

author = {Gary William Flake and Steve Lawrence and C. Lee Giles},

title = {Efficient Identification of Web Communities},

booktitle = {In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},

year = {2000},

pages = {150--160},

publisher = {ACM Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We de ne a community on the web as a set of sites that have more links (in either direction) to members of the community than to non-members. Members of such a community can be eciently identi ed in a maximum ow / minimum cut framework, where the source is composed of known members, and the sink consists of well-known non-members. A focused crawler that crawls to a xed depth can approximate community membership by augmenting the graph induced by the crawl with links to a virtual sink node. The effectiveness of the approximation algorithm is demonstrated with several crawl results that identify hubs, authorities, web rings, and other link topologies that are useful but not easily categorized. Applications of our approach include focused crawlers and search engines, automatic population of portal categories, and improved ltering.

### Citations

10996 |
Computers and Intractability: A Guide to the Theory of NPCompleteness
- Garey, Johnson
- 1979
(Show Context)
Citation Context ...communities obey our dening characteristic of having more edges inside the community than outside. Unfortunately, the most generic versions of balanced minimum-cut graph partitioning are NP-complete [=-=14]-=-. On the other hand, if the constraint on the partition sizes is removed, then the problem lends itself to many polynomial time algorithms [15]; however, under this formulation, solutions will often b... |

8583 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 2009
(Show Context)
Citation Context ...are water pipes and vertices are pipe junctions, then the maximumsow problem tells you how much water you can move from one point to another. The famous \maxsow-min cut" theorem of Ford and Fulke=-=rson [16, 17]-=- proves that the maximumsow of the network is identical to the minimum cut that separates s and t. Many polynomial time algorithms exist for solving the s-t maximum ow problem, and applications of the... |

8198 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...e number of edges removed). Thus, we need to augment our procedure with a method for identifying new seeds. We solve this problem with a method inspired by the Expectation Maximization (EM) algorithm =-=[23]. The EM algori-=-thm is a two-step process that iteratively applies estimation (the \E" step) and maximization (the \M" step). In our case, the \E" step corresponds to using the maximum ow algorithm to ... |

2743 | Authoritative sources in a hyperlinked environment
- Kleinberg
- 1999
(Show Context)
Citation Context ...r literature databases with over 150 thousand documents [10]. However, applying these methods to systems such as the web, with over 10 9 documents, would obviously be challenging. Recently, Kleinberg =-=[11] showed th-=-at the HITS algorithm, which is strongly related to spectral graph partitioning and methods used by the Google search engine [12], can identify \hub" and \authority" web pages. A hub site li... |

1419 |
Network Flows: Theory, Algorithms, and Applications
- Ahuja, Magnanti, et al.
- 1993
(Show Context)
Citation Context ... polynomial time algorithms exist for solving the s-t maximum ow problem, and applications of the problem include VLSI design, routing, scheduling, image segmentation, and network reliability testing =-=[18]-=-. The maximumsow problem is well-suited to the application of identifying web communities because, unlike the balanced and unbalanced graph partition problems, it is computationally tractable and it a... |

490 | Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks
- Chakrabarti, Berg, et al.
- 1999
(Show Context)
Citation Context ...mmunities that appears to work well in practice. Figure 2 illustrates how our focused crawler retrieves pages and the graph that is induced by the crawl. (For another example of focused crawling, see =-=[21]-=-.) The crawl begins with the seed web pages, shown as set (b) in thesgure, andsnds all pages that link to or from the seed set. Outbound links are trivially found by examining the HTML of the page. In... |

360 |
Social network analysis: A handbook
- Scott
- 1991
(Show Context)
Citation Context ...ng circles. We summarize a small subset of theseselds to give our own work the proper context. 2.1 Link Analysis One of the earliest uses of link structure is found in the analysis of social networks =-=[3]-=-, where network properties such as cliques, centroids, and diameters are used to analyze the collective properties of interacting agents. Theselds of citation analysis [4] and bibliometrics [5] also u... |

341 | Inferring web communities from link topology
- Gibson, Kleinberg, et al.
- 1998
(Show Context)
Citation Context ...arting point for a focused crawl. Hubs and authorities are very useful for identifying key sites related to some community and, hence, should work well as seeds to our method. HITS has also been used =-=[13]-=- to extract related 1 documents; however, using HITS for enumerating all members of a community may be problematic because the communities that one is interested in may be overshadowed by a more domin... |

270 | An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation
- Wu, Leahy
- 1993
(Show Context)
Citation Context ...pages two links away do not belong in the community due to the rapid branching factor of the web. Intuitively, our method of using a virtual sink is very similar to methods used in image segmentation =-=[22]-=-. In the maximumsow formulation of image segmentation, adjacent pixels are connected with edge capacities as a function of gray level similarity. All pixels on the perimeter of an image are connected ... |

183 |
Bibliographic coupling between scientific papers
- KESSLER
- 1963
(Show Context)
Citation Context ...agents. Theselds of citation analysis [4] and bibliometrics [5] also use the citation links between works of literature to identify patterns in collections. Co-citation [6] and bibliographic coupling =-=[7]-=- are two of the more fundamental measures used to characterize the similarity between two documents. Thesrst measures the number of citations in common between two documents, while the second measures... |

169 |
Diameter of the world wide web
- Albert, Jeong, et al.
- 1999
(Show Context)
Citation Context ...d web sites; (c) vertices of web sites one link away from any seed site; (d) references to sites not in (b) or (c); and (e) the virtual sink vertex. law distribution on the inbound and outbound links =-=[20]-=-, web portal sites such as Yahoo! should be very close to the center of the web graph. Thus, by using the top-levels of a small collection of web portals as a virtual sink, the maximumsow formulation ... |

123 |
K-means-type algorithms: a generalized convergence theorem and characterization of local optimality
- Selim, Ismail
- 1984
(Show Context)
Citation Context ...y the local link properties between two documents. Of course, similarity metrics such as co-citation and bibliographic coupling can be used along with classical clustering techniques, such as k-means =-=[9]-=-, to reduce the dimensionality of the document space, thus identifying documents in a community that is centered about a cluster centroid. More radical forms of dimensionality reduction have used this... |

96 |
Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace. Annual Meeting of the American Society for Information Science
- Larson
- 1996
(Show Context)
Citation Context ...uments, while the second measures the number of documents that cite both of two documents under consideration. Methods from bibliometrics have also been applied to the problem of mining web documents =-=[8]-=-. Bibliometrics techniques can be thought of as local in nature because they typically consider only the local link properties between two documents. Of course, similarity metrics such as co-citation ... |

40 | Experimental study of minimum cut algorithms
- Chekuri, Goldberg, et al.
- 1997
(Show Context)
Citation Context ... balanced minimum-cut graph partitioning are NP-complete [14]. On the other hand, if the constraint on the partition sizes is removed, then the problem lends itself to many polynomial time algorithms =-=[15]-=-; however, under this formulation, solutions will often be trivial cuts that leave one partition very small relative to the size of the original graph. Intuitively, balanced minimal cuts are hard beca... |

36 |
Accessibility of information on the web. Nature
- Lawrence, Giles
- 1999
(Show Context)
Citation Context ...uted for prot or commercial advantage and that copies bear this notice and the full citation on thesrst page. 16% of the web, and the union of 11 major search engines covers less than 50% of the web [=-=2]-=-. Moreover, search engines are often out-of-date partially due to limited crawling speeds and the low average life-span of web pages. A second dilemma for search engines resides in the balance between... |

34 | Clustering and identifying temporal trends in document databases
- Popescul, Flake, et al.
- 2000
(Show Context)
Citation Context ...nts in a community that is centered about a cluster centroid. More radical forms of dimensionality reduction have used this basic idea to cluster literature databases with over 150 thousand documents =-=[10]-=-. However, applying these methods to systems such as the web, with over 10 9 documents, would obviously be challenging. Recently, Kleinberg [11] showed that the HITS algorithm, which is strongly relat... |

26 | H.: Human performance on clustering web pages: A preliminary study
- Macskassy, Banerjee, et al.
- 1998
(Show Context)
Citation Context ...ds to re-populate categories with newer and more relevant sites, thus addressing the lack of recall that portals are known to have and easing the burden on humans that manually construct such portals =-=[24-=-]. In terms ofsltering, controversial web sites such as pornography and hate sites could also be identied; since pornography accounts for approximately 2% of the web [2], this is not out of the questi... |

20 |
PageRank: Bringing order to the Web,” Stanford Digital Libraries working paper 19970072
- Page
(Show Context)
Citation Context ...cuments, would obviously be challenging. Recently, Kleinberg [11] showed that the HITS algorithm, which is strongly related to spectral graph partitioning and methods used by the Google search engine =-=[12], can ide-=-ntify \hub" and \authority" web pages. A hub site links to many authority sites and an authority site is linked to by many hubs. Thus, the denition of the two page types is recursive and mut... |

15 |
Theoretical improvements in algorithmic e#ciency for network flow problems
- Edmonds, Karp
- 1972
(Show Context)
Citation Context ... graph that corresponds to the web is vastly larger than any single computer can store in main memory. Nevertheless, one of the simplest maximumsow algorithms|the shortest augmentation path algorithm =-=[19]-=-|can solve the problem by examining only the portions of the graph that arise when locating shortest paths between the source and sink nodes. Thus, it should be possible to solve a maximumsow problem ... |

15 |
A new approach to the maximum problem
- Goldberg, Tarjan
- 1988
(Show Context)
Citation Context ...THM Most modern implementations of maximumsow algorithms rely on having access to the entire graph under consideration in order to make thesow analysis ecient. For example, the pre- ow push algorithm =-=[25]-=- (considered the fastest in practice) often uses a topological sort of all edges in order to improve eciency. Clearly, global access to the web graph is not practical if one wishes to calculate an exa... |

11 |
Co-citation in the scienti c literature: A new measure of the relationship between two documents
- Small
- 1973
(Show Context)
Citation Context ...tive properties of interacting agents. Theselds of citation analysis [4] and bibliometrics [5] also use the citation links between works of literature to identify patterns in collections. Co-citation =-=[6]-=- and bibliographic coupling [7] are two of the more fundamental measures used to characterize the similarity between two documents. Thesrst measures the number of citations in common between two docum... |

9 |
Max-flow through a network
- Ford, Fulkerson
- 1956
(Show Context)
Citation Context ...are water pipes and vertices are pipe junctions, then the maximumsow problem tells you how much water you can move from one point to another. The famous \maxsow-min cut" theorem of Ford and Fulke=-=rson [16, 17]-=- proves that the maximumsow of the network is identical to the minimum cut that separates s and t. Many polynomial time algorithms exist for solving the s-t maximum ow problem, and applications of the... |

3 |
Inktomi webmap press release
- Corporation
- 2000
(Show Context)
Citation Context ...e designers that have to balance a number of con icting goals in order to make search engines practical in the real-world. One con ict hinges on the shear number of indexable web pages (now over 10 9 =-=[1]-=-). Ideally, search engine crawlers could sample the indexable web often enough to insure that results are valid, and broadly enough to insure that all valuable documents are indexed. However, the most... |

1 |
Citation Indexing: Its Theory and Application in Science
- Gar
- 1979
(Show Context)
Citation Context ... analysis of social networks [3], where network properties such as cliques, centroids, and diameters are used to analyze the collective properties of interacting agents. Theselds of citation analysis =-=[4]-=- and bibliometrics [5] also use the citation links between works of literature to identify patterns in collections. Co-citation [6] and bibliographic coupling [7] are two of the more fundamental measu... |