Results 1 - 10
of
17
Efficient Identification of Web Communities
- In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2000
"... We de ne a community on the web as a set of sites that have more links (in either direction) to members of the community than to non-members. Members of such a community can be eciently identi ed in a maximum ow / minimum cut framework, where the source is composed of known members, and the sink c ..."
Abstract
-
Cited by 188 (11 self)
- Add to MetaCart
We de ne a community on the web as a set of sites that have more links (in either direction) to members of the community than to non-members. Members of such a community can be eciently identi ed in a maximum ow / minimum cut framework, where the source is composed of known members, and the sink consists of well-known non-members. A focused crawler that crawls to a xed depth can approximate community membership by augmenting the graph induced by the crawl with links to a virtual sink node. The effectiveness of the approximation algorithm is demonstrated with several crawl results that identify hubs, authorities, web rings, and other link topologies that are useful but not easily categorized. Applications of our approach include focused crawlers and search engines, automatic population of portal categories, and improved ltering.
Information retrieval on the Web
- ACM Computing Surveys
, 2000
"... In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical ..."
Abstract
-
Cited by 58 (0 self)
- Add to MetaCart
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited
Web search results clustering in Polish: experimental evaluation of Carrot
- In IIS03
, 2003
"... In this paper we consider the problem of web search results clustering in the Polish language, supporting our analysis with results acquired from an experimental system named Carrot. The algorithm we put into consideration -- Su#x Tree Clustering has been acknowledged as being very e#cient when appl ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
In this paper we consider the problem of web search results clustering in the Polish language, supporting our analysis with results acquired from an experimental system named Carrot. The algorithm we put into consideration -- Su#x Tree Clustering has been acknowledged as being very e#cient when applied to English. We present conclusions from its experimental application to Polish, indicating fragile areas, where the algorithm seem to fail due to specific properties of the input data. We indicate that the characteristics of produced clusters (number, value), unlike in English, strongly depend on pre-processing phase. We also attempt to investigate the influence of two primary STC parameters: merge threshold and minimum base cluster score on the number and quality of results. Finally, we introduce two approaches to e#cient, approximate stemming of Polish words: quasi-stemmer and an automaton-based method.
Ontology-based Text Document Clustering
- KÜNSTLICHE INTELLIGENZ
, 2002
"... Text clustering typically involves clustering in a high dimensional space, which appears difficult with regard to virtually all practical settings. In addition, given a particular clustering result it is typically very hard to come up with a good explanation of why the text clusters have been constr ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Text clustering typically involves clustering in a high dimensional space, which appears difficult with regard to virtually all practical settings. In addition, given a particular clustering result it is typically very hard to come up with a good explanation of why the text clusters have been constructed the way they are. In this paper, we propose a new approach for applying background knowledge during preprocessing in order to improve clustering results and allow for selection between results. We preprocess our input data applying an ontology-based heuristics for feature selection and feature aggregation. Thus, we construct a number of altemative text representations. Based on these representations, we compute multiple clustering results using K-Means. The results
An efficient algorithm to rank Web resources
, 2000
"... How to rank Web resources is critical to Web Resource Discovery (Search Engine). This paper not only points out the weakness of current approaches, but also presents in-depth analysis of the multidimensionality and subjectivity of rank algorithms. From a dynamics viewpoint, this paper abstracts a us ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
How to rank Web resources is critical to Web Resource Discovery (Search Engine). This paper not only points out the weakness of current approaches, but also presents in-depth analysis of the multidimensionality and subjectivity of rank algorithms. From a dynamics viewpoint, this paper abstracts a user's Web surfing action as a Markov model. Based on this model, we propose a new rank algorithm. The result of our rank algorithm, which synthesizes the relevance, authority, integrativity and novelty of each Web resource, can be computed efficiently not by iteration but through solving a group of linear equations. 2000 Published by Elsevier Science B.V. All rights reserved.
Information Retrieval on the Web: Selected Topics
- IBM research, Tokyo Research Laboratory, IBM
, 1999
"... In this paper we review studies on the growth of the Internet and technologies which are useful for information search and retrieval on the Web. In the rst section, we present data on the Internet from several dierent sources, e.g., current as well as projected number of users, hosts and Web sites. ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In this paper we review studies on the growth of the Internet and technologies which are useful for information search and retrieval on the Web. In the rst section, we present data on the Internet from several dierent sources, e.g., current as well as projected number of users, hosts and Web sites. Although the numerical gures vary, the overall trends cited by the sources are consistent and point to exponential growth during the coming decade. And Internet users are increasingly using search engines and search services to nd speci c information of interest. However, users are not satis ed with the performance of the current generation of search engines; the slow speed of retrieval, communication delays, and poor quality of retrieved results (e.g., noise and broken links) are commonly cited problems. The main body of our paper focuses on linear algebraic models and techniques for solving these problems. keywords: clustering, indexing, information retrieval, Internet, late...
Document Categorization with MajorClust
- In: Proc. 12th Workshop on Information Technology and Systems, Tech. Univ. of
, 2002
"... Abstract This paper investigates the text categorization capabilities of two special clustering algorithms: Fuzzy k-Medoid and MAJORCLUST. Aside from quantifying the categorization performance of the mentioned algorithms, our experimental setting will also help to answer special questions related to ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract This paper investigates the text categorization capabilities of two special clustering algorithms: Fuzzy k-Medoid and MAJORCLUST. Aside from quantifying the categorization performance of the mentioned algorithms, our experimental setting will also help to answer special questions related to clustering problems such as cluster number determination or cluster quality evaluation.
Conceptual Clustering Using Lingo Algorithm: Evaluation on Open Directory Project Data
- In IIPWM04
, 2004
"... Search results clustering problem is defined as an automatic, on-line grouping of similar documents in a search hits list, returned from a search engine. In this paper we present the results of an experimental evaluation of a new algorithm named Lingo. We use Open Directory Project as a source of hi ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Search results clustering problem is defined as an automatic, on-line grouping of similar documents in a search hits list, returned from a search engine. In this paper we present the results of an experimental evaluation of a new algorithm named Lingo. We use Open Directory Project as a source of high-quality narrowtopic document references and mix them into several multi-topic test sets for the algorithm. We then compare the clusters acquired from Lingo to the expected set of ODP categories mixed in the input. Finally we discuss observations from the experiment, highlighting the algorithm's strengths and weaknesses and conclude with research directions for the future.
Descriptive Clustering as a Method for Exploring Text Collections
, 2006
"... Grupowanie opisowe jako metoda eksploracji zbiorów dokumentów tekstowych ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Grupowanie opisowe jako metoda eksploracji zbiorów dokumentów tekstowych

