Results 1 - 10
of
15
Web Document Clustering: A Feasibility Demonstration
, 1998
"... Abstract Users of Web search engines are often forced to sift through the long ordered list of document “snippets” returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on the major s ..."
Abstract
-
Cited by 279 (3 self)
- Add to MetaCart
Abstract Users of Web search engines are often forced to sift through the long ordered list of document “snippets” returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on the major search engines. The paper articulates the unique requirements of Web document clustering and reports on the first evaluation of clustering methods in this domain. A key requirement is that the methods create their clusters based on the short snippets returned by Web search engines. Surprisingly, we find that clusters based on snippets are almost as good as clusters created using the full text of Web documents. To satisfy the stringent requirements of the Web domain, we introduce an incremental, linear time (in the document collection size) algorithm called Suffix Tree Clustering (STC). which creates clusters based on phrases shared between documents. We show that STC is faster than standard clustering methods in this domain, and argue that Web document clustering via STC is both feasible and potentially beneficial. 1
Grouper: A Dynamic Clustering Interface to Web Search Results
, 1999
"... Users of Web search engines are often forced to sift through the long ordered list of document "snippets" returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on most major search en ..."
Abstract
-
Cited by 196 (2 self)
- Add to MetaCart
Users of Web search engines are often forced to sift through the long ordered list of document "snippets" returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on most major search engines. The NorthernLight search engine organizes its output into "custom folders" based on pre-computed document labels, but does not reveal how the folders are generated or how well they correspond to users' interests. In this paper, we introduce Grouper -- an interface to the results of the HuskySearch meta-search engine, which dynamically groups the search results into clusters labeled by phrases extracted from the snippets. In addition, we report on the first empirical comparison of user Web search behavior on a standard ranked-list presentation versus a clustered presentation. By analyzing HuskySearch logs, we are able to demonstrate substantial differences in the number of documents f...
BitCube: A Three-Dimensional Bitmap Indexing for XML Documents
- Journal of Intelligent Information Systems
, 2001
"... XML is a new standard for exchanging and representing information on the Internet. Documents can be hierarchically represented by XML-elements. In this paper, we propose that an XML document collection be represented and indexed using a bitmap indexing technique. We define the similarity and popular ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
XML is a new standard for exchanging and representing information on the Internet. Documents can be hierarchically represented by XML-elements. In this paper, we propose that an XML document collection be represented and indexed using a bitmap indexing technique. We define the similarity and popularity operations suitable for bitmap indexes. We also define statistical measurements in the BitCube: center, and radius. Based on these measurements, we describe a new bitmap indexing based technique to cluster XML documents. The techniques for clustering are motivated by the fact that the bitmap indexes are expected to be very sparse.
A modified fuzzy art for soft document clustering
- In: Proc. International Joint Conference on Neural Networks
, 2002
"... Document clustering is a very useful application in recent days especially with the advent of the World Wide Web. Most of the existing document clustering algorithms either produce clusters of poor quality or are highly computationally expensive. In this paper we propose a document-clustering algori ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Document clustering is a very useful application in recent days especially with the advent of the World Wide Web. Most of the existing document clustering algorithms either produce clusters of poor quality or are highly computationally expensive. In this paper we propose a document-clustering algorithm, KMART, that uses an unsupervised Fuzzy Adaptive Resonance Theory (Fuzzy-ART) neural network. A modified version of the Fuzzy ART is used to enable a document to be in multiple clusters. The number of clusters is determined dynamically. Some experiments are reported to compare the efficiency and execution time of our algorithm with other document-clustering algorithm like Fuzzy c Means. The results show that KMART is both effective and efficient. 1.
Web Mining: Machine Learning for Web Applications
- Annual Review of Information Science and Technology
, 2004
"... With more than two billion pages created by millions of Web page authors and organizations, the World Wide Web is a tremendously rich ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
With more than two billion pages created by millions of Web page authors and organizations, the World Wide Web is a tremendously rich
Rijke. Resolving person names in web people search
- In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005
, 2005
"... Abstract Disambiguating person names in a set of documents (such as a set of web pages returned in response to a person name) is a key task for the presentation of results and the automatic profiling of experts. With largely unstructured documents and an unknown number of people with the same name t ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract Disambiguating person names in a set of documents (such as a set of web pages returned in response to a person name) is a key task for the presentation of results and the automatic profiling of experts. With largely unstructured documents and an unknown number of people with the same name the problem presents many difficulties and challenges. This chapter treats the task of person name disambiguation as a document clustering problem, where it is assumed that the documents represent particular people. This leads to the person cluster hypothesis, which states that similar documents tend to represent the same person. Single Pass Clustering, k-Means Clustering, Agglomerative Clustering and Probabilistic Latent Semantic Analysis are employed and empirically evaluated in this context. On the SemEval 2007 Web People Search it is shown that the person cluster hypothesis holds reasonably well and that the Single Pass Clustering and Agglomerative Clustering methods provide the best performance. 1
Personal Name Resolution of Web People Search
"... Disambiguating personal names in a set of documents (such as a set of web pages returned in response to a person name) is a difficult and challenging task. In this paper, we explore the extent to which the “cluster hypothesis ” for this task holds (i.e., that similar documents tend to represent the ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Disambiguating personal names in a set of documents (such as a set of web pages returned in response to a person name) is a difficult and challenging task. In this paper, we explore the extent to which the “cluster hypothesis ” for this task holds (i.e., that similar documents tend to represent the same person). We explore two clustering techniques which used either (1) term based matching (single pass clustering) or (2) semantic based matching (Probabilistic Latent Semantic Analysis). We compare and contrast these strategies and provide strong evidence to suggest that the hypothesis holds for the former. And in fact, on the new evaluation platform of the SemEval 2007 Web People Search task, we show that using single pass clustering with a standard IR document representations fits well with the assumptions about the data and the task which yields state-of-the-art performance. 1.
A Semantic Graph Model for Text Representation and Matching in Document Mining
, 2006
"... I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii The explosive growth in the number of documents produc ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii The explosive growth in the number of documents produced daily necessitates the development of effective alternatives to explore, analyze, and discover knowledge from documents. Document mining research work has emerged to devise automated means to discover and analyze useful information from documents. This work has been mainly concerned with constructing text representation models, developing distance measures to estimate similarities between documents, and utilizing that in mining processes such as document clustering, document classification, information retrieval, information filtering, and information extraction. Conventional text representation methodologies consider documents as bags of words and ignore the meanings and ideas their authors want to convey. It is this
Learning Objects Clustering based on Semantic Understanding of Text
"... To discover knowledge form the available volumes of learning objects, the tasks to manage, analyze, search, filter, and summarize them should be automated. This requires understanding of the objects' contents. The main objective of our work is to advance the state-of-the-art techniques in learning o ..."
Abstract
- Add to MetaCart
To discover knowledge form the available volumes of learning objects, the tasks to manage, analyze, search, filter, and summarize them should be automated. This requires understanding of the objects' contents. The main objective of our work is to advance the state-of-the-art techniques in learning objects mining by developing and demonstrating the use of semantic understanding as basis of its mechanisms. The approach is based on semantic notions to represent text, and to estimate distances between the represented text contents of the objects. The representation reflects existing relations between concepts and facilitates accurate similarity judgments that results in better mining performance. Mining processes are carried out by the developed models and algorithms. In this paper, the application of the semantic understanding-based approach in clustering learning objects is presented. Experimental work is reported, and its results are presented and analyzed 1.

