Results 1 -
7 of
7
Web Document Clustering: A Feasibility Demonstration
, 1998
"... Abstract Users of Web search engines are often forced to sift through the long ordered list of document “snippets” returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on the major s ..."
Abstract
-
Cited by 279 (3 self)
- Add to MetaCart
Abstract Users of Web search engines are often forced to sift through the long ordered list of document “snippets” returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on the major search engines. The paper articulates the unique requirements of Web document clustering and reports on the first evaluation of clustering methods in this domain. A key requirement is that the methods create their clusters based on the short snippets returned by Web search engines. Surprisingly, we find that clusters based on snippets are almost as good as clusters created using the full text of Web documents. To satisfy the stringent requirements of the Web domain, we introduce an incremental, linear time (in the document collection size) algorithm called Suffix Tree Clustering (STC). which creates clusters based on phrases shared between documents. We show that STC is faster than standard clustering methods in this domain, and argue that Web document clustering via STC is both feasible and potentially beneficial. 1
Fast and Intuitive Clustering of Web Documents
- In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining
, 1997
"... Conventional document retrieval systems (e.g., Alta Vista) return long lists of ranked documents in response to user queries. Recently, document clustering has been put forth as an alternative method of organizing retrieval results (Cutting et al. 1992). A person browsing the clusters can discover ..."
Abstract
-
Cited by 87 (2 self)
- Add to MetaCart
Conventional document retrieval systems (e.g., Alta Vista) return long lists of ranked documents in response to user queries. Recently, document clustering has been put forth as an alternative method of organizing retrieval results (Cutting et al. 1992). A person browsing the clusters can discover patterns that could be overlooked in the traditional presentation. This paper describes two novel clustering methods that intersect the documents in a cluster to determine the set of words (or phrases) shared by all the documents in the cluster. We report on experiments that evaluate these intersectionbased clustering methods on collections of snippets returned from Web search engines. First, we show that word-intersection clustering produces superior clusters and does so faster than standard techniques. Second, we show that our O(n log n) time phrase-intersection clustering method produces comparable clusters and does so more than two orders of magnitude faster than all methods tested. I...
Learning Implicit User Interest Hierarchy for Context in Personalization
- In Proc. of International Conference on Intelligent User Interface (IUI
, 2003
"... To provide a more robust context for personalization, we desire to extract a continuum of general (long-term) to specific (short-term) interests of a user. Our proposed approach is to learn a user interest hierarchy (UIH) from a set of web pages visited by a user. We devise a divisive hierarchical c ..."
Abstract
-
Cited by 32 (4 self)
- Add to MetaCart
To provide a more robust context for personalization, we desire to extract a continuum of general (long-term) to specific (short-term) interests of a user. Our proposed approach is to learn a user interest hierarchy (UIH) from a set of web pages visited by a user. We devise a divisive hierarchical clustering (DHC) algorithm to group words (topics) into a hierarchy where more general interests are represented by a larger set of words. Each web page can then be assigned to nodes in the hierarchy for further processing in learning and predicting interests. This approach is analogous to building a subject taxonomy for a library catalog system and assigning books to the taxonomy. Our approach does not need user involvement and learns the UIH "implicitly." Furthermore, it allows the original objects, web pages, to be assigned to multiple topics (nodes in the hierarchy). In this paper, we focus on learning the UIH from a set of visited pages. We propose a few similarity functions and dynamic threshold-funding methods, and evaluate the resulting hierarchies according to their meaningfulhess and shape.
Genetic Algorithm for Feature Subset Selection with Exploitation of Feature Correlations from Continuous Wavelet Transform: a real-case Application
- International Journal of Computational Intelligence
, 2004
"... Abstract—A genetic algorithm (GA) based feature subset selection algorithm is proposed in which the correlation structure of the features is exploited. The subset of features is validated according to the classification performance. Features derived from the continuous wavelet transform are potentia ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract—A genetic algorithm (GA) based feature subset selection algorithm is proposed in which the correlation structure of the features is exploited. The subset of features is validated according to the classification performance. Features derived from the continuous wavelet transform are potentially strongly correlated. GA’s that do not take the correlation structure of features into account are inefficient. The proposed algorithm forms clusters of correlated features and searches for a good candidate set of clusters. Secondly a search within the clusters is performed. Different simulations of the algorithm on a real-case data set with strong correlations between features show the increased classification performance. Comparison is performed with a standard GA without use of the correlation structure. Keywords—Classification, genetic algorithm, hierarchical agglomerative clustering, wavelet transform. I I.
Automatic Query Taxonomy Generation for Information Retrieval Applications
- In OIR 2003
"... Abstract. It is crucial for information retrieval systems to learn more about what users search for in order to fulfill the intent of search. This paper introduces a new problem, called query taxonomy generation, which is trying to organize users ’ queries into a hierarchical structure of topic clas ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. It is crucial for information retrieval systems to learn more about what users search for in order to fulfill the intent of search. This paper introduces a new problem, called query taxonomy generation, which is trying to organize users ’ queries into a hierarchical structure of topic classes. Such a query taxonomy provides a basis for the in-depth analysis of users ’ queries on a larger scale and can benefit many information retrieval systems. The proposed approach to this problem consists of two computational processes: hierarchical query clustering to generate a query taxonomy from scratch, and query categorization to place new-arrived queries into the taxonomy. The obtained results of preliminary experiment have shown the potential of the proposed approach in generating taxonomies for queries, which may be useful in various Web information retrieval applications.
Adaptive Abstraction for Model-Based Reinforcement Learning
, 2006
"... This paper presents a novel model-based reinforcement learning framework called the Adaptive Modelling and Planning System (AMPS). The challenge of a model-based reinforcement learning agent is using experience in the world to generate a model. In problems with large state and action spaces, the ag ..."
Abstract
- Add to MetaCart
This paper presents a novel model-based reinforcement learning framework called the Adaptive Modelling and Planning System (AMPS). The challenge of a model-based reinforcement learning agent is using experience in the world to generate a model. In problems with large state and action spaces, the agent must generalise from limited experience by grouping together similar states and actions, effectively partitioning the state and action spaces into finite sets of regions. Several different abstraction approaches have been proposed in the literature, but the existing algorithms have many limitations. They generally only increase resolution, require a large amount of data before changing the abstraction, do not generalise over actions, and are computationally expensive. AMPS aims to solve these problems using a new kind of approach. AMPS splits and merges existing regions in its abstraction according to a set of heuristics. The system introduces splits using a mechanism related to supervised learning and is defined generally, allowing AMPS to leverage a wide variety of representations. The system merges existing regions when an analysis of the current plan indicates that doing so could be useful. Because several different regions may require revision at any given time, AMPS prioritises revision to best utilise whatever computational resources are available. Changes in the abstraction lead to changes in the model, requiring changes to the plan. AMPS prioritises the planning process, and when the agent has time, it replans in high-priority regions. This paper demonstrates the flexibility and strength of this approach in learning

