Results 1 - 10
of
14
A maximum entropy approach to collaborative filtering in dynamic, sparse, high-dimensional domains
, 2002
"... We develop a maximum entropy (maxent) approach to generating recommendations in the context of a user’s current navigation stream, suitable for environments where data is sparse, highdimensional, and dynamic—conditions typical of many recommendation applications. We address sparsity and dimensionali ..."
Abstract
-
Cited by 24 (6 self)
- Add to MetaCart
We develop a maximum entropy (maxent) approach to generating recommendations in the context of a user’s current navigation stream, suitable for environments where data is sparse, highdimensional, and dynamic—conditions typical of many recommendation applications. We address sparsity and dimensionality reduction by first clustering items based on user access patterns so as to attempt to minimize the apriori probability that recommendations will cross cluster boundaries and then recommending only within clusters. We address the inherent dynamic nature of the problem by explicitly modeling the data as a time series; we show how this representational expressivity fits naturally into a maxent framework. We conduct experiments on data from ResearchIndex, a popular online repository of over 470,000 computer science documents. We show that our maxent formulation outperforms several competing algorithms in offline tests simulating the recommendation of documents to ResearchIndex users. 1
Language Models for Hierarchical Summarization
, 2003
"... Hierarchies have long been used for organization, summarization, and access to information. In this dissertation we define summarization in terms of a probabilistic language model and use this definition to explore a new technique for automatically generating topic hierarchies. We use the language ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Hierarchies have long been used for organization, summarization, and access to information. In this dissertation we define summarization in terms of a probabilistic language model and use this definition to explore a new technique for automatically generating topic hierarchies. We use the language model to characterize the documents that will be summarized and then apply a graph-theoretic algorithm to determine the best topic words for the hierarchical summary. This work is very different from previous attempts to generate topic hierarchies because it relies on statistical analysis and language modeling to identify descriptive words for a document and organize the words in a hierarchical structure. We compare
Automatically labeling hierarchical clusters
- In Proc. of the Sixth National Conference on Digital Government Research
, 2006
"... Government agencies must often quickly organize and analyze large amounts of textual information, for example comments received as part of notice and comment rulemaking. Hierarchical organization is popular because it represents information at different levels of detail and is convenient for interac ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Government agencies must often quickly organize and analyze large amounts of textual information, for example comments received as part of notice and comment rulemaking. Hierarchical organization is popular because it represents information at different levels of detail and is convenient for interactive browsing. Good hierarchical clustering algorithms are available, but there are few good solutions for automatically labeling the nodes in a cluster hierarchy. This paper presents a simple algorithm that automatically assigns labels to hierarchical clusters. The algorithm evaluates candidate labels using information from the cluster, the parent cluster, and corpus statistics. A trainable threshold enables the algorithm to assign just a few high-quality labels to each cluster. Experiments with Open Directory Project (ODP) hierarchies indicate that the algorithm creates cluster labels that are similar to labels created by ODP editors.
Discovering a term taxonomy from term similarities using principal component analysis
- Semantics, Web and Mining., LNAI 4289
, 2006
"... Abstract. We show that eigenvector decomposition can be used to extract a term taxonomy from a given collection of text documents. So far, methods based on eigenvector decomposition, such as latent semantic indexing (LSI) or principal component analysis (PCA), were only known to be useful for extrac ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. We show that eigenvector decomposition can be used to extract a term taxonomy from a given collection of text documents. So far, methods based on eigenvector decomposition, such as latent semantic indexing (LSI) or principal component analysis (PCA), were only known to be useful for extracting symmetric relations between terms. We give a precise mathematical criterion for distinguishing between four kinds of relations of a given pair of terms of a given collection: unrelated (car-fruit), symmetrically related (car- automobile), asymmetrically related with the first term being more specific than the second (banana- fruit), and asymmetrically related in the other direction (fruit- banana). We give theoretical evidence for the soundness of our criterion, by showing that in a simplified mathematical model the criterion does the apparently right thing. We applied our scheme to the reconstruction of a selected part of the open directory project (ODP) hierarchy, with promising results.
D.: A web-based novel term similarity framework for ontology learning
- In: ODBASE: Int. Conf. on Ontologies, Databases and Applications of Semantics
, 2006
"... Abstract. Given that pairwise similarity computations are essential in ontology learning and data mining, we propose a similarity framework that is based on a conventional Web search engine. There are two main aspects that we can benefit from utilizing a Web search engine. First, we can obtain the f ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Given that pairwise similarity computations are essential in ontology learning and data mining, we propose a similarity framework that is based on a conventional Web search engine. There are two main aspects that we can benefit from utilizing a Web search engine. First, we can obtain the freshest content for each term that represents the upto-date knowledge on the term. This is particularly useful for dynamic ontology management in that ontologies must evolve with time as new concepts or terms appear. Second, in comparison with the approaches that use the certain amount of crawled Web documents as corpus, our method is less sensitive to the problem of data sparseness because we access as much content as possible using a search engine. At the core of our proposed methodology, we present two different measures for similarity computation, a mutual information based and a feature-based metric. Moreover, we show how the proposed metrics can be utilized for modifying existing ontologies. Finally, we compare the extracted similarity relations with semantic similarity using WordNet. Experimental results show that our method can extract topical relations between terms that are not present in conventional concept-based ontologies. 1
ABSTRACT Long-Answer Question Answering and Rhetorical-Semantic Relations
, 2007
"... Over the past decade, Question Answering (QA) has generated considerable interest and participation in the fields of Natural Language Processing and Information Retrieval. Conferences such as TREC, CLEF and DUC have examined various aspects of the QA task in the academic community. In the commercial ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Over the past decade, Question Answering (QA) has generated considerable interest and participation in the fields of Natural Language Processing and Information Retrieval. Conferences such as TREC, CLEF and DUC have examined various aspects of the QA task in the academic community. In the commercial world, major search engines from Google, Microsoft and Yahoo have integrated basic QA capabilities into their core web search. These efforts have focused largely on so-called “factoid ” questions seeking a single fact, such as the birthdate of an individual or the capital city of a country. Yet in the past few years, there has been growing recognition of a broad class of “long-answer ” questions which cannot be satisfactorily answered in this framework, such as those seeking a definition, explanation, or other descriptive information in response. In this thesis, we consider the problem of answering such questions, with particular focus on the contribution to be made by integrating rhetorical and semantic models. We present DefScriber, a system for answering definitional (“What is X?”), biographi-cal (“Who is X?”) and other long-answer questions using a hybrid of goal- and data-driven methods. Our goal-driven, or top-down, approach is motivated by a set of definitional pred-
Word Clouds of Multiple Search Results
"... Abstract. Search engine result pages (SERPs) are known as the most expensive real estate on the planet. Most queries yield millions of organic search results, yet searchers seldom look beyond the first handful of results. To make things worse, different searchers with different query intents may iss ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Search engine result pages (SERPs) are known as the most expensive real estate on the planet. Most queries yield millions of organic search results, yet searchers seldom look beyond the first handful of results. To make things worse, different searchers with different query intents may issue the exact same query. An alternative to showing individual web pages summarized by snippets is to represent whole group of results. In this paper we investigate if we can use word clouds to summarize groups of documents, e.g. to give a preview of the next SERP, or clusters of topically related documents. We experiment with three word cloud generation methods (full-text, query biased and anchor text based clouds) and evaluate them in a user study. Our findings are: First, biasing the cloud towards the query does not lead to test persons better distinguishing relevance and topic of the search results, but test persons prefer them because differences between the clouds are emphasized. Second, anchor text clouds are to be preferred over full-text clouds. Anchor text contains less noisy words than the full text of documents. Third, we obtain moderately positive results on the relation between the selected world clouds and the underlying search results: there is exact correspondence in 70 % of the subtopic matching judgments and in 60 % of the relevance assessment judgments. Our initial experiments open up new possibilities to have SERPs reflect a far larger number of results by using word clouds to summarize groups of search results. 1
WebSim: A Novel Term Similarity Metric based on a Web Search Technology
"... Abstract. Given that pairwise similarity computations are essential in ontology learning and data mining, we propose WebSim (Web-based term Similarity metric), whose feature extraction and similarity model is based on a conventional Web search engine. There are two main aspects that we can benefit f ..."
Abstract
- Add to MetaCart
Abstract. Given that pairwise similarity computations are essential in ontology learning and data mining, we propose WebSim (Web-based term Similarity metric), whose feature extraction and similarity model is based on a conventional Web search engine. There are two main aspects that we can benefit from utilizing a Web search engine. First, we can obtain the freshest content for each term that represents the up-to-date knowledge on the term. This is particularly useful for dynamic ontology management in that ontologies must evolve with time as new concepts or terms appear. Second, in comparison with the approaches that use the certain amount of crawled Web documents as corpus, our method is less sensitive to the problem of data sparseness because we access as much content as possible using a search engine. At the core of WebSim, we present two different methodologies for similarity computation, a mutual information based metric and a feature-based metric. Moreover, we show how WebSim can be utilized for modifying existing ontologies. Finally, we demonstrate the characteristics of WebSim by coupling with WordNet. Experimental results show that WebSim can uncover topical relations between terms that are not shown in conventional concept-based ontologies. 1
A Maximum Entropy Approach To
- In Proceedings of Neural Information Processing Systems
, 2002
"... We develop a maximum entropy (maxent) approach to generating recommendations in the context of a user's current navigation stream, suitable for environments where data is sparse, high-dimensional, and dynamic--- conditions typical of many recommendation applications. We address sparsity and dime ..."
Abstract
- Add to MetaCart
We develop a maximum entropy (maxent) approach to generating recommendations in the context of a user's current navigation stream, suitable for environments where data is sparse, high-dimensional, and dynamic--- conditions typical of many recommendation applications. We address sparsity and dimensionality reduction by first clustering items based on user access patterns so as to attempt to minimize the apriori probability that recommendations will cross cluster boundaries and then recommending only within clusters. We address the inherent dynamic nature of the problem by explicitly modeling the data as a time series; we show how this representational expressivity fits naturally into a maxent framework.
PageCluster: Mining Conceptual Link
- ACM Trans. Inter. Tech
, 2004
"... this article, we propose a solution to the navigation problem. We first construct the link hierarchy of a Web site on the basis of user traversals on hyperlinks recorded in a Web log file. The pages on each conceptual level of the link hierarchy are then clustered on the basis of their link similari ..."
Abstract
- Add to MetaCart
this article, we propose a solution to the navigation problem. We first construct the link hierarchy of a Web site on the basis of user traversals on hyperlinks recorded in a Web log file. The pages on each conceptual level of the link hierarchy are then clustered on the basis of their link similarities. Finally, the clusters are used to construct a conceptual link hierarchy of the Web site that can be visualized to help users navigate the Web site. Our work also presents a new approach to building adaptive Web sites, that can automatically change their organization and presentation to assist user navigation by learning from Web usage data [Perkowitz and Etzioni 1997]

