Results 1 - 10
of
41
Mining the Web for Synonyms: PMI-IR Versus LSA on TOEFL
, 2001
"... This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of wo ..."
Abstract
-
Cited by 118 (10 self)
- Add to MetaCart
This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing).
Reading Tea Leaves: How Humans Interpret Topic Models
"... Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summariz ..."
Abstract
-
Cited by 45 (5 self)
- Add to MetaCart
Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide exploration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics. 1
Topic Identification in Dynamical Text by Complexity Pursuit
, 2003
"... The problem of analysing dynamically evolving textual data has arisen within the last few years. An example of such data is the discussion appearing in Internet chat lines. In this Letter a recently introduced source separation Inethod, termed as complexity pursuit, is applied to the problem of find ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
The problem of analysing dynamically evolving textual data has arisen within the last few years. An example of such data is the discussion appearing in Internet chat lines. In this Letter a recently introduced source separation Inethod, termed as complexity pursuit, is applied to the problem of finding topics in dynamical text and is compared against several blind separation algorithms for the problem considered. Complexity pursuit is a generalisation of projection pursuit to time series and it is able to use both higher-order statistical measures and temporal dependency information in separating the topics. Experimental results on chat line and newsgloup data demonstrate that the minimum complexity time series indeed do correspond to meaningful topics inherent in the dynamical text data, and also suggest the applicability of the inethod to query-based retrieval from a temporally changing text stream.
Exploiting Information Access Patterns for Context-Based Retrieval
"... In order for intelligent interfaces to provide proactive assistance, they must customize their behavior based on the user's task context. Existing systems often assess context based on a single snapshot of the user's current activities (e.g., examining the content of the document that the user is cu ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
In order for intelligent interfaces to provide proactive assistance, they must customize their behavior based on the user's task context. Existing systems often assess context based on a single snapshot of the user's current activities (e.g., examining the content of the document that the user is currently consulting). However, an accurate picture of the user's context may depend not only on this local information, but also on information about the user's behavior over time. This paper discusses work on a recommender system, Calvin, which learns to identify broader contexts by relating documents that tend to be accessed together. Calvin's text analysis algorithm, WordSieve, develops term vector descriptions of these contexts in real time, without needing to accumulate comprehensive statistics about an entire corpus. Calvin uses these descriptions (1) to index documents to suggest them in similar future contexts and (2) to formulate contextbased queries for search engines. Results of initial experiments are encouraging for the approach's improved ability to associate documents with the research tasks in which they were consulted, compared to methods using only local information. This paper sketches the project goals, the current implementation of the system, and plans for its continued development and evaluation.
Protein association discovery in biomedical literature
- In Proceedings of the Joint Conference on Digital Libraries (JCDL 2003
, 2003
"... Protein association discovery can directly contribute toward developing protein pathways; hence it is a significant problem in bioinformatics. LUCAS (Library of User-Oriented Concepts for Access Services) was designed to automatically extract and determine associations among proteins from biomedical ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Protein association discovery can directly contribute toward developing protein pathways; hence it is a significant problem in bioinformatics. LUCAS (Library of User-Oriented Concepts for Access Services) was designed to automatically extract and determine associations among proteins from biomedical literature. Such a tool has notable potential to automate database construction in biomedicine, instead of relying on experts ’ analysis. This paper reports on the mechanisms for automatically generating clusters of proteins. A formal evaluation of the system, based on a subset of 2000 MEDLINE titles and abstracts, has been conducted against Swiss-Prot database in which the associations among concepts are entered by experts manually. 1.
Support Vector Machines for Text Categorization Based on Latent Semantic Indexing
, 2001
"... Text Categorization(TC) is an important component in many information organization and information management tasks. Two key issues in TC are feature coding and classifier design. In this paper Text Categorization via Support Vector Machines(SVMs) approach based on Latent Semantic Indexing(LSI) ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Text Categorization(TC) is an important component in many information organization and information management tasks. Two key issues in TC are feature coding and classifier design. In this paper Text Categorization via Support Vector Machines(SVMs) approach based on Latent Semantic Indexing(LSI) is described. Latent Semantic Indexing[1][2] is a method for selecting informative subspaces of feature spaces with the goal of obtaining a compact representation of document. Support Vector Machines[3] are powerful machine learning systems, which combine remarkable performance with an elegant theoretical framework. The SVMs well fits the Text Categorization task due to the special properties of text itself. Experiments show that the LSI+SVMs frame improves clustering performance by focusing attention of Support Vector Machines onto informative subspaces of the feature spaces.
Predicting Who Rated What in Large-scale Datasets
, 2007
"... KDD Cup 2007 focuses on movie rating behaviors. The goal of the task “Who Rated What” is to predict whether “existing” users will review “existing” movies in the future. We cast the task as a link prediction problem and address it via a simple classification approach. Compared with other application ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
KDD Cup 2007 focuses on movie rating behaviors. The goal of the task “Who Rated What” is to predict whether “existing” users will review “existing” movies in the future. We cast the task as a link prediction problem and address it via a simple classification approach. Compared with other applications for link prediction, there are two major challenges in our task: (1) the huge size of the Netflix data; (2) the prediction target is complicated by many factors, such as a general decrease of interest in old movies and more tendency to review more movies by Netflix users due to the success of the internet DVD rental industries. We address the first challenge by “selective” subsampling and the second by combining information from the review scores, movie contents and graph topology effectively.
A Latent Space Approach to Dynamic Embedding of Co-occurrence Data
"... We consider dynamic co-occurrence data, such as author-word links in papers published in successive years of the same conference. For static co-occurrence data, researchers often seek an embedding of the entities (authors and words) into a lowdimensional Euclidean space. We generalize a recent stati ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We consider dynamic co-occurrence data, such as author-word links in papers published in successive years of the same conference. For static co-occurrence data, researchers often seek an embedding of the entities (authors and words) into a lowdimensional Euclidean space. We generalize a recent static co-occurrence model, the CODE model of Globerson et al. (2004), to the dynamic setting: we seek coordinates for each entity at each time step. The coordinates can change with time to explain new observations, but since large changes are improbable, we can exploit data at previous and subsequent steps to find a better explanation for current observations. To make inference tractable, we show how to approximate our observation model with a Gaussian distribution, allowing the use of a Kalman filter for tractable inference. The result is the first algorithm for dynamic embedding of co-occurrence data which provides distributional information for its coordinate estimates. We demonstrate our model both on synthetic data and on author-word data from the NIPS corpus, showing that it produces intuitively reasonable embeddings. We also provide evidence for the usefulness of our model by its performance on an authorprediction task. 1
Concept extraction and association from cancer literature
- In Proceedings of the fourth international workshop on Web information and data management of ACM CIKM
, 2002
"... There is a large and growing body of web accessible biomedical literature. As this body of electronic literature grows, so does the possibility that document analysis techniques can be used to automatically extract useful biomedical information from them, particularly in the discovery of key concept ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
There is a large and growing body of web accessible biomedical literature. As this body of electronic literature grows, so does the possibility that document analysis techniques can be used to automatically extract useful biomedical information from them, particularly in the discovery of key concepts dealing with genes, proteins, drugs, and diseases and associations among these concepts. VCGS (Vocabulary Cluster Generating System) was designed to automatically extract and determine associations among tokens from a subset of biomedical literature namely cancer. Such information has notable potential to automate database construction in biomedicine, instead of relying on experts ’ analysis. This paper reports on t he mechanisms for automatically generating clusters of tokens. A formal evaluation of the system, based on a subset of 5338 Pubmed titles and abstracts, has been conducted against the Swiss-Prot database in which the associations among concepts are entered by experts by hand.

