Results 1  10
of
197
Authoritative Sources in a Hyperlinked Environment
 JOURNAL OF THE ACM
, 1999
"... The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and repo ..."
Abstract

Cited by 2702 (10 self)
 Add to MetaCart
The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of “authoritative ” information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages ” that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for linkbased analysis.
Latent dirichlet allocation
 Journal of Machine Learning Research
, 2003
"... We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a threelevel hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, ..."
Abstract

Cited by 2350 (63 self)
 Add to MetaCart
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a threelevel hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model. 1.
Concept Decompositions for Large Sparse Text Data using Clustering
 Machine Learning
, 2000
"... . Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as highdimensional and sparse vectorsa few thousand dimensions and a sparsity of 95 to 99 ..."
Abstract

Cited by 303 (28 self)
 Add to MetaCart
. Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as highdimensional and sparse vectorsa few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical kmeans algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the highdimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractallike" and "selfsimilar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the leastsquares approximation onto the linear subspace spanned...
On Clusterings: Good, Bad and Spectral
, 2000
"... We motivate and develop a natural bicriteria measure for assessing the quality of a clustering which avoids the drawbacks of existing measures. A simple recursive heuristic has polylogarithmic worstcase guarantees under the new measure. The main result of the paper is the analysis of a popular spe ..."
Abstract

Cited by 254 (12 self)
 Add to MetaCart
We motivate and develop a natural bicriteria measure for assessing the quality of a clustering which avoids the drawbacks of existing measures. A simple recursive heuristic has polylogarithmic worstcase guarantees under the new measure. The main result of the paper is the analysis of a popular spectral algorithm. One variant of spectral clustering turns out to have effective worstcase guarantees
Locality Preserving Projections
, 2002
"... Many problems in information processing involve some form of dimensionality reduction. In this paper, we introduce Locality Preserving Projections (LPP). These are linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the data s ..."
Abstract

Cited by 209 (15 self)
 Add to MetaCart
Many problems in information processing involve some form of dimensionality reduction. In this paper, we introduce Locality Preserving Projections (LPP). These are linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the data set. LPP should be seen as an alternative to Principal Component Analysis (PCA)  a classical linear technique that projects the data along the directions of maximal variance. When the high dimensional data lies on a low dimensional manifold embedded in the ambient space, the Locality Preserving Projections are obtained by finding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the manifold. As a result, LPP shares many of the data representation properties of nonlinear techniques such as Laplacian Eigenmaps or Locally Linear Embedding. Yet LPP is linear and more crucially is defined everywhere in ambient space rather than just on the training data points. This is borne out by illustrative examples on some high dimensional data sets.
Self Organization of a Massive Document Collection
 IEEE Transactions on Neural Networks
"... This article describes the implementation of a system that is able to organize vast document collections according to textual similarities. It is based on the SelfOrganizing Map (SOM) algorithm. As the feature vectors for the documents we use statistical representations of their vocabularies. The m ..."
Abstract

Cited by 204 (14 self)
 Add to MetaCart
This article describes the implementation of a system that is able to organize vast document collections according to textual similarities. It is based on the SelfOrganizing Map (SOM) algorithm. As the feature vectors for the documents we use statistical representations of their vocabularies. The main goal in our work has been to scale up the SOM algorithm to be able to deal with large amounts of highdimensional data. In a practical experiment we mapped 6,840,568 patent abstracts onto a 1,002,240node SOM. As the feature vectors we used 500dimensional vectors of stochastic figures obtained as random projections of weighted word histograms. Keywords Data mining, exploratory data analysis, knowledge discovery, large databases, parallel implementation, random projection, SelfOrganizing Map (SOM), textual documents. I. Introduction A. From simple searches to browsing of selforganized data collections Locating documents on the basis of keywords and simple search expressions is a c...
Mining the Web for Synonyms: PMIIR Versus LSA on TOEFL
, 2001
"... This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMIIR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of wo ..."
Abstract

Cited by 173 (12 self)
 Add to MetaCart
This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMIIR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMIIR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMIIR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing).
Semantic Search
 INTERNATIONAL WORLD WIDE WEB CONFERENCE
, 2003
"... Activities such as Web Services and the Semantic Web are working to create a web of distributed machine understandable data. In this paper we present an application called Semantic Search which is built on these supporting technologies and is designed to improve traditional web searching. We provide ..."
Abstract

Cited by 159 (3 self)
 Add to MetaCart
Activities such as Web Services and the Semantic Web are working to create a web of distributed machine understandable data. In this paper we present an application called Semantic Search which is built on these supporting technologies and is designed to improve traditional web searching. We provide an overview of TAP, the application framework upon which the Semantic Search is built. We describe two implemented Semantic Search systems which, based on the denotation of the search query, augment traditional search results with relevant data aggregated from distributed sources. We also discuss some general issues related to searching and the Semantic Web and outline how an understanding of the semantics of the search terms can be used to provide better results.
Databasefriendly Random Projections
, 2001
"... A classic result of Johnson and Lindenstrauss asserts that any set of n points in ddimensional Euclidean space can be embedded into kdimensional Euclidean space  where k is logarithmic in n and independent of d  so that all pairwise distances are maintained within an arbitrarily small factor. Al ..."
Abstract

Cited by 158 (3 self)
 Add to MetaCart
A classic result of Johnson and Lindenstrauss asserts that any set of n points in ddimensional Euclidean space can be embedded into kdimensional Euclidean space  where k is logarithmic in n and independent of d  so that all pairwise distances are maintained within an arbitrarily small factor. All known constructions of such embeddings involve projecting the n points onto a random kdimensional hyperplane. We give a novel construction of the embedding, suitable for database applications, which amounts to computing a simple aggregate over k random attribute partitions.