Results 11  20
of
92
Extracting keysubstringgroup features for text classification
 In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’06
, 2006
"... In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like Support Vector Machine (SVM) have been quite successful in text classification with word features, it is neither effective nor efficient to apply them straightforwardly taking all substrings in the corpus as features. In this paper, we propose to partition all substrings into statistical equivalence groups, and then pick those groups which are important (in the statistical sense) as features (named keysubstringgroup features) for text classification. In particular, we propose a suffix tree based algorithm that can extract such features in linear time (with respect to the total number of characters in the corpus). Our experiments on English, Chinese and Greek datasets show that SVM with keysubstringgroup features can achieve outstanding performance for various text classification tasks.
A riemannian approach to graph embedding
 Pattern Recognition
"... In this paper, we make use of the relationship between the LaplaceBeltrami operator and the graph Laplacian, for the purposes of embedding a graph onto a Riemannian manifold. To embark on this study, we review some of the basics of Riemannian geometry and explain the relationship between the Laplac ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
In this paper, we make use of the relationship between the LaplaceBeltrami operator and the graph Laplacian, for the purposes of embedding a graph onto a Riemannian manifold. To embark on this study, we review some of the basics of Riemannian geometry and explain the relationship between the LaplaceBeltrami operator and the graph Laplacian. Using the properties of Jacobi fields, we show how to compute an edgeweight matrix in which the elements reflect the sectional curvatures associated with the geodesic paths on the manifold between nodes. For the particular case of a constant sectional curvature surface, we use the Kruskal coordinates to compute edge weights that are proportional to the geodesic distance between points. We use the resulting edgeweight matrix to embed the nodes of the graph onto a Riemannian manifold. To do this we develop a method that can be used to perform double centering on the Laplacian matrix computed from the edgeweights. The embedding coordinates are given by the eigenvectors of the centred Laplacian. With the set of embedding coordinates at hand, a number of graph manipulation tasks can be performed. In this paper we are primarily interested in graphmatching. We recast the graph matching problem as that of aligning pairs of manifolds subject to a geometric transformation. We show that this transformation is Procrustean in nature. We illustrate the utility of the method on image matching using the COIL database.
Learning Latent Semantic Relations from Clickthrough Data for Query Suggestion
, 2008
"... For a given query raised by a specific user, the Query Suggestion technique aims to recommend relevant queries which potentially suit the information needs of that user. Due to the complexity of the Web structure and the ambiguity of users ’ inputs, most of the suggestion algorithms suffer from the ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
For a given query raised by a specific user, the Query Suggestion technique aims to recommend relevant queries which potentially suit the information needs of that user. Due to the complexity of the Web structure and the ambiguity of users ’ inputs, most of the suggestion algorithms suffer from the problem of poor recommendation accuracy. In this paper, aiming at providing semantically relevant queries for users, we develop a novel, effective and efficient twolevel query suggestion model by mining clickthrough data, in the form of two bipartite graphs (userquery and queryURL bipartite graphs) extracted from the clickthrough data. Based on this, we first propose a joint matrix factorization method which utilizes two bipartite graphs to learn the lowrank query latent feature space, and then build a query similarity graph based on the features. After that, we design an online ranking algorithm to propagate similarities on the query similarity graph, and finally recommend latent semantically relevant queries to users. Experimental analysis on the clickthrough data of a commercial search engine shows the effectiveness and the efficiency of our method.
The Locally Weighted Bag of Words Framework for Document Representation
"... The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words represen ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its ngram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or ngrams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks.
Nonextensive Information Theoretic Kernels on Measures
, 2009
"... Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenSha ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenShannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JStype divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of qconvexity, for which we prove a Jensen qinequality. Based on this inequality, we introduce JensenTsallis (JT) qdifferences, a nonextensive generalization of the JS divergence, and define a kth order JT qdifference between stochastic processes. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the pspectrum kernel. We illustrate the performance of
Geometric representations for multiple documents
 In Proc. of SIGIR
, 2010
"... Combining multiple documents to represent an information object is wellknown as an effective approach for many Information Retrieval tasks. For example, passages can be combined to represent a document for retrieval, document clusters are represented using combinations of the documents they contain ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Combining multiple documents to represent an information object is wellknown as an effective approach for many Information Retrieval tasks. For example, passages can be combined to represent a document for retrieval, document clusters are represented using combinations of the documents they contain, and feedback documents can be combined to represent a query model. Various techniques for combination have been introduced, and among them, representation techniques based on concatenation and the arithmetic mean are frequently used. Some recent work has shown the potential of a new representation technique using the geometric mean. However, these studies lack a theoretical foundation explaining why the geometric mean should have advantages for representing multiple documents. In this paper, we show that the arithmetic mean and the geometric mean are approximations to the center of mass in certain geometries, and show empirically that the geometric mean is closer to the center. Through experiments with two IR tasks, we show the potential benefits for geometric representations, including a geometrybased pseudorelevance feedback method that outperforms stateoftheart techniques.
Dimensionality reduction on statistical manifolds
, 2009
"... This work could not have been possible without the support of many individuals, and I would be remiss if I did not take the opportunity to thank them. To start, I give the utmost thanks to my advisor, Professor Alfred Hero. He not only took me under his wing as a research assistant, but was a major ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
This work could not have been possible without the support of many individuals, and I would be remiss if I did not take the opportunity to thank them. To start, I give the utmost thanks to my advisor, Professor Alfred Hero. He not only took me under his wing as a research assistant, but was a major contributor to my professional development. While his otherworldly knowledge base was critical towards my maturation as a researcher, his motivation, mentorship, and words of advice kept me going during difficult and stressful times. I would also like to thank Professor Raviv Raich, who has worked sidebyside with me throughout my entire research experience. Whenever I came upon a road block, I knew I could count on Raviv to have the patience and wherewithal to guide me through. My ability to progress so quickly throughout this process was due in large part to my amazing research project. I owe this entirely to Dr. William Finn and the Department of Pathology at the University of Michigan, who came to us with an idea and a lot of data. Dr. Finn was always available for discussion and insight into the process of flow cytometry, and throughout my development he has shown a genuine excitement for all of the work I have done. Without his knowledge, support, and enthusiasm, none of this work would have been completed. This work has also benefited from discussions with the remainder of my committee members. A special thanks goes to Professor Elizaveta Levina and Professor Clayton Scott for their input and support. Their level of expertise in many of the areas iii directly coinciding to my research topics was very beneficial, and the thirdparty
Fine: Information embedding for document classification
 in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing
, 2008
"... The problem of document classification considers categorizing or grouping of various document types. Each document can be represented as a bag of words, which has no straightforward Euclidean representation. Relative word counts form the basis for similarity metrics among documents. Endowing the vec ..."
Abstract

Cited by 8 (7 self)
 Add to MetaCart
The problem of document classification considers categorizing or grouping of various document types. Each document can be represented as a bag of words, which has no straightforward Euclidean representation. Relative word counts form the basis for similarity metrics among documents. Endowing the vector of term frequencies with a Euclidean metric has no obvious straightforward justification. A more appropriate assumption commonly used is that the data lies on a statistical manifold, or a manifold of probabilistic generative models. In this paper, we propose calculating a lowdimensional, information based embedding of documents into Euclidean space. One component of our approach motivated by information geometry is the Fisher information distance to define similarities between documents. The other component is the calculation of the Fisher metric over a lower dimensional statistical manifold estimated in a nonparametric fashion from the data. We demonstrate that in the classification task, this information driven embedding outperforms both a standard PCA embedding and other Euclidean embeddings of the term frequency vector. Index Terms — Manifold learning, Riemannian manifold, geodesics, text classification, information geometry
Sequential document visualization
 IEEE Transactions on Visualization and Computer Graphics
"... Abstract — Documents and other categorical valued time series are often characterized by the frequencies of short range sequential patterns such as ngrams. This representation converts sequential data of varying lengths to high dimensional histogram vectors which are easily modeled by standard stat ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
Abstract — Documents and other categorical valued time series are often characterized by the frequencies of short range sequential patterns such as ngrams. This representation converts sequential data of varying lengths to high dimensional histogram vectors which are easily modeled by standard statistical models. Unfortunately, the histogram representation ignores most of the medium and long range sequential dependencies making it unsuitable for visualizing sequential data. We present a novel framework for sequential visualization of discrete categorical time series based on the idea of local statistical modeling. The framework embeds categorical time series as smooth curves in the multinomial simplex summarizing the progression of sequential trends. We discuss several visualization techniques based on the above framework and demonstrate their usefulness for document visualization. Index Terms—Document visualization, multiresolution analysis, local fitting. 1
Information geometry, the embedding principle, and document classification
 in Proceedings of the 2nd International Symposium on Information Geometry and its Applications
, 2005
"... Abstract. High dimensional structured data such as text and images is often poorly understood and misrepresented in statistical modeling. Typical approaches to modeling such data involve, either explicitly or implicitly, arbitrary geometric assumptions. In this paper, we review a framework introduce ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Abstract. High dimensional structured data such as text and images is often poorly understood and misrepresented in statistical modeling. Typical approaches to modeling such data involve, either explicitly or implicitly, arbitrary geometric assumptions. In this paper, we review a framework introduced by Lebanon and Lafferty that is based on Čencov’s theorem for obtaining a coherent geometry for data. The framework enables adaptation of popular models to the new geometry and in the context of text classification yields superior performance with respect to classification error rate on held out data. The framework demonstrates how information geometry may be applied to modeling high dimensional structured data and points at new directions for future research. 1.