Results 11 - 20
of
43
Information geometry, the embedding principle, and document classification
- in Proceedings of the 2nd International Symposium on Information Geometry and its Applications
, 2005
"... Abstract. High dimensional structured data such as text and images is often poorly understood and misrepresented in statistical modeling. Typical approaches to modeling such data involve, either explicitly or implicitly, arbitrary geometric assumptions. In this paper, we review a framework introduce ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract. High dimensional structured data such as text and images is often poorly understood and misrepresented in statistical modeling. Typical approaches to modeling such data involve, either explicitly or implicitly, arbitrary geometric assumptions. In this paper, we review a framework introduced by Lebanon and Lafferty that is based on Čencov’s theorem for obtaining a coherent geometry for data. The framework enables adaptation of popular models to the new geometry and in the context of text classification yields superior performance with respect to classification error rate on held out data. The framework demonstrates how information geometry may be applied to modeling high dimensional structured data and points at new directions for future research. 1.
Fine: Information embedding for document classification
- in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing
, 2008
"... The problem of document classification considers categorizing or grouping of various document types. Each document can be represented as a bag of words, which has no straightforward Euclidean representation. Relative word counts form the basis for similarity metrics among documents. Endowing the vec ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
The problem of document classification considers categorizing or grouping of various document types. Each document can be represented as a bag of words, which has no straightforward Euclidean representation. Relative word counts form the basis for similarity metrics among documents. Endowing the vector of term frequencies with a Euclidean metric has no obvious straightforward justification. A more appropriate assumption commonly used is that the data lies on a statistical manifold, or a manifold of probabilistic generative models. In this paper, we propose calculating a low-dimensional, information based embedding of documents into Euclidean space. One component of our approach motivated by information geometry is the Fisher information distance to define similarities between documents. The other component is the calculation of the Fisher metric over a lower dimensional statistical manifold estimated in a nonparametric fashion from the data. We demonstrate that in the classification task, this information driven embedding outperforms both a standard PCA embedding and other Euclidean embeddings of the term frequency vector. Index Terms — Manifold learning, Riemannian manifold, geodesics, text classification, information geometry
Learning Latent Semantic Relations from Clickthrough Data for Query Suggestion
, 2008
"... For a given query raised by a specific user, the Query Suggestion technique aims to recommend relevant queries which potentially suit the information needs of that user. Due to the complexity of the Web structure and the ambiguity of users ’ inputs, most of the suggestion algorithms suffer from the ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
For a given query raised by a specific user, the Query Suggestion technique aims to recommend relevant queries which potentially suit the information needs of that user. Due to the complexity of the Web structure and the ambiguity of users ’ inputs, most of the suggestion algorithms suffer from the problem of poor recommendation accuracy. In this paper, aiming at providing semantically relevant queries for users, we develop a novel, effective and efficient two-level query suggestion model by mining clickthrough data, in the form of two bipartite graphs (user-query and query-URL bipartite graphs) extracted from the clickthrough data. Based on this, we first propose a joint matrix factorization method which utilizes two bipartite graphs to learn the low-rank query latent feature space, and then build a query similarity graph based on the features. After that, we design an online ranking algorithm to propagate similarities on the query similarity graph, and finally recommend latent semantically relevant queries to users. Experimental analysis on the clickthrough data of a commercial search engine shows the effectiveness and the efficiency of our method.
Hilbertian metrics on probability measures and their application in svm’s
- In Proceedings of the 26th DAGM Symposium
, 2004
"... Abstract. In this article we investigate the field of Hilbertian metrics on probability measures. Since they are very versatile and can therefore be applied in various problems they are of great interest in kernel methods. Quit recently Topsøe and Fuglede introduced a family of Hilbertian metrics on ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract. In this article we investigate the field of Hilbertian metrics on probability measures. Since they are very versatile and can therefore be applied in various problems they are of great interest in kernel methods. Quit recently Topsøe and Fuglede introduced a family of Hilbertian metrics on probability measures. We give basic properties of the Hilbertian metrics of this family and other used metrics in the literature. Then we propose an extension of the considered metrics which incorporates structural information of the probability space into the Hilbertian metric. Finally we compare all proposed metrics in an image and text classification problem using histogram data. 1
Mining Social Networks Using Heat Diffusion Processes for Marketing Candidates Selection
, 2008
"... Social Network Marketing techniques employ pre-existing social networks to increase brands or products awareness through word-of-mouth promotion. Full understanding of social network marketing and the potential candidates that can thus be marketed to certainly offer lucrative opportunities for prosp ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Social Network Marketing techniques employ pre-existing social networks to increase brands or products awareness through word-of-mouth promotion. Full understanding of social network marketing and the potential candidates that can thus be marketed to certainly offer lucrative opportunities for prospective sellers. Due to the complexity of social networks, few models exist to interpret social network marketing realistically. We propose to model social network marketing using Heat Diffusion Processes. This paper presents three diffusion models, along with three algorithms for selecting the best individuals to receive marketing samples. These approaches have the following advantages to best illustrate the properties of real-world social networks: (1) We can plan a marketing strategy sequentially in time since we include a time factor in the simulation of product adoptions; (2) The algorithm of selecting marketing candidates best represents and utilizes the clustering property of real-world social networks; and (3) The model we construct can diffuse both positive and negative comments on products or brands in order to simulate the complicated communications within social networks. Our work represents a novel approach to the analysis of social network marketing, and is the first work to propose how to defend against negative comments within social networks. Complexity analysis shows our model is also scalable to very large social networks.
The Locally Weighted Bag of Words Framework for Document Representation
"... The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words represen ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks.
NHDC and phdc: Non-propagating and propagating heat diffusion classifiers
- In Proceedings of the 12th International Conference on Neural Information Processing (ICONIP
, 2005
"... Abstract — By imitating the way that heat flows in a medium with a geometric structure, we propose two novel classification algorithms, Non-propagating Heat Diffusion Classifier (NHDC) and Propagating Heat Diffusion Classifier (PHDC). In NHDC, an unlabelled data is classified into the class that dif ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract — By imitating the way that heat flows in a medium with a geometric structure, we propose two novel classification algorithms, Non-propagating Heat Diffusion Classifier (NHDC) and Propagating Heat Diffusion Classifier (PHDC). In NHDC, an unlabelled data is classified into the class that diffuses the most heat to the unlabelled data after one local diffusion from time 0 to a small time period, while in PHDC, an unlabelled data is classified into the class that diffuses the most heat to the unlabelled data in the propagating effect of the heat flow from time 0 to time t, which means that in PHDC, the heat diffuses infinitely many times from time 0 and each time period is infinitely small. In other words, we measure the similarity between an unlabelled data and a class by the heat amount that the unlabelled data receives from the set of labelled data in the class, and then classify the unlabelled data into the class with the most similarity. Unlike the traditional method, in which the heat kernel is applied to a kernel-based classifier we employ the heat kernel to construct the classifier directly; moreover, instead of imitating the way that the heat flows along a linear or nonlinear manifold, we let the heat flow along a graph formed by the k-nearest neighbors. An important and special feature in both NHDC and PHDC is that the kernel is not symmetric. We show theoretically that PWA (Parzen Window Approach when the window function is a multivariate normal kernel) and KNN are actually special cases of NHDC model, and that PHDC has the ability to approximate NHDC. Experiments show that NHDC performs better than PWA and KNN in prediction accuracy, and that PHDC performs better than NHDC. I.
A Class of Kernels for Sets of Vectors
- In Proceedings of the 13th European Symposium on Artificial Neural Networks
, 2005
"... In some important applications such as speaker recognition or image texture classification, the data to be processed are sets of vectors. ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In some important applications such as speaker recognition or image texture classification, the data to be processed are sets of vectors.
Sequential document visualization
- IEEE Transactions on Visualization and Computer Graphics
"... Abstract — Documents and other categorical valued time series are often characterized by the frequencies of short range sequential patterns such as n-grams. This representation converts sequential data of varying lengths to high dimensional histogram vectors which are easily modeled by standard stat ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract — Documents and other categorical valued time series are often characterized by the frequencies of short range sequential patterns such as n-grams. This representation converts sequential data of varying lengths to high dimensional histogram vectors which are easily modeled by standard statistical models. Unfortunately, the histogram representation ignores most of the medium and long range sequential dependencies making it unsuitable for visualizing sequential data. We present a novel framework for sequential visualization of discrete categorical time series based on the idea of local statistical modeling. The framework embeds categorical time series as smooth curves in the multinomial simplex summarizing the progression of sequential trends. We discuss several visualization techniques based on the above framework and demonstrate their usefulness for document visualization. Index Terms—Document visualization, multi-resolution analysis, local fitting. 1
Nonextensive Entropic Kernels
, 2008
"... Positive definite kernels on probability measures have been recently applied in structured data classification problems. Some of these kernels are related to classic information theoretic quantities, such as mutual information and the Jensen-Shannon divergence. Meanwhile, driven by recent advances i ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Positive definite kernels on probability measures have been recently applied in structured data classification problems. Some of these kernels are related to classic information theoretic quantities, such as mutual information and the Jensen-Shannon divergence. Meanwhile, driven by recent advances in Tsallis statistics, nonextensive generalizations of Shannon’s information theory have been proposed. This paper bridges these two trends. We introduce the Jensen-Tsallis q-difference, a generalization of the Jensen-Shannon divergence. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, Jensen-Shannon, and linear kernels as particular cases. We illustrate the performance of these kernels on text categorization tasks.

