## The Locally Weighted Bag of Words Framework for Document Representation

Citations: | 9 - 1 self |

### BibTeX

@MISC{Lebanon_thelocally,

author = {Guy Lebanon and Yi Mao and Joshua Dillon},

title = {The Locally Weighted Bag of Words Framework for Document Representation},

year = {}

}

### OpenURL

### Abstract

The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks.

### Citations

9830 | The Nature of Statistical Learning Theory - Vapnik - 1999 |

5046 | Matrix Analysis - Horn, Johnson - 1985 |

2639 | Modern Information Retrieval - Baeza-Yates, Ribeiro-Neto - 1999 |

2615 | Latent dirichlet allocation - Blei, Ng, et al. - 2003 |

796 | A Comprehensive Introduction to Differential Geometry. Publish or - Spivak - 1979 |

789 | Statistical Methods for Speech Recognition - Jelinek - 1998 |

753 | A study of smoothing methods for language models applied to information retrieval - Zhai, Lafferty |

566 | Functional Data Analysis
- Ramsay, Silverman
- 1997
(Show Context)
Citation Context ... j| 2 ≤ |µ − τ|O(K). 3. Modeling of Simplicial Curves Modeling functional data such as lowbow curves is known in the statistics literature as functional data analysis (e.g., Ramsay and Dalzell, 1991; =-=Ramsay and Silverman, 2005-=-). Previous work in this area focused on low dimensional functional data such as one dimensional or two dimensional curves. In this section we discuss some issues concerning generative and conditional... |

547 | An evaluation of statistical approaches to text categorization - Yang - 1999 |

529 | BoosTexter: A boosting-based system for text categorization - Schapire, Singer |

491 | DynaMic Programming Algorithm Optimization for Spoken Word Recognition - Sakoe, Chiba - 1978 |

479 | Rcv1: A new benchmark collection for text categorization research - Lewis, Yang, et al. - 2004 |

429 | Dynamic topic models - Blei, Lafferty - 2006 |

405 | An introduction to differentiable manifolds and Riemannian geometry - Boothby - 1975 |

324 |
Neural Network Learning: Theoretical Foundations
- Anthony, Bartlett
- 1999
(Show Context)
Citation Context ...een the empirical risk or training error and the expected risk uniformly over a class of functions L = { fα : α ∈ I}. These bounds are expressed probabilistically and usually take the following form (=-=Anthony and Bartlett, 1999-=-) � � P sup|E p(L( fα(Z))) − E ˜p(L( fα(Z)))| ≥ ε α∈I ≤ C(L,L,n,ε). (14) Above, Z represents any sequence of n examples - either X in the unsupervised scenario or (X,Y ) in the supervised scenario and... |

296 | Methods of Information Geometry - Amari, Nagaoka - 1999 |

272 | Algorithms for the assignment and transportation problems
- Munkres
- 1957
(Show Context)
Citation Context ...nation of dynamic programming similar to the one of Sakoe and Chiba (1978) and a variation of earth mover distance 2429sLEBANON, MAO AND DILLON (Rubner et al., 2000) known as the Hungarian algorithm (=-=Munkres, 1957-=-), the minimization problem (19) over the class I described above may be computed efficiently. We conducted a series of experiments examining the benefit in introducing dynamic time warping or registr... |

229 | Statistical models for text segmentation - Beeferman, Berger, et al. - 1999 |

183 | Local regression and likelihood - Loader - 1999 |

174 | Themeriver: visualizing thematic changes in large document collections. Visualization and Computer Graphics - HAVRE, HETZLER, et al. - 2002 |

147 | Harmonic Analysis on Semigroups - Berg, Christensen, et al. - 1984 |

138 | Diffusion kernels on graphs and other discrete structures - Kondor, Lafferty - 2002 |

128 | Introduction to smooth manifolds - Lee - 2003 |

92 | Le Spectre d’une Variété Riemannienne - Berger, Gauduchon, et al. - 1971 |

91 | Diffusion kernels on statistical manifolds
- Lafferty, Lebanon
- 2005
(Show Context)
Citation Context ...er description of the topic hierarchy of the RCV1 data set refer to Lewis et al. (2004). In our experiments we examined the classification performance of SVM with the Fisher diffusion kernel for bow (=-=Lafferty and Lebanon, 2005-=-) and its corresponding product version for lowbow (10) (which reverts to the kernel of Lafferty and Lebanon (2005) for σ → ∞). Our experiments validate the findings in Lafferty and Lebanon (2005) whi... |

87 | A survey of smoothing techniques for ME models - Chen, Rosenfeld - 2000 |

84 |
A course on empirical processes
- Dudley
- 1984
(Show Context)
Citation Context ...ng error but would reduce C and therefore also the bound on the expected error. The most frequent way to bound C is through the use of the covering number which measures the size of a function class (=-=Dudley, 1984-=-; Anthony and Bartlett, 1999). The covering number enables several ways of determining the rate of uniform convergence C in (14), for example see Theorem 1 and 2 in Zhang (2002). Definition 8 Let x = ... |

82 | Geometrical Foundations of Asymptotic Inference - KASS, E, et al. - 1997 |

71 | Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators - Williamson, Smola, et al. - 1998 |

67 |
Some tools for functional data analysis
- Ramsay, Dalzell
- 1991
(Show Context)
Citation Context ...∑ j∈V |[γµ(y)] j − [γτ(y)] j| 2 ≤ |µ − τ|O(K). 3. Modeling of Simplicial Curves Modeling functional data such as lowbow curves is known in the statistics literature as functional data analysis (e.g., =-=Ramsay and Dalzell, 1991-=-; Ramsay and Silverman, 2005). Previous work in this area focused on low dimensional functional data such as one dimensional or two dimensional curves. In this section we discuss some issues concernin... |

60 |
Statistical decision rules and optimal inference
- Čencov
- 1982
(Show Context)
Citation Context ...θ(x) above is the multinomial probability associated with the parameter θ. It can be shown that the Fisher information metric is the only invariant metric under sufficient statistics transformations (=-=Čencov, 1982-=-; Campbell, 1986). In addition, various recent results motivate the Fisher geometry from a practical perspective (Lafferty and Lebanon, 2005). The inner product (20) defines the geometric properties o... |

44 | Covering number bounds of certain regularized linear function classes - Zhang - 2002 |

31 | Visualization of text document corpus - Fortuna, Grobelnik, et al. - 2005 |

31 | The geometry of asymptotic inference - Kass - 1999 |

26 | The Maximum-Margin Approach to Learning Text Classifiers: Methods, Theory, and Algorithms - Joachims - 2000 |

24 | Information diffusion kernels - Lafferty, Lebanon - 2002 |

23 | Tackling concept drift by temporal inductive transfer
- Forman
- 2006
(Show Context)
Citation Context ...cording to Z i/N yi ∼ Mult(θi1,...,θiV ) where θi j ∝ [γµ] j dµ. (i−1)/N The above model can also be used to describe situations in which the underlying document distribution changes with time (e.g., =-=Forman, 2006-=-). Lebanon and Zhao (2007) describe a local likelihood model that is essentially equivalent to the generative lowbow model described above. In contrast to the model of Blei and Lafferty (2006) the low... |

22 | A kernel for time series based on global alignments
- Cuturi, Vert, et al.
(Show Context)
Citation Context ...n Lafferty and Lebanon (2003) in the context of Riemannian manifolds. Cuturi (2005) describes some related ideas that lead to a non-smooth multi-scale view of images. These ideas were later expanded (=-=Cuturi et al., 2007-=-) to consider dynamic time warping which is highly relevant to the problem of matching two lowbow curves. Modeling functional data such as lowbow curves in statistics has been studied in the context o... |

20 | Covering numbers for support vector machines - Guo, Bartlett, et al. - 2002 |

14 | Spherical subfamily models - Gous - 1999 |

13 | Learning curved multinomial subfamilies for natural language processing and information retrieval - Hall, Hofmann - 2000 |

11 |
An extended Čencov characterization of the information metric
- Campbell
- 1986
(Show Context)
Citation Context ...the multinomial probability associated with the parameter θ. It can be shown that the Fisher information metric is the only invariant metric under sufficient statistics transformations (Čencov, 1982; =-=Campbell, 1986-=-). In addition, various recent results motivate the Fisher geometry from a practical perspective (Lafferty and Lebanon, 2005). The inner product (20) defines the geometric properties of distance, angl... |

9 | Riemannian Geometry and Statistical Machine Learning
- Lebanon
- 2005
(Show Context)
Citation Context ...n of the topic hierarchy of the RCV1 data set refer to Lewis et al. (2004). In our experiments we examined the classification performance of SVM with the Fisher diffusion kernel for bow (Lafferty and =-=Lebanon, 2005-=-) and its corresponding product version for lowbow (10) (which reverts to the kernel of Lafferty and Lebanon (2005) for σ → ∞). Our experiments validate the findings in Lafferty and Lebanon (2005) whi... |

7 | Sequential document visualization - Mao, Dillon, et al. |

2 | Learning from Structured Objects with Semigroup Kernels - Cuturi - 2005 |

2 | Local likelihood modeling of the concept drift phenomenon - Lebanon, Zhao - 2007 |

1 |
Statistical Theory and Computational Aspects of Smoothing, chapter Smoothing by Local Regression: Principles and Methods
- Cleveland, Loader
- 1996
(Show Context)
Citation Context ...stimate obtained from the plug-in rule for the bias and variance (i.e., θµ j ↦→ ˆθµ j in Equations (11)-(12)) is usually not recommended due to the poor estimation performance of plug-in rules (e.g., =-=Cleveland and Loader, 1996-=-). More sophisticated estimates exist, including adaptive estimators that may select different bandwidths or kernels at different points. An alternative approach, which we adopted in our experiments, ... |