## An Information-Theoretic Framework for Semantic-Multimedia Indexing

Citations: | 2 - 0 self |

### BibTeX

@MISC{Magalhães_aninformation-theoretic,

author = {João Magalhães and Stefan Rüger},

title = {An Information-Theoretic Framework for Semantic-Multimedia Indexing},

year = {}

}

### OpenURL

### Abstract

Abstract. To solve the problem of indexing collections with diverse text documents, image documents, or documents with both text and images, one needs to develop a model that supports heterogeneous types of documents. In this paper, we show how information theory supplies us with the tools necessary to develop a unique model for text, image, and text/image retrieval. In our approach, for each possible query keyword we estimate a maximum entropy model based on exclusively continuous features that were pre-processed. The unique continuous feature-space of text and visual data is constructed by using a minimum description length criterion to find the optimal feature-space representation (optimal from an information theory point of view). We evaluate our approach in three experiments: only text retrieval, only image retrieval, and text combined with image retrieval. 1

### Citations

8603 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...epresenting the data space. Given the codebook of a feature space one is able to represent all samples of that feature space as a linear combination of keywords from that codebook. Information theory =-=[9]-=- provides us with a set of information measures that not only assess the amount of information that one single source of data contains, but also the amount of information that two (or more) sources of... |

1915 |
Numerical optimization
- Nocedal, Wright
- 1999
(Show Context)
Citation Context ...ls. It has been shown that L-BFGS is the best optimization procedure for both maximum entropy [27] and conditional random fields models [35]. For more details on the limited-memory BFGS algorithm see =-=[31]-=-. 6 Evaluation The presented algorithms were evaluated with a retrieval setting on the Reuters-21578 collection, on a subset of the Corel Stock Photo CDs [10] and on a subset of the TRECVID2006 develo... |

1807 |
An algorithm for suffix stripping
- Porter
- 1980
(Show Context)
Citation Context ...l standard text processing techniques [41]: stop words are first removed to eliminate redundant information, and rare words are also removed to avoid over-fitting [19]. After this, the Porter stemmer =-=[32]-=- reduces words to their morphological root, which we call term. Finally, we discard the term sequence information and use a bag-of-words approach. These text pre-processing techniques result in a feat... |

1706 | Text categorization with support vector machines: Learning with many relevant features
- Joachims
- 1998
(Show Context)
Citation Context ...ly term-weighting. Due to the high-dimensional feature space of text data most text categorization algorithms are linear models such as naïve Bayes [28], maximum entropy [30], Support Vector Machines =-=[19]-=-, regularized linear models [46], and Linear Least Squares Fit [42]. Joachims [19] applies SVMs directly to the text terms. Text is ideal for applying SVMs without the need of a kernel function becaus... |

1166 |
Information Theory, Inference and Learning Algorithms
- MacKay
- 2003
(Show Context)
Citation Context ... F ( d ) and F ( ) d . Note that I use T T V V the word “optimal” from an information theory point of view. The treatment of the model selection problem presented in this section is based on [14] and =-=[25]-=-. 4.1 Assessing the Data Representation Error The process of changing the original feature-space representation into the new representation with a given candidate transformation ˆ F has an associated ... |

1165 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ... of information that one single source of data contains, but also the amount of information that two (or more) sources of data have in common. Thus, we employ the minimum description length criterion =-=[33]-=-, to infer the optimal V 8complexity T M and V M of each feature space transformation F ( d ) and F ( ) d . Note that I use T T V V the word “optimal” from an information theory point of view. The tr... |

1087 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context .... (37) p f i t 5.2 Keywords as Logistic Regression Models Logistic regression is a statistical learning technique that has been applied to a great variety of fields, e.g., natural language processing =-=[5]-=-, text classification [30], and image annotation [17]. In this section we employ a binomial logistic model to represent keywords in the multi-modal feature space. The expression of the binomial logist... |

958 | A comparative study on feature selection in text categorization
- Yang, Pedersen
- 1997
(Show Context)
Citation Context ... is j i p( yj, dT, i ) ( yj ti ) = ∑ ∑ p( yj dT, i ) , (25) p( y ) p( d ) MU , , log j= { } T, i y 0;1 d j T, i where d Ti , is the number of occurrences of term t i in document d . Yang and Pedersen =-=[44]-=- and 19Forman [13] have shown experimentally that this is one of the best criteria for feature selection. A document d is then represented by k T text terms as the mixture k T T Ti , ( ) = ∑αi ( i | ... |

852 |
Relevance feedback in information retrieval
- Rocchio
- 1971
(Show Context)
Citation Context ...omputed analytically. 5.1.1 Rocchio Classifier Rocchio classifier was initially proposed as a relevance feedback algorithm to compute a query vector from a small set of positive and negative examples =-=[34]-=-. It can also be used for categorization tasks, e.g., [18]: a keyword t w is represented as a vector β t in the multi-modal space, and the closer a document is to this vector the higher is the similar... |

795 | Probabilistic latent semantic indexing
- Hofmann
- 1999
(Show Context)
Citation Context ...h-dimensional sparse feature spaces. The proposed maximum entropy framework tackles this problem by expanding the feature space in a similar spirit to Hoffman’s probabilistic Latent Semantic Indexing =-=[15]-=-. These single-modality based approaches are far from our initial goal but by analysing them we can see which family of models can be used to simultaneously model text, image, and multi-modal content.... |

762 | A comparison of event models for naive bayes text classification
- McCallum, Nigam
- 1998
(Show Context)
Citation Context ...oving stop-words and rare words, stemming, and finally term-weighting. Due to the high-dimensional feature space of text data most text categorization algorithms are linear models such as naïve Bayes =-=[28]-=-, maximum entropy [30], Support Vector Machines [19], regularized linear models [46], and Linear Least Squares Fit [42]. Joachims [19] applies SVMs directly to the text terms. Text is ideal for applyi... |

757 | A study of cross-validation and bootstrap for accuracy estimation and model selection
- Kohavi
- 1995
(Show Context)
Citation Context ...l feature space a. For each keyword in the considered collection i. Estimate the keyword model on the training set by applying a cross-validation with 5 folds and 10 value iterations, as suggested in =-=[20]-=-, to determine the ideal Gaussian prior variance 2 σ ξ ii. Compute the relevance of each test document iii. Rank all test documents by their relevance for the considered keyword iv. Use the collection... |

701 |
The elements of statistical learning: data mining, inference, and prediction: with 200 full-color illustrations
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...ances on different text collections. Their results indicate that k-Nearest Neighbour, SVMs, and LLSF are the best classifiers. Note that nearest neighbour approaches have certain characteristics (see =-=[14]-=-) that make them computationally too complex to handle large-scale indexing. The simplest image annotation models deploy a traditional multi-class supervised learning model and learn the class-conditi... |

641 | A Re-Examination of Text Categorization Methods
- Yang, Liu
- 1999
(Show Context)
Citation Context ...l advantage that while these approaches have no automatic mechanism to select a vocabulary size we use the minimum description length principle to select its optimal size. Yang [41], and Yang and Liu =-=[43]-=- have compared a number of text classification algorithms and reported their performances on different text collections. Their results indicate that k-Nearest Neighbour, SVMs, and LLSF are the best cl... |

492 | An evaluation of statistical approaches to text categorization
- Yang
- 1999
(Show Context)
Citation Context ...quares) with the crucial advantage that while these approaches have no automatic mechanism to select a vocabulary size we use the minimum description length principle to select its optimal size. Yang =-=[41]-=-, and Yang and Liu [43] have compared a number of text classification algorithms and reported their performances on different text collections. Their results indicate that k-Nearest Neighbour, SVMs, a... |

490 | On the limited memory BFGS method for large scale optimization
- Liu, Nocedal
- 1989
(Show Context)
Citation Context ...vectors that represent approximations implicitly made in previous iterations of the algorithm. The L-BFGS algorithm (limited-memory Broyden-Fletcher-Goldfarb-Shanno) 26is one of such algorithms, see =-=[23]-=- for details: “The main idea of this method is to use curvature information from only the most recent iterations to construct the Hessian approximation. Curvature information from earlier iterations, ... |

447 | Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary
- Duygulu, Barnard, et al.
(Show Context)
Citation Context ...om such sparse data. Other types of approaches are based on a translation model between keywords and images (global, tiles or regions). Inspired by automatic text translation research, Duygulu et al. =-=[10]-=- developed a method of annotating images with words. First, regions are created using a segmentation algorithm like normalised cuts. For each region, features are computed and then blobs are generated... |

444 | Shallow parsing with conditional random fields
- Sha, Pereira
- 2003
(Show Context)
Citation Context ...he parameters of both linear logistic models and log-linear models. It has been shown that L-BFGS is the best optimization procedure for both maximum entropy [27] and conditional random fields models =-=[35]-=-. For more details on the limited-memory BFGS algorithm see [31]. 6 Evaluation The presented algorithms were evaluated with a retrieval setting on the Reuters-21578 collection, on a subset of the Core... |

342 | A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
- Joachims
- 1997
(Show Context)
Citation Context ...ssifier was initially proposed as a relevance feedback algorithm to compute a query vector from a small set of positive and negative examples [34]. It can also be used for categorization tasks, e.g., =-=[18]-=-: a keyword t w is represented as a vector β t in the multi-modal space, and the closer a document is to this vector the higher is the similarity between the document and the keyword. A keyword vector... |

334 | Modeling annotated data
- Blei, Jordan
- 2003
(Show Context)
Citation Context ...spired by a hierarchical clustering/aspect model. The data are assumed to be generated by a fixed hierarchy of nodes with the leaves of the hierarchy corresponding to 6soft clusters. Blei and Jordan =-=[6]-=- propose the correspondence latent Dirichlet allocation model; a Bayesian model for capturing the relations between regions, words and latent variables. The exploitation of hierarchical structures (ei... |

321 | Automatic image annotation and retrieval using cross- media relevance models
- Jeon, Lavrenko, et al.
- 2003
(Show Context)
Citation Context ...ions across an image collection. The problem is then formulated as learning the correspondence between the discrete vocabulary of blobs and the image keywords. Following the same translation approach =-=[11, 16, 21]-=- have developed a series of translation models that use different models for keywords (multinomial/binomial) and images representations (hard clustered regions, soft clustered regions, tiles). Hierarc... |

278 | An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research
- Forman
(Show Context)
Citation Context ...i ) ( yj ti ) = ∑ ∑ p( yj dT, i ) , (25) p( y ) p( d ) MU , , log j= { } T, i y 0;1 d j T, i where d Ti , is the number of occurrences of term t i in document d . Yang and Pedersen [44] and 19Forman =-=[13]-=- have shown experimentally that this is one of the best criteria for feature selection. A document d is then represented by k T text terms as the mixture k T T Ti , ( ) = ∑αi ( i | ) = ∑ αi , (26) p d... |

271 | Unsupervised learning of finite mixture models
- Figueiredo, Jain
(Show Context)
Citation Context ...s several other strategies that we will describe next. 4.3.2.1 Detailed Hierarchical EM The hierarchical EM algorithm was implemented in C++ and it is based on the one proposed by Figueiredo and Jain =-=[12]-=-: it follows the component-wise EM algorithm with embedded component elimination.. The mixture fitting algorithm presents a series of strategies that avoids some of the EM algorithm’s drawbacks: sensi... |

262 | Using maximum entropy for text classification
- Nigam, Lafferty, et al.
- 1999
(Show Context)
Citation Context ...are words, stemming, and finally term-weighting. Due to the high-dimensional feature space of text data most text categorization algorithms are linear models such as naïve Bayes [28], maximum entropy =-=[30]-=-, Support Vector Machines [19], regularized linear models [46], and Linear Least Squares Fit [42]. Joachims [19] applies SVMs directly to the text terms. Text is ideal for applying SVMs without the ne... |

230 | A gaussian prior for smoothing maximum entropy models
- Chen, Rosenfeld
- 1999
(Show Context)
Citation Context ... visual features, text features from different sources (ASR, OCR), and audio features. Three types of classifiers are available: logistic regression (which without regularization is known to over-fit =-=[8]-=-), Fisher linear discriminant, and SVMs (offering the best accuracy). The fusion of the different modalities is possible to be done at different levels and it is chosen by cross-validation for each co... |

229 | A comparison of algorithms for maximum entropy parameter estimation
- Malouf
- 2002
(Show Context)
Citation Context ...ided by Liu and Nocedal [24] to estimate the parameters of both linear logistic models and log-linear models. It has been shown that L-BFGS is the best optimization procedure for both maximum entropy =-=[27]-=- and conditional random fields models [35]. For more details on the limited-memory BFGS algorithm see [31]. 6 Evaluation The presented algorithms were evaluated with a retrieval setting on the Reuters... |

223 | Learning the semantics of words and pictures
- Barnard, Forsyth
- 2001
(Show Context)
Citation Context ...ds (multinomial/binomial) and images representations (hard clustered regions, soft clustered regions, tiles). Hierarchical models have also been used in image annotation such as Barnard and Forsyth’s =-=[3]-=- generative hierarchical aspect model inspired by a hierarchical clustering/aspect model. The data are assumed to be generated by a fixed hierarchy of nodes with the leaves of the hierarchy correspond... |

206 |
Minimum Complexity Density Estimation
- Barron, Cover
- 1991
(Show Context)
Citation Context ...on ˆ F , and N is the number of samples in the training dataset. Hence, the MDL criterion is designed “to achieve the best compromise between likelihood and … complexity relative to the sample size”, =-=[4]-=-. Finally, the optimal feature-space transformation is the one 10that minimizes Equation (14), which results in ˆF ( ˆ ) F = argminDL F, D . (15) The MDL criterion provides an estimate of the model e... |

184 |
Multiple bernoulli relevance models for image and video annotation
- Feng, Manmatha, et al.
- 2004
(Show Context)
Citation Context ...ions across an image collection. The problem is then formulated as learning the correspondence between the discrete vocabulary of blobs and the image keywords. Following the same translation approach =-=[11, 16, 21]-=- have developed a series of translation models that use different models for keywords (multinomial/binomial) and images representations (hard clustered regions, soft clustered regions, tiles). Hierarc... |

180 | A model for learning the semantics of pictures
- Lavrenko, Manmatha, et al.
- 2003
(Show Context)
Citation Context ...ions across an image collection. The problem is then formulated as learning the correspondence between the discrete vocabulary of blobs and the image keywords. Following the same translation approach =-=[11, 16, 21]-=- have developed a series of translation models that use different models for keywords (multinomial/binomial) and images representations (hard clustered regions, soft clustered regions, tiles). Hierarc... |

155 | Image classification for content-based indexing
- Vailaya, Figueiredo, et al.
- 2001
(Show Context)
Citation Context ... et al. [45] deployed a nonparametric distribution; Carneiro and Vasconcelos [7] a semi-parametric density estimation; Westerveld and de Vries [39] a finite-mixture of Gaussians; while Vailaya et al. =-=[38]-=- apply a vector quantization technique. Density based approaches are among the most successful ones. However, density distributions are not adequate for text because the density models do not get enou... |

94 |
A probabilistic framework for semantic video indexing, filtering, and retrieval
- Naphade, Huang
- 2001
(Show Context)
Citation Context ... these classifiers. Westerveld, et al. [40] combine the visual model and the text model under the assumption that they are independent, thus the probabilities are simply multiplied. Naphade and Huang =-=[29]-=- model visual features with Gaussian Mixtures Models (GMM), audio features with Hidden Markov Models (HMM) and combine them in a Bayesian network. In multimedia documents the different modalities cont... |

81 | Text categorization based on regularized linear classifiers
- Zhang, Oles
- 2001
(Show Context)
Citation Context ...gh-dimensional feature space of text data most text categorization algorithms are linear models such as naïve Bayes [28], maximum entropy [30], Support Vector Machines [19], regularized linear models =-=[46]-=-, and Linear Least Squares Fit [42]. Joachims [19] applies SVMs directly to the text terms. Text is ideal for applying SVMs without the need of a kernel function because data is already sparse and hig... |

71 |
On the limited memory method for large scale optimization
- Liu, Nocedal
- 1989
(Show Context)
Citation Context ...e binomial logistic regression functions that the L-BFGS algorithm evaluates on each iteration to compute the βt regression coefficients. We use the FORTRAN implementation provided by Liu and Nocedal =-=[24]-=- to estimate the parameters of both linear logistic models and log-linear models. It has been shown that L-BFGS is the best optimization procedure for both maximum entropy [27] and conditional random ... |

63 | Formulating semantic image annotation as a supervised learning problem
- Carneiro, Vasconcelos
- 2005
(Show Context)
Citation Context ...ral techniques to model ( ) | p x w with different types of probability density distributions have been proposed: Yavlinsky et al. [45] deployed a nonparametric distribution; Carneiro and Vasconcelos =-=[7]-=- a semi-parametric density estimation; Westerveld and de Vries [39] a finite-mixture of Gaussians; while Vailaya et al. [38] apply a vector quantization technique. Density based approaches are among t... |

60 | Automated image annotation using global features and robust nonparametric density estimation
- Yavlinsky, Schofield, et al.
- 2005
(Show Context)
Citation Context ...w ) , the features data density distribution of a given keyword. Several techniques to model ( ) | p x w with different types of probability density distributions have been proposed: Yavlinsky et al. =-=[45]-=- deployed a nonparametric distribution; Carneiro and Vasconcelos [7] a semi-parametric density estimation; Westerveld and de Vries [39] a finite-mixture of Gaussians; while Vailaya et al. [38] apply a... |

59 | The semantic pathfinder: using an authoring metaphor for generic multimedia indexing
- Snoek, Worring, et al.
(Show Context)
Citation Context ...en way because they represent the same reality. Synchronization/relation and the strategy to combine the multi-modal patterns is a key point of the Semantic pathfinder system proposed by Snoek et al. =-=[36, 37]-=-. Their system uses a unique feature vector that concatenates a rich set of visual features, text features from different sources (ASR, OCR), and audio features. Three types of classifiers are availab... |

54 | A maximum entropy framework for part-based texture and object recognition
- Lazebnik, Schmid, et al.
- 2005
(Show Context)
Citation Context ...ses the number of parameters (model complexity) to be estimated with the same amount of training data. Maximum entropy models have also been applied to image annotation [2, 17] and object recognition =-=[22]-=-. All these three approaches have specific features for each class (keywords in our case) which increases the complexity of the system. It is curious to note the large difference in precision results ... |

19 | The mediamill trecvid 2006 semantic video search engine
- Snoek, Gemert, et al.
- 2006
(Show Context)
Citation Context ...en way because they represent the same reality. Synchronization/relation and the strategy to combine the multi-modal patterns is a key point of the Semantic pathfinder system proposed by Snoek et al. =-=[36, 37]-=-. Their system uses a unique feature vector that concatenates a rich set of visual features, text features from different sources (ASR, OCR), and audio features. Three types of classifiers are availab... |

18 | IBM research trecvid-2005 video retrieval system
- Amir, Iyengar, et al.
(Show Context)
Citation Context ...ed to compute the visual features and to iteratively select the best classifier, the best type of fusion, and the SVMs parameter optimization are serious drawbacks of this system. IBM’s Marvel system =-=[1]-=- has a similar architecture with different learning algorithms to analyse the semantics of multimedia content. These two approaches offer the best performance on the TRECVID2005 conference. Both appro... |

9 |
Combining information sources for video retrieval
- Westerveld, Ianeva, et al.
- 2004
(Show Context)
Citation Context ... information about each keyword of the vocabulary. The simplest approach to multi-modal analysis is to design a classifier per modality and combine the output of these classifiers. Westerveld, et al. =-=[40]-=- combine the visual model and the text model under the assumption that they are independent, thus the probabilities are simply multiplied. Naphade and Huang [29] model visual features with Gaussian Mi... |

8 |
Semantic annotation of multimedia using maximum entropy models
- Argillander, Iyengar, et al.
- 2005
(Show Context)
Citation Context ...ta or of the parameters) increases the number of parameters (model complexity) to be estimated with the same amount of training data. Maximum entropy models have also been applied to image annotation =-=[2, 17]-=- and object recognition [22]. All these three approaches have specific features for each class (keywords in our case) which increases the complexity of the system. It is curious to note the large diff... |

6 |
de Vries, "Experimental result analysis for a generative probabilistic image retrieval model
- Westerveld, P
- 2003
(Show Context)
Citation Context ...ility density distributions have been proposed: Yavlinsky et al. [45] deployed a nonparametric distribution; Carneiro and Vasconcelos [7] a semi-parametric density estimation; Westerveld and de Vries =-=[39]-=- a finite-mixture of Gaussians; while Vailaya et al. [38] apply a vector quantization technique. Density based approaches are among the most successful ones. However, density distributions are not ade... |

5 | High-dimensional visual vocabularies for image retrieval
- Magalhaes, Rueger
- 2007
(Show Context)
Citation Context ... of the computed models. 5.1 Keyword Baseline Models The first linear models that we shall present in this section are simple but effective models that can be applied in the multi-modal feature space =-=[26]-=-. The advantage of both Rocchio classifier and naïve Bayes classifier is that they can be computed analytically. 5.1.1 Rocchio Classifier Rocchio classifier was initially proposed as a relevance feedb... |