Results 1 - 10
of
485
Evaluating Color Descriptors for Object and Scene Recognition
, 2010
"... Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used for feature extraction at salient points. To increase illumination invariance and discriminative power, color descriptors have been ..."
Abstract
-
Cited by 423 (33 self)
- Add to MetaCart
Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used for feature extraction at salient points. To increase illumination invariance and discriminative power, color descriptors have been proposed. Because many different descriptors exist, a structured overview is required of color invariant descriptors in the context of image category recognition. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors (software to compute the color descriptors from this paper is available from
80 million tiny images: a large dataset for non-parametric object and scene recognition
- IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
"... ..."
Data Clustering: 50 Years Beyond K-Means
, 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract
-
Cited by 294 (7 self)
- Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.
Flickr tag recommendation based on collective knowledge
- IN WWW ’08: PROC. OF THE 17TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB
, 2008
"... Online photo services such as Flickr and Zooomr allow users to share their photos with family, friends, and the online community at large. An important facet of these services is that users manually annotate their photos using so called tags, which describe the contents of the photo or provide addit ..."
Abstract
-
Cited by 224 (1 self)
- Add to MetaCart
(Show Context)
Online photo services such as Flickr and Zooomr allow users to share their photos with family, friends, and the online community at large. An important facet of these services is that users manually annotate their photos using so called tags, which describe the contents of the photo or provide additional contextual and semantical information. In this paper we investigate how we can assist users in the tagging phase. The contribution of our research is twofold. We analyse a representative snapshot of Flickr and present the results by means of a tag characterisation focussing on how users tags photos and what information is contained in the tagging. Based on this analysis, we present and evaluate tag recommendation strategies to support the user in the photo annotation task by recommending a set of tags that can be added to the photo. The results of the empirical evaluation show that we can effectively recommend relevant tags for a variety of photos with different levels of exhaustiveness of original tagging.
Small codes and large image databases for recognition
"... The Internet contains billions of images, freely available online. Methods for efficiently searching this incredibly rich resource are vital for a large number of applications. These include object recognition [2], computer graphics [11, 27], personal photo collections, online image search tools. In ..."
Abstract
-
Cited by 185 (7 self)
- Add to MetaCart
(Show Context)
The Internet contains billions of images, freely available online. Methods for efficiently searching this incredibly rich resource are vital for a large number of applications. These include object recognition [2], computer graphics [11, 27], personal photo collections, online image search tools. In this paper, our goal is to develop efficient image search and scene matching techniques that are not only fast, but also require very little memory, enabling their use on standard hardware or even on handheld devices. Our approach uses recently developed machine learning techniques to convert the Gist descriptor (a real valued vector that describes orientation energies at different scales and orientations within an image) to a compact binary code, with a few hundred bits per image. Using our scheme, it
A New Baseline for Image Annotation
"... Abstract. Automatically assigning keywords to images is of great interest as it allows one to index, retrieve, and understand large collections of image data. Many techniques have been proposed for image annotation in the last decade that give reasonable performance on standard datasets. However, mo ..."
Abstract
-
Cited by 138 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Automatically assigning keywords to images is of great interest as it allows one to index, retrieve, and understand large collections of image data. Many techniques have been proposed for image annotation in the last decade that give reasonable performance on standard datasets. However, most of these works fail to compare their methods with simple baseline techniques to justify the need for complex models and subsequent training. In this work, we introduce a new baseline technique for image annotation that treats annotation as a retrieval problem. The proposed technique utilizes low-level image features and a simple combination of basic distances to find nearest neighbors of a given image. The keywords are then assigned using a greedy label transfer mechanism. The proposed baseline outperforms the current state-of-the-art methods on two standard and one large Web dataset. We believe that such a baseline measure will provide a strong platform to compare and better understand future annotation techniques. 1
A New Approach to Cross-Modal Multimedia Retrieval
"... The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual (SIFT) features. Two hypotheses are investig ..."
Abstract
-
Cited by 81 (5 self)
- Add to MetaCart
(Show Context)
The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual (SIFT) features. Two hypotheses are investigated: that 1) there is a benefit to explicitly modeling correlations between the two components, and 2) this modeling is more effective in feature spaces with higher levels of abstraction. Correlations between the two components are learned with canonical correlation analysis. Abstraction is achieved by representing text and images at a more general, semantic level. The two hypotheses are studied in the context of the task of cross-modal document retrieval. This includes retrieving the text that most closely matches a query image, or retrieving the images that most closely match a query text. It is shown that accounting for crossmodal correlations and semantic abstraction both improve retrieval accuracy. The cross-modal model is also shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.
Semi-supervised learning in gigantic image collections
- In Advances in Neural Information Processing Systems 22
, 2009
"... With the advent of the Internet it is now possible to col-lect hundreds of millions of images. These images come with varying degrees of label information. “Clean labels” can be manually obtained on a small fraction, “noisy la-bels ” may be extracted automatically from surrounding text, while for mo ..."
Abstract
-
Cited by 76 (3 self)
- Add to MetaCart
(Show Context)
With the advent of the Internet it is now possible to col-lect hundreds of millions of images. These images come with varying degrees of label information. “Clean labels” can be manually obtained on a small fraction, “noisy la-bels ” may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combin-ing these different label sources. However, it scales poly-nomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in ma-chine learning to obtain highly efficient approximations for semi-supervised learning. Specifically, we use the conver-gence of the eigenvectors of the normalized graph Lapla-cian to eigenfunctions of weighted Laplace-Beltrami oper-ators. We combine this with a label sharing framework obtained from Wordnet to propagate label information to classes lacking manual annotations. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images with 74 thousand classes. 1.
I2T: Image Parsing to Text Description
"... In this paper, we present an image parsing to text generation (I2T) framework that generates natural language descriptions from image and video content. This framework converts the harder content based image and video retrieval problem into an easier text search problem with potential applications ..."
Abstract
-
Cited by 52 (2 self)
- Add to MetaCart
(Show Context)
In this paper, we present an image parsing to text generation (I2T) framework that generates natural language descriptions from image and video content. This framework converts the harder content based image and video retrieval problem into an easier text search problem with potential applications in Internet search and visual data mining. The proposed I2T framework follows three steps. 1) Input images or video frames are decomposed into their constituent visual patterns through an image parsing engine, which outputs a scene as a parse graph representation, in a spirit similar to parsing sentences in speech and natural language. 2) The parse graphs are converted into semantic representation using the Web Ontology Language (OWL) format, which is a formal and unambiguous knowledge representation. 3) A text generation engine converts the semantic representation into a semantically meaningful, human readable and query-able text report. Success of the above framework relies on two knowledge bases. The first one is a visual knowledge base that provides top-down hypotheses for image parsing and serves as an image ontology for translating parse graphs into semantic representations. The core of the visual knowledge base is an And-Or graph representation. It entails vocabularies of visual elements including pixels, primitives, parts, objects and scenes and a stochastic image grammar specifying compositional, spatial, temporal and functional relations between visual elements. We developed a large-scale ground-truth image database and an interactive image annotation software to build the And-Or graph from real-world image instances. The second knowledge base is a general knowledge base that interconnects several domain specific ontologies in the form of the Semantic Web. This knowledge base further enriches the semantic representation of visual content with domain specific information. Finally, we demonstrate a case study in video surveillance, an end-to-end system that automatically infers video events and generates natural language descriptions of video scenes. Experiments with maritime and urban scenes indicate the feasibility of the proposed approach.