Results 1 
6 of
6
Streaming Submodular Maximization: Massive Data Summarization on the Fly
, 2014
"... How can one summarize a massive data set “on the fly”, i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
How can one summarize a massive data set “on the fly”, i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of “representativeness ” satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2 − ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training largescale kernel methods and exemplarbased clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.
Learning Scalable Discriminative Dictionary with Sample Relatedness
"... Attributes are widely used as midlevel descriptors of object properties in object recognition and retrieval. Mostly, such attributes are manually predefined based on domain knowledge, and their number is fixed. However, predefined attributes may fail to adapt to the properties of the data at han ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Attributes are widely used as midlevel descriptors of object properties in object recognition and retrieval. Mostly, such attributes are manually predefined based on domain knowledge, and their number is fixed. However, predefined attributes may fail to adapt to the properties of the data at hand, may not necessarily be discriminative, and/or may not generalize well. In this work, we propose a dictionary learning framework that flexibly adapts to the complexity of the given data set and reliably discovers the inherent discriminative middlelevel binary features in the data. We use sample relatedness information to improve the generalization of the learned dictionary. We demonstrate that our framework is applicable to both object recognition and complex image retrieval tasks even with few training examples. Moreover, the learned dictionary also help classify novel object categories. Experimental results on the Animals with Attributes, ILSVRC2010 and PASCAL VOC2007 datasets indicate that using relatedness information leads to significant performance gains over established baselines. 1.
Parallel Double Greedy Submodular Maximization
"... Many machine learning problems can be reduced to the maximization of submodular functions. Although well understood in the serial setting, the parallel maximization of submodular functions remains an open area of research with recent results [1] only addressing monotone functions. The optimal algor ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Many machine learning problems can be reduced to the maximization of submodular functions. Although well understood in the serial setting, the parallel maximization of submodular functions remains an open area of research with recent results [1] only addressing monotone functions. The optimal algorithm for maximizing the more general class of nonmonotone submodular functions was introduced by Buchbinder et al. [2] and follows a strongly serial doublegreedy logic and program analysis. In this work, we propose two methods to parallelize the doublegreedy algorithm. The first, coordinationfree approach emphasizes speed at the cost of a weaker approximation guarantee. The second, concurrency control approach guarantees a tight 1/2approximation, at the quantifiable cost of additional coordination and reduced parallelism. As a consequence we explore the tradeoff space between guaranteed performance and objective optimality. We implement and evaluate both algorithms on multicore hardware and billion edge graphs, demonstrating both the scalability and tradeoffs of each approach. 1
Fast Constrained Submodular Maximization: Personalized Data Summarization
, 2016
"... Abstract Can we summarize multicategory data based on user preferences in a scalable manner? Many utility functions used for data summarization satisfy submodularity, a natural diminishing returns property. We cast personalized data summarization as an instance of a general submodular maximization ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract Can we summarize multicategory data based on user preferences in a scalable manner? Many utility functions used for data summarization satisfy submodularity, a natural diminishing returns property. We cast personalized data summarization as an instance of a general submodular maximization problem subject to multiple constraints. We develop the first practical and FAst coNsTrained submOdular Maximization algorithm, FANTOM, with strong theoretical guarantees. FANTOM maximizes a submodular function (not necessarily monotone) subject to the intersection of a psystem and l knapsacks constrains. It achieves a (1+ )(p+1)(2p+2l+1)/p approximation guarantee with only O( nrp log(n) ) query complexity (n and r indicate the size of the ground set and the size of the largest feasible solution, respectively). We then show how we can use FANTOM for personalized data summarization. In particular, a psystem can model different aspects of data, such as categories or time stamps, from which the users choose. In addition, knapsacks encode users' constraints including budget or time. In our set of experiments, we consider several concrete applications: movie recommendation over 11K movies, personalized image summarization with 10K images, and revenue maximization on the YouTube social networks with 5000 communities. We observe that FANTOM constantly provides the highest utility against all the baselines.
An Empirical Study of Stochastic Variational Algorithms for the Beta Bernoulli Process
"... Stochastic variational inference (SVI) is emerging as the most promising candidate for scaling inference in Bayesian probabilistic models to large datasets. However, the performance of these methods has been assessed primarily in the context of Bayesian topic models, particularly latent Dirichlet ..."
Abstract
 Add to MetaCart
(Show Context)
Stochastic variational inference (SVI) is emerging as the most promising candidate for scaling inference in Bayesian probabilistic models to large datasets. However, the performance of these methods has been assessed primarily in the context of Bayesian topic models, particularly latent Dirichlet allocation (LDA). Deriving several new algorithms, and using synthetic, image and genomic datasets, we investigate whether the understanding gleaned from LDA applies in the setting of sparse latent factor models, specifically beta process factor analysis (BPFA). We demonstrate that the big picture is consistent: using Gibbs sampling within SVI to maintain certain posterior dependencies is extremely effective. However, we find that different posterior dependencies are important in BPFA relative to LDA. Particularly, approximations able to model intralocal variable dependence perform best. 1.
Communication Communities in MOOCs
"... Massive Open Online Courses (MOOCs) bring together thousands of people from different geographies and demographic backgrounds – but to date, little is known about how they learn or communicate. We introduce a new contentanalysed MOOC dataset and use Bayesian Nonnegative Matrix Factorization (B ..."
Abstract
 Add to MetaCart
(Show Context)
Massive Open Online Courses (MOOCs) bring together thousands of people from different geographies and demographic backgrounds – but to date, little is known about how they learn or communicate. We introduce a new contentanalysed MOOC dataset and use Bayesian Nonnegative Matrix Factorization (BNMF) to extract communities of learners based on the nature of their online forum posts. We see that BNMF yields a superior probabilistic generative model for online discussions when compared to other models, and that the communities it learns are differentiated by their composite students ’ demographic and course performance indicators. These findings suggest that computationally efficient probabilistic generative modelling of MOOCs can reveal important insights for educational researchers and practitioners and help to develop more intelligent and responsive online learning environments. 1