Results 1  10
of
38
Accelerated gradient methods for stochastic optimization and online learning
 Advances in Neural Information Processing Systems 22
, 2009
"... Regularized risk minimization often involves nonsmooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., ℓ1regularizer). Gradient methods, though highly scalable and easy to implement, are known to converge slowly. In this paper, we develop a novel acce ..."
Abstract

Cited by 32 (1 self)
 Add to MetaCart
(Show Context)
Regularized risk minimization often involves nonsmooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., ℓ1regularizer). Gradient methods, though highly scalable and easy to implement, are known to converge slowly. In this paper, we develop a novel accelerated gradient method for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic composite optimization with convex or strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD. Moreover, SAGE can also be extended for online learning, resulting in a simple algorithm but with the best regret bounds currently known for these problems. 1
Spectral Analysis for BillionScale Graphs: Discoveries and Implementation
"... Abstract. Given a graph with billions of nodes and edges, how can we find patterns and anomalies? Are there nodes that participate in too many or too few triangles? Are there closeknit nearcliques? These questions are expensive to answer unless we have the first several eigenvalues and eigenvector ..."
Abstract

Cited by 31 (13 self)
 Add to MetaCart
(Show Context)
Abstract. Given a graph with billions of nodes and edges, how can we find patterns and anomalies? Are there nodes that participate in too many or too few triangles? Are there closeknit nearcliques? These questions are expensive to answer unless we have the first several eigenvalues and eigenvectors of the graph adjacency matrix. However, eigensolvers suffer from subtle problems (e.g., convergence) for large sparse matrices, let alone for billionscale ones. We address this problem with the proposed HEIGEN algorithm, which we carefully design to be accurate, efficient, and able to run on the highly scalable MAPREDUCE (HADOOP) environment. This enables HEIGEN to handle matrices more than 1000 × larger than those which can be analyzed by existing algorithms. We implement HEIGEN and run it on the M45 cluster, one of the top 50 supercomputers in the world. We report important discoveries about nearcliques and triangles on several realworld graphs, including a snapshot of the Twitter social network (38Gb, 2 billion edges) and the “YahooWeb ” dataset, one of the largest publicly available graphs (120Gb, 1.4 billion nodes, 6.6 billion edges). 1
A very fast method for clustering big text datasets
 In ECAI
, 2010
"... Abstract. Largescale text datasets have long eluded a family of particularly elegant and effective clustering methods that exploits the power of pairwise similarities between data points due to the prohibitive cost, timeand spacewise, in operating on a similarity matrix, where the stateofthe ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Largescale text datasets have long eluded a family of particularly elegant and effective clustering methods that exploits the power of pairwise similarities between data points due to the prohibitive cost, timeand spacewise, in operating on a similarity matrix, where the stateoftheart is at best quadratic in time and in space. We present an extremely fast and simple method also using the power of all pairwise similarity between data points, and show through experiments that it does as well as previous methods in clustering accuracy, and it does so with in linear time and space, without sampling data points or sparsifying the similarity matrix.
Evolutionary Network Analysis: A Survey
"... Evolutionary network analysis has found an increasing interest in the literature because of the importance of different kinds of dynamic social networks, email networks, biological networks, and social streams. When a network evolves, the results of data mining algorithms such as community detection ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Evolutionary network analysis has found an increasing interest in the literature because of the importance of different kinds of dynamic social networks, email networks, biological networks, and social streams. When a network evolves, the results of data mining algorithms such as community detection need to be correspondingly updated. Furthermore, the specific kinds of changes to the structure of the network, such as the impact on community structure or the impact on network structural parameters, such as node degrees, also needs to be analyzed. Some dynamic networks have a much faster rate of edge arrival and are referred to as network streams or graph streams. The analysis of such networks is especially challenging, because it needs to be performed with an online approach, under the onepass constraint of data streams. The incorporation of content can add further complexity to the evolution analysis process. This survey provides an overview of the vast literature on graph evolution analysis and the numerous applications that arise in different contexts.
A Comprehensive Approach to Image Spam Detection: From Server to Client Solution, Information Forensics and Security
 IEEE Transactions
, 2010
"... Abstract—Image spam is a type of email spam that embeds spam text content into graphical images to bypass traditional textbased email spam filters. To effectively detect image spam, it is desirable to leverage image content analysis technologies. However, most previous works of image spam detecti ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Image spam is a type of email spam that embeds spam text content into graphical images to bypass traditional textbased email spam filters. To effectively detect image spam, it is desirable to leverage image content analysis technologies. However, most previous works of image spam detection focus on filtering the image spam on the client side. We propose a more desirable comprehensive solution which embraces both serverside filtering and clientside detection to effectively mitigate image spam. On the server side, we present a nonnegative sparsity induced similarity measure for cluster analysis of spam images to filter the attack activities of spammers and fast trace back the spam sources. On the client side, we employ the principle of active learning where the learner guides the users to label as few images as possible while maximizing the classification accuracy. The serverside filtering identifies large image clusters as suspicious spam sources and further analysis can be performed to identify the real sources and block them from the beginning. For those spam images which survived the serverside filter, our active learner on the client side will further guide the users to interactively and efficiently filter them out. Our experiments on an image spam dataset collected from the email server of our department demonstrate the efficacy of the proposed comprehensive solution. Index Terms—Active learning, clustering, image recognition, image spam, spam filtering, sparse representation. I.
Ktree: large scale document clustering
 In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
, 2009
"... ar ..."
(Show Context)
Multilevel Approximate Spectral Clustering
"... Abstract—Clustering is a task of finding natural groups in datasets based on measured or perceived similarity between data points. Spectral clustering is a wellknown graphtheoretic approach, which is capable of capturing nonconvex geometries of datasets. However, it generally becomes infeasible f ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Clustering is a task of finding natural groups in datasets based on measured or perceived similarity between data points. Spectral clustering is a wellknown graphtheoretic approach, which is capable of capturing nonconvex geometries of datasets. However, it generally becomes infeasible for analyzing large datasets due to relatively high time and space complexity. In this paper, we propose Multilevel Approximate Spectral (MAS) clustering to enable efficient analysis of large datasets. By integrating a series of lowrank matrix approximations (i.e., approximations to the affinity matrix and its subspace, as well as those for the Laplacian matrix and the Laplacian subspace), MAS achieves great computational and spacial efficiency. MAS provides a general framework for fast and accurate spectral clustering, which works with any kernels, various fast sampling strategies and different lowrank approximation algorithms. In addition, it can be easily extended for distributed computing. From a theoretical perspective, we provide rigorous analysis of its approximation error in addition to its correctness and computational complexity. Through extensive experiments we demonstrate superior performance of the proposed method relative to several wellknown approximate spectral clustering algorithms. I.
Imagequality prediction of synthetic aperture sonar imagery
 in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
"... Abstract. This work exploits several machinelearning techniques to address the problem of imagequality prediction of synthetic aperture sonar (SAS) imagery. The objective is to predict the correlation of sonar pingreturns as a function of range from the sonar by using measurements of sonarplatfo ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Abstract. This work exploits several machinelearning techniques to address the problem of imagequality prediction of synthetic aperture sonar (SAS) imagery. The objective is to predict the correlation of sonar pingreturns as a function of range from the sonar by using measurements of sonarplatform motion and estimates of environmental characteristics. The environmental characteristics are estimated by effectively performing unsupervised seabed segmentation, which entails extracting waveletbased features, performing spectral clustering, and learning a variational Bayesian Gaussian mixture model. The motion measurements and environmental features are then used to learn a Gaussian process regression model so that ping correlations can be predicted. To handle issues related to the large size of the data set considered, sparse methods and an outofsample extension for spectral clustering are also exploited. The approach is demonstrated on an enormous data set of real SAS images collected in the Baltic Sea.
A sparsityinducing formulation for evolutionary coclustering
 In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. KDD ’12. ACM
"... Traditional coclustering methods identify block structures from static data matrices. However, the data matrices in many applications are dynamic; that is, they evolve smoothly over time. Consequently, the hidden block structures embedded into the matrices are also expected to vary smoothly along ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Traditional coclustering methods identify block structures from static data matrices. However, the data matrices in many applications are dynamic; that is, they evolve smoothly over time. Consequently, the hidden block structures embedded into the matrices are also expected to vary smoothly along the temporal dimension. It is therefore desirable to encourage smoothness between the block structures identified from temporally adjacent data matrices. In this paper, we propose an evolutionary coclustering formulation for identifying cocluster structures from timevarying data. The proposed formulation encourages smoothness between temporally adjacent blocks by employing the fused Lasso type of regularization. Our formulation is very flexible and allows for imposing smoothness constraints over only one dimension of the data matrices, thereby enabling its applicability to a large variety of settings. The optimization problem for the proposed formulation is nonconvex, nonsmooth, and nonseparable. We develop an iterative procedure to compute the solution. Each step of the iterative procedure involves a convex, but nonsmooth and nonseparable problem. We propose to solve this problem in its dual form, which is convex and smooth. This leads to a simple gradient descent algorithm for computing the dual optimal solution. We evaluate the proposed formulation using the Allen Developing Mouse Brain Atlas data. Results show that our formulation consistently outperforms methods without the temporal smoothness constraints.
Quantifying Sentiment and Influence in Blogspaces
"... The weblog, or blog, has become a popular form of social media, through which authors can write posts, which can in turn generate feedback in the form of user comments. When considered in totality, a collection of blogs can thus be viewed as a sort of informal collection of mass sentiment and opinio ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
The weblog, or blog, has become a popular form of social media, through which authors can write posts, which can in turn generate feedback in the form of user comments. When considered in totality, a collection of blogs can thus be viewed as a sort of informal collection of mass sentiment and opinion. An obvious topic of interest might be to mine this collection to obtain some gauge of public sentiment over the wide variety of topics contained therein. However, the sheer size of the socalled blogosphere, combined with the fact that the subjects of posts can vary over a practically limitless number of topics poses some serious challenges when any meaningful analysis is attempted. Namely, the fact that largely anyone with access to the Internet can author their own blog, raises the serious issue of credibility— should some blogs be considered to be more influential than others, and consequently, when gauging sentiment with respect to a topic, should some blogs be weighted more heavily than others? In addition, as new posts and comments can be made on almost a constant basis, any blog analysis algorithm must be able to handle such updates efficiently. In this paper, we give a formalization of the blog model. We give formal methods of quantifying sentiment and influence with respect to a hierarchy of topics, with the specific aim of facilitating the computation of a pertopic, influenceweighted sentiment measure. Finally, as efficiency is a specific endgoal, we give upper bounds on the time required to update these values with new posts, showing that our analysis and algorithms are scalable.