Results 1  10
of
163
Batch tuning strategies for statistical machine translation
 In HLTNAACL
, 2012
"... There has been a proliferation of recent work on SMT tuning algorithms capable of handling larger feature sets than the traditional MERT approach. We analyze a number of these algorithms in terms of their sentencelevel loss functions, which motivates several new approaches, including a Structured SV ..."
Abstract

Cited by 62 (10 self)
 Add to MetaCart
(Show Context)
There has been a proliferation of recent work on SMT tuning algorithms capable of handling larger feature sets than the traditional MERT approach. We analyze a number of these algorithms in terms of their sentencelevel loss functions, which motivates several new approaches, including a Structured SVM. We perform empirical comparisons of eight different tuning strategies, including MERT, in a variety of settings. Among other results, we find that a simple and efficient batch version of MIRA performs at least as well as training online, and consistently outperforms other options. 1
Paragon: Qosaware scheduling for heterogeneous datacenters
 In Proceedings of the eighteenth international
, 2013
"... Largescale datacenters (DCs) host tens of thousands of diverse applications each day. However, interference between colocated workloads and the difficulty to match applications to one of the many hardware platforms available can degrade performance, violating the quality of service (QoS) guarantees ..."
Abstract

Cited by 37 (7 self)
 Add to MetaCart
(Show Context)
Largescale datacenters (DCs) host tens of thousands of diverse applications each day. However, interference between colocated workloads and the difficulty to match applications to one of the many hardware platforms available can degrade performance, violating the quality of service (QoS) guarantees that many cloud workloads require. While previous work has identified the impact of heterogeneity and interference, existing solutions are computationally intensive, cannot be applied online and do not scale beyond few applications. We present Paragon, an online and scalable DC scheduler that is heterogeneity and interferenceaware. Paragon is derived from robust analytical methods and instead of profiling each application in detail, it leverages information the system already has about applications it has previously seen. It uses collaborative filtering techniques to quickly and accurately classify an unknown, incoming workload with respect to heterogeneity and interference in multiple shared resources, by identifying similarities to previously scheduled applications. The classification allows Paragon to greedily schedule applications in a manner that minimizes interference and maximizes server utilization. Paragon scales to tens of thousands of servers with marginal scheduling overheads in terms of time or state. We evaluate Paragon with a wide range of workload scenarios, on both small and largescale systems, including 1,000 servers on EC2. For a 2,500workload scenario, Paragon enforces performance guarantees for 91 % of applications, while significantly improving utilization. In comparison, heterogeneityoblivious, interferenceoblivious and leastloaded schedulers only provide similar guarantees for 14%, 11 % and 3 % of workloads. The differences are more striking in oversubscribed scenarios where resource efficiency is more critical.
Largescale machine learning at Twitter
 In SIGMOD
, 2012
"... The success of datadriven solutions to difficult problems, along with the dropping costs of storing and processing massive amounts of data, has led to growing interest in largescale machine learning. This paper presents a case study of Twitter’s integration of machine learning tools into its existi ..."
Abstract

Cited by 27 (8 self)
 Add to MetaCart
(Show Context)
The success of datadriven solutions to difficult problems, along with the dropping costs of storing and processing massive amounts of data, has led to growing interest in largescale machine learning. This paper presents a case study of Twitter’s integration of machine learning tools into its existing Hadoopbased, Pigcentric analytics platform. We begin with an overview of this platform, which handles “traditional” data warehousing and business intelligence tasks for the organization. The core of this work lies in recent Pig extensions to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification. In particular, we have identified stochastic gradient descent techniques for online learning and ensemble methods as being highly amenable to scaling out to large amounts of data. In our deployed solution, common machine learning tasks such as data sampling, feature generation, training, and testing can be accomplished directly in Pig, via carefully crafted loaders, storage functions, and userdefined functions. This means that machine learning is just another Pig script, which allows seamless integration with existing infrastructure for data management, scheduling, and monitoring in a production environment, as well as access to rich libraries of userdefined functions and the materialized output of other scripts.
Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems
"... Abstract—Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle webscale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gra ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle webscale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) are two popular approaches to compute matrix factorization. There has been a recent flurry of activity to parallelize these algorithms. However, due to the cubic time complexity in the target rank, ALS is not scalable to largescale datasets. On the other hand, SGD conducts efficient updates but usually suffers from slow convergence that is sensitive to the parameters. Coordinate descent, a classical optimization approach, has been used for many other largescale problems, but its application to matrix factorization for recommender systems has not been explored thoroughly. In this paper, we show that coordinate descent based methods have a more efficient update rule compared to ALS, and are faster and have more stable convergence than SGD. We study different update sequences and propose the CCD++ algorithm, which updates rankone factors one by one. In addition, CCD++ can be easily parallelized on both multicore and distributed systems. We empirically show that CCD++ is much faster than ALS and SGD in both settings. As an example, on a synthetic dataset with 2 billion ratings, CCD++ is 4 times faster than both SGD and ALS using a distributed system with 20 machines. KeywordsRecommender systems, Matrix factorization, Low rank approximation, Parallelization.
Optimization with firstorder surrogate functions
 In Proceedings of the International Conference on Machine Learning (ICML
, 2013
"... In this paper, we study optimization methods consisting of iteratively minimizing surrogates of an objective function. By proposing several algorithmic variants and simple convergence analyses, we make two main contributions. First, we provide a unified viewpoint for several firstorder optimization ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we study optimization methods consisting of iteratively minimizing surrogates of an objective function. By proposing several algorithmic variants and simple convergence analyses, we make two main contributions. First, we provide a unified viewpoint for several firstorder optimization techniques such as accelerated proximal gradient, block coordinate descent, or FrankWolfe algorithms. Second, we introduce a new incremental scheme that experimentally matches or outperforms stateoftheart solvers for largescale optimization problems typically arising in machine learning. 1.
Metric learning for large scale image classification: Generalizing to new classes at nearzero cost
 In ECCV. Yunchao Gong et al
, 2012
"... Abstract. We are interested in largescale image classification and especially in the setting where images corresponding to new or existing classes are continuously added to the training set. Our goal is to devise classifiers which can incorporate such images and classes onthefly at (near) zero c ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
(Show Context)
Abstract. We are interested in largescale image classification and especially in the setting where images corresponding to new or existing classes are continuously added to the training set. Our goal is to devise classifiers which can incorporate such images and classes onthefly at (near) zero cost. We cast this problem into one of learning a metric which is shared across all classes and explore knearest neighbor (kNN) and nearest class mean (NCM) classifiers. We learn metrics on the ImageNet 2010 challenge data set, which contains more than 1.2M training images of 1K classes. Surprisingly, the NCM classifier compares favorably to the more flexible kNN classifier, and has comparable performance to linear SVMs. We also study the generalization performance, among others by using the learned metric on the ImageNet10K dataset, and we obtain competitive performance. Finally, we explore zeroshot classification, and show how the zeroshot model can be combined very effectively with small training datasets. 1
Quasar: ResourceEfficient and QoSAware Cluster Management
"... Cloud computing promises flexibility and high performance for users and high costefficiency for operators. Nevertheless, most cloud facilities operate at very low utilization, hurting both cost effectiveness and future scalability. We present Quasar, a cluster management system that increases reso ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
(Show Context)
Cloud computing promises flexibility and high performance for users and high costefficiency for operators. Nevertheless, most cloud facilities operate at very low utilization, hurting both cost effectiveness and future scalability. We present Quasar, a cluster management system that increases resource utilization while providing consistently high application performance. Quasar employs three techniques. First, it does not rely on resource reservations, which lead to underutilization as users do not necessarily understand workload dynamics and physical resource requirements of complex codebases. Instead, users express performance constraints for each workload, letting Quasar determine the right amount of resources to meet these constraints at any point. Second, Quasar uses classification techniques to quickly and accurately determine the impact of the amount of resources (scaleout and scaleup), type of resources, and interference on performance for each workload and dataset. Third, it uses the classification results to jointly perform resource allocation and assignment, quickly exploring the large space of options for an efficient way to pack workloads on available resources. Quasar monitors workload performance and adjusts resource allocation and assignment when needed. We evaluate Quasar over a wide range of workload scenarios, including combinations of distributed analytics frameworks and lowlatency, stateful services, both on a local cluster and a cluster of dedicated EC2 servers. At steady state, Quasar improves resource utilization by 47 % in the 200server EC2 cluster, while meeting performance constraints for workloads of all types.
Pushsum distributed dual averaging for convex optimization
 in IEEE CDC
, 2012
"... Abstract — In this paper we extend and analyze the distributed dual averaging algorithm [1] to handle communication delays and general stochastic consensus protocols. Assuming each network link experiences some fixed bounded delay, we show that distributed dual averaging converges and the error dec ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
(Show Context)
Abstract — In this paper we extend and analyze the distributed dual averaging algorithm [1] to handle communication delays and general stochastic consensus protocols. Assuming each network link experiences some fixed bounded delay, we show that distributed dual averaging converges and the error decays at a rate O(T−0.5) where T is the number of iterations. This bound is an improvement over [1] by a logarithmic factor in T for networks of fixed size. Finally, we extend the algorithm to the case of using general nonaveraging consensus protocols. We prove that the bias introduced in the optimization can be removed by a simple correction that depends on the stationary distribution of the consensus matrix. I.
Communicationefficient distributed dual coordinate ascent
 In Advances in Neural Information Processing Systems (NIPS
, 2014
"... ar ..."