Results 1  10
of
12
BALANCED ALLOCATIONS: THE HEAVILY LOADED CASE
, 2006
"... We investigate ballsintobins processes allocating m balls into n bins based on the multiplechoice paradigm. In the classical singlechoice variant each ball is placed into a bin selected uniformly at random. In a multiplechoice process each ball can be placed into one out of d ≥ 2 randomly selec ..."
Abstract

Cited by 57 (7 self)
 Add to MetaCart
We investigate ballsintobins processes allocating m balls into n bins based on the multiplechoice paradigm. In the classical singlechoice variant each ball is placed into a bin selected uniformly at random. In a multiplechoice process each ball can be placed into one out of d ≥ 2 randomly selected bins. It is known that in many scenarios having more than one choice for each ball can improve the load balance significantly. Formal analyses of this phenomenon prior to this work considered mostly the lightly loaded case, that is, when m ≈ n. In this paper we present the first tight analysis in the heavily loaded case, that is, when m ≫ n rather than m ≈ n. The best previously known results for the multiplechoice processes in the heavily loaded case were obtained using majorization by the singlechoice process. This yields an upper bound of the maximum load of bins of m/n + O ( √ m ln n/n) with high probability. We show, however, that the multiplechoice processes are fundamentally different from the singlechoice variant in that they have “short memory. ” The great consequence of this property is that the deviation of the multiplechoice processes from the optimal allocation (that is, the allocation in which each bin has either ⌊m/n ⌋ or ⌈m/n ⌉ balls) does not increase with the number of balls as in the case of the singlechoice process. In particular, we investigate the allocation obtained by two different multiplechoice allocation schemes,
Distributed selfish load balancing
, 2006
"... Suppose that a set of m tasks are to be shared as equally as possible amongst a set of n resources. A gametheoretic mechanism to find a suitable allocation is to associate each task with a “selfish agent”, and require each agent to select a resource, with the cost of a resource being the number of ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
Suppose that a set of m tasks are to be shared as equally as possible amongst a set of n resources. A gametheoretic mechanism to find a suitable allocation is to associate each task with a “selfish agent”, and require each agent to select a resource, with the cost of a resource being the number of agents to select it. Agents would then be expected to migrate from overloaded to underloaded resources, until the allocation becomes balanced. Recent work has studied the question of how this can take place within a distributed setting in which agents migrate selfishly without any centralized control. In this paper we discuss a natural protocol for the agents which combines the following desirable features: It can be implemented in a strongly distributed setting, uses no central control, and has good convergence properties. For m ≫ n, the system becomes approximately balanced (an ǫNash equilibrium) in expected time O(log log m). We show using a martingale technique that the process converges to a perfectly balanced allocation in expected time O(log log m + n 4). We also give a lower bound of Ω(max{loglog m, n}) for the convergence time.
Efficient hashing with lookups in two memory accesses, in: 16th
 SODA, ACMSIAM
"... The study of hashing is closely related to the analysis of balls and bins. Azar et. al. [1] showed that instead of using a single hash function if we randomly hash a ball into two bins and place it in the smaller of the two, then this dramatically lowers the maximum load on bins. This leads to the c ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
The study of hashing is closely related to the analysis of balls and bins. Azar et. al. [1] showed that instead of using a single hash function if we randomly hash a ball into two bins and place it in the smaller of the two, then this dramatically lowers the maximum load on bins. This leads to the concept of twoway hashing where the largest bucket contains O(log log n) balls with high probability. The hash look up will now search in both the buckets an item hashes to. Since an item may be placed in one of two buckets, we could potentially move an item after it has been initially placed to reduce maximum load. Using this fact, we present a simple, practical hashing scheme that maintains a maximum load of 2, with high probability, while achieving high memory utilization. In fact, with n buckets, even if the space for two items are preallocated per bucket, as may be desirable in hardware implementations, more than n items can be stored giving a high memory utilization. Assuming truly random hash functions, we prove the following properties for our hashing scheme. • Each lookup takes two random memory accesses, and reads at most two items per access. • Each insert takes O(log n) time and up to log log n+ O(1) moves, with high probability, and constant time in expectation. • Maintains 83.75 % memory utilization, without requiring dynamic allocation during inserts. We also analyze the tradeoff between the number of moves performed during inserts and the maximum load on a bucket. By performing at most h moves, we can maintain a maximum load of O(hlogl((~og~og:n/h)). So, even by performing one move, we achieve a better bound than by performing no moves at all. 1
Succinct Data Structures for Retrieval and Approximate Membership
"... Abstract. The retrieval problem is the problem of associating data with keys in a set. Formally, the data structure must store a function f: U → {0, 1} r that has specified values on the elements of a given set S ⊆ U, S  = n, but may have any value on elements outside S. All known methods (e. g. ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
Abstract. The retrieval problem is the problem of associating data with keys in a set. Formally, the data structure must store a function f: U → {0, 1} r that has specified values on the elements of a given set S ⊆ U, S  = n, but may have any value on elements outside S. All known methods (e. g. those based on perfect hash functions), induce a space overhead of Θ(n) bits over the optimum, regardless of the evaluation time. We show that for any k, query time O(k) can be achieved using space that is within a factor 1 + e −k of optimal, asymptotically for large n. The time to construct the data structure is O(n), expected. If we allow logarithmic evaluation time, the additive overhead can be reduced to O(log log n) bits whp. A general reduction transfers the results on retrieval into analogous results on approximate membership, a problem traditionally addressed using Bloom filters. Thus we obtain space bounds arbitrarily close to the lower bound for this problem as well. The evaluation procedures of our data structures are extremely simple. For the results stated above we assume free access to fully random hash functions. This assumption can be justified using space o(n) to simulate full randomness on a RAM. 1
Balanced Allocation on Graphs
 In Proc. 7th Symposium on Discrete Algorithms (SODA
, 2006
"... It is well known that if n balls are inserted into n bins, with high probability, the bin with maximum load contains (1 + o(1))log n / loglog n balls. Azar, Broder, Karlin, and Upfal [1] showed that instead of choosing one bin, if d ≥ 2 bins are chosen at random and the ball inserted into the least ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
It is well known that if n balls are inserted into n bins, with high probability, the bin with maximum load contains (1 + o(1))log n / loglog n balls. Azar, Broder, Karlin, and Upfal [1] showed that instead of choosing one bin, if d ≥ 2 bins are chosen at random and the ball inserted into the least loaded of the d bins, the maximum load reduces drastically to log log n / log d+O(1). In this paper, we study the two choice balls and bins process when balls are not allowed to choose any two random bins, but only bins that are connected by an edge in an underlying graph. We show that for n balls and n bins, if the graph is almost regular with degree n ǫ, where ǫ is not too small, the previous bounds on the maximum load continue to hold. Precisely, the maximum load is
Efficient parallel processing of range queries through replicated declustering
 JOURNAL OF DISTRIBUTED AND PARALLEL DATABASES
"... A common technique used to minimize I/O in data intensive applications is data declustering over parallel servers. This technique involves distributing data among several disks so as to parallelize query retrieval and thus, improve performance. We focus on optimizing access to large spatial data, an ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
A common technique used to minimize I/O in data intensive applications is data declustering over parallel servers. This technique involves distributing data among several disks so as to parallelize query retrieval and thus, improve performance. We focus on optimizing access to large spatial data, and the most common type of queries on such data, i.e., range queries. An optimal declustering scheme is one in which the processing for all range queries is balanced uniformly among the available disks. It has been shown that single copy based declustering schemes are nonoptimal for range queries. In this paper, we integrate replication in conjunction with parallel disk declustering for efficient processing of range queries. We note that replication is largely used in database applications for several purposes like load balancing, fault tolerance and availability of data. We propose theoretical foundations for replicated declustering and propose a class of replicated declustering schemes, periodic allocations, which are shown to be strictly optimal for a number of disks. We propose a framework for replicated declustering, using a limited amount of replication and provide extensions to apply it on real data, which include arbitrary grids and a large number of disks. Our framework also provides an effective indexing scheme that enables fast identification of data of interest in parallel servers. In addition to optimal processing of single queries, we show that this framework is effective for parallel processing of multiple queries. We present experimental results comparing the proposed replication scheme to other techniques for both single queries and multiple queries, on synthetic and real data sets.
Kinesis: A new approach to replica placement in distributed storage systems
 ACM Transactions on Storage (TOS
"... Kinesis is a novel data placement model for distributed storage systems. It exemplifies three design principles: structure (division of servers into a few failureisolated segments), freedom of choice (freedom to allocate the best servers to store and retrieve data based on current resource availabi ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Kinesis is a novel data placement model for distributed storage systems. It exemplifies three design principles: structure (division of servers into a few failureisolated segments), freedom of choice (freedom to allocate the best servers to store and retrieve data based on current resource availability), and scattered distribution (independent, pseudorandom spread of replicas in the system). These design principles enable storage systems to achieve balanced utilization of storage and network resources in the presence of incremental system expansions, failures of single and shared components, and skewed distributions of data size and popularity. In turn, this ability leads to significantly reduced resource provisioning costs, good userperceived response times, and fast, parallelized recovery from independent and correlated failures. This paper validates Kinesis through theoretical analysis, simulations, and experiments on a prototype implementation. Evaluations driven by realworld traces show that Kinesis can significantly outperform the widelyused Chain replicaplacement strategy in terms of resource requirements, endtoend delay, and failure recovery.
Fractional Matching via BallsandBins
"... In this paper we relate the problem of finding structures related to perfect matchings in bipartite graphs to a stochastic process similar to throwing balls into bins. Given a bipartite graph with n nodes on each side, we view each node on the left as having balls that it can throw into nodes on the ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this paper we relate the problem of finding structures related to perfect matchings in bipartite graphs to a stochastic process similar to throwing balls into bins. Given a bipartite graph with n nodes on each side, we view each node on the left as having balls that it can throw into nodes on the right (bins) to which it is adjacent. If each node on the left throws exactly one ball and each bin on the right gets exactly one ball, then the edges represented by the ballplacement form a perfect matching. Further, if each thrower is allowed to throw a large but equal number of balls, and each bin on the right receives an equal number of balls, then the set of ballplacements corresponds to a perfect fractional matching – a weighted subgraph on all nodes with nonnegative weights on edges so that the total weight incident at each node is 1. We show that several simple algorithms based on throwing balls into bins deliver a nearperfect fractional matching. For example, we show that by iteratively picking a random node on the left and throwing a ball into its leastloaded neighbor, the distribution of balls obtained is no worse than randomly throwing kn balls into n bins. Another algorithm is based on the dchoice loadbalancing of balls and bins. By picking a constant number of nodes on the left and appropriately inserting a ball into the leastloaded of their neighbors, we achieve a smoother load distribution on both sides – maximum load is at most log log n / log d + O(1). When each vertex on the left throws k balls, we obtain an algorithm that achieves a load within k ± 1 on the right vertices. 1
6 Distributed data management I – Hashing
"... There are two major approaches for the management of data in distributed systems: hashing and caching. The hashing approach tries to minimize the use of communication hardware by distributing data randomly among the given processors, which helps to avoid hot spots, and the caching approach tries to ..."
Abstract
 Add to MetaCart
There are two major approaches for the management of data in distributed systems: hashing and caching. The hashing approach tries to minimize the use of communication hardware by distributing data randomly among the given processors, which helps to avoid hot spots, and the caching approach tries to minimize the use of communication hardware by keeping data as close to the requesting processors as possible. In this section, we will concentrate on the hashing approach. 6.1 Static hashing The basic idea behind hashing is to use a compact function (the hash function) in order to map some space U onto some space V. Hashing is mostly used in the context of resource management. Consider, for example, the problem of distributing data with addresses in some space U = {1,..., M} evenly among n storage units (also called nodes in the following). In the static case we assume that n is fixed and the nodes are numbered from 1 to n. If the whole address space U is occupied by data items, it is easy to achieve an even distribution of the data among the nodes: node i gets all data j with (j mod n) + 1 = i. However, if U is sparsely populated, and in addition the data allocation in U changes over time, it is more difficult to keep the data evenly distributed among the nodes. In this case, random hash functions can help. Suppose that we have a random hash function that assigns every element in U to a node in {1,..., n} chosen uniformly at random, i.e. for every x ∈ U, every node is chosen with probability 1/n. Then for any set S ⊆ U, the expected number of elements in S that are assigned to node i is S/n for every i ∈ {1,..., n}. In addition, the following result can be shown: Theorem 6.1 For any set S ⊆ U of size m, the maximum number of elements in S placed in a single node when using a random hash function is at most