Results 1  10
of
160
Mining Generalized Association Rules
, 1995
"... We introduce the problem of mining generalized association rules. Given a large database of transactions, where each transaction consists of a set of items, and a taxonomy (isa hierarchy) on the items, we find associations between items at any level of the taxonomy. For example, given a taxonomy th ..."
Abstract

Cited by 457 (7 self)
 Add to MetaCart
We introduce the problem of mining generalized association rules. Given a large database of transactions, where each transaction consists of a set of items, and a taxonomy (isa hierarchy) on the items, we find associations between items at any level of the taxonomy. For example, given a taxonomy that says that jackets isa outerwear isa clothes, we may infer a rule that "people who buy outerwear tend to buy shoes". This rule may hold even if rules that "people who buy jackets tend to buy shoes", and "people who buy clothes tend to buy shoes" do not hold. An obvious solution to the problem is to add all ancestors of each item in a transaction to the transaction, and then run any of the algorithms for mining association rules on these "extended transactions ". However, this "Basic" algorithm is not very fast; we present two algorithms, Cumulate and EstMerge, which run 2 to 5 times faster than Basic (and more than 100 times faster on one reallife dataset). We also present a new interes...
Efficient Algorithms for Discovering Association Rules
, 1994
"... Association rules are statements of the form "for 90 % of the rows of the relation, if the row has value 1 in the columns in set W , then it has 1 also in column B". Agrawal, Imielinski, and Swami introduced the problem of mining association rules from large collections of data, and gave a method ba ..."
Abstract

Cited by 206 (11 self)
 Add to MetaCart
Association rules are statements of the form "for 90 % of the rows of the relation, if the row has value 1 in the columns in set W , then it has 1 also in column B". Agrawal, Imielinski, and Swami introduced the problem of mining association rules from large collections of data, and gave a method based on successive passes over the database. We give an improved algorithm for the problem. The method is based on careful combinatorial analysis of the information obtained in previous passes; this makes it possible to eliminate unnecessary candidate rules. Experiments on a university course enrollment database indicate that the method outperforms the previous one by a factor of 5. We also show that sampling is in general a very efficient way of finding such rules. Keywords: association rules, covering sets, algorithms, sampling. 1 Introduction Data mining (database mining, knowledge discovery in databases) has recently been recognized as a promising new field in the intersection of databa...
The Power of Two Choices in Randomized Load Balancing
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1996
"... Suppose that n balls are placed into n bins, each ball being placed into a bin chosen independently and uniformly at random. Then, with high probability, the maximum load in any bin is approximately log n log log n . Suppose instead that each ball is placed sequentially into the least full of d ..."
Abstract

Cited by 200 (22 self)
 Add to MetaCart
Suppose that n balls are placed into n bins, each ball being placed into a bin chosen independently and uniformly at random. Then, with high probability, the maximum load in any bin is approximately log n log log n . Suppose instead that each ball is placed sequentially into the least full of d bins chosen independently and uniformly at random. It has recently been shown that the maximum load is then only log log n log d +O(1) with high probability. Thus giving each ball two choices instead of just one leads to an exponential improvement in the maximum load. This result demonstrates the power of two choices, and it has several applications to load balancing in distributed systems. In this thesis, we expand upon this result by examining related models and by developing techniques for stu...
Learning Decision Trees using the Fourier Spectrum
, 1991
"... This work gives a polynomial time algorithm for learning decision trees with respect to the uniform distribution. (This algorithm uses membership queries.) The decision tree model that is considered is an extension of the traditional boolean decision tree model that allows linear operations in each ..."
Abstract

Cited by 187 (10 self)
 Add to MetaCart
This work gives a polynomial time algorithm for learning decision trees with respect to the uniform distribution. (This algorithm uses membership queries.) The decision tree model that is considered is an extension of the traditional boolean decision tree model that allows linear operations in each node (i.e., summation of a subset of the input variables over GF (2)). This paper shows how to learn in polynomial time any function that can be approximated (in norm L 2 ) by a polynomially sparse function (i.e., a function with only polynomially many nonzero Fourier coefficients). The authors demonstrate that any function f whose L 1 norm (i.e., the sum of absolute value of the Fourier coefficients) is polynomial can be approximated by a polynomially sparse function, and prove that boolean decision trees with linear operations are a subset of this class of functions. Moreover, it is shown that the functions with polynomial L 1 norm can be learned deterministically. The algorithm can a...
A Comparison of Sorting Algorithms for the Connection Machine CM2
"... We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in t ..."
Abstract

Cited by 173 (6 self)
 Add to MetaCart
We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in the literature. Our computational experiments show that the sample sort algorithm, which is a theoretically efficient "randomized" algorithm, is the fastest of the three algorithms on large data sets. On a 64Kprocessor CM2, our sample sort implementation can sort 32 10 6 64bit keys in 5.1 seconds, which is over 10 times faster than the CM2 library sort. Our implementation of radix sort, although not as fast on large data sets, is deterministic, much simpler to code, stable, faster with small keys, and faster on small data sets (few elements per processor). Our implementation of bitonic sort, which is pipelined to use all the hypercube wires simultaneously, is the least efficient of the three on large data sets, but is the most efficient on small data sets, and is considerably more space efficient. This paper analyzes the three algorithms in detail and discusses many practical issues that led us to the particular implementations.
Randomized Search Trees
 ALGORITHMICA
, 1996
"... We present a randomized strategy for maintaining balance in dynamically changing search trees that has optimal expected behavior. In particular, in the expected case a search or an update takes logarithmic time, with the update requiring fewer than two rotations. Moreover, the update time remains ..."
Abstract

Cited by 137 (1 self)
 Add to MetaCart
We present a randomized strategy for maintaining balance in dynamically changing search trees that has optimal expected behavior. In particular, in the expected case a search or an update takes logarithmic time, with the update requiring fewer than two rotations. Moreover, the update time remains logarithmic, even if the cost of a rotation is taken to be proportional to the size of the rotated subtree. Finger searches and splits and joins can be performed in optimal expected time also. We show that these results continue to hold even if very little true randomness is available, i.e. if only a logarithmic number of truely random bits are available. Our approach generalizes naturally to weighted trees, where the expected time bounds for accesses and updates again match the worst case time bounds of the best deterministic methods. We also discuss ways of implementing our randomized strategy so that no explicit balance information is maintained. Our balancing strategy and our alg...
Dispersers, Deterministic Amplification, and Weak Random Sources.
, 1989
"... We use a certain type of expanding bipartite graphs, called disperser graphs, to design procedures for picking highly correlated samples from a finite set, with the property that the probability of hitting any sufficiently large subset is high. These procedures require a relatively small number of r ..."
Abstract

Cited by 94 (12 self)
 Add to MetaCart
We use a certain type of expanding bipartite graphs, called disperser graphs, to design procedures for picking highly correlated samples from a finite set, with the property that the probability of hitting any sufficiently large subset is high. These procedures require a relatively small number of random bits and are robust with respect to the quality of the random bits. Using these sampling procedures to sample random inputs of polynomial time probabilistic algorithms, we can simulate the performance of some probabilistic algorithms with less random bits or with low quality random bits. We obtain the following results: 1. The error probability of an RP or BPP algorithm that operates with a constant error bound and requires n random bits, can be made exponentially small (i.e. 2 \Gamman ), with only (3 + ffl)n random bits, as opposed to standard amplification techniques that require \Omega\Gamma n 2 ) random bits for the same task. This result is nearly optimal, since the informati...
Backwards Analysis of Randomized Geometric Algorithms
 Trends in Discrete and Computational Geometry, volume 10 of Algorithms and Combinatorics
, 1992
"... The theme of this paper is a rather simple method that has proved very potent in the analysis of the expected performance of various randomized algorithms and data structures in computational geometry. The method can be described as "analyze a randomized algorithm as if it were running backwards in ..."
Abstract

Cited by 58 (0 self)
 Add to MetaCart
The theme of this paper is a rather simple method that has proved very potent in the analysis of the expected performance of various randomized algorithms and data structures in computational geometry. The method can be described as "analyze a randomized algorithm as if it were running backwards in time, from output to input." We apply this type of analysis to a variety of algorithms, old and new, and obtain solutions with optimal or near optimal expected performance for a plethora of problems in computational geometry, such as computing Delaunay triangulations of convex polygons, computing convex hulls of point sets in the plane or in higher dimensions, sorting, intersecting line segments, linear programming with a fixed number of variables, and others. 1 Introduction The curious phenomenon that randomness can be used profitably in the solution of computational tasks has attracted a lot of attention from researchers in recent years. The approach has proved useful in such diverse area...
Parallel Randomized Load Balancing
 In Symposium on Theory of Computing. ACM
, 1995
"... It is well known that after placing n balls independently and uniformly at random into n bins, the fullest bin holds \Theta(log n= log log n) balls with high probability. Recently, Azar et al. analyzed the following: randomly choose d bins for each ball, and then sequentially place each ball in the ..."
Abstract

Cited by 56 (8 self)
 Add to MetaCart
It is well known that after placing n balls independently and uniformly at random into n bins, the fullest bin holds \Theta(log n= log log n) balls with high probability. Recently, Azar et al. analyzed the following: randomly choose d bins for each ball, and then sequentially place each ball in the least full of its chosen bins [2]. They show that the fullest bin contains only log log n= log d + \Theta(1) balls with high probability. We explore extensions of this result to parallel and distributed settings. Our results focus on the tradeoff between the amount of communication and the final load. Given r rounds of communication, we provide lower bounds on the maximum load of \Omega\Gamma r p log n= log log n) for a wide class of strategies. Our results extend to the case where the number of rounds is allowed to grow with n. We then demonstrate parallelizations of the sequential strategy presented in Azar et al. that achieve loads within a constant factor of the lower bound for two ...