Results 1  10
of
162
Mining Generalized Association Rules
, 1995
"... We introduce the problem of mining generalized association rules. Given a large database of transactions, where each transaction consists of a set of items, and a taxonomy (isa hierarchy) on the items, we find associations between items at any level of the taxonomy. For example, given a taxonomy th ..."
Abstract

Cited by 447 (7 self)
 Add to MetaCart
We introduce the problem of mining generalized association rules. Given a large database of transactions, where each transaction consists of a set of items, and a taxonomy (isa hierarchy) on the items, we find associations between items at any level of the taxonomy. For example, given a taxonomy that says that jackets isa outerwear isa clothes, we may infer a rule that "people who buy outerwear tend to buy shoes". This rule may hold even if rules that "people who buy jackets tend to buy shoes", and "people who buy clothes tend to buy shoes" do not hold. An obvious solution to the problem is to add all ancestors of each item in a transaction to the transaction, and then run any of the algorithms for mining association rules on these "extended transactions ". However, this "Basic" algorithm is not very fast; we present two algorithms, Cumulate and EstMerge, which run 2 to 5 times faster than Basic (and more than 100 times faster on one reallife dataset). We also present a new interes...
Combining fuzzy information from multiple systems, IBM
, 1995
"... In a traditional database system, the result of a query is a set of values (those values that satisfy the query). In other data servers, such as a system with queries baaed on image content, or many text retrieval systems, the result of a query is a sorted list. For example, in the case of a system ..."
Abstract

Cited by 331 (6 self)
 Add to MetaCart
In a traditional database system, the result of a query is a set of values (those values that satisfy the query). In other data servers, such as a system with queries baaed on image content, or many text retrieval systems, the result of a query is a sorted list. For example, in the case of a system with queries based on image content, the query might aak for objects that are a particular shade of red, and the result of the query would be a sorted list of objects in the database, sorted by how well the color of the object matches that given in the query. A multimedia system must somehow synthesize both types of queries (those whose result is a set, and those whose result is a sorted list) in a consistent manner. In this paper we discuss the solution adopted by Garlic, a multimedia information system being developed at
Efficient Algorithms for Discovering Association Rules
, 1994
"... Association rules are statements of the form "for 90 % of the rows of the relation, if the row has value 1 in the columns in set W , then it has 1 also in column B". Agrawal, Imielinski, and Swami introduced the problem of mining association rules from large collections of data, and gave a method ba ..."
Abstract

Cited by 204 (11 self)
 Add to MetaCart
Association rules are statements of the form "for 90 % of the rows of the relation, if the row has value 1 in the columns in set W , then it has 1 also in column B". Agrawal, Imielinski, and Swami introduced the problem of mining association rules from large collections of data, and gave a method based on successive passes over the database. We give an improved algorithm for the problem. The method is based on careful combinatorial analysis of the information obtained in previous passes; this makes it possible to eliminate unnecessary candidate rules. Experiments on a university course enrollment database indicate that the method outperforms the previous one by a factor of 5. We also show that sampling is in general a very efficient way of finding such rules. Keywords: association rules, covering sets, algorithms, sampling. 1 Introduction Data mining (database mining, knowledge discovery in databases) has recently been recognized as a promising new field in the intersection of databa...
The Power of Two Choices in Randomized Load Balancing
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1996
"... Suppose that n balls are placed into n bins, each ball being placed into a bin chosen independently and uniformly at random. Then, with high probability, the maximum load in any bin is approximately log n log log n . Suppose instead that each ball is placed sequentially into the least full of d ..."
Abstract

Cited by 201 (23 self)
 Add to MetaCart
Suppose that n balls are placed into n bins, each ball being placed into a bin chosen independently and uniformly at random. Then, with high probability, the maximum load in any bin is approximately log n log log n . Suppose instead that each ball is placed sequentially into the least full of d bins chosen independently and uniformly at random. It has recently been shown that the maximum load is then only log log n log d +O(1) with high probability. Thus giving each ball two choices instead of just one leads to an exponential improvement in the maximum load. This result demonstrates the power of two choices, and it has several applications to load balancing in distributed systems. In this thesis, we expand upon this result by examining related models and by developing techniques for stu...
Learning Decision Trees using the Fourier Spectrum
, 1991
"... This work gives a polynomial time algorithm for learning decision trees with respect to the uniform distribution. (This algorithm uses membership queries.) The decision tree model that is considered is an extension of the traditional boolean decision tree model that allows linear operations in each ..."
Abstract

Cited by 182 (10 self)
 Add to MetaCart
This work gives a polynomial time algorithm for learning decision trees with respect to the uniform distribution. (This algorithm uses membership queries.) The decision tree model that is considered is an extension of the traditional boolean decision tree model that allows linear operations in each node (i.e., summation of a subset of the input variables over GF (2)). This paper shows how to learn in polynomial time any function that can be approximated (in norm L 2 ) by a polynomially sparse function (i.e., a function with only polynomially many nonzero Fourier coefficients). The authors demonstrate that any function f whose L 1 norm (i.e., the sum of absolute value of the Fourier coefficients) is polynomial can be approximated by a polynomially sparse function, and prove that boolean decision trees with linear operations are a subset of this class of functions. Moreover, it is shown that the functions with polynomial L 1 norm can be learned deterministically. The algorithm can a...
A Comparison of Sorting Algorithms for the Connection Machine CM2
"... We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in t ..."
Abstract

Cited by 173 (6 self)
 Add to MetaCart
We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in the literature. Our computational experiments show that the sample sort algorithm, which is a theoretically efficient "randomized" algorithm, is the fastest of the three algorithms on large data sets. On a 64Kprocessor CM2, our sample sort implementation can sort 32 10 6 64bit keys in 5.1 seconds, which is over 10 times faster than the CM2 library sort. Our implementation of radix sort, although not as fast on large data sets, is deterministic, much simpler to code, stable, faster with small keys, and faster on small data sets (few elements per processor). Our implementation of bitonic sort, which is pipelined to use all the hypercube wires simultaneously, is the least efficient of the three on large data sets, but is the most efficient on small data sets, and is considerably more space efficient. This paper analyzes the three algorithms in detail and discusses many practical issues that led us to the particular implementations.
Randomized Search Trees
 ALGORITHMICA
, 1996
"... We present a randomized strategy for maintaining balance in dynamically changing search trees that has optimal expected behavior. In particular, in the expected case a search or an update takes logarithmic time, with the update requiring fewer than two rotations. Moreover, the update time remains ..."
Abstract

Cited by 139 (1 self)
 Add to MetaCart
We present a randomized strategy for maintaining balance in dynamically changing search trees that has optimal expected behavior. In particular, in the expected case a search or an update takes logarithmic time, with the update requiring fewer than two rotations. Moreover, the update time remains logarithmic, even if the cost of a rotation is taken to be proportional to the size of the rotated subtree. Finger searches and splits and joins can be performed in optimal expected time also. We show that these results continue to hold even if very little true randomness is available, i.e. if only a logarithmic number of truely random bits are available. Our approach generalizes naturally to weighted trees, where the expected time bounds for accesses and updates again match the worst case time bounds of the best deterministic methods. We also discuss ways of implementing our randomized strategy so that no explicit balance information is maintained. Our balancing strategy and our alg...
Dispersers, Deterministic Amplification, and Weak Random Sources.
, 1989
"... We use a certain type of expanding bipartite graphs, called disperser graphs, to design procedures for picking highly correlated samples from a finite set, with the property that the probability of hitting any sufficiently large subset is high. These procedures require a relatively small number of r ..."
Abstract

Cited by 93 (11 self)
 Add to MetaCart
We use a certain type of expanding bipartite graphs, called disperser graphs, to design procedures for picking highly correlated samples from a finite set, with the property that the probability of hitting any sufficiently large subset is high. These procedures require a relatively small number of random bits and are robust with respect to the quality of the random bits. Using these sampling procedures to sample random inputs of polynomial time probabilistic algorithms, we can simulate the performance of some probabilistic algorithms with less random bits or with low quality random bits. We obtain the following results: 1. The error probability of an RP or BPP algorithm that operates with a constant error bound and requires n random bits, can be made exponentially small (i.e. 2 \Gamman ), with only (3 + ffl)n random bits, as opposed to standard amplification techniques that require \Omega\Gamma n 2 ) random bits for the same task. This result is nearly optimal, since the informati...
Backwards Analysis of Randomized Geometric Algorithms
 Trends in Discrete and Computational Geometry, volume 10 of Algorithms and Combinatorics
, 1992
"... The theme of this paper is a rather simple method that has proved very potent in the analysis of the expected performance of various randomized algorithms and data structures in computational geometry. The method can be described as "analyze a randomized algorithm as if it were running backwards in ..."
Abstract

Cited by 60 (0 self)
 Add to MetaCart
The theme of this paper is a rather simple method that has proved very potent in the analysis of the expected performance of various randomized algorithms and data structures in computational geometry. The method can be described as "analyze a randomized algorithm as if it were running backwards in time, from output to input." We apply this type of analysis to a variety of algorithms, old and new, and obtain solutions with optimal or near optimal expected performance for a plethora of problems in computational geometry, such as computing Delaunay triangulations of convex polygons, computing convex hulls of point sets in the plane or in higher dimensions, sorting, intersecting line segments, linear programming with a fixed number of variables, and others. 1 Introduction The curious phenomenon that randomness can be used profitably in the solution of computational tasks has attracted a lot of attention from researchers in recent years. The approach has proved useful in such diverse area...