Beyond Market Baskets: Generalizing Association Rules To Dependence Rules
, 1998
"One of the more wellstudied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: "A customer purchasing item A often also purchases item B. Motivated partly by the goal of generalizing beyond market basket data..."
One of the more wellstudied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: “A customer purchasing item A often also purchases item B. Motivated partly by the goal of generalizing beyond market basket data and partly by the goal of ironing out some problems in the definition of association rules, we develop the notion of dependence rules that identify statistical dependence in both the presence and absence of items in itemsets. We propose measuring significance of dependence via the chisquared test for independence from classical statistics. This leads to a measure that is upwardclosed in the itemset lattice, enabling us to reduce the mining problem to the search for a border between dependent and independent itemsets in the lattice. We develop pruning strategies based on the closure property and thereby devise an efficient algorithm for discovering dependence rules. We demonstrate our algorithm’s effectiveness by testing it on census data, text data (wherein we seek term dependence), and synthetic data.
Minwise Independent Permutations
 Journal of Computer and System Sciences
, 1998
"We define and study the notion of minwise independent families of permutations. We say that F ⊆ Sn is minwise independent if for any set X ⊆ [n] and any x ∈ X, when π is chosen at random in F we have Pr(min{π(X)} = π(x)) = 1/|X|..."
We define and study the notion of minwise independent families of permutations. We say that F ⊆ Sn is minwise independent if for any set X ⊆ [n] and any x ∈ X, when π is chosen at random in F we have Pr(min{π(X)} = π(x)) = 1 X . In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter nearduplicate documents. However, in the course of our investigation we have discovered interesting and challenging theoretical questions related to this concept – we present the solutions to some of them and we list the rest as open problems.
Faster algorithms for the shortest path problem
, 1990
"Efficient implementations of Dijkstra's shortest path algorithm are investigated. A new data structure, called the radix heap, is proposed for use in this algorithm..."
Efficient implementations of Dijkstra's shortest path algorithm are investigated. A new data structure, called the radix heap, is proposed for use in this algorithm. On a network with n vertices, mn edges, and nonnegative integer arc costs bounded by C, a onelevel form of radix heap gives a time bound for Dijkstra's algorithm of O(m + n log C). A twolevel form of radix heap gives a bound of O(m + n log C/log log C). A combination of a radix heap and a previously known data structure called a Fibonacci heap gives a bound of O(m + n /log C). The best previously known bounds are O(m + n log n) using Fibonacci heaps alone and O(m log log C) using the priority queue structure of Van Emde Boas et al. [17].
Let Sleeping Files Lie: Pattern Matching in Zcompressed Files
, 1994
"The current explosion of stored information necessitates a new model of pattern matching, that of compressed matching..."
The current explosion of stored information necessitates a new model of pattern matching, that of compressed matching. In this model one tries to find all occurrences of a pattern in a compressed text in time proportional to the compressed text size, i.e., without decompressing the text. The most effective general purpose compression algorithms are adaptive, in that the text represented by each compression symbol is determined dynamically by the data. As a result, the encoding of a substring depends on its location. Thus the same substring may "look different" every time it appears in the compressed text. In this paper we consider pattern matching without decompression in the UNIX Zcompression. This is a variant of the LempelZiv adaptive compression scheme. If n is the length of the compressed text and m is the length of the pattern, our algorithms find the first pattern occurrence in time O(n+m 2 ) or O(n log m+m). We also introduce a new criterion to measure compressed matching ...
ClosestPoint Problems in Computational Geometry
, 1997
"This is the preliminary version of a chapter that will appear in the Handbook on Computational Geometry..."
This is the preliminary version of a chapter that will appear in the Handbook on Computational Geometry, edited by J.R. Sack and J. Urrutia. A comprehensive overview is given of algorithms and data structures for proximity problems on point sets in IR D . In particular, the closest pair problem, the exact and approximate postoffice problem, and the problem of constructing spanners are discussed in detail. Contents 1 Introduction 1 2 The static closest pair problem 4 2.1 Preliminary remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Algorithms that are optimal in the algebraic computation tree model . 5 2.2.1 An algorithm based on the Voronoi diagram . . . . . . . . . . . 5 2.2.2 A divideandconquer algorithm . . . . . . . . . . . . . . . . . . 5 2.2.3 A plane sweep algorithm . . . . . . . . . . . . . . . . . . . . . . 6 2.3 A deterministic algorithm that uses indirect addressing . . . . . . . . . 7 2.3.1 The degraded grid . . . . . . . . . . . . . . . . . . ...
A Perfect Hash Function Generator
"gperf is a "softwaretool generatingtool" designed to automate the generation of perfect hash functions..."
gperf is a "softwaretool generatingtool" designed to automate the generation of perfect hash functions. This paper describes the features, algorithms, and objectoriented design and implementation strategies incorporated in gperf.Italso presents the results from an empirical comparison between gperfgenerated recognizers and other popular techniques for reserved word lookup. gperf is distributed with the GNU libg++ library and is used to generate the keyword recognizers for the GNU C and GNU C++ compilers. 1 Introduction Perfect hash functions are a time and space efficient implementation of static search sets, which are ADTs with operations like initialize, insert,andretrieve. Static search sets are common in system software applications. Typical static search sets include compiler and interpreter reserved words, assembler instruction mnemonics, and shell interpreter builtin commands. Search set elements are called keywords.Key words are inserted into the set once, usually at c...
Spheres, Molecules, and Hidden Surface Removal
, 1996
"We devise techniques to manipulate a collection of loosely interpenetrating spheres in threedimensional space..."
We devise techniques to manipulate a collection of loosely interpenetrating spheres in threedimensional space. Our study is motivated by the representation and manipulation of molecular con gurations, modeled by a collection of spheres. We analyze the sphere model and point toitsfavorable properties that make it more easy to manipulate than an arbitrary collection of spheres. For this special sphere model we present e cient algorithms for computing its union boundary and for hidden surface removal. The e ciency and practicality of our approach are demonstrated by experiments on actual molecule data.
The Snapshot Index, an I/OOptimal Access Method for Timeslice Queries
 Information Systems, An International Journal
, 1995
"We present an access method for timeslice queries that reconstructs a past state s(t) of a timeevolving collection of objects..."
Abstract We present an access method for timeslice queries that reconstructs a past state s(t) of a timeevolving collection of objects, in O(log,, n + Is(t)l/b) I/O ‘8, where Is(t)1 denotes the size of the collection at time t, n is the total number of changes in the collection’s evolution and b is the size of an I/O transfer. Changes include the addition, deletion or attribute modification of objects; they are assumed to occur in increasing time order and always affect the most current state of the collection (thus our index supports transactiontime.) The space used is 0 n/b) while the update processing is constant per change, i.e., independent of n. This is the first I I Ooptimal access method for this problem using O(n/b) space and O(1) updating (in the expected amortized sense due to the use of hashing.) This performance is also achieved for interval intersection temporal queries. An advantage of our approach is that its performance can be tuned to match particular application needs (trading space for query time and vice versa). In addition, the Snapshot Index can naturally migrate data on a writeonce optical medium while maintaining the same performance bounds.
Space Efficient Hash Tables With Worst Case Constant Access Time
 In STACS
, 2003
"We generalize Cuckoo Hashing [23] to dary Cuckoo Hashing and show how this yields a simple hash table data structure..."
We generalize Cuckoo Hashing [23] to dary Cuckoo Hashing and show how this yields a simple hash table data structure that stores n elements in (1 + ffl) n memory cells, for any constant ffl ? 0. Assuming uniform hashing, accessing or deleting table entries takes at most d = O(ln ffl ) probes and the expected amortized insertion time is constant. This is the first dictionary that has worst case constant access time and expected constant update time, works with (1 + ffl) n space, and supports satellite information. Experiments indicate that d = 4 choices suffice for ffl 0:03. We also describe variants of the data structure that allow the use of hash functions that can be evaluted in constant time.
Minwise independent permutations (extended abstract
 In STOC ’98: Proceedings of the thirtieth annual ACM symposium on Theory of computing
, 1998
"We define and study the notion of minwise independent families of permutations. We say that F⊆Sn is minwise independent if for any set X ⊆ [n] and any x ∈ X, when π is chosen at random in F we have Pr ( min{π(X)} = π(x) ) = 1/|X|..."
We define and study the notion of minwise independent families of permutations. We say that F⊆Sn is minwise independent if for any set X ⊆ [n] and any x ∈ X, when π is chosen at random in F we have Pr ( min{π(X)} = π(x) ) = 1 X . In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter nearduplicate documents. However, in the course of