Results 1 - 10
of
71
Robust and efficient fuzzy match for online data cleaning
- In SIGMOD
, 2003
"... To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the p ..."
Abstract
-
Cited by 130 (6 self)
- Add to MetaCart
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets. 1.
Finding interesting associations without support pruning
- In ICDE
, 2000
"... Abstract Association-rule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the well-known a-priori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applicat ..."
Abstract
-
Cited by 111 (13 self)
- Add to MetaCart
Abstract Association-rule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the well-known a-priori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applications, such as data mining, identification of similar web documents, clustering, and collaborative filtering, where the rules of interest have comparatively few instances in the data. In these cases, we must look for highly correlated items, or possibly even causal relationships between infrequent items. We develop a family of algorithms for solving this problem, employing a combination of random sampling and hashing techniques. We provide analysis of the algorithms developed, and conduct experiments on real and synthetic data to obtain a comparative performance analysis.
Counting Distinct Elements in a Data Stream
, 2002
"... We present three algorithms to count the number of distinct elements in a data stream to within a factor of 1 ± epsilon. Our algorithms improve upon known algorithms for this problem, and offer a spectrum of time/space tradeoffs. ..."
Abstract
-
Cited by 111 (4 self)
- Add to MetaCart
We present three algorithms to count the number of distinct elements in a data stream to within a factor of 1 ± epsilon. Our algorithms improve upon known algorithms for this problem, and offer a spectrum of time/space tradeoffs.
ANF: A Fast and Scalable Tool for Data Mining in Massive Graphs
- NTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2002
"... Graphs are an increasingly important data source, with such important graphs as the Internet and the Web. Other familiar graphs include CAD circuits, phone records, gene sequences, city streets, social networks and academic citations. Any kind of relationship, such as actors appearing in movies, can ..."
Abstract
-
Cited by 73 (15 self)
- Add to MetaCart
Graphs are an increasingly important data source, with such important graphs as the Internet and the Web. Other familiar graphs include CAD circuits, phone records, gene sequences, city streets, social networks and academic citations. Any kind of relationship, such as actors appearing in movies, can be represented as a graph. This work presents a data mining tool, called ANF, that can quickly answer a number of interesting questions on graph-represented data, such as the following. How robust is the Internet to failures? What are the most influential database papers? Are there gender differences in movie appearance patterns? At its core, ANF is based on a fast and memory-efficient approach for approximating the complete "neighbourhood function" for a graph. For the Internet graph (268K nodes), ANF's highly-accurate approximation is more than 700 times faster than the exact computation. This reduces the running time from nearly a day to a matter of a minute or two, allowing users to perform ad hoc drill-down tasks and to repeatedly answer questions about changing data sources. To enable this drill-down, ANF employs new techniques for approximating neighbourhood-type functions for graphs with distinguished nodes and/or edges. When compared to the best existing approximation, ANF's approach is both faster and more accurate, given the same resources. Additionally, unlike previous approaches, ANF scales gracefully to handle disk resident graphs. Finally, we present some of our results from mining large graphs using ANF.
Counting Twig Matches in a Tree
, 2001
"... We describe efficient algorithms for accurately estimating the number of matches of a small node-labeled tree, i.e., a twig, in a large node-labeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for cos ..."
Abstract
-
Cited by 61 (2 self)
- Add to MetaCart
We describe efficient algorithms for accurately estimating the number of matches of a small node-labeled tree, i.e., a twig, in a large node-labeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for costbased query optimization. Our summary data structure scalably representsapproximate frequencyinformation about twiglets (i.e., small twigs) in the data tree. Given a twig query, the number of matches is estimated by creating a set of query twiglets, and combining two complementary approaches: Set Hashing, used to estimate the number of matches of each query twiglet, and Maximal Overlap, used to combine the query twiglet estimates into an estimate for the twig query. We propose several estimation algorithms that apply these approaches on query twiglets formed using variations on different twiglet decomposition techniques. We present an extensive experimental evaluation using several real XML...
On approximating the depth and related problems
- In Proc. 16th ACM-SIAM Sympos. Discrete Algorithms
, 2005
"... We study the question of finding a deepest point in an arrangement of regions, and provide a fast algorithm for this problem using random sampling, showing it sufficient to solve this problem when the deepest point is shallow. This implies, among other results, a fast algorithm for solving linear pr ..."
Abstract
-
Cited by 54 (10 self)
- Add to MetaCart
We study the question of finding a deepest point in an arrangement of regions, and provide a fast algorithm for this problem using random sampling, showing it sufficient to solve this problem when the deepest point is shallow. This implies, among other results, a fast algorithm for solving linear programming with violations approximately. We also use this technique to approximate the disk covering the largest number of red points, while avoiding all the blue points, given two such sets in the plane. Using similar techniques imply that approximate range counting queries have roughly the same time
A Small Approximately Min-Wise Independent Family of Hash Functions
- Journal of Algorithms
, 1999
"... In this paper we give a construction of a small approximately min-wise independent family of hash functions. The number of bits needed to represent each function is O(logn \Delta log 1=ffl). This construction gives a solution to the main open problem of [2]. 1 Introduction A family of functions H ..."
Abstract
-
Cited by 48 (2 self)
- Add to MetaCart
In this paper we give a construction of a small approximately min-wise independent family of hash functions. The number of bits needed to represent each function is O(logn \Delta log 1=ffl). This construction gives a solution to the main open problem of [2]. 1 Introduction A family of functions H ae [n] ! [n] (where [n] = f0 : : : n \Gamma 1g) is called ffl-min-wise independent if for any X ae [n] and x 2 [n] \Gamma X we have Pr h2H [h(x) ! minh(X)] = 1 jXj + 1 (1 \Sigma ffl) 1 This definition can be generalized to the case when jXj is restricted to be smaller than a prespecified bound s. Such families (restricted to the case when all functions from H are permutations) were introduced and investigated in [2] and independently earlier in [6] (cf. [7]). The motivation for studying such families is to reduce amount of randomness used by algorithms [6, 2, 3]. In particular (as pointed out in [2]) they have immediate application to efficient detection of similar documents in large...
Towards scaling fully personalized PageRank
- In Proceedings of the 3rd Workshop on Algorithms and Models for the Web-Graph (WAW
, 2004
"... Abstract Personalized PageRank expresses backlink-based page quality around user-selected pages in a similar way as PageRank expresses quality over the entire Web. Existing personalized PageRank algorithms can however serve on-line queries only for a restricted choice of page selection. In this pape ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
Abstract Personalized PageRank expresses backlink-based page quality around user-selected pages in a similar way as PageRank expresses quality over the entire Web. Existing personalized PageRank algorithms can however serve on-line queries only for a restricted choice of page selection. In this paper we achieve full personalization by a novel algorithm that computes a compact database of simulated random walks; this database can serve arbitrary personal choices of small subsets of web pages. We prove that for a fixed error probability, the size of our database is linear in the number of web pages. We justify our estimation approach by asymptotic worst-case lower bounds; we show that exact personalized PageRank values can only be obtained from a database of quadratic size. 1
New Streaming Algorithms for Fast Detection of Superspreaders
- in Proceedings of Network and Distributed System Security Symposium (NDSS
, 2005
"... High-speed monitoring of Internet traffic is an important and challenging problem, with applications to realtime attack detection and mitigation, traffic engineering, etc. However, packet-level monitoring requires fast streaming algorithms that use very little memory and little communication among c ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
High-speed monitoring of Internet traffic is an important and challenging problem, with applications to realtime attack detection and mitigation, traffic engineering, etc. However, packet-level monitoring requires fast streaming algorithms that use very little memory and little communication among collaborating network monitoring points. In this paper, we consider the problem of detecting superspreaders, which are sources that connect to a large number of distinct destinations. We propose new streaming algorithms for detecting superspreaders and prove guarantees on their accuracy and memory requirements. We also show experimental results on real network traces. Our algorithms are substantially more efficient (both theoretically and experimentally) than previous approaches. We also extend our algorithms to identify superspreaders in a distributed setting, with sliding windows, and when deletions are allowed in the stream (which lets us identify sources that make a large number of failed connections to distinct destinations). More generally, our algorithms are applicable to any problem that can be formulated as follows: given a stream of (x, y) pairs, find all the x’s that are paired with a large number of distinct y’s. We call this the heavy distinct-hitters problem. There are many network security applications of this general problem. This paper discusses these applications and, for concreteness, focuses on the superspreader problem. 1
A Fully Dynamic Algorithm for Maintaining the Transitive Closure
- In Proc. 31st ACM Symposium on Theory of Computing (STOC'99
, 1999
"... This paper presents an efficient fully dynamic graph algorithm for maintaining the transitive closure of a directed graph. The algorithm updates the adjacency matrix of the transitive closure with each update to the graph. Hence, each reachability query of the form "Is there a directed path from i t ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
This paper presents an efficient fully dynamic graph algorithm for maintaining the transitive closure of a directed graph. The algorithm updates the adjacency matrix of the transitive closure with each update to the graph. Hence, each reachability query of the form "Is there a directed path from i to j?" can be answered in O(1) time. The algorithm is randomized; it is correct when answering yes, but has O(1/n^c) probability of error when answering no, for any constant c. In acyclic graphs, worst case update time is O(n^2). In general graphs, update time is O(n^(2+alpha)), where alpha = min {.26, maximum size of a strongly connected component}. The space complexity of the algorithm is O(n^2).

