Results 11  20
of
2,170
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching
, 2002
"... Matching elements of two data schemas or two data instances plays a key role in data warehousing, ebusiness, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (sch ..."
Abstract

Cited by 557 (12 self)
 Add to MetaCart
(Show Context)
Matching elements of two data schemas or two data instances plays a key role in data warehousing, ebusiness, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using filters. After our algorithm runs, we expect a human to check and if necessary adjust the results. As a matter of fact, we evaluate the ‘accuracy ’ of the algorithm by counting the number of needed adjustments. We conducted a user study, in which our accuracy metric was used to estimate the labor savings that the users could obtain by utilizing our algorithm to obtain an initial matching. Finally, we illustrate how our matching algorithm is deployed as one of several highlevel operators in an implemented testbed for managing information models and mappings.
The Capacity of LowDensity ParityCheck Codes Under MessagePassing Decoding
, 2001
"... In this paper, we present a general method for determining the capacity of lowdensity paritycheck (LDPC) codes under messagepassing decoding when used over any binaryinput memoryless channel with discrete or continuous output alphabets. Transmitting at rates below this capacity, a randomly chos ..."
Abstract

Cited by 547 (9 self)
 Add to MetaCart
(Show Context)
In this paper, we present a general method for determining the capacity of lowdensity paritycheck (LDPC) codes under messagepassing decoding when used over any binaryinput memoryless channel with discrete or continuous output alphabets. Transmitting at rates below this capacity, a randomly chosen element of the given ensemble will achieve an arbitrarily small target probability of error with a probability that approaches one exponentially fast in the length of the code. (By concatenating with an appropriate outer code one can achieve a probability of error that approaches zero exponentially fast in the length of the code with arbitrarily small loss in rate.) Conversely, transmitting at rates above this capacity the probability of error is bounded away from zero by a strictly positive constant which is independent of the length of the code and of the number of iterations performed. Our results are based on the observation that the concentration of the performance of the decoder around its average performance, as observed by Luby et al. [1] in the case of a binarysymmetric channel and a binary messagepassing algorithm, is a general phenomenon. For the particularly important case of beliefpropagation decoders, we provide an effective algorithm to determine the corresponding capacity to any desired degree of accuracy. The ideas presented in this paper are broadly applicable and extensions of the general method to lowdensity paritycheck codes over larger alphabets, turbo codes, and other concatenated coding schemes are outlined.
GossipBased Computation of Aggregate Information
, 2003
"... between computers, and a resulting paradigm shift from centralized to highly distributed systems. With massive scale also comes massive instability, as node and link failures become the norm rather than the exception. For such highly volatile systems, decentralized gossipbased protocols are emergin ..."
Abstract

Cited by 439 (2 self)
 Add to MetaCart
(Show Context)
between computers, and a resulting paradigm shift from centralized to highly distributed systems. With massive scale also comes massive instability, as node and link failures become the norm rather than the exception. For such highly volatile systems, decentralized gossipbased protocols are emerging as an approach to maintaining simplicity and scalability while achieving faulttolerant information dissemination.
A PolynomialTime Approximation Algorithm for the Permanent of a Matrix with NonNegative Entries
 Journal of the ACM
, 2004
"... Abstract. We present a polynomialtime randomized algorithm for estimating the permanent of an arbitrary n ×n matrix with nonnegative entries. This algorithm—technically a “fullypolynomial randomized approximation scheme”—computes an approximation that is, with high probability, within arbitrarily ..."
Abstract

Cited by 424 (24 self)
 Add to MetaCart
Abstract. We present a polynomialtime randomized algorithm for estimating the permanent of an arbitrary n ×n matrix with nonnegative entries. This algorithm—technically a “fullypolynomial randomized approximation scheme”—computes an approximation that is, with high probability, within arbitrarily small specified relative error of the true value of the permanent. Categories and Subject Descriptors: F.2.2 [Analysis of algorithms and problem complexity]: Nonnumerical
An improved data stream summary: The CountMin sketch and its applications
 J. Algorithms
, 2004
"... Abstract. We introduce a new sublinear space data structure—the CountMin Sketch — for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applie ..."
Abstract

Cited by 407 (44 self)
 Add to MetaCart
(Show Context)
Abstract. We introduce a new sublinear space data structure—the CountMin Sketch — for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known — typically from 1/ε 2 to 1/ε in factor. 1
Approximate Frequency Counts over Data Streams
 VLDB
, 2002
"... We present algorithms for computing frequency counts exceeding a userspecified threshold over data streams. Our algorithms are simple and have provably small memory footprints. Although the output is approximate, the error is guaranteed not to exceed a userspecified parameter. Our algorithms can e ..."
Abstract

Cited by 392 (1 self)
 Add to MetaCart
We present algorithms for computing frequency counts exceeding a userspecified threshold over data streams. Our algorithms are simple and have provably small memory footprints. Although the output is approximate, the error is guaranteed not to exceed a userspecified parameter. Our algorithms can easily be deployed for streams of singleton items like those found in IP network monitoring. We can also handle streams of variable sized sets of items exemplified by a sequence of market basket transactions at a retail store. For such streams, we describe an optimized implementation to compute frequent itemsets in a single pass.
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 371 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Efficient erasure correcting codes
 IEEE Transactions on Information Theory
, 2001
"... Abstract—We introduce a simple erasure recovery algorithm for codes derived from cascades of sparse bipartite graphs and analyze the algorithm by analyzing a corresponding discretetime random process. As a result, we obtain a simple criterion involving the fractions of nodes of different degrees on ..."
Abstract

Cited by 352 (27 self)
 Add to MetaCart
(Show Context)
Abstract—We introduce a simple erasure recovery algorithm for codes derived from cascades of sparse bipartite graphs and analyze the algorithm by analyzing a corresponding discretetime random process. As a result, we obtain a simple criterion involving the fractions of nodes of different degrees on both sides of the graph which is necessary and sufficient for the decoding process to finish successfully with high probability. By carefully designing these graphs we can construct for any given rate and any given real number a family of linear codes of rate which can be encoded in time proportional to ��@I A times their block length. Furthermore, a codeword can be recovered with high probability from a portion of its entries of length @IC A or more. The recovery algorithm also runs in time proportional to ��@I A. Our algorithms have been implemented and work well in practice; various implementation issues are discussed. Index Terms—Erasure channel, large deviation analysis, lowdensity paritycheck codes. I.