Results 1  10
of
49
An Optimal Algorithm for the Distinct Elements Problem
"... We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, ne ..."
Abstract

Cited by 67 (7 self)
 Add to MetaCart
(Show Context)
We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, network topology, and data mining. For a stream of indices in {1,..., n}, our algorithm computes a (1 ± ε)approximation using an optimal O(ε −2 +log(n)) bits of space with 2/3 success probability, where 0 < ε < 1 is given. This probability can be amplified by independent repetition. Furthermore, our algorithm processes each stream update in O(1) worstcase time, and can report an estimate at any point midstream in O(1) worstcase time, thus settling both the space and time complexities simultaneously.
Order statistics and estimating cardinalities of massive data sets
 2005 International Conference on Analysis of Algorithms, volumeADofDMTCS Proceedings, pages157–166. Discrete Mathematics and Theoretical Computer Science
, 2005
"... Anewclassofalgorithmstoestimatethecardinalityofverylarge multisets using constant memory and doing only one pass on the data is introduced here. It is based on order statistics rather than on bit patterns in binary representations of numbers. Three families of estimators are analyzed. They attain a ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
(Show Context)
Anewclassofalgorithmstoestimatethecardinalityofverylarge multisets using constant memory and doing only one pass on the data is introduced here. It is based on order statistics rather than on bit patterns in binary representations of numbers. Three families of estimators are analyzed. They attain a standard error using M units of storage, which places them in the same class as the best known algorithms so far. The algorithms have a very simple internal loop, which gives them an advantage in term of processing speed. For instance, a memory of only 12kB and only few seconds are sufficient to process a multiset with several million elements and to build an estimate with accuracy of order 2 percents. The algorithms are validated both by mathematical analysis and by experimentations on real internet traffic. of 1 √M
Probabilistic aggregation for data dissemination in vanets
 In Proceedings of the fourth ACM international
, 2004
"... We propose an algorithm for the hierarchical aggregation of observations in disseminationbased, distributed traffic information systems. Instead of carrying specific values (e. g., the number of free parking places in a given area), our aggregates contain a modified FlajoletMartin sketch as a prob ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
(Show Context)
We propose an algorithm for the hierarchical aggregation of observations in disseminationbased, distributed traffic information systems. Instead of carrying specific values (e. g., the number of free parking places in a given area), our aggregates contain a modified FlajoletMartin sketch as a probabilistic approximation. The main advantage of this approach is that the aggregates are duplicate insensitive. This overcomes two central problems of existing aggregation schemes for VANET applications. First, when multiple aggregates of observations for the same area are available, it is possible to combine them into an aggregate containing all information from the original aggregates. This is fundamentally different from existing approaches where typically one of the aggregates is selected for further use while the rest is discarded. Second, any observation or aggregate can be included into higher level aggregates, regardless if it has already been previously—directly or indirectly—added. As a result of those characteristics the quality of the aggregates is high, while their construction is very flexible. We demonstrate these traits of our approach by a simulation study.
Scalable diversified ranking on large graphs
 In ICDM
, 2011
"... Abstract—Enhancing diversity in ranking on graphs has been identified as an important retrieval and mining task. Nevertheless, many existing diversified ranking algorithms cannot be scalable to large graphs as they have high time or space complexity. In this paper, we propose a scalable algorithm to ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
(Show Context)
Abstract—Enhancing diversity in ranking on graphs has been identified as an important retrieval and mining task. Nevertheless, many existing diversified ranking algorithms cannot be scalable to large graphs as they have high time or space complexity. In this paper, we propose a scalable algorithm to find the topK diversified ranking list on graphs. The key idea of our algorithm is that we first compute the Pagerank of the nodes of the graph, and then perform a carefully designed vertex selection algorithm to find the topK diversified ranking list. Specifically, we firstly present a new diversified ranking measure, which can capture both relevance and diversity. Secondly, we prove the submodularity of the proposed measure. And then we propose an efficient greedy algorithm with linear time and space complexity with respect to the size of the graph to achieve nearoptimal diversified ranking. Finally, we evaluate the proposed method through extensive experiments on four real networks. The experimental results indicate that the proposed method outperforms existing diversified ranking algorithms both on improving diversity in ranking and the efficiency of the algorithms. I.
Impala: A modern, opensource sql engine for hadoop
 In Proc. CIDR’15
, 2015
"... Cloudera Impala is a modern, opensource MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic readmostly queries on Hadoop, not delivered by batch frameworks such as Apache Hive. This paper presen ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
Cloudera Impala is a modern, opensource MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic readmostly queries on Hadoop, not delivered by batch frameworks such as Apache Hive. This paper presents Impala from a user’s perspective, gives an overview of its architecture and main components and briefly demonstrates its superior performance compared against other popular SQLonHadoop systems. 1.
Information network or social network? The structure of the twitter follow graph
 in Proc. Int. Conf. World Wide Web
"... In this paper, we provide a characterization of the topological features of the Twitter follow graph, analyzing properties such as degree distributions, connected components, shortest path lengths, clustering coefficients, and degree assortativity. For each of these properties, we compare and con ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we provide a characterization of the topological features of the Twitter follow graph, analyzing properties such as degree distributions, connected components, shortest path lengths, clustering coefficients, and degree assortativity. For each of these properties, we compare and contrast with available data from other social networks. These analyses provide a set of authoritative statistics that the community can reference. In addition, we use these data to investigate an oftenposed question: Is Twitter a social network or an information network? The “follow ” relationship in Twitter is primarily about information consumption, yet many follows are built on social ties. Not surprisingly, we find that the Twitter follow graph exhibits structural characteristics of both an information network and a social network. Going beyond descriptive characterizations, we hypothesize that from an individual user’s perspective, Twitter starts off more like an information network, but evolves to behave more like a social network. We provide preliminary evidence that may serve as a formal model of how a hybrid network like Twitter evolves.
A Survey of Distributed Data Aggregation Algorithms,” University of Minho
, 2011
"... Distributed data aggregation is an important task, allowing the decentralized determination of meaningful global properties, that can then be used to direct the execution of other applications. The resulting values result from the distributed computation of functions like count, sum and average. S ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Distributed data aggregation is an important task, allowing the decentralized determination of meaningful global properties, that can then be used to direct the execution of other applications. The resulting values result from the distributed computation of functions like count, sum and average. Some application examples can found to determine the network size, total storage capacity, average load, majorities and many others. In the last decade, many different approaches have been proposed, with different tradeoffs in terms of accuracy, reliability, message and time complexity. Due to the considerable amount and variety of aggregation algorithms, it can be difficult and time consuming to determine which techniques will be more appropriate to use in specific settings, justifying the existence of a survey to aid in this task. This work reviews the state of the art on distributed data aggregation algorithms, providing three main contributions. First, it formally defines the concept of aggregation, characterizing the different types of aggregation functions. Second, it succinctly describes the main aggregation techniques, organizing them in a taxonomy. Finally, it provides some guidelines toward the selection and use of the most relevant techniques, summarizing their principal characteristics. 1
Extrema Propagation: Fast Distributed Estimation of Sums and Network Sizes Abstract
"... Aggregation of data values plays an important role on distributed computations, in particular over peertopeer and sensor networks, as it can provide a summary of some global system property and direct the actions of selfadaptive distributed algorithms. Examples include using estimates of the netw ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
(Show Context)
Aggregation of data values plays an important role on distributed computations, in particular over peertopeer and sensor networks, as it can provide a summary of some global system property and direct the actions of selfadaptive distributed algorithms. Examples include using estimates of the network size to dimension distributed hash tables or estimates of the average system load to direct loadbalancing. Distributed aggregation using nonidempotent functions, like sums, is not trivial as it is not easy to prevent a given value from being accounted for multiple times; this is specially the case if no centralized algorithms or global identifiers can be used. This paper introduces a novel technique, Extrema Propagation, for distributed estimation of the sum of positive real numbers. It is more expressive than previous approaches as it encompasses summing naturals and counting. As a special important case we show how it can be applied to network size estimation. The technique relies on the exchange of duplicate insensitive messages and can be applied in flood and/or epidemic settings, where multipath routing occurs; it is tolerant of message loss; it is fast, as the number of message exchange steps can be made just slightly above the theoretical minimum; and it is fully distributed, with no single point of failure and the result produced at every node.
When Distributed Computation is Communication Expensive
"... We consider a number of fundamental statistical and graph problems in the messagepassing model, where we have k machines (sites), each holding a piece of data, and the machines want to jointly solve a problem defined on the union of the k data sets. The communication is pointtopoint, and the goal ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
(Show Context)
We consider a number of fundamental statistical and graph problems in the messagepassing model, where we have k machines (sites), each holding a piece of data, and the machines want to jointly solve a problem defined on the union of the k data sets. The communication is pointtopoint, and the goal is to minimize the total communication among the k machines. This model captures all pointtopoint distributed computational models with respect to minimizing communication costs. Our analysis shows that exact computation of many statistical and graph problems in this distributed setting requires a prohibitively large amount of communication, and often one cannot improve upon the communication of the simple protocol in which all machines send their data to a centralized server. Thus, in order to obtain protocols that are communicationefficient, one has to allow approximation, or investigate the distribution or layout of the data sets. 1