Results 1  10
of
60
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract

Cited by 404 (24 self)
 Add to MetaCart
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudorandom computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
Robust and fast similarity search for moving object trajectories
 In Proc. ACM SIGMOD Int. Conf. on Management of Data
, 2005
"... An important consideration in similaritybased retrieval of moving object trajectories is the definition of a distance function. The existing distance functions are usually sensitive to noise, shifts and scaling of data that commonly occur due to sensor failures, errors in detection techniques, dis ..."
Abstract

Cited by 93 (14 self)
 Add to MetaCart
An important consideration in similaritybased retrieval of moving object trajectories is the definition of a distance function. The existing distance functions are usually sensitive to noise, shifts and scaling of data that commonly occur due to sensor failures, errors in detection techniques, disturbance signals, and different sampling rates. Cleaning data to eliminate these is not always possible. In this paper, we introduce a novel distance function, Edit Distance on Real sequence (EDR) which is robust against these data imperfections. Analysis and comparison of EDR with other popular distance functions, such as Euclidean distance, Dynamic Time Warping (DTW), Edit distance with Real Penalty (ERP), and Longest Common Subsequences (LCSS), indicate that EDR is more robust than Euclidean distance, DTW and ERP, and it is on average 50% more accurate than LCSS. We also develop three pruning techniques to improve the retrieval efficiency of EDR and show that these techniques can be combined effectively in a search, increasing the pruning power significantly. The experimental results confirm the superior efficiency of the combined methods. 1.
LowDistortion Embeddings of Finite Metric Spaces
 in Handbook of Discrete and Computational Geometry
, 2004
"... INTRODUCTION An npoint metric space (X; D) can be represented by an n n table specifying the distances. Such tables arise in many diverse areas. For example, consider the following scenario in microbiology: X is a collection of bacterial strains, and for every two strains, one is given their diss ..."
Abstract

Cited by 48 (0 self)
 Add to MetaCart
INTRODUCTION An npoint metric space (X; D) can be represented by an n n table specifying the distances. Such tables arise in many diverse areas. For example, consider the following scenario in microbiology: X is a collection of bacterial strains, and for every two strains, one is given their dissimilarity (computed, say, by comparing their DNA). It is dicult to see any structure in a large table of numbers, and so we would like to represent a given metric space in a more comprehensible way. For example, it would be very nice if we could assign to each x 2 X a point f(x) in the plane in such a way that D(x; y) equals the Euclidean distance of f(x) and f(y). Such a representation would allow us to see the structure of the metric space: tight clusters, isolated points, and so on. Another advantage would be that the metric would now be represented by only 2n real numbers, the coordinates of the n points in the plane, instead of numbers as before. Moreover, many quantities concern
Nonembeddability theorems via Fourier analysis
"... Various new nonembeddability results (mainly into L1) are proved via Fourier analysis. In particular, it is shown that the Edit Distance on {0, 1}d has L1 distortion (log d) 12o(1). We also give new lower bounds on the L1 distortion of flat tori, quotients of the discrete hypercube under group ac ..."
Abstract

Cited by 43 (11 self)
 Add to MetaCart
Various new nonembeddability results (mainly into L1) are proved via Fourier analysis. In particular, it is shown that the Edit Distance on {0, 1}d has L1 distortion (log d) 12o(1). We also give new lower bounds on the L1 distortion of flat tori, quotients of the discrete hypercube under group actions, and the transportation cost (Earthmover) metric.
Traffic aggregation for malware detection
, 2008
"... Stealthy malware, such as botnets and spyware, are hard to detect because their activities are subtle and do not disrupt the network, in contrast to DoS attacks and aggressive worms. Stealthy malware, however, does communicate to exfiltrate data to the attacker, to receive the attacker’s commands, ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
Stealthy malware, such as botnets and spyware, are hard to detect because their activities are subtle and do not disrupt the network, in contrast to DoS attacks and aggressive worms. Stealthy malware, however, does communicate to exfiltrate data to the attacker, to receive the attacker’s commands, or to carry out those commands. Moreover, since malware rarely infiltrates only a single host in a large enterprise, these communications should emerge from multiple hosts within coarse temporal proximity to one another. In this paper, we describe a system called TĀMD (pronounced “tamed”) with which an enterprise can identify candidate groups of infected computers within its network. TĀMD accomplishes this by finding new communication “aggregates ” involving multiple internal hosts, i.e., communication flows that share common characteristics. We describe characteristics for defining aggregates—including flows that communicate with the same external network, that share similar payload, and/or that involve internal hosts with similar software platforms—and justify their use in finding infected hosts. We also detail efficient algorithms employed by TĀMD for identifying such aggregates, and demonstrate a particular configuration of TĀMD that identifies new infections for multiple bot and spyware examples, within traces of traffic recorded at the edge of a university network. This is achieved even when the number of infected hosts comprise only about 0.0097 % of all internal hosts in the network.
Improved lower bounds for embeddings into L1
 SIAM J. COMPUT.
, 2009
"... We improve upon recent lower bounds on the minimum distortion of embedding certain finite metric spaces into L1. In particular, we show that for every n ≥ 1, there is an npoint metric space of negative type that requires a distortion of Ω(log log n) for such an embedding, implying the same lower bo ..."
Abstract

Cited by 32 (5 self)
 Add to MetaCart
We improve upon recent lower bounds on the minimum distortion of embedding certain finite metric spaces into L1. In particular, we show that for every n ≥ 1, there is an npoint metric space of negative type that requires a distortion of Ω(log log n) for such an embedding, implying the same lower bound on the integrality gap of a wellknown semidefinite programming relaxation for sparsest cut. This result builds upon and improves the recent lower bound of (log log n) 1/6−o(1) due to Khot and Vishnoi [The unique games conjecture, integrality gap for cut problems and the embeddability of negative type metrics into l1, in Proceedings of the 46th Annual IEEE Symposium
Approximating edit distance efficiently
 In Proc. FOCS 2004
, 2004
"... Edit distance has been extensively studied for the past several years. Nevertheless, no lineartime algorithm is known to compute the edit distance between two strings, or even to approximate it to within a modest factor. Furthermore, for various natural algorithmic problems such as lowdistortion e ..."
Abstract

Cited by 27 (5 self)
 Add to MetaCart
Edit distance has been extensively studied for the past several years. Nevertheless, no lineartime algorithm is known to compute the edit distance between two strings, or even to approximate it to within a modest factor. Furthermore, for various natural algorithmic problems such as lowdistortion embeddings into normed spaces, approximate nearestneighbor schemes, and sketching algorithms, known results for the edit distance are rather weak. We develop algorithms that solve gap versions of the edit distance problem: given two strings of length n with the promise that their edit distance is either at most k or greater than ℓ, decide which of the two holds. We present two sketching algorithms for gap versions of edit distance. Our first algorithm solves the k vs. (kn) 2/3 gap problem, using a constant size sketch. A more involved algorithm solves the stronger k vs. ℓ gap problem, where ℓ can be as small as O(k 2)—still with a constant sketch—but works only for strings that are mildly “nonrepetitive”. Finally, we develop an n 3/7approximation quasilinear time algorithm for edit distance, improving the previous best factor of n 3/4 [5]; if the input strings are assumed to be nonrepetitive, then the approximation factor can be strengthened to n 1/3. 1.
Lower bounds for embedding edit distance into normed spaces
 In Proc. SODA 2003
, 2003
"... MIT S. Raskhodnikova MIT 1 Introduction The edit distance (also called Levenshtein metric) between two strings is the minimum number of operations (insertions, deletions and character substitutions) needed to transform one string into another. This distance is of key importance in computational biol ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
MIT S. Raskhodnikova MIT 1 Introduction The edit distance (also called Levenshtein metric) between two strings is the minimum number of operations (insertions, deletions and character substitutions) needed to transform one string into another. This distance is of key importance in computational biology, as well as text processing and other areas. Algorithms for problems involving this metric have been extensively investigated. In particular, the quadratictime dynamic programming algorithm for computing the edit distance between two strings is one of the most investigated and used algorithms in computational biology. Recently, a new approach to problems involving edit distance has been proposed. Its basic component is construction of a mapping f (called an embedding), which maps any string s into a vector f (s) 2!
Minimum common string partition problem: Hardness and approximations
 In Proc. ISAAC 2004, volume 3341 of LNCS
, 2004
"... Abstract. String comparison is a fundamental problem in computer science, with applications in areas such as computational biology, text processing or compression. In this paper we address the minimum common string partition problem, a string comparison problem with tight connection to the problem o ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
Abstract. String comparison is a fundamental problem in computer science, with applications in areas such as computational biology, text processing or compression. In this paper we address the minimum common string partition problem, a string comparison problem with tight connection to the problem of sorting by reversals with duplicates, a key problem in genome rearrangement. A partition of a string A is a sequence P = (P1, P2,..., Pm) of strings, called the blocks, whose concatenation is equal to A. Given a partition P of a string A and a partition Q of a string B, we say that the pair 〈P, Q 〉 is a common partition of A and B if Q is a permutation of P. The minimum common string partition problem (MCSP) is to find a common partition of two strings A and B with the minimum number of blocks. The restricted version of MCSP where each letter occurs at most k times in each input string, is denoted by kMCSP. In this paper, we show that 2MCSP (and therefore MCSP) is NPhard and, moreover, even APXhard. We describe a 1.1037approximation for 2MCSP and a linear time 4approximation algorithm for 3MCSP. We are not aware of any better approximations. 1
Low distortion embeddings for edit distance
 In Proceedings of the Symposium on Theory of Computing
, 2005
"... We show that {0, 1} d endowed with edit distance embeds into ℓ1 with distortion 2 O( √ log d log log d). We further show efficient implementations of the embedding that yield solutions to various computational problems involving edit distance. These include sketching, communication complexity, neare ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
We show that {0, 1} d endowed with edit distance embeds into ℓ1 with distortion 2 O( √ log d log log d). We further show efficient implementations of the embedding that yield solutions to various computational problems involving edit distance. These include sketching, communication complexity, nearest neighbor search. For all these problems, we improve upon previous bounds. 1