Results 1 - 10
of
183
Probabilistic discovery of time series motifs
, 2003
"... Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of thi ..."
Abstract
-
Cited by 92 (19 self)
- Add to MetaCart
Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of this work were the poor scalability of the motif discovery algorithm, and the inability to discover motifs in the presence of noise. Here we address these limitations by introducing a novel algorithm inspired by recent advances in the problem of pattern discovery in biosequences. Our algorithm is probabilistic in nature, but as we show empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or “don’t care ” symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time.
Downloading textual hidden web content through keyword queries
- In JCDL
, 2005
"... An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there ar ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users. In this paper, we study how we can build an effective Hidden Web crawler that can autonomously discover and download pages from the Hidden Web. Since the only “entry point ” to a Hidden Web site is a query interface, the main challenge that a Hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. Here, we provide a theoretical framework to investigate the query generation problem for the Hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real Hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90 % of a Hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries.
Image Compression by Linear Splines over Adaptive Triangulations
"... This paper proposes a new method for image compression. The method is based on the approximation of an image, regarded as a function, by a linear spline over an adapted triangulation, D(Y ), which is the Delaunay triangulation of a small set Y of significant pixels. The linear spline minimizes the d ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
This paper proposes a new method for image compression. The method is based on the approximation of an image, regarded as a function, by a linear spline over an adapted triangulation, D(Y ), which is the Delaunay triangulation of a small set Y of significant pixels. The linear spline minimizes the distance to the image, measured by the mean square error, among all linear splines over D(Y ). The significant pixels in Y are selected by an adaptive thinning algorithm, which recursively removes less significant pixels in a greedy way, using a sophisticated criterion for measuring the significance of a pixel. The proposed compression method combines the approximation scheme with a customized scattered data coding scheme. We demonstrate that our compression method outperforms JPEG2000 on two geometric images and performs competitively with JPEG2000 on three popular test cases of real images.
Minimization of Sequential Transducers
- Lecture Notes in Computer Science
"... . We present an algorithm for minimizing sequential transducers. This algorithm is shown to be efficient, since in the case of acyclic transducers it operates in O(jEj + jV j + (Ej \Gamma jV j + jF j):(jPmax j + 1) steps, where E is the set of edges of the given transducer, V the set of its vertices ..."
Abstract
-
Cited by 22 (9 self)
- Add to MetaCart
. We present an algorithm for minimizing sequential transducers. This algorithm is shown to be efficient, since in the case of acyclic transducers it operates in O(jEj + jV j + (Ej \Gamma jV j + jF j):(jPmax j + 1) steps, where E is the set of edges of the given transducer, V the set of its vertices, F the set of final states, and Pmax the longest of the greatest common prefixes of the output paths leaving each state of the transducer. It can be applied to a larger class of transducers which includes subsequential transducers. 1 Introduction Finite automata and transducers are used in many efficient programs. They allow to produce in a very easy way lexical analyzers for complex languages. In some applications as in Natural Language Processing the involved finite-state machines can contain several hundreds of thousands of states. Reducing the size of these graphs without losing their recognition properties is then crucial. This problem has been solved in the case of deterministic autom...
Pruning policies for two-tiered inverted index with correctness guarantee
- In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
, 2007
"... The Web search engines maintain large-scale inverted indexes which are queried thousands of times per second by users eager for information. In order to cope with the vast amounts of query loads, search engines prune their index to keep documents that are likely to be returned as top results, and us ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
The Web search engines maintain large-scale inverted indexes which are queried thousands of times per second by users eager for information. In order to cope with the vast amounts of query loads, search engines prune their index to keep documents that are likely to be returned as top results, and use this pruned index to compute the first batches of results. While this approach can improve performance by reducing the size of the index, if we compute the top results only from the pruned index we may notice a significant degradation in the result quality: if a document should be in the top results but was not included in the pruned index, it will be placed behind the results computed from the pruned index. Given the fierce competition in the online search market, this phenomenon is clearly undesirable. In this paper, we study how we can avoid any degradation of result quality due to the pruning-based performance optimization, while still realizing most of its benefit. Our contribution is a number of modifications in the pruning techniques for creating the pruned index and a new result computation algorithm that guarantees that the top-matching pages are always placed at the top search results, even though we are computing the first batch from the pruned index most of the time. We also show how to determine the optimal size of a pruned index and we experimentally evaluate our algorithms on a collection of 130 million Web pages.
Parameterized complexity of generalized vertex cover problems
- In Proc. 9th WADS, volume 3608 of LNCS
, 2005
"... Abstract. Important generalizations of the Vertex Cover problem ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Abstract. Important generalizations of the Vertex Cover problem
Adaptive Server Selection for Large Scale Interactive Online Games
- ACM Int’l Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV
, 2004
"... In this paper, we present a novel distributed algorithm that dynamically selects game servers for a group of game clients participating in large scale interactive online games. The goal of server selection is to minimize server resource usage while satisfying the real-time delay constraint. We devel ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
In this paper, we present a novel distributed algorithm that dynamically selects game servers for a group of game clients participating in large scale interactive online games. The goal of server selection is to minimize server resource usage while satisfying the real-time delay constraint. We develop a synchronization delay model for interactive games and formulate the server selection problem, and prove that the considered problem is NP-hard. The proposed algorithm, called zoom-in-zoom-out, is adaptive to session dynamics (e.g. clients join and leave) and lets the clients select appropriate servers in a distributed manner such that the number of servers used by the game session is minimized. Using simulation, we present the performance of the proposed algorithm and show that it is simple yet effective in achieving its design goal. In particular, we show that the performance of our algorithm is comparable to that of a greedy selection algorithm, which requires global information and excessive computation.
Competitive collaborative learning
- In Proceedings of the 18th Annual Conference on Learning Theory (COLT
, 2005
"... Abstract. We develop algorithms for a community of users to make decisions about selecting products or resources, in a model characterized by two key features: – The quality of the products or resources may vary over time. – Some of the users in the system may be dishonest, manipulating their action ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
Abstract. We develop algorithms for a community of users to make decisions about selecting products or resources, in a model characterized by two key features: – The quality of the products or resources may vary over time. – Some of the users in the system may be dishonest, manipulating their actions in a Byzantine manner to achieve other goals. We formulate such learning tasks as an algorithmic problem based on the multi-armed bandit problem, but with a set of users (as opposed to a single user), of whom a constant fraction are honest and are partitioned into coalitions such that the users in a coalition perceive the same expected quality if they sample the same resource at the same time. Our main result exhibits an algorithm for this problem which converges in polylogarithmic time to a state in which the average regret (per honest user) is an arbitrarily small constant. 1
I/O-efficient batched union-find and its applications to terrain analysis
- In Proc. 22nd Annual Symposium on Computational Geometry
, 2006
"... Despite extensive study over the last four decades and numerous applications, no I/O-efficient algorithm is known for the union-find problem. In this paper we present an I/O-efficient algorithm for the batched (off-line) version of the union-find problem. Given any sequence of N union and find opera ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
Despite extensive study over the last four decades and numerous applications, no I/O-efficient algorithm is known for the union-find problem. In this paper we present an I/O-efficient algorithm for the batched (off-line) version of the union-find problem. Given any sequence of N union and find operations, where each union operation joins two distinct sets, our algorithm uses O(SORT(N)) = O ( N B log M/B N I/Os, where M is the memory size and B is the disk block size. This bound is asymptotically optimal in the worst case. If there are union operations that join a set with itself, our algorithm uses O(SORT(N) + MST(N)) I/Os, where MST(N) is the number of I/Os needed to compute the minimum spanning tree of a graph with N edges. We also describe a simple and practical O(SORT(N) log ( N M))-I/O algorithm for this problem, which we have implemented. We are interested in the union-find problem because of its applications in terrain analysis. A terrain can be abstracted as a height function defined over R2, and many problems that deal with such functions require a union-find data structure. With the emergence of modern mapping technologies, huge amount of elevation data is being generated that is too large to fit in memory, thus I/O-efficient algorithms are needed to process this data efficiently. In this paper, we study two terrain-analysis problems that benefit from a union-find data structure: (i) computing topological persistence and (ii) constructing the contour tree. We give the first O(SORT(N))-I/O algorithms for these two problems, assuming that the input terrain is represented as a triangular mesh with N vertices. Finally, we report some preliminary experimental results, showing that our algorithms give order-ofmagnitude improvement over previous methods on large data sets that do not fit in memory. 1
Syntactic analysis by local grammars and automata: an efficient algorithm
- In Proceedings of the International Conference on Computational Lexicography (COMPLEX 94
, 1994
"... address: ..."

