Results 1 - 10
of
218
An O(ND) Difference Algorithm and Its Variations
- Algorithmica
, 1986
"... The problems of finding a longest common subsequence of two sequences A and B and a shortest edit script for transforming A into B have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a s ..."
Abstract
-
Cited by 212 (4 self)
- Add to MetaCart
(Show Context)
The problems of finding a longest common subsequence of two sequences A and B and a shortest edit script for transforming A into B have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a simple O(ND) time and space algorithm is developed where N is the sum of the lengths of A and B and D is the size of the minimum edit script for A and B. The algorithm performs well when differences are small (sequences are similar) and is consequently fast in typical applications. The algorithm is shown to have O(N +D expected-time performance under a basic stochastic model. A refinement of the algorithm requires only O(N) space, and the use of suffix trees leads to an O(NlgN +D ) time variation.
A content-driven reputation system for the Wikipedia
- In Proceedings of the 16th International World Wide Web Conference
, 2007
"... On-line forums for the collaborative creation of bodies of information are a phenomenon of rising importance; the Wikipedia is one of the best-known examples. The open nature of such forums could benet from a notion of reputation for its authors. Author reputation could be used to
ag new contributi ..."
Abstract
-
Cited by 168 (11 self)
- Add to MetaCart
On-line forums for the collaborative creation of bodies of information are a phenomenon of rising importance; the Wikipedia is one of the best-known examples. The open nature of such forums could benet from a notion of reputation for its authors. Author reputation could be used to
ag new contributions from low-reputation authors, and it could be used to allow only authors with good reputation to contribute to controversial or critical pages. A reputation system for the Wikipedia would also provide an incentive to give high-quality contributions. We present in this paper a novel type of content-driven reputation system for Wikipedia authors. In our system, authors gain reputation when the edits and text additions they perform to Wikipedia articles are long-lived, and they lose reputation when their changes are undone in short order. We have implemented the pro-posed system, and we have used it to analyze the en-tire Italian and French Wikipedias, consisting of a to-tal of 691,551 pages and 5,587,523 revisions. Our re-sults show that our notion of reputation has good pre-dictive value: changes performed by low-reputation au-thors have a signicantly larger than average probability of having poor quality, and of being undone. 1
X-Diff: An Effective Change Detection Algorithm for XML Documents
"... XML has become the de facto standard format for web publishing and data transportation. Since online information changes frequently, being able to quickly detect changes in XML documents is important to Internet query systems, search engines, and continuous query systems. Previous work in change det ..."
Abstract
-
Cited by 132 (0 self)
- Add to MetaCart
XML has become the de facto standard format for web publishing and data transportation. Since online information changes frequently, being able to quickly detect changes in XML documents is important to Internet query systems, search engines, and continuous query systems. Previous work in change detection on XML, or other hierarchically structured documents, used an ordered tree model, in which left-to-right order among siblings is important and it can affect the change result. This paper argues that an unordered model (only ancestor relationships are significant) is more suitable for most database applications. Using an unordered model, change detection is substantially harder than using the ordered model, but the change result that it generates is more accurate. This paper proposes X-Diff, an effective algorithm that integrates key XML structure characteristics with standard tree-to-tree correction techniques. The algorithm is analyzed and compared with XyDiff [CAM02], a published XML diff algorithm. An experimental evaluation on both algorithms is provided.
Identifying Syntactic Differences Between Two Programs
- Software - Practice and Experience
, 1991
"... this paper is organized into five sections, as follows. The internal form of a program, which is a variant of a parse tree, is discussed in the next section. Then the tree-matching algorithm and the synchronous pretty-printing technique are described. Experience with the comparator for the C languag ..."
Abstract
-
Cited by 119 (0 self)
- Add to MetaCart
this paper is organized into five sections, as follows. The internal form of a program, which is a variant of a parse tree, is discussed in the next section. Then the tree-matching algorithm and the synchronous pretty-printing technique are described. Experience with the comparator for the C language and some performance measurements are also presented. The last section discusses related work and concludes this paper
Clickstream Clustering Using Weighted Longest Common Subsequences
- In Proceedings of the Web Mining Workshop at the 1st SIAM Conference on Data Mining
, 2001
"... Categorizing visitors based on their interactions with a website is a key problem in web usage mining. The clickstreams generated by various users often follow distinct patterns, the knowledge of which may help in providing customized content. In this paper, we propose a novel and effective algorith ..."
Abstract
-
Cited by 78 (4 self)
- Add to MetaCart
(Show Context)
Categorizing visitors based on their interactions with a website is a key problem in web usage mining. The clickstreams generated by various users often follow distinct patterns, the knowledge of which may help in providing customized content. In this paper, we propose a novel and effective algorithm for clustering webusers based on a function of the longest common subsequence of their clickstreams that takes into account both the trajectory taken through a website and the time spent at each page. Results are presented on weblogs of www.sulekha.com to illustrate the techniques.
MAPO: Mining API Usages from Open Source Repositories
- In MSR ’06: Proceedings of the 2006 international workshop on Mining software repositories
, 2006
"... To improve software productivity, when constructing new software systems, developers often reuse existing class libraries or frameworks by invoking their APIs. Those APIs, however, are often complex and not well documented, posing barriers for developers to use them in new client code. To get famili ..."
Abstract
-
Cited by 70 (4 self)
- Add to MetaCart
(Show Context)
To improve software productivity, when constructing new software systems, developers often reuse existing class libraries or frameworks by invoking their APIs. Those APIs, however, are often complex and not well documented, posing barriers for developers to use them in new client code. To get familiar with how those APIs are used, developers may search the Web using a general search engine to find relevant documents or code examples. Developers can also use a source code search engine to search open source repositories for source files that use the same APIs. Nevertheless, the number of returned source files is often large. It is difficult for developers to learn API usages from a large number of returned results. In order to help developers understand API usages and write API client code more effectively, we have developed an API usage mining framework and its supporting tool called MAPO (for Mining API usages from Open source repositories). Given a query that describes a method, class, or package for an API, MAPO leverages the existing source code search engines to gather relevant source files and conducts data mining. The mining leads to a short list of frequent API usages for developers to inspect. MAPO currently consists of five components: a code search engine, a source code analyzer, a sequence preprocessor, a frequent sequence miner, and a frequent sequence postprocessor. We have examined the effectiveness of MAPO using a set of various queries. The preliminary results show that the framework is practical for providing informative and succinct API usage patterns.
The AT&T Internet Difference Engine: Tracking and Viewing Changes on the Web
, 1997
"... The AT&T Internet Difference Engine (aide) is a system that finds and displays changes to pages on the World Wide Web. The system consists of several components, including a webcrawler that detects changes, an archive of past versions of pages, a tool called HtmlDiff to highlight changes between ..."
Abstract
-
Cited by 62 (3 self)
- Add to MetaCart
The AT&T Internet Difference Engine (aide) is a system that finds and displays changes to pages on the World Wide Web. The system consists of several components, including a webcrawler that detects changes, an archive of past versions of pages, a tool called HtmlDiff to highlight changes between versions of a page, and a graphical interface to view the relationship between pages over time. This paper describes aide, with an emphasis on the evolution of the system and experiences with it. It also raises some sociological and legal issues.
Matching Planar Maps
"... The subject of this paper are algorithms for measuring the similarity of patterns of line segments in the plane, a standard problem in, e.g. computer vision, geographic information systems, etc. More precisely, we will define feasible distance measures that reflect how close a given pattern H is to ..."
Abstract
-
Cited by 54 (14 self)
- Add to MetaCart
The subject of this paper are algorithms for measuring the similarity of patterns of line segments in the plane, a standard problem in, e.g. computer vision, geographic information systems, etc. More precisely, we will define feasible distance measures that reflect how close a given pattern H is to some part of a larger pattern G. These distance measures are generalizations of the well known Frechet distance for curves. We will first give an efficient algorithm for the case that H is a polygonal curve and G is a geometric graph. Then, slightly relaxing the definition of distance measure we will give an algorithm for the general case where both, H and G, are geometric graphs.
Sparse dynamic programming I: Linear cost functions
- J. Assoc. Comp. Mach
, 1992
"... A.bstmct: We consider dynamic programming solutions to a number of different recurrences for sequence comparison and for R ~ A secondary structure prediction. These recurrences are defined over a number of points that is quadratic in the input size; however only a sparse set matters for the result. ..."
Abstract
-
Cited by 53 (3 self)
- Add to MetaCart
(Show Context)
A.bstmct: We consider dynamic programming solutions to a number of different recurrences for sequence comparison and for R ~ A secondary structure prediction. These recurrences are defined over a number of points that is quadratic in the input size; however only a sparse set matters for the result. \Ve give efficient algorithms for these problems. when the weight functions used in the recurrences are taken to be linear. Our algorithms reduce the best known bounds by a factor almost linear in the density of the problems: when the problems are sparse this results in a substantial speed-up. In trod uction Sparsity is a phenomenon that has long been exploited for efficient algorithms. For instance, most of the best known graph algorithms take time bounded by a function of the number of actual edges in the graph, rather than the maximum possible number of edges. The algorithms we study in this paper perform various kinds of sequence analysis. which are typically solved by dynamic programming in a matrix indexed by positions in the inpllt sequences. Only two such problems are already known to be solved by algorithms taking advantage of
Incremental String Comparison
- SIAM JOURNAL ON COMPUTING
, 1995
"... The problem of comparing two sequences A and B to determine their LCS or the edit distance between them has been much studied. In this paper we consider the following incremental version of these problems: given an appropriate encoding of a comparison between A and B, can one incrementally compute t ..."
Abstract
-
Cited by 49 (2 self)
- Add to MetaCart
The problem of comparing two sequences A and B to determine their LCS or the edit distance between them has been much studied. In this paper we consider the following incremental version of these problems: given an appropriate encoding of a comparison between A and B, can one incrementally compute the answer for A and bB, and the answer for A and Bb with equal efficiency, where b is an additional symbol? Our main result is a theorem exposing a surprising relationship between the dynamic programming solutions for two such "adjacent" problems. Given a threshold k on the number of differences to be permitted in an alignment, the theorem leads directly to an O(k) algorithm for incrementally computing a new solution from an old one, as contrasts the O(k²) time required to compute a solution from scratch. We further show with a series of applications that this algorithm is indeed more powerful than its non-incremental counterpart by solving the applications with greater asymptotic ef...