Results 1  10
of
17
On the Evolution of Clusters of NearDuplicate Web Pages
 IN 1ST LATIN AMERICAN WEB CONGRESS
, 2003
"... This paper expands on a 1997 study of the amount and distribution of nearduplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are nearduplicates of one another, and tracked how clust ..."
Abstract

Cited by 54 (3 self)
 Add to MetaCart
This paper expands on a 1997 study of the amount and distribution of nearduplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are nearduplicates of one another, and tracked how clusters of nearduplicate documents evolved over time. We found that 29.2% of all web pages are very similar to other pages, and that 22.2% are virtually identical to other pages. We also found that clusters of nearduplicate documents are fairly stable: Two documents that are nearduplicates of one another are very likely to still be nearduplicates 10 weeks later. This result is of significant relevance to search engines: Web crawlers can be fairly confident that two pages that have been found to be nearduplicates of one another will continue to be so for the foreseeable future, and may thus decide to recrawl only one version of that page, or at least to lower the download priority of the other versions, thereby freeing up crawling resources that can be brought to bear more productively somewhere else.
Modality in Dialogue: Planning, Pragmatics and Computation
, 1998
"... Natural language generation (NLG) is first and foremost a reasoning task. In this reasoning, a system plans a communicative act that will signal key facts about the domain to the hearer. In generating action descriptions, this reasoning draws on characterizations both of the causal properties of the ..."
Abstract

Cited by 36 (9 self)
 Add to MetaCart
Natural language generation (NLG) is first and foremost a reasoning task. In this reasoning, a system plans a communicative act that will signal key facts about the domain to the hearer. In generating action descriptions, this reasoning draws on characterizations both of the causal properties of the domain and the states of knowledge of the participants in the conversation. This dissertation shows how such characterizations can be specified declaratively and accessed efficiently in NLG. The heart of this dissertation is a study of logical statements about knowledge and action in modal logic. By investigating the prooftheory of modal logic from a logic programming point of view, I show how many kinds of modal statements can be seen as straightforward instructions for computationally manageable search, just as Prolog clauses can. These modal statements provide sufficient expressive resources for an NLG system to represent the effects of actions in the world or to model an addressee whose knowledge in some respects exceeds and in other respects falls short of its own. To illustrate the use of such statements, I describe how the SPUD sentence planner exploits a modal knowledge base to
Local Search Genetic Algorithm for Optimization of Highly Reliable Communications Networks
 IEEE Transactions on Evolutionary Computation
, 1997
"... This paper presents a genetic algorithm (GA) with specialized encoding, initialization and local search genetic operators to optimize communication network topologies. This NPhard problem is often highly constrained so that random initialization and standard genetic operators usually generate ..."
Abstract

Cited by 35 (4 self)
 Add to MetaCart
This paper presents a genetic algorithm (GA) with specialized encoding, initialization and local search genetic operators to optimize communication network topologies. This NPhard problem is often highly constrained so that random initialization and standard genetic operators usually generate infeasible network architectures. Compounding this infeasibility issue is that the fitness function involves calculating the allterminal reliability of the network, a calculation which is computationally expensive. Therefore, it is imperative that the search balances the need to thoroughly explore the boundary between feasible and infeasible networks, along with calculating fitness on only the most promising candidate networks. The algorithm results are compared to optimum results found by branch and bound and also to GA results without local search operators on a suite of 79 test problems. This GA strategy of employing bounds, simple heuristic checks and problem specific rep...
Matching Algorithms within a Duplicate Detection System
 Bulletin of the Technical Committee on Data Engineering
, 2000
"... Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same realworld entity because of data entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same realworld entity because of data entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from multiple databases – such as what happens in data warehousing where records from multiple data sources are integrated into a single source of information – among other reasons. In this paper we review a system to detect approximate duplicate records in a database and provide properties that a pairwise record matching algorithm must have in order to have a successful duplicate detection system. 1
Simple generation of static singleassignment form
 In Proceedings of the 9th International Conference on Compiler Construction
, 2000
"... cfl SpringerVerlag Abstract. The static singleassignment (SSA) form of a program provides data flow information in a form which makes some compiler optimizations easy to perform. In this paper we present a new, simple method for converting to SSA form, which produces correct solutions for nonreduc ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
cfl SpringerVerlag Abstract. The static singleassignment (SSA) form of a program provides data flow information in a form which makes some compiler optimizations easy to perform. In this paper we present a new, simple method for converting to SSA form, which produces correct solutions for nonreducible controlflow graphs, and produces minimal solutions for reducible ones. Our timing results show that, despite its simplicity, our algorithm is competitive with more established techniques. 1 Introduction The static singleassignment (SSA) form is a program representation in which variables are split into "instances. " Every new assignment to a variable or more generally, every new definition of a variable results in a new instance. The variable instances are numbered so that each use of a variable may be easily linked back to a single definition point. Figure 1 gives a example of SSA form for some straightline code. As its name suggests, SSA only reflects static properties; in the example, V1's value is a dynamic property, but the static property that all instances labelled V1 refer to the same value will still hold.
Lower Bounds for the UnionFind and the SplitFind Problem on Pointer Machines
, 1989
"... A wellknown result of Taxjan (cf. [15]) states that for all n and m _> n there exists a sequence of n  1 Union and rn Find operations that needs at least /l(rn.a(rn, n)) execution steps on a pointer machine that satisfies the separation condition. In [1, 16] the bound was extended to (n + n.a(r ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
A wellknown result of Taxjan (cf. [15]) states that for all n and m _> n there exists a sequence of n  1 Union and rn Find operations that needs at least /l(rn.a(rn, n)) execution steps on a pointer machine that satisfies the separation condition. In [1, 16] the bound was extended to (n + n.a(rn, n)) for all m and n. In this paper we prove that this bound holds on a general pointer machine without the separation condition and we prove that the same bound holds for the SplitFind problem as well.
New Techniques for the UnionFind Problem
 UTRECHT UNIVERSITY
, 1989
"... A wellknown result of Tarjan (cf. [10]) states that a program of up to n UNION and m FIND instructions can be executed in O(n + m.c(m, n)) time on a collection of n elements, where a(m, n) denotes the functional inverse of Ackermann's function. In this paper we develop a new approach to the prob ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
A wellknown result of Tarjan (cf. [10]) states that a program of up to n UNION and m FIND instructions can be executed in O(n + m.c(m, n)) time on a collection of n elements, where a(m, n) denotes the functional inverse of Ackermann's function. In this paper we develop a new approach to the problem and prove that the time for the k th FIND can be limited to O(c(k, n)) worst case, while the total cost for the program of UNION's and m FIND's remains bounded by O(n + m.c(m, n)). The technique is part of a family of lgorithms that can achieve various tradeoffs in cost for the individual instructions. The new lgorithm is important in all setmanipulation problems that require frequent FIND's. Because a(m, n) is O(1) in all practical cases, the new lgorithms guarantees that FIND's are essentially O(1) worst case, within the optimal bound for the UNIONFIND problem as a whole. The algorithm
Amortization, Lazy Evaluation, and Persistence: Lists with Catenation via Lazy Linking
 Pages 646654 of: IEEE Symposium on Foundations of Computer Science
, 1995
"... Amortization has been underutilized in the design of persistent data structures, largely because traditional accounting schemes break down in a persistent setting. Such schemes depend on saving "credits" for future use, but a persistent data structure may have multiple "futures", each competing for ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Amortization has been underutilized in the design of persistent data structures, largely because traditional accounting schemes break down in a persistent setting. Such schemes depend on saving "credits" for future use, but a persistent data structure may have multiple "futures", each competing for the same credits. We describe how lazy evaluation can often remedy this problem, yielding persistent data structures with good amortized efficiency. In fact, such data structures can be implemented purely functionally in any functional language supporting lazy evaluation. As an example of this technique, we present a purely functional (and therefore persistent) implementation of lists that simultaneously support catenation and all other usual list primitives in constant amortized time. This data structure is much simpler than the only existing data structure with comparable bounds, the recently discovered catenable lists of Kaplan and Tarjan, which support all operations in constant worstca...
An Adaptive and Efficient Algorithm for Detecting Approximately Duplicate Database Records
, 2000
"... The integration of information is an important area of research in databases. By combining multiple information sources, a more complete and more accurate view of the world is attained, and additional knowledge gained. This is a nontrivial task however. Often there are many sources which contain ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
The integration of information is an important area of research in databases. By combining multiple information sources, a more complete and more accurate view of the world is attained, and additional knowledge gained. This is a nontrivial task however. Often there are many sources which contain information about a certain kind of entity, and some will contain records concerning the same realworld entity. Furthermore, one source may not have the exact information that another source contains. Some of the information may be different due to data entry errors for example or may be missing altogether. Thus, one problem in integrating information sources is to identify possibly different designators of the same entity. Data cleansing is the process of purging databases of inaccurate or inconsistent data. The data is typically manipulated into a form which is useful for other tasks, such as data mining. This paper addresses the data cleansing problem of detecting database records that are approximate duplicates, but not exact duplicates. An efficient algorithm is presented which combines three key ideas. First, the SmithWaterman algorithm for computing the minimum editdistance is used as a domainindependent method to recognize pairs of approximately duplicates.
Efficient Propagators for Global Constraints
"... I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ClaudeGuy Quimper ii We study in this thesis three well ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ClaudeGuy Quimper ii We study in this thesis three well known global constraints. The AllDifferent constraint restricts a set of variables to be assigned to distinct values. The global cardinality constraint (GCC) ensures that a value v is assigned to at least lv variables and to at most uv variables among a set of given variables where lv and uv are nonnegative integers such that lv ≤ uv. The InterDistance constraint ensures that all variables, among a set of variables x1,..., xn, are pairwise distant from p, i.e. xi − xj  ≥ p for all i � = j. The AllDifferent constraint, the GCC, and the InterDistance constraint are largely used in scheduling problems. For instance, in scheduling problems where tasks with unit processing time compete for a single