Results 1  10
of
32
On the Evolution of Clusters of NearDuplicate Web Pages
 IN 1ST LATIN AMERICAN WEB CONGRESS
, 2003
"... This paper expands on a 1997 study of the amount and distribution of nearduplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are nearduplicates of one another, and tracked how clust ..."
Abstract

Cited by 57 (3 self)
 Add to MetaCart
This paper expands on a 1997 study of the amount and distribution of nearduplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are nearduplicates of one another, and tracked how clusters of nearduplicate documents evolved over time. We found that 29.2% of all web pages are very similar to other pages, and that 22.2% are virtually identical to other pages. We also found that clusters of nearduplicate documents are fairly stable: Two documents that are nearduplicates of one another are very likely to still be nearduplicates 10 weeks later. This result is of significant relevance to search engines: Web crawlers can be fairly confident that two pages that have been found to be nearduplicates of one another will continue to be so for the foreseeable future, and may thus decide to recrawl only one version of that page, or at least to lower the download priority of the other versions, thereby freeing up crawling resources that can be brought to bear more productively somewhere else.
Local Search Genetic Algorithm for Optimization of Highly Reliable Communications Networks
 IEEE Transactions on Evolutionary Computation
, 1997
"... This paper presents a genetic algorithm (GA) with specialized encoding, initialization and local search genetic operators to optimize communication network topologies. This NPhard problem is often highly constrained so that random initialization and standard genetic operators usually generate ..."
Abstract

Cited by 37 (5 self)
 Add to MetaCart
This paper presents a genetic algorithm (GA) with specialized encoding, initialization and local search genetic operators to optimize communication network topologies. This NPhard problem is often highly constrained so that random initialization and standard genetic operators usually generate infeasible network architectures. Compounding this infeasibility issue is that the fitness function involves calculating the allterminal reliability of the network, a calculation which is computationally expensive. Therefore, it is imperative that the search balances the need to thoroughly explore the boundary between feasible and infeasible networks, along with calculating fitness on only the most promising candidate networks. The algorithm results are compared to optimum results found by branch and bound and also to GA results without local search operators on a suite of 79 test problems. This GA strategy of employing bounds, simple heuristic checks and problem specific rep...
Modality in Dialogue: Planning, Pragmatics and Computation
, 1998
"... Natural language generation (NLG) is first and foremost a reasoning task. In this reasoning, a system plans a communicative act that will signal key facts about the domain to the hearer. In generating action descriptions, this reasoning draws on characterizations both of the causal properties of the ..."
Abstract

Cited by 36 (9 self)
 Add to MetaCart
Natural language generation (NLG) is first and foremost a reasoning task. In this reasoning, a system plans a communicative act that will signal key facts about the domain to the hearer. In generating action descriptions, this reasoning draws on characterizations both of the causal properties of the domain and the states of knowledge of the participants in the conversation. This dissertation shows how such characterizations can be specified declaratively and accessed efficiently in NLG. The heart of this dissertation is a study of logical statements about knowledge and action in modal logic. By investigating the prooftheory of modal logic from a logic programming point of view, I show how many kinds of modal statements can be seen as straightforward instructions for computationally manageable search, just as Prolog clauses can. These modal statements provide sufficient expressive resources for an NLG system to represent the effects of actions in the world or to model an addressee whose knowledge in some respects exceeds and in other respects falls short of its own. To illustrate the use of such statements, I describe how the SPUD sentence planner exploits a modal knowledge base to
Matching Algorithms within a Duplicate Detection System
 Bulletin of the Technical Committee on Data Engineering
, 2000
"... Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same realworld entity because of data entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same realworld entity because of data entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from multiple databases – such as what happens in data warehousing where records from multiple data sources are integrated into a single source of information – among other reasons. In this paper we review a system to detect approximate duplicate records in a database and provide properties that a pairwise record matching algorithm must have in order to have a successful duplicate detection system. 1
Simple generation of static singleassignment form
 In Proceedings of the 9th International Conference on Compiler Construction
, 2000
"... cfl SpringerVerlag Abstract. The static singleassignment (SSA) form of a program provides data flow information in a form which makes some compiler optimizations easy to perform. In this paper we present a new, simple method for converting to SSA form, which produces correct solutions for nonreduc ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
cfl SpringerVerlag Abstract. The static singleassignment (SSA) form of a program provides data flow information in a form which makes some compiler optimizations easy to perform. In this paper we present a new, simple method for converting to SSA form, which produces correct solutions for nonreducible controlflow graphs, and produces minimal solutions for reducible ones. Our timing results show that, despite its simplicity, our algorithm is competitive with more established techniques. 1 Introduction The static singleassignment (SSA) form is a program representation in which variables are split into &quot;instances. &quot; Every new assignment to a variable or more generally, every new definition of a variable results in a new instance. The variable instances are numbered so that each use of a variable may be easily linked back to a single definition point. Figure 1 gives a example of SSA form for some straightline code. As its name suggests, SSA only reflects static properties; in the example, V1's value is a dynamic property, but the static property that all instances labelled V1 refer to the same value will still hold.
Lower Bounds for the UnionFind and the SplitFind Problem on Pointer Machines
, 1989
"... A wellknown result of Taxjan (cf. [15]) states that for all n and m _> n there exists a sequence of n  1 Union and rn Find operations that needs at least /l(rn.a(rn, n)) execution steps on a pointer machine that satisfies the separation condition. In [1, 16] the bound was extended to (n + n. ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
A wellknown result of Taxjan (cf. [15]) states that for all n and m _> n there exists a sequence of n  1 Union and rn Find operations that needs at least /l(rn.a(rn, n)) execution steps on a pointer machine that satisfies the separation condition. In [1, 16] the bound was extended to (n + n.a(rn, n)) for all m and n. In this paper we prove that this bound holds on a general pointer machine without the separation condition and we prove that the same bound holds for the SplitFind problem as well.
Amortization, Lazy Evaluation, and Persistence: Lists with Catenation via Lazy Linking
 Pages 646654 of: IEEE Symposium on Foundations of Computer Science
, 1995
"... Amortization has been underutilized in the design of persistent data structures, largely because traditional accounting schemes break down in a persistent setting. Such schemes depend on saving "credits" for future use, but a persistent data structure may have multiple "futures", ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Amortization has been underutilized in the design of persistent data structures, largely because traditional accounting schemes break down in a persistent setting. Such schemes depend on saving "credits" for future use, but a persistent data structure may have multiple "futures", each competing for the same credits. We describe how lazy evaluation can often remedy this problem, yielding persistent data structures with good amortized efficiency. In fact, such data structures can be implemented purely functionally in any functional language supporting lazy evaluation. As an example of this technique, we present a purely functional (and therefore persistent) implementation of lists that simultaneously support catenation and all other usual list primitives in constant amortized time. This data structure is much simpler than the only existing data structure with comparable bounds, the recently discovered catenable lists of Kaplan and Tarjan, which support all operations in constant worstca...
New Techniques for the UnionFind Problem
 UTRECHT UNIVERSITY
, 1989
"... A wellknown result of Tarjan (cf. [10]) states that a program of up to n UNION and m FIND instructions can be executed in O(n + m.c(m, n)) time on a collection of n elements, where a(m, n) denotes the functional inverse of Ackermann's function. In this paper we develop a new approach to the ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
A wellknown result of Tarjan (cf. [10]) states that a program of up to n UNION and m FIND instructions can be executed in O(n + m.c(m, n)) time on a collection of n elements, where a(m, n) denotes the functional inverse of Ackermann's function. In this paper we develop a new approach to the problem and prove that the time for the k th FIND can be limited to O(c(k, n)) worst case, while the total cost for the program of UNION's and m FIND's remains bounded by O(n + m.c(m, n)). The technique is part of a family of lgorithms that can achieve various tradeoffs in cost for the individual instructions. The new lgorithm is important in all setmanipulation problems that require frequent FIND's. Because a(m, n) is O(1) in all practical cases, the new lgorithms guarantees that FIND's are essentially O(1) worst case, within the optimal bound for the UNIONFIND problem as a whole. The algorithm
A survey of analysis techniques for discrete algorithms
, 1977
"... This survey includes an introduction to the concepts of problem complexity, analysis of algorithms to find bounds on complexity, averagecase behavior, and approximation algorithms The major techniques used in analysis of algorithms are reviewed and examples of the use of these methods are presented ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
This survey includes an introduction to the concepts of problem complexity, analysis of algorithms to find bounds on complexity, averagecase behavior, and approximation algorithms The major techniques used in analysis of algorithms are reviewed and examples of the use of these methods are presented. A brief explanation of the problem classes P and NP, as well as the class of NPcomplete problems, is also presented.
A genetic algorithm approach to optimal topological design of all terminal networks
 Intelligent Engineering Systems Through Artificial Neural Networks
, 1995
"... In the design of communication networks, one of the fundamental considerations is the reliability and availability of communication paths between all terminals. Together, these form the network system reliability. The other important aspect is the layout of paths to minimize cost while meeting a rel ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
In the design of communication networks, one of the fundamental considerations is the reliability and availability of communication paths between all terminals. Together, these form the network system reliability. The other important aspect is the layout of paths to minimize cost while meeting a reliability criterion. In this paper, a new heuristic search algorithm based on Genetic Algorithms (GA) is presented to optimize the design of large scale network topologies subject to a reliability constraint. The search works with an improved Monte Carlo simulation technique to estimate the system reliability of a network topology.