Results 1 - 10
of
15
On the Evolution of Clusters of Near-Duplicate Web Pages
- IN 1ST LATIN AMERICAN WEB CONGRESS
, 2003
"... This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are near-duplicates of one another, and tracked how clust ..."
Abstract
-
Cited by 41 (3 self)
- Add to MetaCart
This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are near-duplicates of one another, and tracked how clusters of near-duplicate documents evolved over time. We found that 29.2% of all web pages are very similar to other pages, and that 22.2% are virtually identical to other pages. We also found that clusters of near-duplicate documents are fairly stable: Two documents that are near-duplicates of one another are very likely to still be near-duplicates 10 weeks later. This result is of significant relevance to search engines: Web crawlers can be fairly confident that two pages that have been found to be near-duplicates of one another will continue to be so for the foreseeable future, and may thus decide to recrawl only one version of that page, or at least to lower the download priority of the other versions, thereby freeing up crawling resources that can be brought to bear more productively somewhere else.
Modality in Dialogue: Planning, Pragmatics and Computation
, 1998
"... Natural language generation (NLG) is first and foremost a reasoning task. In this reasoning, a system plans a communicative act that will signal key facts about the domain to the hearer. In generating action descriptions, this reasoning draws on characterizations both of the causal properties of the ..."
Abstract
-
Cited by 32 (9 self)
- Add to MetaCart
Natural language generation (NLG) is first and foremost a reasoning task. In this reasoning, a system plans a communicative act that will signal key facts about the domain to the hearer. In generating action descriptions, this reasoning draws on characterizations both of the causal properties of the domain and the states of knowledge of the participants in the conversation. This dissertation shows how such characterizations can be specified declaratively and accessed efficiently in NLG. The heart of this dissertation is a study of logical statements about knowledge and action in modal logic. By investigating the proof-theory of modal logic from a logic programming point of view, I show how many kinds of modal statements can be seen as straightforward instructions for computationally manageable search, just as Prolog clauses can. These modal statements provide sufficient expressive resources for an NLG system to represent the effects of actions in the world or to model an addressee whose knowledge in some respects exceeds and in other respects falls short of its own. To illustrate the use of such statements, I describe how the SPUD sentence planner exploits a modal knowledge base to
Local Search Genetic Algorithm for Optimization of Highly Reliable Communications Networks
- IEEE Transactions on Evolutionary Computation
, 1997
"... This paper presents a genetic algorithm (GA) with specialized encoding, initialization and local search genetic operators to optimize communication network topologies. This NPhard problem is often highly constrained so that random initialization and standard genetic operators usually generate ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
This paper presents a genetic algorithm (GA) with specialized encoding, initialization and local search genetic operators to optimize communication network topologies. This NPhard problem is often highly constrained so that random initialization and standard genetic operators usually generate infeasible network architectures. Compounding this infeasibility issue is that the fitness function involves calculating the all-terminal reliability of the network, a calculation which is computationally expensive. Therefore, it is imperative that the search balances the need to thoroughly explore the boundary between feasible and infeasible networks, along with calculating fitness on only the most promising candidate networks. The algorithm results are compared to optimum results found by branch and bound and also to GA results without local search operators on a suite of 79 test problems. This GA strategy of employing bounds, simple heuristic checks and problem specific rep...
Matching Algorithms within a Duplicate Detection System
- Bulletin of the Technical Committee on Data Engineering
, 2000
"... Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same real-world entity because of data entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same real-world entity because of data entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from multiple databases – such as what happens in data warehousing where records from multiple data sources are integrated into a single source of information – among other reasons. In this paper we review a system to detect approximate duplicate records in a database and provide properties that a pair-wise record matching algorithm must have in order to have a successful duplicate detection system. 1
Simple generation of static single-assignment form
- In Proceedings of the 9th International Conference on Compiler Construction
, 2000
"... cfl Springer-Verlag Abstract. The static single-assignment (SSA) form of a program provides data flow information in a form which makes some compiler optimizations easy to perform. In this paper we present a new, simple method for converting to SSA form, which produces correct solutions for nonreduc ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
cfl Springer-Verlag Abstract. The static single-assignment (SSA) form of a program provides data flow information in a form which makes some compiler optimizations easy to perform. In this paper we present a new, simple method for converting to SSA form, which produces correct solutions for nonreducible control-flow graphs, and produces minimal solutions for reducible ones. Our timing results show that, despite its simplicity, our algorithm is competitive with more established techniques. 1 Introduction The static single-assignment (SSA) form is a program representation in which variables are split into "instances. " Every new assignment to a variable-- or more generally, every new definition of a variable-- results in a new instance. The variable instances are numbered so that each use of a variable may be easily linked back to a single definition point. Figure 1 gives a example of SSA form for some straight-line code. As its name suggests, SSA only reflects static properties; in the example, V1's value is a dynamic property, but the static property that all instances labelled V1 refer to the same value will still hold.
Lower Bounds for the Union-Find and the Split-Find Problem on Pointer Machines
, 1989
"... A well-known result of Taxjan (cf. [15]) states that for all n and m _> n there exists a sequence of n - 1 Union and rn Find operations that needs at least /l(rn.a(rn, n)) execution steps on a pointer machine that satisfies the separation condition. In [1, 16] the bound was extended to (n + n.a(r ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
A well-known result of Taxjan (cf. [15]) states that for all n and m _> n there exists a sequence of n - 1 Union and rn Find operations that needs at least /l(rn.a(rn, n)) execution steps on a pointer machine that satisfies the separation condition. In [1, 16] the bound was extended to (n + n.a(rn, n)) for all m and n. In this paper we prove that this bound holds on a general pointer machine without the separation condition and we prove that the same bound holds for the Split-Find problem as well.
Amortization, Lazy Evaluation, and Persistence: Lists with Catenation via Lazy Linking
- Pages 646--654 of: IEEE Symposium on Foundations of Computer Science
, 1995
"... Amortization has been underutilized in the design of persistent data structures, largely because traditional accounting schemes break down in a persistent setting. Such schemes depend on saving "credits" for future use, but a persistent data structure may have multiple "futures", each competing for ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Amortization has been underutilized in the design of persistent data structures, largely because traditional accounting schemes break down in a persistent setting. Such schemes depend on saving "credits" for future use, but a persistent data structure may have multiple "futures", each competing for the same credits. We describe how lazy evaluation can often remedy this problem, yielding persistent data structures with good amortized efficiency. In fact, such data structures can be implemented purely functionally in any functional language supporting lazy evaluation. As an example of this technique, we present a purely functional (and therefore persistent) implementation of lists that simultaneously support catenation and all other usual list primitives in constant amortized time. This data structure is much simpler than the only existing data structure with comparable bounds, the recently discovered catenable lists of Kaplan and Tarjan, which support all operations in constant worst-ca...
New Techniques for the Union-Find Problem
- UTRECHT UNIVERSITY
, 1989
"... A well-known result of Tarjan (cf. [10]) states that a program of up to n UNION and m FIND instructions can be executed in O(n + m.c(m, n)) time on a collection of n elements, where a(m, n) denotes the functional inverse of Ackermann's function. In this paper we develop a new approach to the prob ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
A well-known result of Tarjan (cf. [10]) states that a program of up to n UNION and m FIND instructions can be executed in O(n + m.c(m, n)) time on a collection of n elements, where a(m, n) denotes the functional inverse of Ackermann's function. In this paper we develop a new approach to the problem and prove that the time for the k th FIND can be limited to O(c(k, n)) worst case, while the total cost for the program of UNION's and m FIND's remains bounded by O(n + m.c(m, n)). The technique is part of a family of lgorithms that can achieve various trade-offs in cost for the individual instructions. The new lgorithm is important in all set-manipulation problems that require frequent FIND's. Because a(m, n) is O(1) in all practical cases, the new lgorithms guarantees that FIND's are essentially O(1) worst case, within the optimal bound for the UNION-FIND problem as a whole. The algorithm
An Adaptive and Efficient Algorithm for Detecting Approximately Duplicate Database Records
, 2000
"... The integration of information is an important area of research in databases. By combining multiple information sources, a more complete and more accurate view of the world is attained, and additional knowledge gained. This is a non-trivial task however. Often there are many sources which contain ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The integration of information is an important area of research in databases. By combining multiple information sources, a more complete and more accurate view of the world is attained, and additional knowledge gained. This is a non-trivial task however. Often there are many sources which contain information about a certain kind of entity, and some will contain records concerning the same real-world entity. Furthermore, one source may not have the exact information that another source contains. Some of the information may be different due to data entry errors for example or may be missing altogether. Thus, one problem in integrating information sources is to identify possibly different designators of the same entity. Data cleansing is the process of purging databases of inaccurate or inconsistent data. The data is typically manipulated into a form which is useful for other tasks, such as data mining. This paper addresses the data cleansing problem of detecting database records that are approximate duplicates, but not exact duplicates. An efficient algorithm is presented which combines three key ideas. First, the Smith-Waterman algorithm for computing the minimum edit-distance is used as a domain-independent method to recognize pairs of approximately duplicates.
Efficient Propagators for Global Constraints
"... I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. Claude-Guy Quimper ii We study in this thesis three well ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. Claude-Guy Quimper ii We study in this thesis three well known global constraints. The All-Different constraint restricts a set of variables to be assigned to distinct values. The global cardinality constraint (GCC) ensures that a value v is assigned to at least lv vari-ables and to at most uv variables among a set of given variables where lv and uv are non-negative integers such that lv ≤ uv. The Inter-Distance constraint ensures that all variables, among a set of variables x1,..., xn, are pairwise distant from p, i.e. |xi − xj | ≥ p for all i � = j. The All-Different constraint, the GCC, and the Inter-Distance constraint are largely used in scheduling problems. For instance, in scheduling problems where tasks with unit processing time compete for a single

