Results 1  10
of
16
Robust web extraction: an approach based on a probabilistic treeedit model
 In SIGMOD
"... On scriptgenerated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wra ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
On scriptgenerated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wrappers. In this paper, we explore a novel approach: we use temporal snapshots of web pages to develop a treeedit model of HTML, and use this model to improve wrapper construction. We view the changes to the tree structure as suppositions of a series of edit operations: deleting nodes, inserting nodes and substituting labels of nodes. The tree structures evolve by choosing these edit operations stochastically. Our model is attractive in that the probability that a source tree has evolved into a target tree can be estimated efficiently—in quadratic time in the size of the trees—making it a potentially useful tool for a variety of treeevolution problems. We give an algorithm to learn the probabilistic model from training examples consisting of pairs of trees, and apply this algorithm to collections of webpage snapshots to derive HTMLspecific tree edit models. Finally, we describe a novel wrapperconstruction framework that takes the treeedit model into account, and compare the quality of resulting wrappers to that of traditional wrappers on synthetic and real HTML document examples. 1.
Learning probabilistic models of tree edit distance
, 2008
"... Nowadays, there is a growing interest in machine learning and pattern recognition for treestructured data. Trees actually provide a suitable structural representation to deal with complex tasks such as web information extraction, RNA secondary structure prediction, computer music, or conversion of ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Nowadays, there is a growing interest in machine learning and pattern recognition for treestructured data. Trees actually provide a suitable structural representation to deal with complex tasks such as web information extraction, RNA secondary structure prediction, computer music, or conversion of semistructured data (e.g. XML documents). Many applications in these domains require the calculation of similarities over pairs of trees. In this context, the tree edit distance (ED) has been subject of investigations for many years in order to improve its computational efficiency. However, used in its classical form, the tree ED needs a priori fixed edit costs which are often difficult to tune, that leaves little room for tackling complex problems. In this paper, to overcome this drawback, we focus on the automatic learning of a non parametric stochastic tree ED. More precisely, we are interested in two kinds of probabilistic approaches. The first one builds a generative model of the tree ED from a joint distribution over the edit operations, while the second works from a conditional distribution providing then a discriminative model. To tackle
M.: Melody recognition with learned edit distances
, 2008
"... Abstract. In a music recognition task, the classification of a new melody is often achieved by looking for the closest piece in a set of already known prototypes. The definition of a relevant similarity measure becomes then a crucial point. So far, the edit distance approach with apriori fixed oper ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Abstract. In a music recognition task, the classification of a new melody is often achieved by looking for the closest piece in a set of already known prototypes. The definition of a relevant similarity measure becomes then a crucial point. So far, the edit distance approach with apriori fixed operation costs has been one of the most used to accomplish the task. In this paper, the application of a probabilistic learning model to both string and tree edit distances is proposed and is compared to a genetic algorithm cost fitting approach. The results show that both learning models outperform fixedcosts systems, and that the probabilistic approach is able to describe consistently the underlying melodic similarity model.
Learning Metrics between Tree Structured Data: Application to Image Recognition ⋆
"... Abstract. The problem of learning metrics between structured data (strings, trees or graphs) has been the subject of various recent papers. With regard to the specific case of trees, some approaches focused on the learning of edit probabilities required to compute a socalled stochastic tree edit di ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Abstract. The problem of learning metrics between structured data (strings, trees or graphs) has been the subject of various recent papers. With regard to the specific case of trees, some approaches focused on the learning of edit probabilities required to compute a socalled stochastic tree edit distance. However, to reduce the algorithmic and learning constraints, the deletion and insertion operations are achieved on entire subtrees rather than on single nodes. We aim in this article at filling the gap with the learning of a more general stochastic tree edit distance where node deletions and insertions are allowed. Our approach is based on an adaptation of the EM optimization algorithm to learn parameters of a tree model. We propose an original experimental approach aiming at representing images by a treestructured representation and then at using our learned metric in an image recognition task. Comparisons with a non learned tree edit distance confirm the effectiveness of our approach. 1
Finding Cognate Groups using Phylogenies
"... A central problem in historical linguistics is the identification of historically related cognate words. We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. Our model represents the process of transformation and transmission from a ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
A central problem in historical linguistics is the identification of historically related cognate words. We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. Our model represents the process of transformation and transmission from ancestor word to daughter word, as well as the alignment between the words lists of the observed languages. We also present a novel method for simplifying complex weighted automata created during inference to counteract the otherwise exponential growth of message sizes. On the task of identifying cognates in a dataset of Romance words, our model significantly outperforms a baseline approach, increasing accuracy by as much as 80%. Finally, we demonstrate that our automatically induced groups can be used to successfully reconstruct ancestral words. 1
M.: A discriminative model of stochastic edit distance in the form of a conditional transducer. Grammatical Inference: Algorithms and Applications 4201
, 2006
"... Abstract. Many realworld applications such as spellchecking or DNA analysis use the Levenshtein editdistance to compute similarities between strings. In practice, the costs of the primitive edit operations (insertion, deletion and substitution of symbols) are generally handtuned. In this paper, ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. Many realworld applications such as spellchecking or DNA analysis use the Levenshtein editdistance to compute similarities between strings. In practice, the costs of the primitive edit operations (insertion, deletion and substitution of symbols) are generally handtuned. In this paper, we propose an algorithm to learn these costs. The underlying model is a probabilitic transducer, computed by using grammatical inference techniques, that allows us to learn both the structure and the probabilities of the model. Beyond the fact that the learned transducers are neither deterministic nor stochastic in the standard terminology, they are conditional, thus independant from the distributions of the input strings. Finally, we show through experiments that our method allows us to design cost functions that depend on the string context where the edit operations are used. In other words, we get kinds of contextsensitive edit distances.
SEDiL: Software for Edit Distance Learning?
"... Abstract. In this paper, we present SEDiL, a Software for Edit Distance Learning. SEDiL is an innovative prototype implementation grouping together most of the state of the art methods [1{4] that aim to automatically learn the parameters of string and tree edit distances. This work was funded by the ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. In this paper, we present SEDiL, a Software for Edit Distance Learning. SEDiL is an innovative prototype implementation grouping together most of the state of the art methods [1{4] that aim to automatically learn the parameters of string and tree edit distances. This work was funded by the French ANR Marmota project, the Pascal Network of
Sequences Classification by Least General Generalisations
, 2010
"... In this paper, we present a general framework for supervised classification. This framework provides methods like boosting and only needs the definition of a generalisation operator called lgg. For sequence classification tasks, lgg is a learner that only uses positive examples. We show that grammat ..."
Abstract
 Add to MetaCart
In this paper, we present a general framework for supervised classification. This framework provides methods like boosting and only needs the definition of a generalisation operator called lgg. For sequence classification tasks, lgg is a learner that only uses positive examples. We show that grammatical inference has already defined such learners for automata classes like reversible automata or kTSS automata. Then we propose a generalisation algorithm for the class of balls of words. Finally, we show through experiments that our method efficiently resolves sequence classification tasks.
A SumoverPaths Extension of Edit Distances . . .
, 2009
"... This work introduces a simple SumoverPaths (SoP) formulation of string edit distances accounting for all possible alignments between two sequences. Each alignment ℘, with a total cost C(℘), is assigned a probability of occurrence P(℘) = exp[−θC(℘)]/Z where Z is a normalization factor. Therefore, g ..."
Abstract
 Add to MetaCart
This work introduces a simple SumoverPaths (SoP) formulation of string edit distances accounting for all possible alignments between two sequences. Each alignment ℘, with a total cost C(℘), is assigned a probability of occurrence P(℘) = exp[−θC(℘)]/Z where Z is a normalization factor. Therefore, good alignments (having a low cost) are favoured over bad alignments (having a high cost). The expected cost, ∑ ℘∈P C(℘) exp [−θC(℘)] /Z, computed over all possible alignments ℘ ∈ P, defines the SoP edit distance. When θ → ∞, only the best alignments matter and the measure reduces to the standard edit distance. The rationale behind this definition is the following: for some applications, two sequences sharing many good alignments should be considered as more similar than two sequences having only one single good, optimal, alignment in common. In other words, suboptimal alignments should also be taken into account. Forward/backward recurrences allowing to efficiently compute the expected cost are developed. Virtually any Viterbilike sequence comparison algorithm computed on a lattice can be generalized in the same way; for instance, a SoP longest common subsequence is also developed. Pattern clustering and classification tasks performed on four data sets show that the new measures usually outperform the standard ones and, in any case, never perform worse, at the cost of tuning the parameter θ.