Results 11 - 20
of
41
Robust web extraction: an approach based on a probabilistic tree-edit model
- In SIGMOD
"... On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wra ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wrappers. In this paper, we explore a novel approach: we use temporal snapshots of web pages to develop a tree-edit model of HTML, and use this model to improve wrapper construction. We view the changes to the tree structure as suppositions of a series of edit operations: deleting nodes, inserting nodes and substituting labels of nodes. The tree structures evolve by choosing these edit operations stochastically. Our model is attractive in that the probability that a source tree has evolved into a target tree can be estimated efficiently—in quadratic time in the size of the trees—making it a potentially useful tool for a variety of tree-evolution problems. We give an algorithm to learn the probabilistic model from training examples consisting of pairs of trees, and apply this algorithm to collections of web-page snapshots to derive HTML-specific tree edit models. Finally, we describe a novel wrapper-construction framework that takes the tree-edit model into account, and compare the quality of resulting wrappers to that of traditional wrappers on synthetic and real HTML document examples. 1.
Bisimulation Minimisation for Weighted Tree Automata
, 2007
"... We generalise existing forward and backward bisimulation minimisation algorithms for tree automata to weighted tree automata. The obtained algorithms work for all semirings and retain the time complexity of their unweighted variants for all additively cancellative semirings. On all other semirings t ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
We generalise existing forward and backward bisimulation minimisation algorithms for tree automata to weighted tree automata. The obtained algorithms work for all semirings and retain the time complexity of their unweighted variants for all additively cancellative semirings. On all other semirings the time complexity is slightly higher (linear instead of logarithmic in the number of states). We discuss implementations of these algorithms on a typical task in natural language processing.
Extended Multi Bottom-Up Tree Transducers
"... Abstract. Extended multi bottom-up tree transducers are de ned and investigated. They are an extension of multi bottom-up tree transducers by arbitrary, not just shallow, left-hand sides of rules; this includes rules that do not consume input. It is shown that such transducers can compute any transf ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract. Extended multi bottom-up tree transducers are de ned and investigated. They are an extension of multi bottom-up tree transducers by arbitrary, not just shallow, left-hand sides of rules; this includes rules that do not consume input. It is shown that such transducers can compute any transformation that is computed by a linear extended top-down tree transducer. Moreover, the classical composition results for bottomup tree transducers are generalized to extended multi bottom-up tree transducers. Finally, a characterization in terms of extended top-down tree transducers is presented. 1
Learning for Semantic Parsing and Natural Language Generation Using Statistical Machine Translation Techniques
, 2007
"... ..."
Fluency Constraints for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices
"... A novel and robust approach to improving statistical machine translation fluency is developed within a minimum Bayesrisk decoding framework. By segmenting translation lattices according to confidence measures over the maximum likelihood translation hypothesis we are able to focus on regions with pot ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
A novel and robust approach to improving statistical machine translation fluency is developed within a minimum Bayesrisk decoding framework. By segmenting translation lattices according to confidence measures over the maximum likelihood translation hypothesis we are able to focus on regions with potential translation errors. Hypothesis space constraints based on monolingual coverage are applied to the low confidence regions to improve overall translation fluency. 1
Pure and o-substitution
, 2006
"... Abstract The basic properties of distributivity and deletion of pure and o-substitution are investigated. The obtained results are applied to show preservation of recognizability in a number of surprising cases. It is proved that linear and recognizable tree series are closed under o-substitution pr ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract The basic properties of distributivity and deletion of pure and o-substitution are investigated. The obtained results are applied to show preservation of recognizability in a number of surprising cases. It is proved that linear and recognizable tree series are closed under o-substitution provided that the underlying semiring is commutative, continuous, and additively idempotent. It is known that, in general, pure substitution does not preserve recognizability (not even for linear target tree series), but it is shown that recognizable linear probability distributions (represented as tree series) are closed under pure substitution.
Myhill-Nerode theorem for recognizable tree series -- revisited
- Manning and Hinrich Sch"utze. Foundations of Statistical Natural Language Processing, chapter 6
, 1999
"... Abstract. In this contribution the Myhill-Nerode congruence relation on tree series is reviewed and a more detailed analysis of its properties is presented. It is shown that, if a tree series is deterministically recognizable over a zero-divisor free and commutative semiring, then the Myhill-Nerode ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. In this contribution the Myhill-Nerode congruence relation on tree series is reviewed and a more detailed analysis of its properties is presented. It is shown that, if a tree series is deterministically recognizable over a zero-divisor free and commutative semiring, then the Myhill-Nerode congruence relation has nite index. By [Borchardt: Myhill-Nerode Theorem for Recognizable Tree Series. LNCS 2710. Springer 2003] the converse holds for commutative semi elds, but not in general. In the second part, a slightly adapted version of the Myhill-Nerode congruence relation is de ned and a characterization is obtained for all-accepting weighted tree automata over multiplicatively cancellative and commutative semirings. 1
A Tree Transducer Model for Synchronous Tree-Adjoining Grammars
"... A characterization of the expressive power of synchronous tree-adjoining grammars (STAGs) in terms of tree transducers (or equivalently, synchronous tree substitution grammars) is developed. Essentially, a STAG corresponds to an extended tree transducer that uses explicit substitution in both the in ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
A characterization of the expressive power of synchronous tree-adjoining grammars (STAGs) in terms of tree transducers (or equivalently, synchronous tree substitution grammars) is developed. Essentially, a STAG corresponds to an extended tree transducer that uses explicit substitution in both the input and output. This characterization allows the easy integration of STAG into toolkits for extended tree transducers. Moreover, the applicability of the characterization to several representational and algorithmic problems is demonstrated. 1
2008. Efficient processing of underspecified discourse representations
- In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08: HLT) – Short Papers
"... Underspecification-based algorithms for processing partially disambiguated discourse structure must cope with extremely high numbers of readings. Based on previous work on dominance graphs and weighted tree grammars, we provide the first possibility for computing an underspecified discourse descript ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Underspecification-based algorithms for processing partially disambiguated discourse structure must cope with extremely high numbers of readings. Based on previous work on dominance graphs and weighted tree grammars, we provide the first possibility for computing an underspecified discourse description and a best discourse representation efficiently enough to process even the longest discourses in the RST Discourse Treebank. 1

