Results 1 -
7 of
7
Information Extraction in Structured Documents using Tree Automata Induction
, 2002
"... Information extraction (IE) addresses the problem of extracting speci c information from a collection of documents. Much of the previous work for IE from structured documents formatted in HTML or XML uses techniques for IE from strings, such as grammar and automata induction. However, such docu ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
Information extraction (IE) addresses the problem of extracting speci c information from a collection of documents. Much of the previous work for IE from structured documents formatted in HTML or XML uses techniques for IE from strings, such as grammar and automata induction. However, such documents have a tree structure. Hence it is natural to investigate methods that are able to recognise and exploit this tree structure. We do this by exploring the use of tree automata for IE in structured documents. Experimental results on benchmark data sets show that our approach compares favorably with previous approaches.
Information extraction from web documents based on local unranked tree automaton inference
- In (IJCAI-2003
, 2003
"... Information extraction (IE) aims at extracting specific information from a collection of documents. A lot of previous work on 10 from semi-structured documents (in XML or HTML) uses learning techniques based on strings. Some recent work converts the document to a ranked tree and uses tree automaton ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Information extraction (IE) aims at extracting specific information from a collection of documents. A lot of previous work on 10 from semi-structured documents (in XML or HTML) uses learning techniques based on strings. Some recent work converts the document to a ranked tree and uses tree automaton induction. This paper introduces an algorithm that uses unranked trees to induce an automaton. Experiments show that this gives the best results obtained so far for IE from semi-structured documents based on learning. 1
Learning Rational Stochastic Tree Languages
"... Abstract. We consider the problem of learning stochastic tree languages, i.e. probability distributions over a set of trees T(F), from a sample of trees independently drawn according to an unknown target P. We consider the case where the target is a rational stochastic tree language, i.e. it can be ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. We consider the problem of learning stochastic tree languages, i.e. probability distributions over a set of trees T(F), from a sample of trees independently drawn according to an unknown target P. We consider the case where the target is a rational stochastic tree language, i.e. it can be computed by a rational tree series or, equivalently, by a multiplicity tree automaton. In this paper, we provide two contributions. First, we show that rational tree series admit a canonical representation with parameters that can be efficiently estimated from samples. Then, we give an inference algorithm that identifies the class of rational stochastic tree languages in the limit with probability one. 1
Learning multiplicity tree automata
- In: Proceedings of the 8th International Colloquium on Grammatical Inference (ICGI’06). Volume 4201 of LNCS
, 2006
"... Abstract. In this paper, we present a theoretical approach for the problem of learning multiplicity tree automata. These automata allows one to define functions which compute a number for each tree. They can be seen as a strict generalization of stochastic tree automata since they allow to define fu ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract. In this paper, we present a theoretical approach for the problem of learning multiplicity tree automata. These automata allows one to define functions which compute a number for each tree. They can be seen as a strict generalization of stochastic tree automata since they allow to define functions over any field K. A multiplicity automaton admits a support which is a non deterministic automaton. From a grammatical inference point of view, this paper presents a contribution which is original due to the combination of two important aspects. This is the first time, as far as we now, that a learning method focuses on non deterministic tree automata which computes functions over a field. The algorithm proposed in this paper stands in Angluin’s exact model where a learner is allowed to use membership and equivalence queries. We show that this algorithm is polynomial in time in function of the size of the representation.
A Comparison of PCFG Models
, 2000
"... In this paper, we compare three different approaches to build a probabilistic context-free grammar for natural language parsing from a tree bank corpus: 1) a model that simply extracts the rules contained in the corpus and counts the number of occurrences of each rule 2) a model that also stores inf ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper, we compare three different approaches to build a probabilistic context-free grammar for natural language parsing from a tree bank corpus: 1) a model that simply extracts the rules contained in the corpus and counts the number of occurrences of each rule 2) a model that also stores information about the parent node's category and, 3) a model that estimates the probabilities according to a generalized k-gram scheme with k = 3. The last one allows for a faster parsing and decreases the perplexity of test samples.
Information Extraction from Structured Documents using k-testable Tree Automaton Inference
"... Information extraction (IE) addresses the problem of extracting specific information from a collection of documents. Much of the previous work on IE from struc- tured documents, such as HTML or XML, uses learning techniques that are based on strings, such as finite automata induction. This paper e ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Information extraction (IE) addresses the problem of extracting specific information from a collection of documents. Much of the previous work on IE from struc- tured documents, such as HTML or XML, uses learning techniques that are based on strings, such as finite automata induction. This paper explores methods that exploit the tree structure of the documents. In particular, our method infers a k- testable tree automaton from a small set of annotated examples and explores various ways to generalize the inferred automaton. Experimental results on the benchmark data sets show that our approach compares favorably to the previous approaches.
Tree k-Grammar Models for Natural Language Modelling and Parsing
"... Abstract. In this paper, we compare three different approaches to build a probabilistic context-free grammar for natural language parsing from a tree bank corpus: (1) a model that simply extracts the rules contained in the corpus and counts the number of occurrences of each rule; (2) a model that al ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. In this paper, we compare three different approaches to build a probabilistic context-free grammar for natural language parsing from a tree bank corpus: (1) a model that simply extracts the rules contained in the corpus and counts the number of occurrences of each rule; (2) a model that also stores information about the parent node’s category, and (3) a model that estimates the probabilities according to a generalized k-gram scheme for trees with k = 3. The last model allows for faster parsing and decreases considerably the perplexity of test samples. 1

