Results 11 - 20
of
58
Design of an End-to-End Method to Extract Information From Tables
- International Journal Document Analysis Research
"... This paper plans an end-to-end method for extracting information from tables embedded in documents; input format is ASCII, to which any richer format can be converted, preserving all textual and much of the layout information. We start by defining table. Then we describe the steps involved in extrac ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
This paper plans an end-to-end method for extracting information from tables embedded in documents; input format is ASCII, to which any richer format can be converted, preserving all textual and much of the layout information. We start by defining table. Then we describe the steps involved in extracting information from tables and analyse table-related research to: place the contribution of different authors, find the paths research is following, and identify issues that are still unsolved. We then analyse current approaches to evaluating table processing algorithms and propose two new metrics for the task of segmenting cells/columns/rows. We proceed to design our own end-to-end method, where there is a higher interaction between the different steps; we indicate how back loops in the usual order of the steps can reduce the possibility of errors and contribute to solving previously unsolved problems. Finally we explore how the actual interpretation of the table not only allows inferring the accuracy of the overall extraction process but also contributes to actually improving its quality. In order to do so, we believe interpretation has to consider context specific knowledge; we explore how the addition of this knowledge can be made in a plug-in/out manner, such that the overall method will maintain its operability in different contexts.
Applying conditional random fields to Japanese morphological analysis
- In Proc. of EMNLP
, 2004
"... This paper presents Japanese morphological analysis based on conditional random fields (CRFs). Previous work in CRFs assumed that observation sequence (word) boundaries were fixed. However, word boundaries are not clear in Japanese, and hence a straightforward application of CRFs is not possible. We ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
This paper presents Japanese morphological analysis based on conditional random fields (CRFs). Previous work in CRFs assumed that observation sequence (word) boundaries were fixed. However, word boundaries are not clear in Japanese, and hence a straightforward application of CRFs is not possible. We show how CRFs can be applied to situations where word boundary ambiguity exists. CRFs offer a solution to the long-standing problems in corpus-based or statistical Japanese morphological analysis. First, flexible feature designs for hierarchical tagsets become possible. Second, influences of label and length bias are minimized. We experiment CRFs on the standard testbed corpus used for Japanese morphological analysis, and evaluate our results using the same experimental dataset as the HMMs and MEMMs previously reported in this task. Our results confirm that CRFs not only solve the long-standing problems but also improve the performance over HMMs and MEMMs. 1
NER Systems that Suit User's Preferences: Adjusting the RecallPrecision Trade-off for Entity Extraction
- In HLT/NAACL
, 2006
"... We describe a method based on “tweaking” an existing learned sequential classifier to change the recall-precision tradeoff, guided by a user-provided performance criterion. This method is evaluated on the task of recognizing personal names in email and newswire text, and proves to be both simple and ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
We describe a method based on “tweaking” an existing learned sequential classifier to change the recall-precision tradeoff, guided by a user-provided performance criterion. This method is evaluated on the task of recognizing personal names in email and newswire text, and proves to be both simple and effective. 1
Extracting Web data using instance-based learning
- In WISE-05
, 2005
"... Abstract. This paper studies structured data extraction from Web pages, e.g., online product description pages. Existing approaches to data extraction include wrapper induction and automatic methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Abstract. This paper studies structured data extraction from Web pages, e.g., online product description pages. Existing approaches to data extraction include wrapper induction and automatic methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance (or page) to be extracted with labeled instances (or pages). The key advantage of our method is that it does not need an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance (or page). Only when a new page cannot be extracted does the page need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled pages may not be representative of all other pages. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates and pages of the same template usually can be extracted using a single page instance of the template. The key issue is the similarity or distance measure. Traditional measures based on the Euclidean distance or text similarity are not easily applicable in this context because items to be extracted from different pages can be entirely different. This paper proposes a novel similarity measure for the purpose, which is suitable for templated Web pages. Experimental results with product data extraction from 1200 pages in 24 diverse Web sites show that the approach is surprisingly effective. It outperforms the state-of-the-art existing systems significantly. 1
Rapid Development of Hindi Named Entity Recognition Using Conditional Random Fields and Feature Induction
- Special issue of ACM Transactions on Asian
, 2004
"... This paper describes our application of Conditional Random Fields (CRFs) with feature induction to a Hindi named entity recognition task. With only five days development time and little knowledge of this language, we automatically discover relevant features by providing a large array of lexical test ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
This paper describes our application of Conditional Random Fields (CRFs) with feature induction to a Hindi named entity recognition task. With only five days development time and little knowledge of this language, we automatically discover relevant features by providing a large array of lexical tests and using feature induction to automatically construct the features that most increase conditional likelihood. In an e#ort to reduce overfitting, we use a combination of a Gaussian prior and early-stopping based on the results of 10-fold cross validation
Segmentation conditional random fields (SCRFs): A new approach for protein fold recognition
- Proc. of the 9th Ann. Intl. Conf. on Comput. Biol. (RECOMB
, 2005
"... Abstract. Protein fold recognition is an important step towards understanding protein three-dimensional structures and their functions. A conditional graphical model, i.e. segmentation conditional random fields (SCRFs), is proposed to solve the problem. In contrast to traditional graphical models su ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Abstract. Protein fold recognition is an important step towards understanding protein three-dimensional structures and their functions. A conditional graphical model, i.e. segmentation conditional random fields (SCRFs), is proposed to solve the problem. In contrast to traditional graphical models such as hidden markov model (HMM), SCRFs follow a discriminative approach. It has the flexibility to include overlapping or long-range interaction features over the whole sequence, as well as global optimally solutions for the parameters. On the other hand, the segmentation setting in SCRFs makes its graphical structures intuitively similar to the protein 3-D structures and more importantly, provides a framework to model the long-range interactions directly. Our model is applied to predict the parallel β-helix fold, an important fold in bacterial infection of plants and binding of antigens. The cross-family validation shows that SCRFs not only can score all known βhelices higher than non β-helices in Protein Data Bank, but also demonstrate more success in locating each rung in the known β-helix proteins than BetaWrap, a state-of-the-art algorithm for predicting β-helix fold, and HMMER, a general motif detection algorithm based on HMM. Applying our prediction model to Uniprot database, we hypothesize previously unknown β-helices. 1
Tableseer: Automatic table metadata extraction and searching in digital libraries
- In Technical Report
, 2007
"... Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the limitation of the existing ranking schemes make table search problem challenging. In this paper, we describe TableSeer, a search engine for tables. TableSeer crawls digital libraries, detects tables from documents, extracts tables metadata, indexes and ranks tables, and provides a userfriendly search interface. We propose an extensive set of medium-independent metadata for tables that scientists and other users can adopt for representing table information. In addition, we devise a novel page box-cutting method to improve the performance of the table detection. Given a query, TableSeer ranks the matched tables using an innovative ranking algorithm – TableRank. TableRank rates each <query, table> pair with a tailored vector space model and a specific term weighting scheme. Overall, T ableSeer eliminates the burden of manually extract table data from digital libraries and enables users to automatically examine tables. We demonstrate the value of TableSeer with empirical studies on scientific documents.
Semi-supervised learning for natural language
- MASTER’S THESIS, MIT
, 2005
"... Statistical supervised learning techniques have been successful for many natural language processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available “for free ” in large quantities. Unlabeled data has shown p ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Statistical supervised learning techniques have been successful for many natural language processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available “for free ” in large quantities. Unlabeled data has shown promise in improving the performance of a number of tasks, e.g. word sense disambiguation, information extraction, and natural language parsing. In this thesis, we focus on two segmentation tasks, named-entity recognition and Chinese word segmentation. The goal of named-entity recognition is to detect and classify names of people, organizations, and locations in a sentence. The goal of Chinese word segmentation is to find the word boundaries in a sentence that has been written as a string of characters without spaces. Our approach is as follows: In a preprocessing step, we use raw text to cluster words and calculate mutual information statistics. The output of this step is then used as features in a supervised model, specifically a global linear model trained using
iASA: Learning to Annotate the Semantic Web
"... With the advent of the Semantic Web, there is a great need to upgrade existing web content to semantic web content. This can be accomplished through semantic annotations. Unfortunately, manual annotation is tedious, time consuming and error-prone. In this paper, we propose a tool, called iASA, that ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
With the advent of the Semantic Web, there is a great need to upgrade existing web content to semantic web content. This can be accomplished through semantic annotations. Unfortunately, manual annotation is tedious, time consuming and error-prone. In this paper, we propose a tool, called iASA, that learns to automatically annotate web documents according to an ontology. iASA is based on the combination of information extraction (specifically, the Similarity-based Rule Learner—SRL) and machine learning techniques. Using linguistic knowledge and optimal dynamic window size, SRL produces annotation rules of better quality than comparable semantic annotation systems. Similarity-based learning efficiently reduces the search space by avoiding pseudo rule generalization. In the annotation phase, iASA exploits ontology knowledge to refine the annotation it proposes. Moreover, our annotation algorithm exploits machine learning methods to correctly select instances and to predict missing instances. Finally, iASA provides an explanation component that explains the nature of the learner and annotator to the user. Explanations can greatly help users understand the rule induction and annotation process, so that they can focus on correcting rules and annotations quickly. Experimental results show that iASA can reach high accuracy quickly.
The importance of the difference in text types to keyword extraction: Evaluating a mechanism
- 7th International Conference on Internet Computing 2006 (ICOMP 2006), Las Vegas
, 2006
"... Abstract- Information exists in every aspect of our life. The expansion of the web has helped to this direction. The web feeds us with enormous information and the widespread use of computers and other hardware appliances has lead us to a state where we have a lot of information in our hands, but ma ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
Abstract- Information exists in every aspect of our life. The expansion of the web has helped to this direction. The web feeds us with enormous information and the widespread use of computers and other hardware appliances has lead us to a state where we have a lot of information in our hands, but many times it is useless. People are not able to find information that they really need but already own. How many times have you tried to find a specific article that you have, or a specific mail that you received, or even an SMS from someone saying something specific. For this reason many information retrieval techniques have been proposed and many information extraction mechanisms have been created. In this paper we will provide the experimental evaluation of a keyword extraction mechanism and how we treat different types of text (news articles, publications, e-mails). This keyword extraction mechanism is a part of a complete system that includes information retrieval, information extraction, categorization and publication of information to a personalized portal.

