• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Learning to extract proteins and their interactions from medline abstracts. Available from http://www.cs.utexas.edu/users/ml/publication/ie.html (2002)

by R Bunescu, R Ge, R J Kate, E M Marcotte, R J Mooney, A K Ramani, Y W Wong
Add To MetaCart

Tools

Sorted by:
Results 1 - 9 of 9

Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods

by William W. Cohen - Semi-Markov Extraction Processes and Data Integration Methods, Proceedings of KDD 2004 , 2004
"... We consider the problem of improving named entity recognition (NER) systems by using external dictionaries---more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is ..."
Abstract - Cited by 56 (5 self) - Add to MetaCart
We consider the problem of improving named entity recognition (NER) systems by using external dictionaries---more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is di#cult because most high-performance named entity recognition systems operate by sequentially classifying words as to whether or not they participate in an entity name; however, the most useful similarity measures score entire candidate names. To correct this mismatch we formalize a semi-Markov extraction process, which is based on sequentially classifying segments of several adjacent words, rather than single words. In addition to allowing a natural way of coupling high-performance NER methods and highperformance similarity functions, this formalism also allows the direct use of other useful entity-level features, and provides a more natural formulation of the NER problem than sequential word classification. Experiments in multiple domains show that the new model can substantially improve extraction performance over previous methods for using external dictionaries in NER.

Extracting Web data using instance-based learning

by Yanhong Zhai, Bing Liu - In WISE-05 , 2005
"... Abstract. This paper studies structured data extraction from Web pages, e.g., online product description pages. Existing approaches to data extraction include wrapper induction and automatic methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing ..."
Abstract - Cited by 12 (1 self) - Add to MetaCart
Abstract. This paper studies structured data extraction from Web pages, e.g., online product description pages. Existing approaches to data extraction include wrapper induction and automatic methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance (or page) to be extracted with labeled instances (or pages). The key advantage of our method is that it does not need an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance (or page). Only when a new page cannot be extracted does the page need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled pages may not be representative of all other pages. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates and pages of the same template usually can be extracted using a single page instance of the template. The key issue is the similarity or distance measure. Traditional measures based on the Euclidean distance or text similarity are not easily applicable in this context because items to be extracted from different pages can be entirely different. This paper proposes a novel similarity measure for the purpose, which is suitable for templated Web pages. Experimental results with product data extraction from 1200 pages in 24 diverse Web sites show that the approach is surprisingly effective. It outperforms the state-of-the-art existing systems significantly. 1

Hierarchical Text Categorization and Its Application to Bioinformatics

by Svetlana Kiritchenko , 2005
"... In a hierarchical categorization problem, categories are partially ordered to form a hier-archy. In this dissertation, we explore two main aspects of hierarchical categorization: learning algorithms and performance evaluation. We introduce the notion of consistent hierarchical classification that ma ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
In a hierarchical categorization problem, categories are partially ordered to form a hier-archy. In this dissertation, we explore two main aspects of hierarchical categorization: learning algorithms and performance evaluation. We introduce the notion of consistent hierarchical classification that makes classification results more comprehensible and easily interpretable for end-users. Among the previously introduced hierarchical learning algo-rithms, only a local top-down approach produces consistent classification. The present work extends this algorithm to the general case of DAG class hierarchies and possible internal class assignments. In addition, a new global hierarchical approach aimed at performing consistent classification is proposed. This is a general framework of convert-ing a conventional “flat ” learning algorithm into a hierarchical one. An extensive set of experiments on real and synthetic data indicate that the proposed approach significantly outperforms the corresponding “flat ” as well as the local top-down method. For eval-uation purposes, we use a novel hierarchical evaluation measure that is superior to the existing hierarchical and non-hierarchical evaluation techniques according to a number

Learning to extract gene-protein names from weaklylabeled text in preparation

by Richard C. Wang, Anthony Tomasic, Robert E. Frederking, Isaac Simmons, William W. Cohen, Isaac Simmons, William W. Cohen, Richard C. Wang, Anthony Tomasic, Robert E. Frederking - In preparation , 2006
"... Training a named entity recognizer (NER) has always been a difficult task due to the effort required to generate a significant amount of annotated training data. In this paper, we reduce or eliminate the effort required to create training data by automatically converting other sources of data into a ..."
Abstract - Cited by 4 (2 self) - Add to MetaCart
Training a named entity recognizer (NER) has always been a difficult task due to the effort required to generate a significant amount of annotated training data. In this paper, we reduce or eliminate the effort required to create training data by automatically converting other sources of data into annotated training data. The performance of this approach is tested on a geneprotein name extractor by using the mouse and fly data obtained from the BioCreAtIvE challenge. Results show that our methods are effective and that our trained NER system outperforms all of our baseline results. 1

Analyzing gene relationships for down syndrome with labeled transitions graphs

by Neha Rungta, Hyrum Carroll, Eric G Mercer, All J. Roper, Mark Clement, Quinn Snell - in Proceedings of Formal Methods in Computer Aided Design (FMCAD , 2007
"... Abstract — The relationship between changes in gene expression and physical characteristics associated with Down syndrome is not well understood. Chromosome 21 genes interact with nonchromosome 21 genes to produce Down syndrome characteristics. This indirect influence, however, is difficult to empir ..."
Abstract - Cited by 3 (3 self) - Add to MetaCart
Abstract — The relationship between changes in gene expression and physical characteristics associated with Down syndrome is not well understood. Chromosome 21 genes interact with nonchromosome 21 genes to produce Down syndrome characteristics. This indirect influence, however, is difficult to empirically define due to the number, size, and complexity of the involved gene regulatory networks. This work links chromosome 21 genes to non-chromosome 21 genes known to interact in a Down syndrome phenotype through a reachability analysis of labeled transition graphs extracted from published gene regulatory network databases. The analysis provides new relations in a recently discovered link between a specific gene and Down syndrome phenotype. This type of formal analysis helps scientists direct empirical studies to unravel chromosome 21 gene interactions with the hope for therapeutic intervention. I.

ProtChew: Automatic Extraction of Protein Names from

by Amund Tveit, Rune Sætre, Astrid Lægreid, Tonje Strømmen Steigedal - In Proceedings of the International Workshop on Biomedical Data Engineering (BMDE 2005, in conjunction with ICDE 2005 , 2005
"... With the increasing amount of biomedical literature, there is a need for automatic extraction of information to support biomedical researchers. Due to incomplete biomedical information databases, the extraction is not straightforward using dictionaries, and several approaches using contextual rules ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
With the increasing amount of biomedical literature, there is a need for automatic extraction of information to support biomedical researchers. Due to incomplete biomedical information databases, the extraction is not straightforward using dictionaries, and several approaches using contextual rules and machine learning have previously been proposed. Our work is inspired by the previous approaches, but is novel in the sense that it is fully automatic and doesn’t rely on expert tagged corpora. The main ideas are 1) unigram tagging of corpora using known protein names for training examples for the protein name extraction classifier and 2) tight positive and negative examples by having protein-related words as negative examples and protein names/synonyms as positive examples. We present preliminary results on Medline abstracts about gastrin, further work will be on testing the approach on BioCreative benchmark data sets. 1.

Using natural language processing and the gene ontology to populate a structured pathway database

by David Dehoney, Rachel Harte, Yan Lu, Daniel Chin - IEEE CSB’03 Poster paper
"... Reading literature is one of the most time consuming tasks a busy scientist has to contend with. As the volume of literature continues to grow there is a need to sort through this information in a more efficient manner. Mapping the pathways of genes and proteins of interest is one goal that requires ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Reading literature is one of the most time consuming tasks a busy scientist has to contend with. As the volume of literature continues to grow there is a need to sort through this information in a more efficient manner. Mapping the pathways of genes and proteins of interest is one goal that requires frequent reference to the literature. Pathway databases can help here and scientists currently have a choice between buying access to externally curated pathway databases or building their own in house. However such databases are either expensive to license or slow to populate manually. Building upon easily available, open-source tools we have developed a pipeline to automate the collection, structuring and storage of gene and protein interaction data from the literature. As a team of both biologists and computer scientists we integrated our natural language processing (NLP) software with the Gene Ontology (GO) to collect and translate unstructured text data into structured interaction data. For NLP we used a machine learning approach with a rule induction program, RAPIER

Construction of Gene Correlation Networks and Text Classification via Biomedical Literature Mining

by George Potamias, Despoina Antonakaki, Ros Kanterakis
"... Abstract — Automatic extraction of information from biomedical texts appears as a necessity considering the growing of the massive amounts of the relative scientific literature. A special feature that makes this task more challenging is the over-abundance and heterogeneity of the relative genes/prot ..."
Abstract - Add to MetaCart
Abstract — Automatic extraction of information from biomedical texts appears as a necessity considering the growing of the massive amounts of the relative scientific literature. A special feature that makes this task more challenging is the over-abundance and heterogeneity of the relative genes/proteins terminology. In this paper we introduce a novel term-identification process and propose an effective data structure based on TRIE trees. It enables the storage of millions of biomedical terms and reflects their semantic relations in a compressed and memory efficient way. Gene-Gene and Gene-Disease correlations are induced based on the utilization of the entropic Mutual Information Measure. Moreover we introduce a novel texts classification process that utilizes the terms identification process and a novel similarity matching metric. The induced correlation networks reveal valuable biomedical information. Text classification results exhibit highly accuracy figures in the range of 90 to 97.5% indicating the reliability of the whole approach.

Semi-Markov Models for Named Entity Recognition

by Sunita Sarawagi, William W. Cohen, Zhenzhen Kou
"... We described semi-Markov models which relaxes usual Markov assumptions made in hidden Markov models. Semi-Markov models classify segments of adjacent words, rather than single words. We proposed two training strategies, a discriminative training and a generative training for semi-Markov models. Impo ..."
Abstract - Add to MetaCart
We described semi-Markov models which relaxes usual Markov assumptions made in hidden Markov models. Semi-Markov models classify segments of adjacent words, rather than single words. We proposed two training strategies, a discriminative training and a generative training for semi-Markov models. Importantly, features for semi-Markov models can measure properties of segments, and transitions within a segment can be non-Markovian. This formalism can incorporate information about the similarity of extracted entities to entities in an external dictionary. This is difficult because most high-performance named entity recognition systems operate by sequentially classifying words as to whether or not they participate in an entity name; however, the most useful similarity measures score entire candidate names. In addition to allowing a natural way of coupling high-performance NER methods and high-performance similarity functions, this formalism also allows the direct use of other useful entity-level features, and provides a more natural formulation of the NER problem than sequential word classification. We applied semi-Markov models to named entity recognition (NER) problems and in experiments of??????????????????????
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University