Results 1 - 10
of
81
Collective Information Extraction with Relational Markov Networks
, 2004
"... Most information extraction (IE) systems treat separate potential extractions as independent. However, in many cases, considering inuences between dierent potential extractions could improve overall accuracy. Statistical methods based on undirected graphical models, such as conditional random elds ..."
Abstract
-
Cited by 52 (4 self)
- Add to MetaCart
Most information extraction (IE) systems treat separate potential extractions as independent. However, in many cases, considering inuences between dierent potential extractions could improve overall accuracy. Statistical methods based on undirected graphical models, such as conditional random elds (CRFs), have been shown to be an eective approach to learning accurate IE systems. We present a new IE method that employs Relational Markov Networks (a generalization of CRFs), which can represent arbitrary dependencies between extractions. This allows for \collective information extraction" that exploits the mutual in- uence between possible extractions. Experiments on learning to extract protein names from biomedical text demonstrate the advantages of this approach.
Extracting Synonymous Gene and Protein Terms From Biological Literature
, 2003
"... Motivation: Genes and proteins are often associated with multiple names. More names are added as new functional or structural information is discovered. Because authors can use any one of the known names for a gene or protein, information retrieval and extraction would benefit from identifying the g ..."
Abstract
-
Cited by 34 (0 self)
- Add to MetaCart
Motivation: Genes and proteins are often associated with multiple names. More names are added as new functional or structural information is discovered. Because authors can use any one of the known names for a gene or protein, information retrieval and extraction would benefit from identifying the gene and protein terms that are synonyms of the same substance.
Effective adaptation of a hidden Markov model-based named entity recognizer for biomedical domain
- In: Proceedings of NLP in Biomedicine, ACL
, 2003
"... In this paper, we explore how to adapt a general Hidden Markov Model-based named entity recognizer effectively to biomedical domain. We integrate various features, including simple deterministic features, morphological features, POS features and semantic trigger features, to capture various evidence ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
In this paper, we explore how to adapt a general Hidden Markov Model-based named entity recognizer effectively to biomedical domain. We integrate various features, including simple deterministic features, morphological features, POS features and semantic trigger features, to capture various evidences especially for biomedical named entity and evaluate their contributions. We also present a simple algorithm to solve the abbreviation problem and a rule-based method to deal with the cascaded phenomena in biomedical domain. Our experiments on GENIA V3.0 and GENIA V1.1 achieve the 66.1 and 62.5 F-measure respectively, which outperform the previous best published results by 8.1 F-measure when using the same training and testing data. 1
Text Mining with Information Extraction
- AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases
, 2002
"... The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this paper, we develop a text-mining system by integrat ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this paper, we develop a text-mining system by integrating methods from Information Extraction (IE) and Data Mining (Knowledge Discovery from Databases or KDD). By utilizing existing IE and KDD techniques, text-mining systems can be developed relatively rapidly and evaluated on existing text corpora for testing IE systems. We present a general text-mining framework called DiscoTEX which employs an IE module for transforming natural-language documents into structured data and a KDD module for discovering prediction rules from the extracted data. When discovering patterns in extracted text, strict matching of strings is inadequate because textual database entries generally exhibit variations due to typographical errors, misspellings, abbreviations, and other
Exploring the boundaries: Gene and protein identification in biomedical text
- In Proceedings of the BioCreative Workshop
, 2004
"... Background: Good automatic information extraction tools offer hope for automatic processing of the exploding biomedical literature, and successful named entity recognition is a key component for such tools. Methods: We present a maximum-entropy based system incorporating a diverse set of features fo ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
Background: Good automatic information extraction tools offer hope for automatic processing of the exploding biomedical literature, and successful named entity recognition is a key component for such tools. Methods: We present a maximum-entropy based system incorporating a diverse set of features for identifying gene and protein names in biomedical abstracts. Results: This system was entered in the BioCreative comparative evaluation and achieved a precision of 0.83 and recall of 0.84 in the “open ” evaluation and a precision of 0.78 and recall of 0.85 in the “closed ” evaluation. Conclusions: Central contributions are rich use of features derived from the training data at multiple levels of granularity, a focus on correctly identifying entity boundaries, and the innovative use of several external knowledge sources including full MEDLINE abstracts and web searches. Background The explosion of information in the biomedical domain and particularly in genetics has highlighted the need for automated text information extraction techniques. MEDLINE, the primary research database serving the biomedical community, currently contains over 14 million abstracts, with 60,000 new abstracts appearing each month. There is also an impressive number of molecular biological databases covering an
A supervised learning approach to acronym identification
- In 8th Canadian Conference on Artificial Intelligence (AI’2005) (LNAI 3501
, 2005
"... Abstract. This paper addresses the task of finding acronym-definition pairs in text. Most of the previous work on the topic is about systems that involve manually generated rules or regular expressions. In this paper, we present a supervised learning approach to the acronym identification task. Our ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Abstract. This paper addresses the task of finding acronym-definition pairs in text. Most of the previous work on the topic is about systems that involve manually generated rules or regular expressions. In this paper, we present a supervised learning approach to the acronym identification task. Our approach reduces the search space of the supervised learning system by putting some weak constraints on the kinds of acronym-definition pairs that can be identified. We obtain results comparable to hand-crafted systems that use stronger constraints. We describe our method for reducing the search space, the features used by our supervised learning system, and our experiments with various learning schemes. 1
A system for identifying named entities in biomedical text: How results from two evaluations reflect on both the system and the evaluations. In The 2004 BioLink meeting at ISMB
, 2005
"... jrfinkel¡ ..."
Relational Markov Networks for Collective Information Extraction”, Relational Learning and Its Connection to Other Fields (SRL- 2004
, 2004
"... Most information extraction (IE) systems treat separate potential extractions as independent. However, in many cases, considering influences between different potential extractions could improve overall accuracy. Statistical methods based on undirected graphical models, such as conditional random fi ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Most information extraction (IE) systems treat separate potential extractions as independent. However, in many cases, considering influences between different potential extractions could improve overall accuracy. Statistical methods based on undirected graphical models, such as conditional random fields (CRFs), have been shown to be an effective approach to learning accurate IE systems. We present a new IE method that employs Relational Markov Networks, which can represent arbitrary dependencies between extractions. This allows for “collective information extraction ” that exploits the mutual influence between possible extractions. Experiments on learning to extract protein names from biomedical text demonstrate the advantages of this approach. 1.
Distribution of information in biomedical abstracts and full-text publications
- BIOINFORMATICS
, 2004
"... Motivation: Full-text documents potentially hold more information than their abstracts, but require more resources for processing. We investigated the added value of full text over abstracts in terms of information content and occurrences of gene symbol—gene name combinations that can resolve gene-s ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Motivation: Full-text documents potentially hold more information than their abstracts, but require more resources for processing. We investigated the added value of full text over abstracts in terms of information content and occurrences of gene symbol—gene name combinations that can resolve gene-symbol ambiguity. Results: We analyzed a set of 3902 biomedical full-text articles. Different keyword measures indicate that information density is highest in abstracts, but that the information coverage in full texts is much greater than in abstracts. Analysis of five different standard sections of articles shows that the highest information coverage is located in the results section. Still, 30–40 % of the information mentioned in each section is unique to that section. Only 30 % of the gene symbols in the abstract are accompanied by their corresponding names, and a further 8 % of the gene names are found in the full text. In the full text, only 18 % of the gene symbols are accompanied by their gene names.
Sarad: A simple and robust abbreviation dictionary
- Bioinformatics
, 2004
"... Motivation: Due to recent interest in the use of textual material to augment traditional experiments it has become necessary to automatically cluster, classify, and filter natural language information. Results: The Simple and Robust Abbreviation Dictionary (SaRAD) provides an easy to implement, high ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Motivation: Due to recent interest in the use of textual material to augment traditional experiments it has become necessary to automatically cluster, classify, and filter natural language information. Results: The Simple and Robust Abbreviation Dictionary (SaRAD) provides an easy to implement, high performance tool for the construction of a biomedical symbol dictionary. The algorithms, applied to the MEDLINE document set, result in a high quality dictionary and toolset to disambiguate abbreviation symbols automatically. Availability: The SaRAD dictionary is available as a web based demonstration, and in pseudo-code form. Contact:

