Results 1 - 10
of
41
Parsing biomedical literature
- In Proceedings of the Second International Joint Conference on Natural Language Processing (IJCNLP-05), Jeju Island, Korea
, 2005
"... Abstract. We present a preliminary study of several parser adaptation techniques evaluated on the GENIA corpus of MEDLINE abstracts [1, 2]. We begin by observing that the Penn Treebank (PTB) is lexically impoverished when measured on various genres of scientific and technical writing, and that this ..."
Abstract
-
Cited by 35 (2 self)
- Add to MetaCart
Abstract. We present a preliminary study of several parser adaptation techniques evaluated on the GENIA corpus of MEDLINE abstracts [1, 2]. We begin by observing that the Penn Treebank (PTB) is lexically impoverished when measured on various genres of scientific and technical writing, and that this significantly impacts parse accuracy. To resolve this without requiring in-domain treebank data, we show how existing domain-specific lexical resources may be leveraged to augment PTB-training: part-of-speech tags, dictionary collocations, and namedentities. Using a state-of-the-art statistical parser [3] as our baseline, our lexically-adapted parser achieves a 14.2 % reduction in error. With oracleknowledge of named-entities, this error reduction improves to 21.2%. 1
Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction
- Proceedings of the 14th International Conference on Inductive Logic Programming (ILP
, 2004
"... Many domains in the field of Inductive Logic Programming (ILP) involve highly unbalanced data. Our research has focused on Information Extraction (IE), a task that typically involves many more negative examples than positive examples. IE is the process of finding facts in unstructured text, such as ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
Many domains in the field of Inductive Logic Programming (ILP) involve highly unbalanced data. Our research has focused on Information Extraction (IE), a task that typically involves many more negative examples than positive examples. IE is the process of finding facts in unstructured text, such as biomedical journals, and putting those facts in an organized system. In particular, we have focused on learning to recognize instances of the protein-localization relationship in Medline abstracts. We view the problem as a machine-learning task: given positive and negative extractions from a training corpus of abstracts, learn a logical theory that performs well on a held-aside testing set. A common way to measure performance in these domains is to use precision and recall instead of simply using accuracy. We propose Gleaner, a randomized search method which collects good clauses from a broad spectrum of points along the recall dimension in recall-precision curves and employs an "at least N of these M clauses" thresholding method to combine the selected clauses. We compare Gleaner to ensembles of standard Aleph theories and find that Gleaner produces comparable testset results in a fraction of the training time needed for ensembles.
BioThesaurus: a web-based thesaurus of protein and gene names
- Bioinformatics
, 2006
"... doi:10.1093/bioinformatics/bti749 ..."
Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation BMC Bioinformatics 2005
- BMC Bioinformatics
"... mining: an application to gene versus protein name disambiguation ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
mining: an application to gene versus protein name disambiguation
Automatically generating gene summaries from biomedical literature
- In Proceedings of Pacific Symposium on Biocomputing
, 2006
"... Biologists often need to find information about genes whose function is not described in the genome databases. Currently they must try to search disparate biomedical literature to locate relevant articles, and spend considerable efforts reading the retrieved articles in order to locate the most rele ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Biologists often need to find information about genes whose function is not described in the genome databases. Currently they must try to search disparate biomedical literature to locate relevant articles, and spend considerable efforts reading the retrieved articles in order to locate the most relevant knowledge about the gene. We describe our software, the first that automatically generates gene summaries from biomedical literature. We present a two-stage summarization method, which involves first retrieving relevant articles and then extracting the most informative sentences from the retrieved articles to generate a structured gene summary. The generated summary explicitly covers multiple aspects of a gene, such as the sequence information, mutant phenotypes, and molecular interaction with other genes. We propose several heuristic approaches to improve the accuracy in both stages. The proposed methods are evaluated using 10 randomly chosen genes from FlyBase and a subset of Medline abstracts about Drosophila. The results show that the precision of the top selected sentences in the 6 aspects is typically about 50-70%, and the generated summaries are quite informative, indicating that our approaches are effective in automatically summarizing literature information about genes. The generated summaries not only are directly useful to biologists but also serve as useful entry points to enable them to quickly digest the retrieved literature articles. 1.
An Ontology-based Approach to Support Text Mining and Information Retrieval in the Biological Domain
"... Abstract: This paper describes an ontology-based approach aiming at helping biologists to annotate their documents and at facilitating their information retrieval task. Our approach, based on semantic web technologies, relies on formalised ontologies, semantic annotations of scientific articles and ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Abstract: This paper describes an ontology-based approach aiming at helping biologists to annotate their documents and at facilitating their information retrieval task. Our approach, based on semantic web technologies, relies on formalised ontologies, semantic annotations of scientific articles and knowledge extraction from texts. We propose a method/system for the generation of ontology-based semantic annotations (MeatAnnot) and a system allowing biologists to draw advanced inferences on these annotations (MeatSearch). This approach was proposed to support biologists working on DNA microarray experiments in the validation and the interpretation of their results, but it can probably be extended to other massive analyses of biological events (as provided by proteomics, metabolomics…).
An application of text categorization methods to gene ontology annotation
- Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, ACM
, 2005
"... This paper describes an application of IR and text categorization methods to a highly practical problem in biomedicine, specifically, Gene Ontology (GO) annotation. GO annotation is a major activity in most model organism database projects and annotates gene functions using a controlled vocabulary. ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
This paper describes an application of IR and text categorization methods to a highly practical problem in biomedicine, specifically, Gene Ontology (GO) annotation. GO annotation is a major activity in most model organism database projects and annotates gene functions using a controlled vocabulary. As a first step toward automatic GO annotation, we aim to assign GO domain codes given a specific gene and an article in which the gene appears, which is one of the task challenges at the TREC 2004 Genomics Track. We approached the task with careful consideration of the specialized terminology and paid special attention to dealing with various forms of gene synonyms, so as to exhaustively locate the occurrences of the target gene. We extracted the words around the gene occurrences and used them to represent the gene for GO domain code annotation. As a classifier, we adopted a variant of k-Nearest Neighbor (kNN) with supervised term weighting schemes to improve the performance, making our method among the top-performing systems in the TREC official evaluation. Moreover, it is demonstrated that our proposed framework is successfully applied to another task of the Genomics Track, showing comparable results to the best performing system. Categories and Subject Descriptors H.2.4 [Database management]: Systems—Textual databases; H.3.1 [Information storage and retrieval]: Content Analysis and Indexing—Abstracting
Gleaner: Creating Ensembles of Firstorder Clauses to Improve Recall-Precision Curves
- Machine Learning
, 2006
"... Abstract. Many domains in the field of Inductive Logic Programming (ILP) involve highly unbalanced data. A common way to measure performance in these domains is to use precision and recall instead of simply using accuracy. The goal of our research is to find new approaches within ILP particularly su ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Abstract. Many domains in the field of Inductive Logic Programming (ILP) involve highly unbalanced data. A common way to measure performance in these domains is to use precision and recall instead of simply using accuracy. The goal of our research is to find new approaches within ILP particularly suited for large, highly-skewed domains. We propose Gleaner, a randomized search method that collects good clauses from a broad spectrum of points along the recall dimension in recall-precision curves and employs an “at least L of these K clauses ” thresholding method to combine sets of selected clauses. Our research focuses on Multi-Slot Information Extraction (IE), a task that typically involves many more negative examples than positive examples. We formulate this problem into a relational domain, using two large testbeds involving the extraction of important relations from the abstracts of biomedical journal articles. We compare Gleaner to ensembles of standard theories learned by Aleph, finding that Gleaner produces comparable testset results in a fraction of the training time.
Learning to Extract Genic Interactions using Gleaner
- Proceedings of the Learning Language in Logic 2005 Workshop at the International Conference on Machine Learning
, 2005
"... We explore here the application of Gleaner, an Inductive Logic Programming approach to learning in highly-skewed domains, to the Learning Language in Logic 2005 biomedical information-extraction challenge task. We create and describe a large number of background knowledge predicates suited for this ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
We explore here the application of Gleaner, an Inductive Logic Programming approach to learning in highly-skewed domains, to the Learning Language in Logic 2005 biomedical information-extraction challenge task. We create and describe a large number of background knowledge predicates suited for this task. We find that Gleaner outperforms standard Aleph theories with respect to recall and that additional linguistic background knowledge improves recall. 1.
Mining semantically related terms from biomedical literature
- ACM TRANSACTIONS ON ASIAN LANGUAGE INFORMATION PROCESSING (TALIP
"... Discovering links and relationships is one of the main challenges in biomedical research, as scientists are interested in uncovering entities that have similar functions, take part in the same processes, or are coregulated. This article discusses the extraction of such semantically related entities ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Discovering links and relationships is one of the main challenges in biomedical research, as scientists are interested in uncovering entities that have similar functions, take part in the same processes, or are coregulated. This article discusses the extraction of such semantically related entities (represented by domain terms) from biomedical literature. The method combines various text-based aspects, such as lexical, syntactic, and contextual similarities between terms. Lexical similarities are based on the level of sharing of word constituents. Syntactic similarities rely on expressions (such as term enumerations and conjunctions) in which a sequence of terms appears as a single syntactic unit. Finally, contextual similarities are based on automatic discovery of relevant contexts shared among terms. The approach is evaluated using the Genia resources, and the results of experiments are presented. Lexical and syntactic links have shown high precision and low recall, while contextual similarities have resulted in significantly higher recall with moderate precision. By combining the three metrics, we achieved F measures of 68 % for semantically related terms and 37 % for highly related entities.

