Results 1 - 10
of
31
Comparative Experiments on Learning Information Extractors for Proteins and their Interactions
, 2004
"... Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer-accessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in M ..."
Abstract
-
Cited by 55 (7 self)
- Add to MetaCart
Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer-accessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in Medline. However, extraction eorts have been frustrated by the lack of conventions for describing human genes and proteins. We have developed and evaluated a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting information on interactions between the proteins. We demonstrate that machine learning approaches using support vector machines and maximum entropy are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manually-developed rules.
ProbLog: a probabilistic Prolog and its application in link discovery
- In Proceedings of 20th International Joint Conference on Artificial Intelligence
, 2007
"... We introduce ProbLog, a probabilistic extension of Prolog. A ProbLog program defines a distribution over logic programs by specifying for each clause the probability that it belongs to a randomly sampled program, and these probabilities are mutually independent. The semantics of ProbLog is then defi ..."
Abstract
-
Cited by 51 (11 self)
- Add to MetaCart
We introduce ProbLog, a probabilistic extension of Prolog. A ProbLog program defines a distribution over logic programs by specifying for each clause the probability that it belongs to a randomly sampled program, and these probabilities are mutually independent. The semantics of ProbLog is then defined by the success probability of a query, which corresponds to the probability that the query succeeds in a randomly sampled program. The key contribution of this paper is the introduction of an effective solver for computing success probabilities. It essentially combines SLD-resolution with methods for computing the probability of Boolean formulae. Our implementation further employs an approximation algorithm that combines iterative deepening with binary decision diagrams. We report on experiments in the context of discovering links in real biological networks, a demonstration of the practical usefulness of the approach. 1
Mining knowledge from text using information extraction
- SIGKDD Explorations
, 2005
"... An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can be used to directly extricate abstract knowledge from a text corpus, or to extract concrete data from a set of documents which can then be further analyzed with traditional data-mining techniques to discover more general patterns. We discuss methods and implemented systems for both of these approaches and summarize results on mining real text corpora of biomedical abstracts, job announcements, and product descriptions. We also discuss challenges that arise when employing current information extraction technology to discover knowledge in text.
C: Extracting phenotypic information from the literature via natural language processing
- Medinfo 2004
"... In recent years, the amount of biomedical knowledge has been increasing exponentially. Several Natural Language Processing (NLP) systems have been developed to help researchers extract, encode and organize new information automatically from textual literature or narrative reports. Some of these syst ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
In recent years, the amount of biomedical knowledge has been increasing exponentially. Several Natural Language Processing (NLP) systems have been developed to help researchers extract, encode and organize new information automatically from textual literature or narrative reports. Some of these systems focus on extracting biological entities or molecular interactions while others retrieve and encode clinical information. To exploit gene functions in the postgenome era, it is necessary to extract phenotypic information automatically from the literature as well. However, few NLP projects have focused on this. We present the development of a system called BioMedLEE that extracts a broad variety of phenotypic information from the biomedical literature. The system was developed by adapting MedLEE, an existing clinical information extraction NLP engine. A feasibility evaluation study of BioMedLEE was performed using 300 randomly chosen journal titles. Results showed that experts achieved an average precision rate of 65.4%, (95%CI: [58.0%, 72.8%]) and a recall rate of 73.0%, (95%CI: [66.2%, 80.0%]). BioMedLEE had 64.0 % precision and 77.1 % recall respectively, according to expert agreements. Keyword: Natural language processing, text mining, data mining, phenotypic information extraction
Mining Concept Profiles with the Vector Model or Where on Earth are Diseases being Studied?
- In: Proceedings of Text Mining Workshop. Third SIAM International Conference on Data Mining
, 2003
"... In this research we study the value of concept exploration, a function o#ered in our text mining prototype. This function, implemented using the vector space model, allows one to build a profile for a given concept. This profile is derived from the text collection being mined. The function may be us ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
In this research we study the value of concept exploration, a function o#ered in our text mining prototype. This function, implemented using the vector space model, allows one to build a profile for a given concept. This profile is derived from the text collection being mined. The function may be used to build profiles for concepts that are as complex or as simple as the user desires. In this paper, we apply this function towards studying trends in disease research. Profiles are built with diseases as concepts and by mining the MEDLINE database. Disease research trends are compared with disease prevalence trends. The study indicates that text mining may o#er a useful option for current e#orts at estimating global epidemiological data. More generally, this research demonstrates the application of text mining and concept exploration.
Learning to Extract Proteins and their Interactions from Medline Abstracts
- In: ICML-2003 Workshop on Machine Learning in Bioinformatics. (2003
, 2003
"... We present results from a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting interactions between the proteins. We demonstrate that machine learning approaches using support vector machines and hidden Markov m ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
We present results from a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting interactions between the proteins. We demonstrate that machine learning approaches using support vector machines and hidden Markov models are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manually-developed rules.
Extraction of gene-disease relations from Medline using domain dictionaries and machine learning
- Proc. PSB 2006
, 2006
"... We describe a system that extracts disease-gene relations from MedLine. We constructed a dictionary for disease and gene names from six public databases and extracted relation candidates by dictionary matching. Since dictionary matching produces a large number of false positives, we developed a meth ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
We describe a system that extracts disease-gene relations from MedLine. We constructed a dictionary for disease and gene names from six public databases and extracted relation candidates by dictionary matching. Since dictionary matching produces a large number of false positives, we developed a method of machine learning-based named entity recognition (NER) to filter out false recognitions of disease/gene names. We found that the performance of relation extraction is heavily dependent upon the performance of NER filtering and that the filtering improves the precision of relation extraction by 26.7 % at the cost of a small reduction in recall. 1.
Using annotations from controlled vocabularies to find meaningful associations
- In Proc. 4th Int. Workshop on Data Integration in the Life Sciences
, 2007
"... Abstract. This paper presents the LSLink (or Life Science Link) methodology that provides users with a set of tools to explore the rich Web of interconnected and annotated objects in multiple repositories, and to identify meaningful associations. Consider a physical link between objects in two repos ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract. This paper presents the LSLink (or Life Science Link) methodology that provides users with a set of tools to explore the rich Web of interconnected and annotated objects in multiple repositories, and to identify meaningful associations. Consider a physical link between objects in two repositories, where each of the objects is annotated with controlled vocabulary (CV) terms from two ontologies. Using a set of LSLink instances generated from a background dataset of knowledge we identify associations between pairs of CV terms that are potentially significant and may lead to new knowledge. We develop an approach based on the logarithm of the odds (LOD) to determine a confidence and support in the associations between pairs of CV terms. Using a case study of Entrez Gene objects annotated with GO terms linked to PubMed objects annotated with MeSH terms, we describe a user validation and analysis task to explore potentially significant associations.
Dragon TF Association Miner: A system for exploring transcription factor associations through text-mining
- Web Server issue
, 2004
"... We present Dragon TF Association Miner (DTFAM), a system for text-mining of PubMed documents for potential functional association of transcription factors (TFs) with terms from Gene Ontology (GO) and with diseases. DTFAM has been trained and tested in the selection of relevant documents on a manuall ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We present Dragon TF Association Miner (DTFAM), a system for text-mining of PubMed documents for potential functional association of transcription factors (TFs) with terms from Gene Ontology (GO) and with diseases. DTFAM has been trained and tested in the selection of relevant documents on a manually curated dataset containing.3000 PubMed abstracts relevant to transcription control. On our test data the system achieves sensitivity of 80 % with specificity of 82%. DTFAM provides comprehensive tabular and graphical reports linking terms to relevant sets of documents. These documents are color-coded for easier inspection. DTFAM complements the existing biological resources by collecting, assessing, extracting and presenting associations that can reveal some of the not so easily observable connections among the entities found which could explain the functions of TFs and help decipher parts of gene transcriptional regulatory networks. DTFAM summarizes information from a large volume of documents saving time and making analysis simpler for individual users. DTFAM is freely available for academic and non-profit users at
Genestrace: phenomic knowledge discovery via structured terminology. Pac Symp Biocomput
- Pac. Symp. Biocomput
, 2005
"... The era of applied genomic medicine is quickly approaching accompanied by the increasing availability of detailed genetic information. Understanding the genetic etiology behind complex, multi-gene diseases remains an important challenge. In order to uncover the putative genetic etiology of complex d ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
The era of applied genomic medicine is quickly approaching accompanied by the increasing availability of detailed genetic information. Understanding the genetic etiology behind complex, multi-gene diseases remains an important challenge. In order to uncover the putative genetic etiology of complex diseases, we designed a method that explores the relationships between two major terminological and ontological resources: the Unified Medical Language System (UMLS) and the Gene Ontology (GO). The UMLS has a mainly clinical emphasis; Gene Ontology has become the standard for biological annotations of genes and gene products. Using statistical and semantic relationships within and between the two resources, we are able to infer relationships between disease concepts in the UMLS and gene products annotated using GO and its associated databases. We validated our inferences by comparing them to the known gene-disease relationships, as defined in the Online Mendelian Inheritance in Man’s morbidmap (OMIM). The proof-of-concept methods presented here are unique in that they bypass the ambiguity of the direct extraction of gene or disease term from MEDLINE. Additionally, our methods provide direct links to clinically significant diseases through established terminologies or ontologies. The preliminary results presented here indicate the potential utility of exploiting the existing, manually curated relationships in biomedical resources as a tool for the discovery of potentially valuable new gene-disease relationships. The GenesTrace system may be accessed at the following URL:

