Results 1 - 10
of
126
Inter-species normalization of gene mentions with GNAT
- BIOINFORMATICS VOL. 24 ECCB 2008, PAGES I126–I132
, 2008
"... Motivation: Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers ..."
Abstract
-
Cited by 59 (7 self)
- Add to MetaCart
Motivation: Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. Results: We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4 % (90.8 % precision at 73.8% recall). For the single-species task, we report an F-measure of 85.4% on human genes. Availability: A web-frontend is available at
U (2009) High-performance gene name normalization with GENO. Bioinformatics 25: 815–821. Available: http:// www.ncbi.nlm.nih.gov/pubmed/19188193
"... Motivation: The recognition and normalization of textual mentions of gene and protein names is both particularly important and challenging. Its importance lies in the fact that they constitute the crucial conceptual entities in biomedicine. Their recognition and normalization remains a challenging t ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
(Show Context)
Motivation: The recognition and normalization of textual mentions of gene and protein names is both particularly important and challenging. Its importance lies in the fact that they constitute the crucial conceptual entities in biomedicine. Their recognition and normalization remains a challenging task because of widespread gene name ambiguities within species, across species, with common English words, and with medical sublanguage terms. Results: We present GENO, a highly competitive system for gene name normalization, which obtains an F-measure performance of 86.4 % (precision: 87.8%, recall: 85.0%) on the BIOCREATIVE-II test set, thus being on a par with the best system on that task. Our system tackles the complex gene normalization problem by employing a carefully crafted suite of symbolic and statistical methods, and by fully relying on publicly available software and data resources, including extensive background knowledge based on semantic profiling. A major goal of our work is to present GENO’s architecture in a lucid and perspicuous way to pave the way to full reproducibility of our results. Availability: GENO, includung its underlying resources, will be available from www.julielab.de. It is also currently deployed in the SEMEDICO search engine at www.semedico.org. Contact:
The GNAT library for local and remote gene mention normalization
, 2011
"... Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for b ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the GNAT Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of GNAT achieves a Tap-20 score of 0.1987. Availability: The library and web services are implemented in Java and the sources are available from
Static Relations: a Piece in the Biomedical Information Extraction Puzzle
, 2009
"... We propose a static relation extraction task to complement biomedical information extraction approaches. We argue that static relations such as part-whole are implicitly involved in many common extraction settings, define a task setting making them explicit, and discuss their integration into previo ..."
Abstract
-
Cited by 20 (11 self)
- Add to MetaCart
(Show Context)
We propose a static relation extraction task to complement biomedical information extraction approaches. We argue that static relations such as part-whole are implicitly involved in many common extraction settings, define a task setting making them explicit, and discuss their integration into previously proposed tasks and extraction methods. We further identify a specific static relation extraction task motivated by the BioNLP’09 shared task on event extraction, introduce an annotated corpus for the task, and demonstrate the feasibility of the task by experiments showing that the defined relations can be reliably extracted. The task setting and corpus can serve to support several forms of domain information extraction. 1
Incorporating genetagstyle annotation to genia corpus
- In: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP ’09. Association for Computational Linguistics
, 2009
"... ..."
(Show Context)
HIDE: An integrated system for health information de-identification
- In CBMS
, 2008
"... While there is an increasing need to share medical information for public health research, such data sharing must preserve patient privacy without disclosing any identifiable information. A considerable amount of research in data privacy community has been devoted to formalizing the notion of identi ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
(Show Context)
While there is an increasing need to share medical information for public health research, such data sharing must preserve patient privacy without disclosing any identifiable information. A considerable amount of research in data privacy community has been devoted to formalizing the notion of identifiability and developing techniques for anonymization but are focused exclusively on structured data. On the other hand, efforts on de-identifying medical text documents in medical informatics community rely on simple identifier removal or grouping techniques without taking advantage of the research developments in the data privacy community. This paper attempts to fill the above gaps and presents a prototype system for de-identifying health information including both structured and unstructured data. It deploys a conditional random fields based technique for extracting identifying attributes from unstructured data and k-anonymization based technique for de-identifying the data while preserving maximum data utility. We present a set of preliminary evaluations showing the effectiveness of our approach. 1.
Scaling up biomedical event extraction to the entire PubMed
- In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing
, 2010
"... We present the first full-scale event extraction experiment covering the titles and abstracts of all PubMed citations. Extraction is performed using a pipeline composed of state-of-the-art methods: the BANNER named entity recognizer, the McClosky-Charniak domain-adapted parser, and the Turku Event E ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
We present the first full-scale event extraction experiment covering the titles and abstracts of all PubMed citations. Extraction is performed using a pipeline composed of state-of-the-art methods: the BANNER named entity recognizer, the McClosky-Charniak domain-adapted parser, and the Turku Event Extraction System. We analyze the statistical properties of the resulting dataset and present evaluations of the core event extraction as well as negation and speculation detection components of the system. Further, we study in detail the set of extracted events relevant to the apoptosis pathway to gain insight into the biological relevance of the result. The dataset, consisting of 19.2 million occurrences of 4.5 million unique events, is freely available for use in research at
BioContext: an integrated text mining system for large-scale extraction and contextualization of biomolecular events
- Bioinformatics
, 2005
"... Motivation: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological eve ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Motivation: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research. Results: Here we present BioContext, an integrated TM system which extracts, extends and integrates results from a number of tools performing entity recognition, biomolecular event extraction and contextualization. Application of our system to 10.9 million MEDLINE abstracts and 234 000 open-access full-text articles from PubMed Central yielded over 36 million mentions representing 11.4 million distinct events. Event participants included over 290 000 distinct genes/proteins that are mentioned more than 80 million times and linked where possible to Entrez Gene identifiers. Over a third of events contain contextual information such as the anatomical location of the event occurrence or whether the event is reported as negated or speculative. Availability: The BioContext pipeline is available for download (under the BSD license) at
Overview of the Entity Relations (REL) supporting task of BioNLP Shared Task 2011
"... This paper presents the Entity Relations (REL) task, a supporting task of the BioNLP Shared Task 2011. The task concerns the extraction of two types of part-of relations between a gene/protein and an associated entity. Four teams submitted final results for the REL task, with the highest-performing ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
(Show Context)
This paper presents the Entity Relations (REL) task, a supporting task of the BioNLP Shared Task 2011. The task concerns the extraction of two types of part-of relations between a gene/protein and an associated entity. Four teams submitted final results for the REL task, with the highest-performing system achieving 57.7 % F-score. While experiments suggest use of the data can help improve event extraction performance, the task data has so far received only limited use in support of event extraction. The REL task continues as an open challenge, with all resources available from the shared task website.
What the papers say: text mining for genomics and systems biology. Hum Genomics 2010;5(1):17–29
"... Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judi-cious (or serendipitous) combination of knowledge from different scientific disciplines, which would require fol-lowing disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining — the automated extraction of information from (electronically) published sources — could potentially fulfil an important role — but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are pub-lished will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scien-tists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts under-lying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a