Results 1 -
6 of
6
Rule-based protein term identification with help from automatic species tagging
- In Proceedings of CICLING 2007
, 2007
"... Abstract. In biomedical articles, terms often refer to different protein entities. For example, an arbitrary occurrence of term p53 might denote thousands of proteins across a number of species. A human annotator is able to resolve this ambiguity relatively easily, by looking at its context and if n ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Abstract. In biomedical articles, terms often refer to different protein entities. For example, an arbitrary occurrence of term p53 might denote thousands of proteins across a number of species. A human annotator is able to resolve this ambiguity relatively easily, by looking at its context and if necessary, by searching an appropriate protein database. However, this phenomenon may cause much trouble to a text mining system, which does not understand human languages and hence can not identify the correct protein that the term refers to. In this paper, we present a Term Identification system which automatically assigns unique identifiers, as found in a protein database, to ambiguous protein mentions in texts. Unlike other solutions described in literature, which only work on gene/protein mentions on a specific model organism, our system is able to tackle protein mentions across many species, by integrating a machine-learning based species tagger. We have compared the performance of our automatic system to that of human annotators, with very promising results. 1
Comparing usability of matching techniques for normalising biomedical named entities
- In Pac Symp Biocomput
, 2008
"... String matching plays an important role in biomedical Term Normalisation, the task of linking mentions of biomedical entities to identifiers in reference databases. This paper evaluates exact, rule-based and various string-similarity-based matching techniques. The matchers are compared in two ways: ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
String matching plays an important role in biomedical Term Normalisation, the task of linking mentions of biomedical entities to identifiers in reference databases. This paper evaluates exact, rule-based and various string-similarity-based matching techniques. The matchers are compared in two ways: first, we measure precision and recall against a gold-standard dataset and second, we integrate the matchers into a curation tool and measure gains in curation speed when they were used to assist a curator in normalising protein and tissue entities. The evaluation shows that a rule-based matcher works better on the gold-standard data, while a string-similarity based system and exact string matcher win out on improving curation efficiency. 1.
GeneRIF quality assurance as summary revision
- In Pacific Symposium on Biocomputing
, 2007
"... Like the primary scientific literature, GeneRIFs exhibit both growth and obsolescence. NLM’s control over the contents of the Entrez Gene database provides a mechanism for dealing with obsolete data: GeneRIFs are removed from the database when they are found to be of low quality. However, the rapid ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Like the primary scientific literature, GeneRIFs exhibit both growth and obsolescence. NLM’s control over the contents of the Entrez Gene database provides a mechanism for dealing with obsolete data: GeneRIFs are removed from the database when they are found to be of low quality. However, the rapid and extensive growth of Entrez Gene makes manual location of low-quality GeneRIFs problematic. This paper presents a system that takes advantage of the summary-like quality of GeneRIFs to detect low-quality GeneRIFs via a summary revision approach, achieving precision of 89 % and recall of 77%. Aspects of the system have been adopted by NLM as a quality assurance mechanism.
Research Interests, Accomplishments, and Goals Haw-ren Fang
, 2010
"... practical algorithms. My research specializations include linear and nonlinear methods for machine learning and data mining, matrix factorization and analysis, numerical optimization, and computer games. Machine learning and data mining Hypergraph-based multilevel matrix approximation for text minin ..."
Abstract
- Add to MetaCart
practical algorithms. My research specializations include linear and nonlinear methods for machine learning and data mining, matrix factorization and analysis, numerical optimization, and computer games. Machine learning and data mining Hypergraph-based multilevel matrix approximation for text mining In text mining, a collection of documents is often pre-processed to form a sparse termdocument matrix. In Latent Semantic Indexing (LSI), this is followed by a computation of a low-rank approximation to the data matrix, in order to filter out noise and redundancy due to word usage. The computation of these low-rank approximation by factorization algorithms can be time-consuming when the data set is large. A multilevel framework based on hypergraph coarsening is presented which exploits the hypergraph that is canonically associated with the sparse term-document matrix representing the data. The main goal of this technique is to reduce the cost of the matrix approximation without sacrificing accuracy. Because coarsening by multilevel hypergraph techniques is a form of clustering, the proposed approach can be regarded as a hybrid of factorization-based LSI and clusteringbased
A Joint Model for Normalizing Gene and Organism Mentions in Text
"... The aim of gene mention normalization is to propose an appropriate canonical name, or an identifier from a popular database, for a gene or a gene product mentioned in a given piece of text. The task has attracted a lot of research attention for several organisms under the assumption that both the me ..."
Abstract
- Add to MetaCart
The aim of gene mention normalization is to propose an appropriate canonical name, or an identifier from a popular database, for a gene or a gene product mentioned in a given piece of text. The task has attracted a lot of research attention for several organisms under the assumption that both the mention boundaries and the target organism are known. Here we extend the task to also recognizing whether the gene mention is valid and to finding the organism it is from. We solve this extended task using a joint model for gene and organism name normalization which allows for instances from different organisms to share features, thus achieving sizable performance gains with different learning methods: Naïve Bayes, Maximum Entropy, Perceptron and mira, as well as averaged versions of the last two. The evaluation results for our joint classifier show F1 score of over 97%, which proves the potential of the approach.

