Results 1 - 10
of
16
Comparative Experiments on Learning Information Extractors for Proteins and their Interactions
, 2004
"... Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer-accessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in M ..."
Abstract
-
Cited by 55 (7 self)
- Add to MetaCart
Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer-accessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in Medline. However, extraction eorts have been frustrated by the lack of conventions for describing human genes and proteins. We have developed and evaluated a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting information on interactions between the proteins. We demonstrate that machine learning approaches using support vector machines and maximum entropy are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manually-developed rules.
Using string-kernels for learning semantic parsers
- In Proc. of COLING/ACL-06
, 2006
"... We present a new approach for mapping natural language sentences to their formal meaning representations using stringkernel-based classifiers. Our system learns these classifiers for every production in the formal language grammar. Meaning representations for novel natural language sentences are obt ..."
Abstract
-
Cited by 38 (10 self)
- Add to MetaCart
We present a new approach for mapping natural language sentences to their formal meaning representations using stringkernel-based classifiers. Our system learns these classifiers for every production in the formal language grammar. Meaning representations for novel natural language sentences are obtained by finding the most probable semantic parse using these string classifiers. Our experiments on two realworld data sets show that this approach compares favorably to other existing systems and is particularly robust to noise. 1
Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases
, 2002
"... The problem of identifying approximately duplicate records in databases has previously been studied as record linkage, the merge/purge problem, hardening soft databases, and field matching. Most existing approaches have focused on efficient algorithms for locating potential duplicates rather than pr ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
The problem of identifying approximately duplicate records in databases has previously been studied as record linkage, the merge/purge problem, hardening soft databases, and field matching. Most existing approaches have focused on efficient algorithms for locating potential duplicates rather than precise similarity metrics for comparing records. In this paper, we present a domain-independent method for improving duplicate detection accuracy using machine learning. First, trainable distance metrics are learned for each field, adapting to the specific notion of similarity that is appropriate for the field's domain. Second, a classifier is employed that uses several diverse metrics for each field as distance features and classifies pairs of records as duplicates or non-duplicates. We also propose an extended model of learnable string distance which improves over an existing approach. Experimental results on real and synthetic datasets show that our method outperforms traditional techniques.
Text Mining with Information Extraction
- AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases
, 2002
"... The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this paper, we develop a text-mining system by integrat ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this paper, we develop a text-mining system by integrating methods from Information Extraction (IE) and Data Mining (Knowledge Discovery from Databases or KDD). By utilizing existing IE and KDD techniques, text-mining systems can be developed relatively rapidly and evaluated on existing text corpora for testing IE systems. We present a general text-mining framework called DiscoTEX which employs an IE module for transforming natural-language documents into structured data and a KDD module for discovering prediction rules from the extracted data. When discovering patterns in extracted text, strict matching of strings is inadequate because textual database entries generally exhibit variations due to typographical errors, misspellings, abbreviations, and other
Two Approaches to Handling Noisy Variation in Text Mining
- In Papers from the Nineteenth International Conference on Machine Learning (ICML-2002) Workshop on Text Learning
, 2002
"... Variation and noise in textual database entries can prevent text mining algorithms from discovering important regularities. We present two novel methods to cope with this problem: (1) an adaptive approach to "hardening" noisy databases by identifying duplicate records, and (2) mining "soft" associat ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
Variation and noise in textual database entries can prevent text mining algorithms from discovering important regularities. We present two novel methods to cope with this problem: (1) an adaptive approach to "hardening" noisy databases by identifying duplicate records, and (2) mining "soft" association rules. For identifying approximately duplicate records, we present a domain-independent two-level method for improving duplicate detection accuracy based on machine learning. For mining soft matching rules, we introduce an algorithm that discovers association rules by allowing partial matching of items based on a textual similarity metric such as edit distance or cosine similarity. Experimental results on real and synthetic datasets show that our methods outperform traditional techniques for noisy textual databases.
RNA secondary structure comparison: exact analysis of the Zhang-Shasha tree edit algorithm
, 2003
"... We are interested in RNA secondary structure comparison, using an approach which consists to represent these structures by labeled ordered trees. Following the problem considered, this tree representation can be rough (considering only the structural patterns), or re ned until an exact coding of th ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
We are interested in RNA secondary structure comparison, using an approach which consists to represent these structures by labeled ordered trees. Following the problem considered, this tree representation can be rough (considering only the structural patterns), or re ned until an exact coding of the structure is obtained. After some preliminary de nitions and the description of the Zhang-Shasha [ZS89] tree edit algorithm, which is on the one hand the reference when dealing with ordered labeled trees comparison, and on the other hand the starting point of our work, this article will present an exact analysis of its complexity. The purpose of this work is also to lead us to a better comprehension of the parameters of this algorithm, in order to be able to modify it more easily without changing its time complexity to take into account biological constraints that occur when comparing RNA secondary structures.
Learning to Extract Proteins and their Interactions from Medline Abstracts
- In: ICML-2003 Workshop on Machine Learning in Bioinformatics. (2003
, 2003
"... We present results from a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting interactions between the proteins. We demonstrate that machine learning approaches using support vector machines and hidden Markov m ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
We present results from a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting interactions between the proteins. We demonstrate that machine learning approaches using support vector machines and hidden Markov models are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manually-developed rules.
Measuring Author Contributions to the Wikipedia
, 2008
"... We consider the problem of measuring user contributions to versioned, collaborative bodies of information, such as wikis. Measuring the contributions of individual authors can be used to divide revenue, to recognize merit, to award status promotions, and to choose the order of authors when citing th ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
We consider the problem of measuring user contributions to versioned, collaborative bodies of information, such as wikis. Measuring the contributions of individual authors can be used to divide revenue, to recognize merit, to award status promotions, and to choose the order of authors when citing the content. In the context of the Wikipedia, previous works on author contribution estimation have focused on two criteria: the total text created, and the total number of edits performed. We show that neither of these criteria work well: both techniques are vulnerable to manipulation, and the totaltext criterion fails to reward people who polish or re-arrange the content. We consider and compare various alternative criteria that take into account the quality of a contribution, in addition to the quantity, and we analyze how the criteria differ in the way they rank authors according to their contributions. As an outcome of this study, we propose to adopt total edit longevity as a measure of author contribution. Edit longevity is resistant to simple attacks, since edits are counted towards an author’s contribution only if other authors accept the contribution. Edit longevity equally rewards people who create content, and people who rearrange or polish the content. Finally, edit longevity distinguishes the people who contribute little (who have contribution close to zero) from spammers or vandals, whose contribution quickly grows negative.
Learnable Similarity Functions and Their Applications to Clustering and Record Linkage
, 2004
"... rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initia ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initial results are encouraging, there still remains a large number of similarity functions that are currently unable to adapt to a particular domain. In our research, we attempt to bridge this gap by developing both new learnable similarity functions and methods for their application to particular problems in machine learning and data mining. In preliminary work, we proposed two learnable similarity functions for strings that adapt distance computations given training pairs of equivalent and non-equivalent strings (Bilenko & Mooney 2003a). The first function is based on a probabilistic model of edit distance with affine gaps (Gus- Copyright c # 2004, American Association for Artificial Intelli
The Varlet Analyst: Employing Imperfect Knowledge in Database Reverse Engineering Tools
- in Database Reverse Engineering Tools. 3rd International Workshop on Intelligent Software Engineering (WISE-3
, 2000
"... Emerging key technologies like World Wide Web, object-orientation, and... In this paper, we present a flexible tool that aims to support the reengineer in these reverse engineering activities. Unlike other tools, our approach does not force the reengineer to follow a strict process or to enter only ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Emerging key technologies like World Wide Web, object-orientation, and... In this paper, we present a flexible tool that aims to support the reengineer in these reverse engineering activities. Unlike other tools, our approach does not force the reengineer to follow a strict process or to enter only consistent information. On the contrary, our tool adopts the mental model of its user and deals with imperfect information (uncertainty and contradiction) explicitly.

