Results 11 - 20
of
76
A Literature-Based Method For Assessing The Functional Coherence Of A Gene Group
, 2003
"... Motivation: Many experimental and algorithmic approaches in biology generate groups of genes that need to be examined for related functional properties. For example, gene expression profiles are frequently organized into clusters of genes that may share functional properties. We evaluate a method, n ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
Motivation: Many experimental and algorithmic approaches in biology generate groups of genes that need to be examined for related functional properties. For example, gene expression profiles are frequently organized into clusters of genes that may share functional properties. We evaluate a method, neighbor divergence per gene (NDPG), that uses scientific literature to assess whether a group of genes are functionally related. The method requires only a corpus of documents and an index connecting the documents to genes.
A Statistical Model for Scientific Readability
- In Proc. of CIKM
, 2001
"... This paper presents a new method of using statistical models to estimate the reading difficulty of Web pages. Language Models are used to represent the content typically associated with different readability levels. Reading level classifiers are created as linear combinations of a language model and ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
This paper presents a new method of using statistical models to estimate the reading difficulty of Web pages. Language Models are used to represent the content typically associated with different readability levels. Reading level classifiers are created as linear combinations of a language model and surface linguistic features. Experiments show that this new method is more accurate than the widely used Flesch-Kincaid readability formula KEYWORDS Readability, Flesch-Kincaid, Unigram Language Model, EM. 1.
Recognition of Cursive Roman Handwriting - Past, Present and Future
- In Proc. 7th Int. Conf. on Document Analysis and Recognition
, 2003
"... This paper review the state of the art in o#-line Roman cursive han dw iting recognition. The input provided to an o#-line han iting recognition system is an image of a digit, aw ord, or - more generally - some text, and the system produces, as output, an ASCII transcription of the input. This taski ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
This paper review the state of the art in o#-line Roman cursive han dw iting recognition. The input provided to an o#-line han iting recognition system is an image of a digit, aw ord, or - more generally - some text, and the system produces, as output, an ASCII transcription of the input. This taskinvolves a number of processing steps, some of w ich are quite di#cult. Typically, preprocessing, normalization, feature extraction, classification, and postprocessing operations are required. We'll survey the state of the art, analyze recent trends, and try to identify challenges for future research in this field.
Measures and Applications of Lexical Distributional Similarity
, 2003
"... This thesis is concerned with the measurement and application of lexical distributional similarity. Two words are said to be distributionally similar if they appear in similar contexts. This loose definition, however, has led to many measures being proposed or adopted from fields such as geometry, s ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
This thesis is concerned with the measurement and application of lexical distributional similarity. Two words are said to be distributionally similar if they appear in similar contexts. This loose definition, however, has led to many measures being proposed or adopted from fields such as geometry, statistics, Information Retrieval (IR) and Information Theory. Our aim is to investigate the properties which make a good measure of lexical distributional similarity. We start by introducing the concept of lexical distributional similarity. We discuss potential applications, which can be roughly divided into distributional or language modelling applications and semantic applications, and methods of evaluation (Chapter 2). We look at existing measures of distributional similarity and carry out an empirical comparison of fifteen of these measures, paying particular attention to the effects of word frequency (Chapter 3). We propose a new general framework for distributional similarity based on the context of lexical substitutability, which me measure using the IR concepts of precision and recall. This framework allows us to investigate the key factors in similarity of asymmetry, the relative influence of different contexts and the extent to which words share a context (Chapter 4). Finally, we consider the application of distributional similarity in language modelling (Chapter 5) and as a predictor of semantic similarity using human judgements of similarity and a spelling correction task (Chapter 6).
Generic soft pattern models for definitional question answering
- Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
, 2005
"... This paper explores probabilistic lexico-syntactic pattern matching, also known as soft pattern matching. While previous methods in soft pattern matching are ad hoc in computing the degree of match, we propose two formal matching models: one based on bigrams and the other on the Profile Hidden Marko ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
This paper explores probabilistic lexico-syntactic pattern matching, also known as soft pattern matching. While previous methods in soft pattern matching are ad hoc in computing the degree of match, we propose two formal matching models: one based on bigrams and the other on the Profile Hidden Markov Model (PHMM). Both models provide a theoretically sound method to model pattern matching as a probabilistic process that generates token sequences. We demonstrate the effectiveness of these models on definition sentence retrieval for definitional question answering. We show that both models significantly outperform state-of-the-art manually constructed patterns. A critical difference between the two models is that the PHMM technique handles language variations more effectively but requires more training data to converge. We believe that both models can be extended to other areas where lexico-syntactic pattern matching can be applied.
TIJAH at INEX 2004: Modeling Phrases and Relevance Feedback
- In Proceedings of the 3rd INEX Workshop, LNCS 3493
, 2005
"... Abstract. This paper discusses our participation in INEX using the TIJAH XML-IR system. We have enriched the TIJAH system, which follows a standard layered database architecture, with several new features. An extensible conceptual level processing unit has been added to the system. The algebra on th ..."
Abstract
-
Cited by 10 (7 self)
- Add to MetaCart
Abstract. This paper discusses our participation in INEX using the TIJAH XML-IR system. We have enriched the TIJAH system, which follows a standard layered database architecture, with several new features. An extensible conceptual level processing unit has been added to the system. The algebra on the logical level and the implementation on the physical level have been extended to support phrase search and structural relevance feedback. The conceptual processing unit is capable of rewriting NEXI content-only and content-and-structure queries into the internal form, based on the retrieval model parameter specification, that is either predefined or based on relevance feedback. Relevance feedback parameters are produced based on the data fusion of result element score values and sizes, and relevance assessments. The introduction of new operators supporting phrase search in score region algebra on the logical level is discussed in the paper, as well as their implementation on the physical level using the pre-post numbering scheme. The framework for structural relevance feedback is also explained in the paper. We conclude with a preliminary analysis of the system performance based on INEX 2004 runs.
Measuring Sparseness Of Noisy Signals
- 4TH INT. SYMP. ON INDEPENDENT COMPONENT ANALYSIS AND BLIND SIGNAL SEPARATION (ICA2003
, 2003
"... In this paper sparseness measures are reviewed, extended and compared. Special attention is paid on measuring sparseness of noisy data. We review and extend several definitions and measures for sparseness, including the # , # norms. A measure based on order statistics is also proposed. The concept o ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
In this paper sparseness measures are reviewed, extended and compared. Special attention is paid on measuring sparseness of noisy data. We review and extend several definitions and measures for sparseness, including the # , # norms. A measure based on order statistics is also proposed. The concept of sparseness is extended to the case where a signal has a dominant value other than zero. The sparseness measures can be easily modified to correspond to this new definition. Eight different measures are compared in three examples. It turns out that different measures may give complete opposite results if the distribution does not have a unique mode at zero. As conclusion, we suggest that the kurtosis should be avoided as a sparseness measure and recommend tanh-functions for measuring noisy sparseness.
Content-based Access to Spoken Audio
- IEEE Signal Processing Magazine
, 2005
"... This article describes approaches to content-based access to spoken audio with a qualitative and tutorial emphasis. We describe how the analysis, retrieval and delivery phases contribute making spoken audio content more accessible, and we outline a number of outstanding research issues. We also disc ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
This article describes approaches to content-based access to spoken audio with a qualitative and tutorial emphasis. We describe how the analysis, retrieval and delivery phases contribute making spoken audio content more accessible, and we outline a number of outstanding research issues. We also discuss the main application domains and try to identify important issues for future developments. The structure of the article is based on general system architecture for content-based 2 access which is depicted in Figure 1. Although the tasks within each processing stage may appear unconnected, the interdependencies and the sequence with which they take place vary
A bayesian interpretation of interpolated kneserney
, 2006
"... Interpolated Kneser-Ney is one of the best smoothing methods for n-gram language models. Previous explanations for its superiority have been based on intuitive and empirical justifications of specific properties of the method. We propose a novel interpretation of interpolated Kneser-Ney as approxima ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Interpolated Kneser-Ney is one of the best smoothing methods for n-gram language models. Previous explanations for its superiority have been based on intuitive and empirical justifications of specific properties of the method. We propose a novel interpretation of interpolated Kneser-Ney as approximate inference in a hierarchical Bayesian model consisting of Pitman-Yor processes. As opposed to past explanations, our interpretation can recover exactly the formulation of interpolated Kneser-Ney, and performs better than interpolated Kneser-Ney when a better inference procedure is used. 1
Detecting Wikipedia Vandalism with Active Learning and Statistical Language Models
, 2010
"... This paper proposes an active learning approach using language model statistics to detect Wikipedia vandalism. Wikipedia is a popular and influential collaborative information system. The collaborative nature of authoring, as well as the high visibility of its content, have exposed Wikipedia article ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
This paper proposes an active learning approach using language model statistics to detect Wikipedia vandalism. Wikipedia is a popular and influential collaborative information system. The collaborative nature of authoring, as well as the high visibility of its content, have exposed Wikipedia articles to vandalism. Vandalism is defined as malicious editing intended to compromise the integrity of the content of articles. Extensive manual efforts are being made to combat vandalism and an automated approach to alleviate the laborious process is needed. This paper builds statistical language models, constructing distributions of words from the revision history of Wikipedia articles. As vandalism often involves the use of unexpected words to draw attention, the fitness (or lack thereof) of a new edit when compared with language models built from previous versions may well indicate that an edit is a vandalism instance. In addition, the paper adopts an active learning model to solve the problem of noisy and incomplete labeling of Wikipedia vandalism. The Wikipedia domain with its revision histories offers a novel context in which to explore the potential of language models in characterizing author intention. As the experimental results presented in the paper demonstrate, these models hold promise for vandalism detection.

