Results 1 - 10
of
156
Improving Text Classification by Shrinkage in a Hierarchy of Classes
, 1998
"... When documents are organized in a large number of topic categories, the categories are often arranged in a hierarchy. The U.S. patent database and Yahoo are two examples. ..."
Abstract
-
Cited by 203 (5 self)
- Add to MetaCart
When documents are organized in a large number of topic categories, the categories are often arranged in a hierarchy. The U.S. patent database and Yahoo are two examples.
On Differential Variability of Expression Ratios: Improving . . .
- JOURNAL OF COMPUTATIONAL BIOLOGY
, 2001
"... We consider the problem of inferring fold changes in gene expression from cDNA microarray data. Standard procedures focus on the ratio of measured fluorescent intensities at each spot on the microarray, but to do so is to ignore the fact that the variation of such ratios is not constant. Estimates o ..."
Abstract
-
Cited by 119 (3 self)
- Add to MetaCart
We consider the problem of inferring fold changes in gene expression from cDNA microarray data. Standard procedures focus on the ratio of measured fluorescent intensities at each spot on the microarray, but to do so is to ignore the fact that the variation of such ratios is not constant. Estimates of gene expression changes are derived within a simple hierarchical model that accounts for measurement error and fluctuations in absolute gene expression levels. Significant gene expression changes are identified by deriving the posterior odds of change within a similar model. The methods are tested via simulation and are applied to a panel of Escherichia coli microarrays.
Information Extraction with HMMs and Shrinkage
- In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction
, 1999
"... Hidden Markov models (HMMs) are a powerful probabilistic tool for modeling time series data, and have been applied with success to many language-related tasks such as part of speech tagging, speech recognition, text segmentation and topic detection. This paper describes the application of HMMs to an ..."
Abstract
-
Cited by 111 (4 self)
- Add to MetaCart
Hidden Markov models (HMMs) are a powerful probabilistic tool for modeling time series data, and have been applied with success to many language-related tasks such as part of speech tagging, speech recognition, text segmentation and topic detection. This paper describes the application of HMMs to another language related task|information extraction|the problem of locating textual sub-segments that answer a particular information need. In our work, the HMM state transition probabilities and word emission probabilities are learned from labeled training data. As in many machine learning problems, however, the lack of suÆcient labeled training data hinders the reliability of the model. The key contribution of this paper is the use of a statistical technique called \shrinkage" that signi cantly improves parameter estimation of the HMM emission probabilities in the face of sparse training data. In experiments on seminar announcements and Reuters acquisitions articles, shrinkage is shown to r...
Making the most of statistical analyses: Improving interpretation and presentation
- American Journal of Political Science
, 2000
"... Social scientists rarely take full advantage of the information available in their statistical results. As a consequence, they miss opportunities to present quantities that are of greatest substantive interest for their research and express the appropriate degree of certainty about these quantities. ..."
Abstract
-
Cited by 108 (18 self)
- Add to MetaCart
Social scientists rarely take full advantage of the information available in their statistical results. As a consequence, they miss opportunities to present quantities that are of greatest substantive interest for their research and express the appropriate degree of certainty about these quantities. In this article, we offer an approach, built on the technique of statistical simulation, to extract the currently overlooked information from any statistical method and to interpret and present it in a reader-friendly manner. Using this technique requires some expertise,
Mixed membership stochastic block models for relational data with application to protein-protein interactions
- In Proceedings of the International Biometrics Society Annual Meeting
, 2006
"... We develop a model for examining data that consists of pairwise measurements, for example, presence or absence of links between pairs of objects. Examples include protein interactions and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing such data with p ..."
Abstract
-
Cited by 97 (22 self)
- Add to MetaCart
We develop a model for examining data that consists of pairwise measurements, for example, presence or absence of links between pairs of objects. Examples include protein interactions and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing such data with probabilistic models requires special assumptions, since the usual independence or exchangeability assumptions no longer hold. We introduce a class of latent variable models for pairwise measurements: mixed membership stochastic blockmodels. Models in this class combine a global model of dense patches of connectivity (blockmodel) and a local model to instantiate nodespecific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodels with applications to social networks and protein interaction networks.
Bayesian measures of model complexity and fit
- Journal of the Royal Statistical Society, Series B
, 2002
"... [Read before The Royal Statistical Society at a meeting organized by the Research ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
[Read before The Royal Statistical Society at a meeting organized by the Research
Bayesian Compressive Sensing
, 2007
"... The data of interest are assumed to be represented as N-dimensional real vectors, and these vectors are compressible in some linear basis B, implying that the signal can be reconstructed accurately using only a small number M ≪ N of basis-function coefficients associated with B. Compressive sensing ..."
Abstract
-
Cited by 60 (10 self)
- Add to MetaCart
The data of interest are assumed to be represented as N-dimensional real vectors, and these vectors are compressible in some linear basis B, implying that the signal can be reconstructed accurately using only a small number M ≪ N of basis-function coefficients associated with B. Compressive sensing is a framework whereby one does not measure one of the aforementioned N-dimensional signals directly, but rather a set of related measurements, with the new measurements a linear combination of the original underlying N-dimensional signal. The number of required compressive-sensing measurements is typically much smaller than N, offering the potential to simplify the sensing system. Let f denote the unknown underlying N-dimensional signal, and g a vector of compressive-sensing measurements, then one may approximate f accurately by utilizing knowledge of the (under-determined) linear relationship between f and g, in addition to knowledge of the fact that f is compressible in B. In this paper we employ a Bayesian formalism for estimating the underlying signal f based on compressive-sensing measurements g. The proposed framework has the following properties: (i) in addition to estimating the underlying signal f, “error bars ” are also estimated, these giving a measure of confidence in the inverted signal; (ii) using knowledge of the error bars, a principled means is provided for determining when a sufficient
Building Domain-Specific Search Engines with Machine Learning Techniques
, 1999
"... Domain-specific search engines are becoming increasingly popular because they offer increased accuracy and extra features not possible with the general, Web-wide search engines. For example, www.campsearch.com allows complex queries by agegroup, size, location and cost over summer camps. Unfortunate ..."
Abstract
-
Cited by 58 (6 self)
- Add to MetaCart
Domain-specific search engines are becoming increasingly popular because they offer increased accuracy and extra features not possible with the general, Web-wide search engines. For example, www.campsearch.com allows complex queries by agegroup, size, location and cost over summer camps. Unfortunately, these domain-specific search engines are difficult and time consuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific search engines. We describe new research in reinforcement learning, text classification and information extraction that automates efficient spidering, populating topic hierarchies, and identifying informative text segments. Using these techniques, we have built a demonstration system: a search engine for computer science research papers. It already contains over 33,000 papers and is publicly available at www.cora.jprc.com. 1 Introduction As the amount of information on the World ...
On Block Updating in Markov Random Field Models For . . .
- SCANDINAVIAN JOURNAL OF STATISTICS
, 2002
"... Gaussian Markov random field (GMRF) models are commonlyufz to model spatial correlation in disease mapping applications. For Bayesian inference by MCMC, so far mainly single-siteuinglealgorithms have been considered. However, convergence and mixing properties ofsuD algorithms can be extremely ..."
Abstract
-
Cited by 42 (7 self)
- Add to MetaCart
Gaussian Markov random field (GMRF) models are commonlyufz to model spatial correlation in disease mapping applications. For Bayesian inference by MCMC, so far mainly single-siteuinglealgorithms have been considered. However, convergence and mixing properties ofsuD algorithms can be extremely poordu to strong dependencies ofparameters in the posteriordistribuQ84K In this paper, we propose variou block sampling algorithms in order to improve the MCMC performance. The methodology is rather general, allows for non-standardfu6 conditionals, and can be applied in amoduzK fashion in a large nugef of di#erent scenarios. For illu##Kzf0 n we consider three di#erent applications: twoformu8Df0z3 for spatial modelling of a single disease (with andwithou additionaluditionalfL parameters respectively), and one formu## ion for the joint analysis of two diseases. TheresuKK indicate that the largest benefits are obtained ifparameters and the corresponding hyperparameter areuefz#L jointly in one large block. Implementation ofsuQ block algorithms is relatively easy usyf methods for fast sampling ofGaungf3 Markov random fields (Rus 2001). By comparison, Monte Carlo estimates based on single-siteungle-s can be rather misleading, even for very long rugfOu resuL6 may have wider relevance for efficient MCMCsimu6z8#f in hierarchical models with Markov random field components.
Using Unlabeled Data to Improve Text Classification
, 2001
"... One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high- ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high-accuracy text classifiers. By assuming that documents are created by a parametric generative model, Expectation-Maximization (EM) finds local maximum a posteriori models and classifiers from all the data -- labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse. Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling sub-topic class structure, and by modeling super-topic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with a representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to low-probability models. Performance can be significantly improved by using active learning to select high-quality initializations, and by using alternatives to EM that avoid low-probability local maxima.

