Results 1  10
of
18
On the optimality of the simple Bayesian classifier under zeroone loss
 MACHINE LEARNING
, 1997
"... The simple Bayesian classifier is known to be optimal when attributes are independent given the class, but the question of whether other sufficient conditions for its optimality exist has so far not been explored. Empirical results showing that it performs surprisingly well in many domains containin ..."
Abstract

Cited by 805 (27 self)
 Add to MetaCart
The simple Bayesian classifier is known to be optimal when attributes are independent given the class, but the question of whether other sufficient conditions for its optimality exist has so far not been explored. Empirical results showing that it performs surprisingly well in many domains containing clear attribute dependences suggest that the answer to this question may be positive. This article shows that, although the Bayesian classifier’s probability estimates are only optimal under quadratic loss if the independence assumption holds, the classifier itself can be optimal under zeroone loss (misclassification rate) even when this assumption is violated by a wide margin. The region of quadraticloss optimality of the Bayesian classifier is in fact a secondorder infinitesimal fraction of the region of zeroone optimality. This implies that the Bayesian classifier has a much greater range of applicability than previously thought. For example, in this article it is shown to be optimal for learning conjunctions and disjunctions, even though they violate the independence assumption. Further, studies in artificial domains show that it will often outperform more powerful classifiers for common training set sizes and numbers of attributes, even if its bias is a priori much less appropriate to the domain. This article’s results also imply that detecting attribute dependence is not necessarily the best way to extend the Bayesian classifier, and this is also verified empirically.
Selection of relevant features and examples in machine learning
 ARTIFICIAL INTELLIGENCE
, 1997
"... In this survey, we review work in machine learning on methods for handling data sets containing large amounts of irrelevant information. We focus on two key issues: the problem of selecting relevant features, and the problem of selecting relevant examples. We describe the advances that have been mad ..."
Abstract

Cited by 590 (2 self)
 Add to MetaCart
In this survey, we review work in machine learning on methods for handling data sets containing large amounts of irrelevant information. We focus on two key issues: the problem of selecting relevant features, and the problem of selecting relevant examples. We describe the advances that have been made on these topics in both empirical and theoretical work in machine learning, and we present a general framework that we use to compare different methods. We close with some challenges for future work in this area.
AverageCase Analysis of a Nearest Neighbor Algorithm
 PROCEEDINGS OF THE THIRTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (PP. 889894). CHAMBERY
, 1993
"... In this paper we present an averagecase analysis of the nearest neighbor algorithm, a simple induction method that has been studied by many researchers. Our analysis assumes a conjunctive target concept, noisefree Boolean attributes, and a uniform distribution over the instance space. We calculate ..."
Abstract

Cited by 42 (4 self)
 Add to MetaCart
In this paper we present an averagecase analysis of the nearest neighbor algorithm, a simple induction method that has been studied by many researchers. Our analysis assumes a conjunctive target concept, noisefree Boolean attributes, and a uniform distribution over the instance space. We calculate the probability that the algorithm will encounter a test instance that is distance d from the prototype of the concept, along with the probability that the nearest stored training case is distance e from this test instance. From this we compute the probability of correct classification as a function of the number of observed training cases, the number of relevant attributes, and the number of irrelevant attributes. We also explore the behavioral implications of the analysis by presenting predicted learning curves for artificial domains, and give experimental results on these domains as a check on our reasoning.
Induction over the unexplained: Using overlygeneral domain theories to aid concept learning
, 1993
"... This paper describes and evaluates an approach to combining empirical and explanationbased learning called Induction Over the Unexplained (IOU). IOU is intended for learning concepts that can be partially explained by an overlygeneral domain theory. An eclectic evaluation of the method is presented ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
(Show Context)
This paper describes and evaluates an approach to combining empirical and explanationbased learning called Induction Over the Unexplained (IOU). IOU is intended for learning concepts that can be partially explained by an overlygeneral domain theory. An eclectic evaluation of the method is presented which includes results from all three major approaches: empirical, theoretical, and psychological. Empirical results shows that IOU is effective at refining overlygeneral domain theories and that it learns more accurate concepts from fewer examples than a purely empirical approach. The application of theoretical results from PAC learnability theory explains why IOU requires fewer examples. IOU is also shown to be able to model psychological data demonstrating the effect of background knowledge on human learning.
Characterizing Rational versus Exponential Learning Curves
 In Computational Learning Theory: Second European Conference. EuroCOLT’95
, 1995
"... . We consider the standard problem of learning a concept from random examples. Here a learning curve can be defined to be the expected error of a learner's hypotheses as a function of training sample size. Haussler, Littlestone and Warmuth have shown that, in the distribution free setting, the ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
. We consider the standard problem of learning a concept from random examples. Here a learning curve can be defined to be the expected error of a learner's hypotheses as a function of training sample size. Haussler, Littlestone and Warmuth have shown that, in the distribution free setting, the smallest expected error a learner can achieve in the worst case over a concept class C converges rationally to zero error (i.e., \Theta(1=t) for training sample size t). However, recently Cohn and Tesauro have demonstrated how exponential convergence can often be observed in experimental settings (i.e., average error decreasing as e \Theta(\Gammat) ). By addressing a simple nonuniformity in the original analysis, this paper shows how the dichotomy between rational and exponential worst case learning curves can be recovered in the distribution free theory. These results support the experimental findings of Cohn and Tesauro: for finite concept classes, any consistent learner achieves exponent...
A complete and tight averagecase analysis of learning monomials
 IN PROC. 16TH INT'L SYMPOS. ON THEORETICAL ASPECTS OF COMPUTER SCIENCE, STACS'99
, 1999
"... We advocate to analyze the average complexity of learning problems. An appropriate framework for this purpose is introduced. Based on it we consider the problem of learning monomials and the special case of learning monotone monomials in the limit and for online predictions in two variants: from ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
We advocate to analyze the average complexity of learning problems. An appropriate framework for this purpose is introduced. Based on it we consider the problem of learning monomials and the special case of learning monotone monomials in the limit and for online predictions in two variants: from positive data only, and from positive and negative examples. The wellknown Wholist algorithm is completely analyzed, in particular its averagecase behavior with respect to the class of binomial distributions. We consider different complexity measures: the number of mind changes, the number of prediction errors, and the total learning time. Tight bounds are obtained implying that worst case bounds are too pessimistic. On the average learning can be achieved exponentially faster. Furthermore, we study a new learning model, stochastic finite learning, in which, in contrast to PAC learning, some information about the underlying distribution is given and the goal is to find a correct (not only approximatively correct) hypothesis. We develop techniques to obtain good bounds for stochastic finite learning from a precise average case analysis of strategies for learning in the limit and illustrate our approach for the case of learning monomials.
A Probabilistic Framework for MemoryBased Reasoning
 Manuscript, in review
, 1998
"... In this paper, we propose a probabilistic framework for MemoryBased Reasoning (MBR). The framework allows us to clarify the technical merits and limitations of several recently published MBR methods and to design new variants. The proposed computational framework consists of three components: a spe ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
In this paper, we propose a probabilistic framework for MemoryBased Reasoning (MBR). The framework allows us to clarify the technical merits and limitations of several recently published MBR methods and to design new variants. The proposed computational framework consists of three components: a specification language to define an adaptive notion of relevant context for a query; mechanisms for retrieving this context; and local learning procedures that are used to induce the desired action from this context. Based on the framework we derive several analytical and empirical results that shed light on MBR algorithms. We introduce the notion of an MBR transform, and discuss its utility for learning algorithms. We also provide several perspectives on memorybased reasoning from a multidisciplinary point of view. 1 Introduction Reasoning can be broadly defined as the task of deciding what action to perform in a particular state or in response to a given query. Actions can range from admit...
BiasVariance Decomposition of ZeroOne Loss in AverageCase Model
 Proc. AMAI
, 2002
"... this paper, we also consider that Fig.1(a) and (b) are regarded as the same case, and Fig.1(c) and (d) are regarded the same case. In other words, we consider the gross variance V gross = V u +V b . For preparation of discussion, we show Theorem 5 ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
this paper, we also consider that Fig.1(a) and (b) are regarded as the same case, and Fig.1(c) and (d) are regarded the same case. In other words, we consider the gross variance V gross = V u +V b . For preparation of discussion, we show Theorem 5
Annotated Bibliography on Research Methodologies
, 1994
"... This paper suggests that science is organized in research programmes: a structure including a hard core that is not questioned and auxiliary hypotheses that guard it from negative evidence. Acknowledging that no data can confirm or refute a theory, scientists should adhere to some normative rules wh ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This paper suggests that science is organized in research programmes: a structure including a hard core that is not questioned and auxiliary hypotheses that guard it from negative evidence. Acknowledging that no data can confirm or refute a theory, scientists should adhere to some normative rules when revising auxiliary hypotheses. (Habermas, 1971) This book provides a critique of positivism through a study of the historical development of ideas that led to contemporary positivism. The book does not argue against science, but against "scientism:" The view that equates all knowledge with science. (Toulmin, 1972) This book argues for the necessity to bring philosophy and science together for a reappraisal of epistemology and methodology. Philosophy is to be a historical, empirical, and pragmatic enterprise that should focus on issues such as conceptual change in the sciences and human thought. (Weimer, 1979) This book develops a metatheory of science in which positivism and logical empiricism and called justificationism and opinions such as those of Popper and Kuhn are termed nonjustificationism. The book criticizes justificationist theories of science and uses the metatheory to explain the differences between positions of different contemporary nonjustificatinists positions. (KnorrCetina, 1981) This book provides a constructivist view of science. It uses several metaphors of a scientist to study the way in which social processes make up for the lack of any rational way for advancing science. (Bunge, 1983) A part of an eightbook treatise on philosophy, this book provides a systems science perspective on epistemology and research methodology. It presents a serious study of scientific realism.
Average Case Analysis of Learning kCNF concepts
 In Proceedings of the Ninth International Workshop on Machine Learning
, 1992
"... We present an approach to modeling the average case behavior of an algorithm for learning Conjunctive Normal Form (CNF, i.e., conjunctions of disjunctions). Our motivation is to predict the expected error of the learning algorithm as a function of the number of training examples from a known distrib ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We present an approach to modeling the average case behavior of an algorithm for learning Conjunctive Normal Form (CNF, i.e., conjunctions of disjunctions). Our motivation is to predict the expected error of the learning algorithm as a function of the number of training examples from a known distribution. We extend the basic model to address issues that arise if the data contain attribute noise. We show how the analysis can lead to insight into the behavior of the algorithm and the factors that affect the error. We make certain independence assumptions during the derivation of the average case model and we demonstrate that the predictions of the model account for a large percentage of the variation in error when these assumptions are violated. 1