Results 1  10
of
79
The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length
 Machine Learning
, 1996
"... . We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions gene ..."
Abstract

Cited by 172 (16 self)
 Add to MetaCart
. We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions generated by general probabilistic automata, we prove that the algorithm we present can efficiently learn distributions generated by PSAs. In particular, we show that for any target PSA, the KLdivergence between the distribution generated by the target and the distribution generated by the hypothesis the learning algorithm outputs, can be made small with high confidence in polynomial time and sample complexity. The learning algorithm is motivated by applications in humanmachine interaction. Here we present two applications of the algorithm. In the first one we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second ...
Using and combining predictors that specialize
 In 29th STOC
, 1997
"... Abstract. We study online learning algorithms that predict by combining the predictions of several subordinate prediction algorithms, sometimes called “experts. ” These simple algorithms belong to the multiplicative weights family of algorithms. The performance of these algorithms degrades only loga ..."
Abstract

Cited by 92 (13 self)
 Add to MetaCart
Abstract. We study online learning algorithms that predict by combining the predictions of several subordinate prediction algorithms, sometimes called “experts. ” These simple algorithms belong to the multiplicative weights family of algorithms. The performance of these algorithms degrades only logarithmically with the number of experts, making them particularly useful in applications where the number of experts is very large. However, in applications such as text categorization, it is often natural for some of the experts to abstain from making predictions on some of the instances. We show how to transform algorithms that assume that all experts are always awake to algorithms that do not require this assumption. We also show how to derive corresponding loss bounds. Our method is very general, and can be applied to a large family of online learning algorithms. We also give applications to various prediction models including decision graphs and “switching ” experts. 1
Predicting Nearly as Well as the Best Pruning of a Decision Tree
 Machine Learning
, 1995
"... . Many algorithms for inferring a decision tree from data involve a twophase process: First, a very large decision tree is grown which typically ends up "overfitting" the data. To reduce overfitting, in the second phase, the tree is pruned using one of a number of available methods. The ..."
Abstract

Cited by 69 (5 self)
 Add to MetaCart
. Many algorithms for inferring a decision tree from data involve a twophase process: First, a very large decision tree is grown which typically ends up "overfitting" the data. To reduce overfitting, in the second phase, the tree is pruned using one of a number of available methods. The final tree is then output and used for classification on test data. In this paper, we suggest an alternative approach to the pruning phase. Using a given unpruned decision tree, we present a new method of making predictions on test data, and we prove that our algorithm's performance will not be "much worse" (in a precise technical sense) than the predictions made by the best reasonably small pruning of the given decision tree. Thus, our procedure is guaranteed to be competitive (in terms of the quality of its predictions) with any pruning algorithm. We prove that our procedure is very efficient and highly robust. Our method can be viewed as a synthesis of two previously studied techniques. First, we ...
The consistency of the BIC Markov order estimator.
"... . The Bayesian Information Criterion (BIC) estimates the order of a Markov chain (with finite alphabet A) from observation of a sample path x 1 ; x 2 ; : : : ; x n , as that value k = k that minimizes the sum of the negative logarithm of the kth order maximum likelihood and the penalty term jAj ..."
Abstract

Cited by 56 (3 self)
 Add to MetaCart
. The Bayesian Information Criterion (BIC) estimates the order of a Markov chain (with finite alphabet A) from observation of a sample path x 1 ; x 2 ; : : : ; x n , as that value k = k that minimizes the sum of the negative logarithm of the kth order maximum likelihood and the penalty term jAj k (jAj\Gamma1) 2 log n: We show that k equals the correct order of the chain, eventually almost surely as n ! 1, thereby strengthening earlier consistency results that assumed an apriori bound on the order. A key tool is a strong ratiotypicality result for Markov sample paths. We also show that the Bayesian estimator or minimum description length estimator, of which the BIC estimator is an approximation, fails to be consistent for the uniformly distributed i.i.d. process. AMS 1991 subject classification: Primary 62F12, 62M05; Secondary 62F13, 60J10 Key words and phrases: Bayesian Information Criterion, order estimation, ratiotypicality, Markov chains. 1 Supported in part by a joint N...
Efficient Bayesian Parameter Estimation in Large Discrete Domains
 Advances in Neural Information Processing Systems
, 1999
"... In this paper we examine the problem of estimating the parameters of a multinomial distribution over a large number of discrete outcomes, most of which do not appear in the training data. We analyze this problem from a Bayesian perspective and develop a hierarchical prior that incorporates the assum ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
In this paper we examine the problem of estimating the parameters of a multinomial distribution over a large number of discrete outcomes, most of which do not appear in the training data. We analyze this problem from a Bayesian perspective and develop a hierarchical prior that incorporates the assumption that the observed outcomes constitute only a small subset of the possible outcomes. We show how to efficiently perform exact inference with this form of hierarchical prior and compare our method to standard approaches and demonstrate its merits. Category: Algorithms and Architectures Presentation preference: none This paper was not submitted elsewhere nor will be submitted during NIPS review period. 1 Introduction One of the most important problems in statistical inference is multinomialestimation: Given a past history of observations independent trials with a discrete set of outcomes, predict the probability of the next trial. Such estimators are the basic building blocks in mor...
Formal grammar and information theory: Together again?
 PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY
, 2000
"... In the last 40 years, research on models of spoken and written language has been split between two seemingly irreconcilable traditions: formal linguistics in the Chomsky tradition, and information theory in the Shannon tradition. Zellig Harris had advocated a close alliance between grammatical and i ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
In the last 40 years, research on models of spoken and written language has been split between two seemingly irreconcilable traditions: formal linguistics in the Chomsky tradition, and information theory in the Shannon tradition. Zellig Harris had advocated a close alliance between grammatical and informationtheoretic principles in the analysis of natural language, and early formallanguage theory provided another strong link between information theory and linguistics. Nevertheless, in most research on language and computation, grammatical and informationtheoretic approaches had moved far apart. Today, after many years on the defensive, the informationtheoretic approach has gained new strength and achieved practical successes in speech recognition, information retrieval, and, increasingly, in language analysis and machine translation. The exponential increase in the speed and storage capacity of computers is the proximate cause of these engineering successes, allowing the automatic estimation of the parameters of probabilistic models of language by counting occurrences of linguistic events in very large bodies of text and speech. However, I will argue that informationtheoretic and computational ideas are also playing an increasing role in the scientific understanding of language, and will help bring together formallinguistic and informationtheoretic perspectives.
An Efficient Extension to Mixture Techniques for Prediction and Decision Trees
 Machine Learning
, 1999
"... We present an e#cient method for maintaining mixtures of prunings of a prediction or decision tree that extends the previous methods for "nodebased" prunings (Buntine, 1990; Willems, Shtarkov, & Tjalkens, 1995; Helmbold & Schapire, 1997; Singer, 1997) to the larger class of edgeb ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
We present an e#cient method for maintaining mixtures of prunings of a prediction or decision tree that extends the previous methods for "nodebased" prunings (Buntine, 1990; Willems, Shtarkov, & Tjalkens, 1995; Helmbold & Schapire, 1997; Singer, 1997) to the larger class of edgebased prunings. The method includes an online weightallocation algorithm that can be used for prediction, compression and classification. Although the set of edgebased prunings of a given tree is much larger than that of nodebased prunings, our algorithm has similar space and time complexity to that of previous mixture algorithms for trees. Using the general online framework of Freund & Schapire (1997), we prove that our algorithm maintains correctly the mixture weights for edgebased prunings with any bounded loss function. We also give a similar algorithm for the logarithmic loss function with a corresponding weightallocation algorithm. Finally, we describe experiments comparing nodebased and edgebased mixture models for estimating the probability of the next word in English text, which show the advantages of edgebased models. Keywords: mixture models, decision and prediction trees, online learning, statistical language modeling 1.
Recommender Systems Using Linear Classifiers
 Journal of Machine Learning Research
, 2002
"... Recommender systems use historical data on user preferences and other available data on users (for example, demographics) and items (for example, taxonomy) to predict items a new user might like. Applications of these methods include recommending items for purchase and personalizing the browsing ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
Recommender systems use historical data on user preferences and other available data on users (for example, demographics) and items (for example, taxonomy) to predict items a new user might like. Applications of these methods include recommending items for purchase and personalizing the browsing experience on a website. Collaborative filtering methods have focused on using just the history of user preferences to make the recommendations.
Performance analysis and modeling of errors and losses over 802.11b LANs for highbitrate realtime multimedia
, 2003
"... Inherent errorresilient nature of multimediacultim renders two highlevel options for wireless multimedia applicedia design. One option is to employ (semi) reliable wireless Medium Acium Control (MAC)funcLjjq in cLj"OwL;C with the traditional User Datagram Protocm (UDP). The other option is t ..."
Abstract

Cited by 20 (13 self)
 Add to MetaCart
Inherent errorresilient nature of multimediacultim renders two highlevel options for wireless multimedia applicedia design. One option is to employ (semi) reliable wireless Medium Acium Control (MAC)funcLjjq in cLj"OwL;C with the traditional User Datagram Protocm (UDP). The other option is to employ a lessreliable MAC and transport layerprotocq stac that passescssesLMO pacesL to theapplicTL;C layer,whic cc,LCqOTTL acLcLC a "higher throughput". This "higher throughput"traffic however,cwev cwever many "useless"c`useles pacsele In this paper, we address key questions regarding the viability of the above two options for the support of highbitrate wireless multimediaappliciaLOM over 802.11b LANs. First, we study the level of throughput improvements realized by the lessreliableprotocl stac at 2, 5.5 and 11 Mbps data rates using acngL measurements thatmimic realistic home or business settings.Secngs we analyze and model the error patterns within the "higher throughput"chroughp pacoug to evaluate their potentialimpac on multimediaapplicdiaLMM Third, wecLCOMM the amount of overhead that is needed at the applicL;MC layer toacOjjC different levels of lostandcLMqwUjL; MCUUTL recMqwU for the two (reliable and lessreliable) protoce stac sce)LkU MajorcorLO"UkL of our studyincyLM` (1) EitherprotocTkM`L;M is viable at 2 Mbps while neither of them is viable at 11 Mbps underrealistic settings; and (2) Some benefits of the "higher throughput" cthrough pacrou c be realized at 5.5 Mbps whencenLMjk with a joint erasureerrorprotecer algorithm at the applicOwL; layer.
Adaptive Mixtures of Probabilistic Transducers
 Neural Computation
, 1996
"... We describe and analyze a mixture model for supervised learning of probabilistic transducers. We devise an online learning algorithm that efficiently infers the structure and estimates the parameters of each probabilistic transducer in the mixture. Theoretical analysis and comparative simulations i ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
We describe and analyze a mixture model for supervised learning of probabilistic transducers. We devise an online learning algorithm that efficiently infers the structure and estimates the parameters of each probabilistic transducer in the mixture. Theoretical analysis and comparative simulations indicate that the learning algorithm tracks the best transducer from an arbitrarily large (possibly infinite) pool of models. We also present an application of the model for inducing a noun phrase recognizer. 1 Introduction Supervised learning of probabilistic mappings between temporal sequences is an important goal of natural data analysis and classification with a broad range of applications, including handwriting and speech recognition, natural language processing and biological sequence analysis. Research efforts in supervised learning of probabilistic mappings have been almost exclusively focused on estimating the parameters of a predefined model. For example, Giles et al. (1992) used a...