Results 1  10
of
12
Athena: Miningbased interactive management of text databases
 International Conference on Extending Database Technology
, 2000
"... Abstract. We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive miningbased operations. Requirements of any such system include speed and minimal enduser e ort. Athena satis es these requirements through lineartime classi cation ..."
Abstract

Cited by 32 (2 self)
 Add to MetaCart
Abstract. We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive miningbased operations. Requirements of any such system include speed and minimal enduser e ort. Athena satis es these requirements through lineartime classi cation and clustering engines which are applied interactively to speed the development of accurate models. Naive Bayes classi ers are recognized to be among the best for classifying text. We show that our specialization of the Naive Bayes classi er is considerably more accurate (7 to 29 % absolute increase in accuracy) than a standard implementation. Our enhancements include using Lidstone's law of succession instead of Laplace's law, underweighting long documents, and overweighting author and subject. We also present a new interactive clustering algorithm, CEvolve, for topic discovery. CEvolve rst nds highly accurate cluster digests (partial clusters), gets user feedback to merge and correct these digests, and then uses the classi cation algorithm to complete the partitioning of the data. By allowing this interactivity in the clustering process, CEvolve achieves considerably higher clustering accuracy (10 to 20 % absolute increase in our experiments) than the popular KMeans and agglomerative clustering methods. 1
On supervised selection of Bayesian networks
 In UAI99
, 1999
"... Given a set of possible models (e.g., Bayesian network structures) and a data sample, in the unsupervised model selection problem the task is to choose the most accurate model with respect to the domain joint probability distribution. In contrast to this, in supervised model selection it is a priori ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
Given a set of possible models (e.g., Bayesian network structures) and a data sample, in the unsupervised model selection problem the task is to choose the most accurate model with respect to the domain joint probability distribution. In contrast to this, in supervised model selection it is a priori known that the chosen model will be used in the future for prediction tasks involving more \focused " predictive distributions. Although focused predictive distributions can be produced from the joint probability distribution by marginalization, in practice the best model in the unsupervised sense does not necessarily perform well in supervised domains. In particular, the standard marginal likelihood score is a criterion for the unsupervised task, and, although frequently used for supervised model selection also, does not perform well in such tasks. In this paper we study the performance of the marginal likelihood score empirically in supervised Bayesian network selection tasks by using a large number of publicly available classi cation data sets, and compare the results to those obtained by alternative model selection criteria, including empirical crossvalidation methods, an approximation of a supervised marginal likelihood measure, and a supervised version of Dawid's prequential (predictive sequential) principle. The results demonstrate that the marginal likelihood score does not perform well for supervised model selection, while the best results are obtained by using Dawid's prequential approach.
Supervised modelbased visualization of highdimensional data
, 2000
"... When highdimensional data vectors are visualized on a two or threedimensional display, the goal is that two vectors close to each other in the multidimensional space should also be close to each other in the lowdimensional space. Traditionally, closeness is defined in terms of some standard ge ..."
Abstract

Cited by 18 (9 self)
 Add to MetaCart
When highdimensional data vectors are visualized on a two or threedimensional display, the goal is that two vectors close to each other in the multidimensional space should also be close to each other in the lowdimensional space. Traditionally, closeness is defined in terms of some standard geometric distance measure, such as the Euclidean distance, based on a more or less straightforward comparison between the contents of the data vectors. However, such distances do not generally reflect properly the properties of complex problem domains, where changing one bit in a vector may completely change the relevance of the vector. What is more, in realworld situations the similarity of two vectors is not a universal property: even if two vectors can be regarded as similar from one point of view, from another point of view they may appear quite dissimilar. In order to capture these requirements for building a pragmatic and flexible similarity measure, we propose a data visualization scheme where the similarity of two vectors is determined indirectly by using a formal model of the problem domain; in our case, a Bayesian network model. In this scheme, two vectors are considered similar if they lead to similar predictions, when given as input to a Bayesian network model. The scheme is supervised in the sense that different perspectives can be taken into account by using different predictive distributions, i.e., by changing what is to be predicted. In addition, the modeling framework can also be used for validating the rationality of the resulting visualization. This modelbased visualization scheme has been implemented and tested on realworld domains with encouraging results.
Segmented regression estimators for massive data sets
 In Second SIAM International Conference on Data Mining
, 2002
"... We describe two methodologies for obtaining segmented regression estimators from massive training data sets. The first methodology, called Linear Regression Tree (LRT), is used for continuous response variables, and the second and complementary methodology, called Naive Bayes Tree (NBT), is used for ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
We describe two methodologies for obtaining segmented regression estimators from massive training data sets. The first methodology, called Linear Regression Tree (LRT), is used for continuous response variables, and the second and complementary methodology, called Naive Bayes Tree (NBT), is used for categorical response variables. These are implemented in the IBM ProbE TM (Probabilistic Estimation) data mining engine, which is an objectoriented framework for building classes of segmented predictive models from massive training data sets. Based on this methodology, an application called ATMSE TM for directmail targeted marketing has been developed jointly with Fingerhut Business Intelligence [1]).
Discriminative Learning of Bayesian Networks via Factorized Conditional LogLikelihood
"... We propose an efficient and parameterfree scoring criterion, the factorized conditional loglikelihood (ˆfCLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional loglikelihood criterion. The approximation is devised in order to guarantee decomposa ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We propose an efficient and parameterfree scoring criterion, the factorized conditional loglikelihood (ˆfCLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional loglikelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional loglikelihood scoring criterion. The resulting criterion has an informationtheoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with stateoftheart classifiers. Results on a large suite of benchmark data sets from the UCI repository show that ˆfCLLtrained classifiers achieve at least as good accuracy as the best compared classifiers, using significantly less computational resources.
Using Bayesian Networks For Visualizing HighDimensional Data
, 1999
"... A Bayesian (belief) network is a representation of a probability distribution over a set of random variables. One of the main advantages of this model family is that it offers a theoretically solid machine learning framework for constructing accurate domain models from sample data efficiently and re ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
A Bayesian (belief) network is a representation of a probability distribution over a set of random variables. One of the main advantages of this model family is that it offers a theoretically solid machine learning framework for constructing accurate domain models from sample data efficiently and reliably. As the parameters of a Bayesian network have a precise semantic interpretation, the learned models can be used for data mining purposes, i.e., for examining regularities found in the data. In addition to this type of direct examination of the model, we suggest that the learned Bayesian networks can also be used for indirect data mining purposes through a visualization scheme which can be used for producing 2D or 3D representations of highdimensional problem domains. Our visualization scheme is based on the predictive distributions produced by the Bayesian network model, which means that the resulting visualizations can also be used as a postprocessing tool for visual inspection of ...
Unsupervised Bayesian Visualization of HighDimensional Data
 In
, 2000
"... We propose a data reduction method based on a probabilistic similarity framework where two vectors are considered similar if they lead to similar predictions. We show how this type of a probabilistic similarity metric can be defined both in a supervised and unsupervised manner. As a concrete applica ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We propose a data reduction method based on a probabilistic similarity framework where two vectors are considered similar if they lead to similar predictions. We show how this type of a probabilistic similarity metric can be defined both in a supervised and unsupervised manner. As a concrete application of the suggested multidimensional scaling scheme, we describe how the method can be used for producing visual images of highdimensional data, and give several examples of visualizations obtained by using the suggested scheme with probabilistic Bayesian network models. 1. INTRODUCTION Multidimensional scaling (see, e.g., [3, 2]) is a data compression or data reduction task where the goal is to replace the original highdimensional data vectors with much shorter vectors, while losing as little information as possible. Intuitively speaking, it can be argued that a pragmatically sensible data reduction scheme is such that two vectors close to each other in the original multidimensional s...
Building and Maintaining Web Taxonomies
, 2002
"... A recognized problem for internet commerce is the task of building a product taxonomy from web pages, without access to corporate databases, and then populating a database with link information about service, spare parts, reviews, product specifications, product family, etc. A key precursor for ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
A recognized problem for internet commerce is the task of building a product taxonomy from web pages, without access to corporate databases, and then populating a database with link information about service, spare parts, reviews, product specifications, product family, etc. A key precursor for this task is the ability to build classification hierarchies in an unsupervised manner of web pages potentially useful.
An Unsupervised Bayesian Distance Measure
 Proceedings of the Fifth European Workshop on Casebased Reasoning (EWCBR’2000). LNAI1898
, 2000
"... . We introduce a distance measure based on the idea that two vectors are considered similar if they lead to similar predictive probability distributions. The suggested approach avoids the scaling problem inherent to many alternative techniques as the method automatically transforms the original ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
. We introduce a distance measure based on the idea that two vectors are considered similar if they lead to similar predictive probability distributions. The suggested approach avoids the scaling problem inherent to many alternative techniques as the method automatically transforms the original attribute space to a probability space where all the numbers lie between 0 and 1. The method is also flexible in the sense that it allows different attribute types (discrete or continuous) in the same consistent framework. To study the validity of the suggested measure, we ran a series of experiments with publicly available data sets. The empirical results demonstrate that the unsupervised distance measure is sensible in the sense that it can be used for discovering the hidden clustering structure of the data. 1 Introduction Machine learning techniques usually aim at compressing available sample data into more compact representations called models. These models can then be used for ...
Accelerated Communication Identification of toxicologically predictive gene sets using cDNA microarrays
, 2001
"... We have developed an approach to classify toxicants based upon their influence on profiles of mRNA transcripts. Changes in liver gene expression were examined after exposure of mice to 24 model treatments that fall into five wellstudied toxicological categories: peroxisome proliferators, aryl hydro ..."
Abstract
 Add to MetaCart
We have developed an approach to classify toxicants based upon their influence on profiles of mRNA transcripts. Changes in liver gene expression were examined after exposure of mice to 24 model treatments that fall into five wellstudied toxicological categories: peroxisome proliferators, aryl hydrocarbon receptor agonists, noncoplanar polychlorinated biphenyls, inflammatory agents, and hypoxiainducing agents. Analysis of 1200 transcripts using both a correlationbased approach and a probabilistic approach resulted in a classification accuracy of between 50 and 70%. However, with the use of a forward parameter selection scheme, a diagnostic set of 12 transcripts was identified that provided an estimated 100% predictive accuracy based on leaveoneout crossvalidation. Expansion of this approach to additional chemicals of regulatory concern could serve as an important screening step in a new era of toxicological testing.