Results 1  10
of
23
Reconciling Schemas of Disparate Data Sources: A MachineLearning Approach
 In SIGMOD Conference
, 2001
"... A dataintegration system provides access to a multitude of data sources through a single mediated schema. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the source schemas and the mediated schema. We describe LSD, a system that empl ..."
Abstract

Cited by 351 (47 self)
 Add to MetaCart
A dataintegration system provides access to a multitude of data sources through a single mediated schema. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the source schemas and the mediated schema. We describe LSD, a system that employs and extends current machinelearning techniques to semiautomatically find such mappings. LSD first asks the user to provide the semantic mappings for a small set of data sources, then uses these mappings together with the sources to train a set of learners. Each learner exploits a different type of information either in the source schemas or in their data. Once the learners have been trained, LSD nds semantic mappings for a new data source by applying the learners, then combining their predictions using a metalearner. To further improve matching accuracy, we extend machine learning techniques so that LSD can incorporate domain constraints as an additional source of knowledge, and develop a novel learner that utilizes the structural information in XML documents. Our approach thus is distinguished in that it incorporates multiple types of knowledge. Importantly, its architecture is extensible to additional learners that may exploit new kinds of information. We describe a set of experiments on several realworld domains, and show that LSD proposes semantic mappings with a high degree of accuracy.
Mining the Network Value of Customers
 In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining
, 2002
"... One of the major applications of data mining is in helping companies determine which potential customers to market to. If the expected pro t from a customer is greater than the cost of marketing to her, the marketing action for that customer is executed. So far, work in this area has considered only ..."
Abstract

Cited by 330 (11 self)
 Add to MetaCart
One of the major applications of data mining is in helping companies determine which potential customers to market to. If the expected pro t from a customer is greater than the cost of marketing to her, the marketing action for that customer is executed. So far, work in this area has considered only the intrinsic value of the customer (i.e, the expected pro t from sales to her). We propose to model also the customer's network value: the expected pro t from sales to other customers she may inuence to buy, the customers those may inuence, and so on recursively. Instead of viewing a market as a set of independent entities, we view it as a social network and model it as a Markov random eld. We show the advantages of this approach using a social network mined from a collaborative ltering database. Marketing that exploits the network value of customersalso known as viral marketingcan be extremely eective, but is still a black art. Our work can be viewed as a step towards providing a more solid foundation for it, taking advantage of the availability of large relevant databases. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications data mining
Relational Learning with Statistical Predicate Invention: Better Models for Hypertext
 Machine Learning
, 2001
"... We present a new approach to learning hypertext classifiers that combines a statistical textlearning method with a relational rule learner. This approach is well suited to learning in hypertext domains because its statistical component allows it to characterize text in terms of word frequencies, wh ..."
Abstract

Cited by 68 (0 self)
 Add to MetaCart
We present a new approach to learning hypertext classifiers that combines a statistical textlearning method with a relational rule learner. This approach is well suited to learning in hypertext domains because its statistical component allows it to characterize text in terms of word frequencies, whereas its relational component is able to describe how neighboring documents are related to each other by hyperlinks that connect them. We evaluate our approach by applying it to tasks that involve learning definitions for (i) classes of pages, (ii) particular relations that exist between pairs of pages, and (iii) locating a particular class of information in the internal structure of pages. Our experiments demonstrate that this new approach is able to learn more accurate classifiers than either of its constituent methods alone. Keywords: Relational Learning, Text Categorization, Predicate Invention, Naive Bayes
Tree induction vs. logistic regression: A learningcurve analysis
 CEDER WORKING PAPER #IS0102, STERN SCHOOL OF BUSINESS
, 2001
"... Tree induction and logistic regression are two standard, offtheshelf methods for building models for classi cation. We present a largescale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on classmembership pr ..."
Abstract

Cited by 64 (16 self)
 Add to MetaCart
Tree induction and logistic regression are two standard, offtheshelf methods for building models for classi cation. We present a largescale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on classmembership probabilities. We use a learningcurve analysis to examine the relationship of these measures to the size of the training set. The results of the study show several remarkable things. (1) Contrary to prior observations, logistic regression does not generally outperform tree induction. (2) More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (i.e., the learning curves cross), so conclusions about inductionalgorithm superiority on a given domain must be based on an analysis of the learning curves. (3) Contrary to conventional wisdom, tree induction is effective atproducing probabilitybased rankings, although apparently comparatively less so foragiven training{set size than at making classifications. Finally, (4) the domains on which tree induction and logistic regression are ultimately preferable canbecharacterized surprisingly well by a simple measure of signaltonoise ratio.
Mining of concurrent text and time series
 In proceedings of the 6 th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining Workshop on Text Mining
, 2000
"... ..."
Enhanced Word Clustering for Hierarchical Text Classification
, 2002
"... In this paper we propose a new informationtheoretic divisive algorithm for word clustering applied to text classification. In previous work, such "distributional clustering" of features has been found to achieve improvements over feature selection in terms of classification accuracy, especially at ..."
Abstract

Cited by 44 (1 self)
 Add to MetaCart
In this paper we propose a new informationtheoretic divisive algorithm for word clustering applied to text classification. In previous work, such "distributional clustering" of features has been found to achieve improvements over feature selection in terms of classification accuracy, especially at lower number of features [2, 28]. However the existing clustering techniques are agglomerative in nature and result in (i) suboptimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value, thus converging to a local minimum. We show that our algorithm minimizes the "withincluster JensenShannon divergence" while simultaneously maximizing the "betweencluster JensenShannon divergence". In comparison to the previously proposed agglomerative strategies our divisive algorithm achieves higher classification accuracy especially at lower number of features. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20 Newsgroups data set and a 3level hierarchy of HTML documents collected from Dmoz Open Directory.
Robust Feature Selection by Mutual Information Distributions
 Proceedings of the 18th International Conference on Uncertainty in Artificial Intelligence (UAI2002
, 2002
"... Mutual information is widely used in artificial intelligence, in a descriptive way, to measure the stochastic dependence of discrete random variables. In order to address questions such as the reliability of the empirical value, one must consider sampletopopulation inferential approaches. This pap ..."
Abstract

Cited by 28 (6 self)
 Add to MetaCart
Mutual information is widely used in artificial intelligence, in a descriptive way, to measure the stochastic dependence of discrete random variables. In order to address questions such as the reliability of the empirical value, one must consider sampletopopulation inferential approaches. This paper deals with the distribution of mutual information, as obtained in a Bayesian framework by a secondorder Dirichlet prior distribution. The exact analytical expression for the mean and an analytical approximation of the variance are reported. Asymptotic approximations of the distribution are proposed. The results are applied to the problem of selecting features for incremental learning and classification of the naive Bayes classifier. A fast, newly defined method is shown to outperform the traditional approach based on empirical mutual information on a number of real data sets. Finally, a theoretical development is reported that allows one to efficiently extend the above methods to incomplete samples in an easy and effective way.
Locally Weighted Naive Bayes
 Proceedings of the Conference on Uncertainty in Artificial Intelligence
, 2003
"... Despite its simplicity, the naive Bayes classifier has surprised machine learning researchers by exhibiting good performance on a variety of learning problems. Encouraged by these results, researchers have looked to overcome naive Bayes' primary weakness  attribute independence  and improve the ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
Despite its simplicity, the naive Bayes classifier has surprised machine learning researchers by exhibiting good performance on a variety of learning problems. Encouraged by these results, researchers have looked to overcome naive Bayes' primary weakness  attribute independence  and improve the performance of the algorithm. This paper presents a locally weighted version of naive Bayes that relaxes the independence assumption by learning local models at prediction time. Experimental results show that locally weighted naive Bayes rarely degrades accuracy compared to standard naive Bayes and, in many cases, improves accuracy dramatically. The main advantage of this method compared to other techniques for enhancing naive Bayes is its conceptual and computational simplicity.
Information Theoretic Feature Clustering for Text Classification
 JOURNAL OF MACHINE LEARNING RESEARCH (JMLR), SPECIAL
, 2002
"... High dimensionality of text can become a severe deterrent in applying complex learners like Support Vector Machines to the task of text classification. Word clustering is a powerful alternative to feature selection for reducing the dimensionality of text. In this paper we propose a new informationt ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
High dimensionality of text can become a severe deterrent in applying complex learners like Support Vector Machines to the task of text classification. Word clustering is a powerful alternative to feature selection for reducing the dimensionality of text. In this paper we propose a new informationtheoretic divisive algorithm for word clustering and apply it to text classification. Existing techniques for such "distributional clustering" of words are agglomerative in nature resulting in (i) suboptimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then
Naive Bayes for regression
 Machine Learning
, 2000
"... Abstract. Despite its simplicity, the naive Bayes learning scheme performs well on most classification tasks, and is often significantly more accurate than more sophisticated methods. Although the probability estimates that it produces can be inaccurate, it often assigns maximum probability to the c ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Abstract. Despite its simplicity, the naive Bayes learning scheme performs well on most classification tasks, and is often significantly more accurate than more sophisticated methods. Although the probability estimates that it produces can be inaccurate, it often assigns maximum probability to the correct class. This suggests that its good performance might be restricted to situations where the output is categorical. It is therefore interesting to see how it performs in domains where the predicted value is numeric, because in this case, predictions are more sensitive to inaccurate probability estimates. This paper shows how to apply the naive Bayes methodology to numeric prediction (i.e., regression) tasks by modeling the probability distribution of the target value with kernel density estimators, and compares it to linear regression, locally weighted linear regression, and a method that produces “model trees”—decision trees with linear regression functions at the leaves. Although we exhibit an artificial dataset for which naive Bayes is the method of choice, on realworld datasets it is almost uniformly worse than locally weighted linear regression and model trees. The comparison with linear regression depends on the error measure: for one measure naive Bayes performs similarly, while for another it is worse. We also show that standard naive Bayes applied to regression problems by discretizing the target value performs similarly badly. We then present empirical evidence that isolates naive Bayes ’ independence assumption as the culprit for its poor performance in the regression setting. These results indicate that the simplistic statistical assumption that naive Bayes makes is indeed more restrictive for regression than for classification.