Results 1 - 10
of
23
Wrappers For Performance Enhancement And Oblivious Decision Graphs
, 1995
"... In this doctoral dissertation, we study three basic problems in machine learning and two new hypothesis spaces with corresponding learning algorithms. The problems we investigate are: accuracy estimation, feature subset selection, and parameter tuning. The latter two problems are related and are stu ..."
Abstract
-
Cited by 94 (6 self)
- Add to MetaCart
In this doctoral dissertation, we study three basic problems in machine learning and two new hypothesis spaces with corresponding learning algorithms. The problems we investigate are: accuracy estimation, feature subset selection, and parameter tuning. The latter two problems are related and are studied under the wrapper approach. The hypothesis spaces we investigate are: decision tables with a default majority rule (DTMs) and oblivious read-once decision graphs (OODGs).
A Survey of Methods for Scaling Up Inductive Algorithms
- Data Mining and Knowledge Discovery
, 1999
"... . One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work on scaling up inductive algorithms. We concentrate on algorithms that build decision trees and rule ..."
Abstract
-
Cited by 74 (10 self)
- Add to MetaCart
. One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work on scaling up inductive algorithms. We concentrate on algorithms that build decision trees and rule sets, in order to provide focus and specific details; the issues and techniques generalize to other types of data mining. We begin with a discussion of important issues related to scaling up. We highlight similarities among scaling techniques by categorizing them into three main approaches. For each approach, we then describe, compare, and contrast the different constituent techniques, drawing on specific examples from published papers. Finally, we use the preceding analysis to suggest how to proceed when dealing with a large problem, and where to focus future research. Keywords: scaling up, inductive learning, decision trees, rule learning 1. Introduction The knowledge discovery and data...
Scaling Up: Distributed Machine Learning with Cooperation
- In Proceedings of the Thirteenth National Conference on Artificial Intelligence
, 1996
"... Machine-learning methods are becoming increasingly popular for automated data analysis. However, standard methods do not scale up to massive scientific and business data sets without expensive hardware. This paper investigates a practical alternative for scaling up: the use of distributed proce ..."
Abstract
-
Cited by 46 (2 self)
- Add to MetaCart
Machine-learning methods are becoming increasingly popular for automated data analysis. However, standard methods do not scale up to massive scientific and business data sets without expensive hardware. This paper investigates a practical alternative for scaling up: the use of distributed processing to take advantage of the often dormant PCs and workstations available on local networks. Each workstation runs a common rule-learning program on a subset of the data. We first show that for commonly used ruleevaluation criteria, a simple form of cooperation can guarantee that a rule will look good to the set of cooperating learners if and only if it would look good to a single learner operating with the entire data set. We then show how such a system can further capitalize on different perspectives by sharing learned knowledge for significant reduction in search effort. We demonstrate the power of the method by learning from a massive data set taken from the domain of cel...
An Extensible Meta-Learning Approach for Scalable and Accurate Inductive Learning
, 1996
"... Much of the research in inductive learning concentrates on problems with relatively small amounts of data. With the coming age of ubiquitous network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. Som ..."
Abstract
-
Cited by 42 (8 self)
- Add to MetaCart
Much of the research in inductive learning concentrates on problems with relatively small amounts of data. With the coming age of ubiquitous network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. Some learning algorithms assume that the entire data set fits into main memory, which is not feasible for massive amounts of data, especially for applications in data mining. One approach to handling a large data set is to partition the data set into subsets, run the learning algorithm on each of the subsets, and combine the results. Moreover, data can be inherently distributed across multiple sites on the network and merging all the data in one location can be expensive or prohibitive. In this thesis we propose, investigate, and evaluate a meta-learning approach to integrating the results of mul...
KDD for Science Data Analysis: Issues and Examples
- In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining
, 1996
"... The analysis of the massive data sets collected by scientific instruments demands automation as a pre-requisite to analysis. There is an urgent need to create an intermediate level at which scientists can operate effectively; isolating them from the massive sizes and harnessing human analysis capabi ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
The analysis of the massive data sets collected by scientific instruments demands automation as a pre-requisite to analysis. There is an urgent need to create an intermediate level at which scientists can operate effectively; isolating them from the massive sizes and harnessing human analysis capabilities to focus on tasks in which machines do not even remotely approach humans---namely, creative data analysis, theory and hypothesis formation, and drawing insights into underlying phenomena. We give an overview of the main issues in the exploitation of scientific datasets, present five case studies where KDD tools play important and enabling roles, and conclude with future challenges for data mining and KDD techniques in science data analysis. keywords: Applications in Science, Data Analysis, overview article, large databases, automated analysis, scietific data sets, scientific discovery. To appear: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining...
Machine Learning Techniques for the Computer Security Domain of Anomaly Detection
, 2000
"... : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xv 1 ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xv 1
A Survey of Methods for Scaling Up Inductive Learning Algorithms
, 1997
"... : Each year, one of the explicit challenges for the KDD research community is to develop methods that facilitate the use of inductive learning algorithms for mining very large databases. By collecting, categorizing, and summarizing past work on scaling up inductive learning algorithms, this paper se ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
: Each year, one of the explicit challenges for the KDD research community is to develop methods that facilitate the use of inductive learning algorithms for mining very large databases. By collecting, categorizing, and summarizing past work on scaling up inductive learning algorithms, this paper serves to establish a common ground for researchers addressing the challenge. We begin with a discussion of important, but often tacit, issues related to scaling up learning algorithms. We highlight similarities among methods by categorizing them into three main approaches. For each approach, we then describe, compare, and contrast the different constituent methods, drawing on specific examples from the published literature. Finally, we use the preceding analysis to suggest how one should proceed when dealing with a large problem, and where future research efforts should be focused. Primary contact: Foster Provost NYNEX Science and Technology, 400 Westchester Avenue, White Plains, NY 10604 em...
A Comparison Of Genetic Algorithms And Other Machine Learning Systems On A Complex Classification Task From Common Disease Research
, 1995
"... A COMPARISON OF GENETIC ALGORITHMS AND OTHER MACHINE LEARNING SYSTEMS ON A COMPLEX CLASSIFICATION TASK FROM COMMON DISEASE RESEARCH by Clare Bates Congdon Co-Chairs: John H. Holland, John E. Laird The thesis project is an investigation of some well-known machine learning systems and evaluates their ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
A COMPARISON OF GENETIC ALGORITHMS AND OTHER MACHINE LEARNING SYSTEMS ON A COMPLEX CLASSIFICATION TASK FROM COMMON DISEASE RESEARCH by Clare Bates Congdon Co-Chairs: John H. Holland, John E. Laird The thesis project is an investigation of some well-known machine learning systems and evaluates their utility when applied to a classification task from the field of human genetics. This common-disease research task, an inquiry into genetic and biochemical factors and their association with a family history of coronary artery disease (CAD), is more complex than many pursued in machine learning research, due to interactions and the inherent noise in the dataset. The task also differs from most pursued in machine learning research because there is a desire to explain the dataset with a small number of rules, even at the expense of accuracy, so that they will be more accessible to medical researchers who are unaccustomed to dealing with disjunctive explanations of data. Furthermore, there is as...
Distributed Data Mining: Scaling up and beyond
- In Advances in Distributed and Parallel Knowledge Discovery
, 1999
"... In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey of what has been done to date. I then provide motivation beyond scaling up, arguing that DDM is a mor ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey of what has been done to date. I then provide motivation beyond scaling up, arguing that DDM is a more natural way to view data mining generally. DDM eliminates many difficulties encountered when coalescing already-distributed data for monolithic data mining, such as those associated with heterogeneity of data and with privacy restrictions. By viewing data mining as inherently distributed, important open research issues come into focus, issues that currently are obscured by the lack of explicit treatment of the process of producing monolithic data sets. I close with a discussion of the necessity of DDM for an efficient process of knowledge discovery.

