Results 1 - 10
of
20
From data mining to knowledge discovery in databases
- AI Magazine
, 1996
"... ■ Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases ..."
Abstract
-
Cited by 215 (0 self)
- Add to MetaCart
■ Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular real-world applications, specific data-mining techniques, challenges involved in real-world applications of knowledge discovery, and current and future research directions in the field. Across a wide variety of fields, data are
Knowledge Discovery and Data Mining: Towards a Unifying Framework
, 1996
"... This paper presents a first step towards a unifying framework for Knowledge Discovery in Databases. We describe links between data mining, knowledge discovery, and other related fields. We then define the KDD process and basic data mining algorithms, discuss application issues and conclude with an a ..."
Abstract
-
Cited by 108 (0 self)
- Add to MetaCart
This paper presents a first step towards a unifying framework for Knowledge Discovery in Databases. We describe links between data mining, knowledge discovery, and other related fields. We then define the KDD process and basic data mining algorithms, discuss application issues and conclude with an analysis of challenges facing practitioners in the field. keywords: Knowledge Discovery in Databases (KDD), Data mining, overview article, large databases, automated analysis, issues and challenges in data mining. To appear: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, August 2-4, 1996, AAAI Press. http://wwwaig. jpl.nasa.gov/kdd96 Knowledge Discovery and Data Mining: Towards a Unifying Framework Usama Fayyad Microsoft Research One Microsoft Way Redmond, WA 98052, USA fayyad@microsoft.com Gregory Piatetsky-Shapiro GTE Laboratories, MS 44 Waltham, MA 02154, USA gps@gte.com Padhraic Smyth Information and Computer S...
Distance-Based Outliers: Algorithms and Applications
, 2000
"... . This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional a ..."
Abstract
-
Cited by 104 (0 self)
- Add to MetaCart
. This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. In this paper, we study the notion of DB- (Distance- Based) outliers. Specifically, we show that: (i) outlier detection can be done efficiently for large datasets, and for k-dimensional datasets with large values of k (e.g., k 5); and (ii), outlier detection is a meaningful and important knowledge discovery task. First, we present two simple algorithms, both having a complexity of O(kN 2 ), k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we ...
A Survey of Methods for Scaling Up Inductive Algorithms
- Data Mining and Knowledge Discovery
, 1999
"... . One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work on scaling up inductive algorithms. We concentrate on algorithms that build decision trees and rule ..."
Abstract
-
Cited by 74 (10 self)
- Add to MetaCart
. One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work on scaling up inductive algorithms. We concentrate on algorithms that build decision trees and rule sets, in order to provide focus and specific details; the issues and techniques generalize to other types of data mining. We begin with a discussion of important issues related to scaling up. We highlight similarities among scaling techniques by categorizing them into three main approaches. For each approach, we then describe, compare, and contrast the different constituent techniques, drawing on specific examples from published papers. Finally, we use the preceding analysis to suggest how to proceed when dealing with a large problem, and where to focus future research. Keywords: scaling up, inductive learning, decision trees, rule learning 1. Introduction The knowledge discovery and data...
The Computational Support of Scientific Discovery
, 2000
"... In this paper, we review AI research on computational discovery and its recent application to the discovery of new scientific knowledge. We characterize five historical stages of the scientific discovery process, which we use as an organizational framework in describing applications. We also identif ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
In this paper, we review AI research on computational discovery and its recent application to the discovery of new scientific knowledge. We characterize five historical stages of the scientific discovery process, which we use as an organizational framework in describing applications. We also identify five distinct steps during which developers or users can influence the behavior of a computational discovery system. Rather than criticizing such intervention, as done in the past, we recommend it as the preferred approach to using discovery software. As evidence for the advantages of such human-computer cooperation, we report seven examples of novel, computer-aided discoveries that have appeared in the scientific literature. We consider briefly the role that humans played in each case, then examine one such interaction in more detail. We close by recommending that future systems provide more explicit support for human intervention in the discovery process. Running head: Computational Sci...
Parcel: Feature Subset Selection in Variable Cost Domains
, 1998
"... The vast majority of classification systems are designed with a single set of features, and optimised to a single specified cost. However, in examples such as medical and financial risk modelling, costs are known to vary subsequent to system design. In this paper, we present a design method for feat ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
The vast majority of classification systems are designed with a single set of features, and optimised to a single specified cost. However, in examples such as medical and financial risk modelling, costs are known to vary subsequent to system design. In this paper, we present a design method for feature selection in the presence of varying costs. Starting from the Wilcoxon nonparametric statistic for the performance of a classification system, we introduce a concept called the maximum realisable receiver operating characteristic (MRROC), and prove a related theorem. A novel criterion for feature selection, based on the area under the MRROC curve, is then introduced. This leads to a framework which we call Parcel. This has the flexibility to use different combinations of features at different operating points on the resulting MRROC curve. Empirical support for each stage in our approach is provided by experiments on real world problems, with Parcel achieving superior results. iv v C...
The Computer-Aided Discovery of Scientific Knowledge
- In Proceedings of the first international conference on discovery science
, 1998
"... . In this paper, we review AI research on computational discovery and its recent application to the discovery of new scientific knowledge. We characterize five historical stages of the scientific discovery process, which we use as an organizational framework in describing applications. We also ident ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
. In this paper, we review AI research on computational discovery and its recent application to the discovery of new scientific knowledge. We characterize five historical stages of the scientific discovery process, which we use as an organizational framework in describing applications. We also identify five distinct steps during which developers or users can influence the behavior of a computational discovery system. Rather than criticizing such intervention, as done in the past, we recommend it as the preferred approach to using discovery software. As evidence for the advantages of such human-computer cooperation, we report seven examples of novel, computer-aided discoveries that have appeared in the scientific literature, along with the role that humans played in each case. We close by recommending that future systems provide more explicit support for human intervention in the discovery process. 1 Introduction The process of scientific discovery has long been viewed as the pinnacle ...
A Survey of Methods for Scaling Up Inductive Learning Algorithms
, 1997
"... : Each year, one of the explicit challenges for the KDD research community is to develop methods that facilitate the use of inductive learning algorithms for mining very large databases. By collecting, categorizing, and summarizing past work on scaling up inductive learning algorithms, this paper se ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
: Each year, one of the explicit challenges for the KDD research community is to develop methods that facilitate the use of inductive learning algorithms for mining very large databases. By collecting, categorizing, and summarizing past work on scaling up inductive learning algorithms, this paper serves to establish a common ground for researchers addressing the challenge. We begin with a discussion of important, but often tacit, issues related to scaling up learning algorithms. We highlight similarities among methods by categorizing them into three main approaches. For each approach, we then describe, compare, and contrast the different constituent methods, drawing on specific examples from the published literature. Finally, we use the preceding analysis to suggest how one should proceed when dealing with a large problem, and where future research efforts should be focused. Primary contact: Foster Provost NYNEX Science and Technology, 400 Westchester Avenue, White Plains, NY 10604 em...
Automated Discovery of Active Motifs in Three Dimensional Molecules
- In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining
, 1997
"... In this paper we present a method for discovering approximately common motifs (also known as active motifs) in three dimensional (3D) molecules. Each node in a molecule is represented by a 3D point in the Euclidean Space and each edge is represented by an undirected line segment connecting two ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
In this paper we present a method for discovering approximately common motifs (also known as active motifs) in three dimensional (3D) molecules. Each node in a molecule is represented by a 3D point in the Euclidean Space and each edge is represented by an undirected line segment connecting two nodes in the molecule. Motifs are rigid substructures which may occur in a molecule after allowing for an arbitrary number of rotations and translations as well as a small number (specified by the user) of node insert/delete operations in the motifs or the molecule. (We call this "approximate occurrence.") The proposed method combines the geometric hashing technique and block detection algorithms for undirected graphs. To demonstrate the utility of our algorithms, we discuss their applications to classifying three families of molecules pertaining to antibacterial sulfa drugs, anti-anxiety agents (benzodiazepines) and antiadrenergic agents (fi receptors). Experimental results i...
Mining for Causes of Cancer: Machine Learning Experiments at Various Levels of Detail
- In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97), Menlo Park, CA
, 1997
"... This paper presents, from a methodological point of view, first results of an interdisciplinary project in scientific data mining. We analyze data about the carcinogenicity of chemicals derived from the carcinogenesis bioassay program, a long-term research study performed by the US National Institut ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
This paper presents, from a methodological point of view, first results of an interdisciplinary project in scientific data mining. We analyze data about the carcinogenicity of chemicals derived from the carcinogenesis bioassay program, a long-term research study performed by the US National Institute of Environmental Health Sciences. The database contains detailed descriptions of 6823 tests performed with more than 330 compounds and animals of different species, strains and sexes. The chemical structures are described at the atom and bond level, and in terms of various relevant structural properties. The goal of this paper is to investigate the effects that various levels of detail and amounts of information have on the resulting hypotheses, both quantitatively and qualitatively. We apply relational and propositional machine learning algorithms to learning problems formulated as regression or as classification tasks. In addition, these experiments have been conducted with two learning ...

