Results 11 - 20
of
21
Data Mining: A Database Perspective.
- in Proc. Int. Conf. Data Mining
, 1998
"... Data mining on large databases has been a major concern in research community, due to the difficulty of analyzing huge volumes of data using only traditional OLAP tools. This sort of process implies a lot of computational power, memory and disk I/O, which can only be provided by parallel computers. ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Data mining on large databases has been a major concern in research community, due to the difficulty of analyzing huge volumes of data using only traditional OLAP tools. This sort of process implies a lot of computational power, memory and disk I/O, which can only be provided by parallel computers. We present a discussion of how database technology can be integrated to data mining techniques. Finally, we also point out several advantages of addressing data consuming activities through a tight integration of a parallel database server and data mining techniques. 1 Introduction Data mining techniques have increasingly been studied 7;9;21 , especially in their application in real-world databases. One typical problem is that databases tend to be very large, and these techniques often repeatedly scan the entire set. Sampling has been used for a long time, but subtle differences among sets of objects become less evident. This work means to provide an overview of some important data mining...
PKDD'98 Tutorial on Scalable, High-Performance Data Mining with Parallel Processing
- In Proceedings of the Principles and Practice of Knowledge Discovery in Databases (PKDD’98
, 1998
"... Contents 1 Introduction 2 Overview of 7 different approaches for speeding up data mining in large databases 3 An overview of parallel processing for data mining 4 Parallel rule induction 5 Parallel Instance-Based Learning 6 Parallel Genetic Algorithms 7 Parallel Neural Networks 8 Conclusions Introd ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Contents 1 Introduction 2 Overview of 7 different approaches for speeding up data mining in large databases 3 An overview of parallel processing for data mining 4 Parallel rule induction 5 Parallel Instance-Based Learning 6 Parallel Genetic Algorithms 7 Parallel Neural Networks 8 Conclusions Introduction. Problem: How to perform efficient data mining in very large databases. Natural solution: parallelism Performance issues: any sequential data mining algorithm: O(N) parallelism reduces this lower bound to O(N/p) (N = No. of tuples, p = No. of processors) Cost-benefit issues: many data warehouses are already implemented on cost-effective parallel database servers 2 Overview of 7 different approaches for speeding up data mining in large databases. Data-Oriented Approaches: (1) Sampling (reduces number of tuples) (2) Attribute selection (reduces number of attributes) (3) Discretization (reduces number of values of attributes, which in
User Interactivity in Very Large Scale Data Mining
"... Knowledge discovery is widely considered to be an interactive and iterative process. The data mining phase of KDD is, on the other hand, often assumed to be an indivisible step. We argue that user interaction during discovery runs is a central element in very large scale data mining. We describe a ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Knowledge discovery is widely considered to be an interactive and iterative process. The data mining phase of KDD is, on the other hand, often assumed to be an indivisible step. We argue that user interaction during discovery runs is a central element in very large scale data mining. We describe a system architecture centered around a common, persistently stored search space. This gives good performance while simultaneously allowing for high user interactivity. Furthermore, the architecture supports an extremely robust and modular implementation of the data mining process. As a concrete instantiation, we discuss the use of this architecture in the ESPRIT KESO project (Knowledge Extraction for Statistical Offices). 1 Introduction Data Mining, or Knowledge Discovery in Databases (KDD), is a rapidly developing field whose goal it is to develop methods and systems for finding useful, interesting, and novel information in data. While many applications in KDD continue to be of a medium siz...
A Fuzzy Beam-Search Rule Induction Algorithm
- Principles of Data Mining and Knowledge Discovery (Proc. 3rd European Conf. - PKDD-99). Lecture Notes in Artificial Intelligence 1704
, 1999
"... . This paper proposes a fuzzy beam search rule induction algorithm for the classification task. The use of fuzzy logic and fuzzy sets not only provides us with a powerful, flexible approach to cope with uncertainty, but also allows us to express the discovered rules in a representation more intuitiv ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
. This paper proposes a fuzzy beam search rule induction algorithm for the classification task. The use of fuzzy logic and fuzzy sets not only provides us with a powerful, flexible approach to cope with uncertainty, but also allows us to express the discovered rules in a representation more intuitive and comprehensible for the user, by using linguistic terms (such as low, medium, high) rather than continuous, numeric values in rule conditions. The proposed algorithm is evaluated in two public domain data sets. 1 Introduction This paper addresses the classification task. In this task the goal is to discover a relationship between a goal attribute, whose value is to be predicted, and a set of predicting attributes. The system discovers this relationship by using known-class examples, and the discovered relationship is then used to predict the goal-attribute value (or the class) of unknown-class examples. There are numerous rule induction algorithms for the classification task. However, ...
Data Mining: From Statistics to Inductive Logic Programming
"... Many different approaches exist in the field of data mining, for instance using inductive logic programming or statistics. In this paper we combine these two paradigms, applying them to the following type of classification problem in data mining. Given is a database of insurance clients. We can cons ..."
Abstract
- Add to MetaCart
Many different approaches exist in the field of data mining, for instance using inductive logic programming or statistics. In this paper we combine these two paradigms, applying them to the following type of classification problem in data mining. Given is a database of insurance clients. We can consider it as a sample of the population (clients and potential clients). Our task is to partition the population into homogeneous classes w.r.t. causing car-accidents. A homogeneous class is a set of people that can not be divided into subclasses with different risks of causing accidents. A definite program clause can be used to define a class in the sample and hence also a class in the population. A client is in the class if he satisfies the conditions given in the body. Whether this client has caused an accident gives the truth value of the head. We give an algorithm, using a refinement operator and confidence intervals, which can find clauses that correspond to an appropriate partition. 1 I...
Using SQL primitives and parallel DB servers to speed up knowledge discovery in large relational databases.
- Proceedings EMCSR '96
, 1996
"... Efficiency is crucial in KDD (Knowledge Discovery in Databases), due to the huge amount of data stored in commercial databases. We argue that high efficiency in KDD can be achieved by combining two approaches, namely mapping KDD functionality onto standard DBMS operations and executing KDD tasks on ..."
Abstract
- Add to MetaCart
Efficiency is crucial in KDD (Knowledge Discovery in Databases), due to the huge amount of data stored in commercial databases. We argue that high efficiency in KDD can be achieved by combining two approaches, namely mapping KDD functionality onto standard DBMS operations and executing KDD tasks on a parallel SQL server. We propose generic KDD primitives which underly the candidate-rule evaluation procedures of many KDD algorithms, and we evaluate the speed up achieved by a parallel SQL server when executing a decision-tree learner algorithm implemented via these primitives. 1 Introduction Our approach to Knowledge Discovery in Databases (KDD) is based on Machine Learning (ML) algorithms. However, sequential versions of most ML algorithms are impractical (i.e. take too long to run) on very large data sets. For instance, Catlett [1991] estimated that sequential C4.5 would take several months to learn from 1,000,000 examples by using state-of-the-art hardware at that time. Provost & Aro...
Integrating KDD algorithms and RDBMS code
"... . In this paper we outline the design of a RDBMS that will provide the user with traditional quey capabilities as well as KDD queries. Our approach is not just another system which adds KDD capabilities, this design is aimed to integrate these KDD capabilities into RDBMS core. The approach also ..."
Abstract
- Add to MetaCart
. In this paper we outline the design of a RDBMS that will provide the user with traditional quey capabilities as well as KDD queries. Our approach is not just another system which adds KDD capabilities, this design is aimed to integrate these KDD capabilities into RDBMS core. The approach also defines a generic engine of Data Mining algorithms that allows easy enhancement of system capabilities as a new algorithm is implemented. 1 Introduction Most of the KDD systems that have been implemented up to the present moment apply just one particular methodology or implement a particular algorithm (rough sets[9], attribute-induction[5], apriori[1, 2]). When designing this architecture we wanted a system that integrates data mining capabilities within the RDBMS. We wanted the system to be extensible, that is, we wanted to build a system in which adding new algorithms would be easy. This goal is achieved dividing KDD algorithms into basic operations that will be implemented as particul...
Parallel Sequence Mining on Shared-Memory Machines
- Journal of Parallel and Distributed Computing
, 2000
"... We present pSPADE, a parallel algorithm for fast discovery of frequent sequences in large databases. pSPADE decomposes the original search space into smaller suffix-based classes. Each class can be solved in main-memory using efficient search techniques, and simple join operations. Further each clas ..."
Abstract
- Add to MetaCart
We present pSPADE, a parallel algorithm for fast discovery of frequent sequences in large databases. pSPADE decomposes the original search space into smaller suffix-based classes. Each class can be solved in main-memory using efficient search techniques, and simple join operations. Further each class can be solved independently on each processor requiring no synchronization. However, dynamic inter-class and intraclass load balancing must be exploited to ensure that each processor gets an equal amount of work. Experiments on a 12 processor SGI Origin 2000 shared memory system show good speedup and scaleup results.

