Results 1 - 10
of
30
Efficient Algorithms for Mining Outliers from Large Data Sets
"... In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its k th nearest neighbor. We rank each point on the basis of its distance to its k th nearest neighbor and declare the top n points in this ranking to be outliers. In addition ..."
Abstract
-
Cited by 170 (1 self)
- Add to MetaCart
In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its k th nearest neighbor. We rank each point on the basis of its distance to its k th nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nestedloop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality. 1
BOAT -- Optimistic Decision Tree Construction
, 1999
"... Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision ..."
Abstract
-
Cited by 97 (1 self)
- Add to MetaCart
Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree. We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any differen...
RainForest - a Framework for Fast Decision Tree Construction of Large Datasets
- In VLDB
, 1998
"... Classification of large datasets is an important data mining problem. Many classification algorithms have been proposed in the literature, but studies have shown that so far no algorithm uniformly outperforms all other algorithms in terms of quality. In this paper, we present a unifying framework fo ..."
Abstract
-
Cited by 85 (8 self)
- Add to MetaCart
Classification of large datasets is an important data mining problem. Many classification algorithms have been proposed in the literature, but studies have shown that so far no algorithm uniformly outperforms all other algorithms in terms of quality. In this paper, we present a unifying framework for decision tree classifiers that separates the scalability aspects of algorithms for constructing a decision tree from the central features that determine the quality of the tree. This generic algorithm is easy to instantiate with specific algorithms from the literature (including C4.5, CART,
XRules: An Effective Structural Classifier for XML Data
, 2003
"... XML documents have recently become ubiquitous because of their varied applicability in a number of applications. Classification is an important problem in the data mining domain, but current classification methods for XML documents use IR-based methods in which each document is treated as a bag of w ..."
Abstract
-
Cited by 63 (6 self)
- Add to MetaCart
XML documents have recently become ubiquitous because of their varied applicability in a number of applications. Classification is an important problem in the data mining domain, but current classification methods for XML documents use IR-based methods in which each document is treated as a bag of words. Such techniques ignore a significant amount of information hidden inside the documents. In this paper we discuss the problem of rule based classification of XML data by using frequent discriminatory substructures within XML documents. Such a technique is more capable of finding the classification characteristics of documents. In addition, the technique can also be extended to cost sensitive classification. We show the e#ectiveness of the method with respect to other classifiers. We note that the methodology discussed in this paper is applicable to any kind of semi-structured data.
A Framework for Measuring Changes in Data Characteristics
- IN PODS
, 1999
"... A data mining algorithm builds a model that captures interesting aspects of the underlying data. We develop a framework for quantifying the difference, called the deviation, between two datasets in terms of the models they induce. Our framework covers a wide variety of models including frequent item ..."
Abstract
-
Cited by 44 (1 self)
- Add to MetaCart
A data mining algorithm builds a model that captures interesting aspects of the underlying data. We develop a framework for quantifying the difference, called the deviation, between two datasets in terms of the models they induce. Our framework covers a wide variety of models including frequent itemsets, decision tree classifiers, and clusters, and captures standard measures of deviation such as the misclassification rate and the chi-squared metric as special cases. We also show how statistical techniques can be applied to the deviation measure to assess whether the difference between two models is meaningful (i.e., whether the underlying datasets have statistically significant differences in their characteristics), and discuss several practical applications.
A Novel Evolutionary Data Mining Algorithm With Applications to Churn Prediction
, 2003
"... Classification is an important topic in data mining research. Given a set of data records, each of which belongs to one of a number of predefined classes, the classification problem is concerned with the discovery of classification rules that can allow records with unknown class membership to be cor ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
Classification is an important topic in data mining research. Given a set of data records, each of which belongs to one of a number of predefined classes, the classification problem is concerned with the discovery of classification rules that can allow records with unknown class membership to be correctly classified. Many algorithms have been developed to mine large data sets for classification models and they have been shown to be very effective. However, when it comes to determining the likelihood of each classification made, many of them are not designed with such purpose in mind. For this, they are not readily applicable to such problem as churn prediction. For such an application, the goal is not only to predict whether or not a subscriber would switch from one carrier to another, it is also important that the likelihood of the subscriber's doing so be predicted. The reason for this is that a carrier can then choose to provide special personalized offer and services to those subscribers who are predicted with higher likelihood to churn. Given its importance, we propose a new data mining algorithm, called data mining by evolutionary learning (DMEL), to handle classification problems of which the accuracy of each predictions made has to be estimated. In performing its tasks, DMEL searches through the possible rule space using an evolutionary approach that has the following characteristics: 1) the evolutionary process begins with the generation of an initial set of first-order rules (i.e., rules with one conjunct/condition) using a probabilistic induction technique and based on these rules, rules of higher order (two or more conjuncts) are obtained iteratively; 2) when identifying interesting rules, an objective interestingness measure is used; 3) the fitness of a ch...
Cancer classification using gene expression data
- Information Systems
, 2003
"... The classification of different tumor types is of great importance in cancer diagnosis and drug discovery. However, most previous cancer classification studies are clinical-based and have limited diagnostic ability. Cancer classification using gene expression data is known to contain the keys for ad ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
The classification of different tumor types is of great importance in cancer diagnosis and drug discovery. However, most previous cancer classification studies are clinical-based and have limited diagnostic ability. Cancer classification using gene expression data is known to contain the keys for addressing the fundamental problems relating to cancer diagnosis and drug discovery. The recent advent of DNA microarray technique has made simultaneous monitoring of thousands of gene expressions possible. With this abundance of gene expression data, researchers have started to explore the possibilities of cancer classification using gene expression data. Quite a number of methods have been proposed in recent years with promising results. But there are still a lot of issues which need to be addressed and understood. In order to gain deep insight into the cancer classification problem, it is necessary to take a closer look at the problem, the proposed solutions and the related issues all together. In this survey paper, we present a comprehensive overview of various proposed cancer classification methods and evaluate them based on their computation time, classification accuracy and ability to reveal biologically meaningful gene information. We also introduce and evaluate various proposed gene selection methods which we believe should be an integral preprocessing step for cancer classification. In order to obtain a full picture of cancer classification, we also discuss several issues related to cancer classification, including the biological significance vs. statistical significance of a cancer classifier, the asymmetrical classification errors for cancer classifiers, and the gene contamination problem.
Data Mining and the Web: Past, Present and Future
- In ACM Workshop on Web Information and Data Management (WIDM
, 1999
"... this paper, we begin by reviewing popular data mining techniques like association rules, classification, clustering and outlier detection. We provide a brief description of each technique as well as efficient algorithms for implementing the technique. We then discuss algorithms for discovering Web, ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
this paper, we begin by reviewing popular data mining techniques like association rules, classification, clustering and outlier detection. We provide a brief description of each technique as well as efficient algorithms for implementing the technique. We then discuss algorithms for discovering Web, hypertext and hyperlink structure, that have been proposed by researchers in recent years. The key difference between these algorithms and earlier data mining algorithms is that the latter take hyperlink information into account. Finally, we conclude by listing research issues that still remain to be addressed in the area of Web Mining.
Hierarchical decision tree induction in distributed genomic databases
- IEEE Trans. Knowl. Data Eng
"... Abstract — Classification based on decision trees is one of the important problems in data mining and has applications in many fields. In recent years, database systems have become highly distributed, and distributed system paradigms such as federated and peer-to-peer databases are being adopted. In ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Abstract — Classification based on decision trees is one of the important problems in data mining and has applications in many fields. In recent years, database systems have become highly distributed, and distributed system paradigms such as federated and peer-to-peer databases are being adopted. In this paper, we consider the problem of inducing decision trees in a large distributed network of genomic databases. Our work is motivated by the existence of distributed databases in healthcare and in bioinformatics, and by the vision that these database are soon to contain large amounts of genomic data, characterized by its high dimensionality. Current decision tree algorithms would require high communication bandwidth when executed on such data, which is not likely to exist in large-scale distributed systems. We present an algorithm that sharply reduces the communication overhead by sending just a fraction of the statistical data. A fraction which is nevertheless sufficient to derive the exact same decision tree learned by a sequential learner on all the data in the network. Extensive experiments using standard synthetic SNP data show that the algorithm utilizes the high dependency among attributes, typical to genomic data, to reduce communication overhead by up to 99%. Scalability tests show that the algorithm scales well with both the size of the dataset, number of SNPs, and the size of the distributed system. I.
User Defined Aggregates in Object-Relational Systems
- In Proceedings of the 16th International Conference on Data Engineering
, 2000
"... User-defined aggregates are essential in many advanced database applications, particularly in expressing data mining functions, but they find little support in current systems including Object-Relational databases. Three serious limitations of current systems are (i) the inability of introducing new ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
User-defined aggregates are essential in many advanced database applications, particularly in expressing data mining functions, but they find little support in current systems including Object-Relational databases. Three serious limitations of current systems are (i) the inability of introducing new aggregates (e.g., by coding them in procedural language as originally proposed in SQL3), (ii) the inability of returning partial results during the computation (e.g., to support online aggregation), and (iii) the inability of using aggregates in recursive queries (e.g., to express Bill of Materials and optimized graph searches). In this paper, we presents a unified solution to these problems which realizes SQL3 original proposal for user-defined aggregates (UDAs) , and adds significant improvements in terms of expressive power and ease of use: in fact our SQL-AG system also supports online aggregation, monotonic aggregation, and a high-level aggregate definition language named SADL. We focu...

