Results 1 - 10
of
40
Sampling Large Databases for Association Rules
, 1996
"... Discovery of association rules is an important database mining problem. Current algorithms for nding association rules require several passes over the analyzed database, and obviously the role of I/O overhead is very signi cant for very large databases. We present new algorithms that reduce the data ..."
Abstract
-
Cited by 330 (4 self)
- Add to MetaCart
Discovery of association rules is an important database mining problem. Current algorithms for nding association rules require several passes over the analyzed database, and obviously the role of I/O overhead is very signi cant for very large databases. We present new algorithms that reduce the database activity considerably. Theidea is to pick a random sample, to ndusingthis sample all association rules that probably hold in the whole database, and then to verify the results with the restofthe database. The algorithms thus produce exact association rules, not approximations based on a sample. The approach is, however, probabilistic, and inthose rare cases where our sampling method does not produce all association rules, the missing rules can be found inasecond pass. Our experiments show that the proposed algorithms can nd association rules very e ciently in only onedatabase pass. 1
CACTUS - Clustering Categorical Data Using Summaries
, 1999
"... Clustering is an important data mining problem. Most of the earlier work on clustering focussed on numeric attributes which have a natural ordering on their attribute values. Recently, clustering data with categorical attributes, whose attribute values do not have a natural ordering, has received so ..."
Abstract
-
Cited by 71 (0 self)
- Add to MetaCart
Clustering is an important data mining problem. Most of the earlier work on clustering focussed on numeric attributes which have a natural ordering on their attribute values. Recently, clustering data with categorical attributes, whose attribute values do not have a natural ordering, has received some attention. However, previous algorithms do not give a formal description of the clusters they discover and some of them assume that the user post-processes the output of the algorithm to identify the final clusters. In this paper, we introduce a novel formalization of a cluster for categorical attributes by generalizing a definition of a cluster for numerical attributes. We then describe a very fast summarizationbased algorithm called CACTUS that discovers exactly such clusters in the data. CACTUS has two important characteristics. First, the algorithm requires only two scans of the dataset, and hence is very fast and scalable. Our experiments on a variety of datasets show that CACTUS outperforms previous work by a factor of 3 to 10. Second, CACTUS can find clusters in subsets of all attributes and can thus perform a subspace clustering of the data. This feature is important if clusters do not span all attributes, a likely scenario if the number of attributes is very large. In a thorough experimental evaluation, we study the performance of CACTUS on real and synthetic datasets. 1
Knowledge Discovery from Telecommunication Network Alarm Databases
, 1996
"... A telecommunication network produces daily large amounts of alarm data. The data contains hidden valuable knowledge about the behavior of the network. This knowledge can be used in filtering redundant alarms, locating problems in the network, and possibly in predicting severe faults. We describe the ..."
Abstract
-
Cited by 46 (8 self)
- Add to MetaCart
A telecommunication network produces daily large amounts of alarm data. The data contains hidden valuable knowledge about the behavior of the network. This knowledge can be used in filtering redundant alarms, locating problems in the network, and possibly in predicting severe faults. We describe the TASA (Telecommunication Network Alarm Sequence Analyzer) system for discovering and browsing knowledge from large alarm databases. The system is built on the basis of viewing knowledge discovery as an interactive and iterative process, containing data collection, pattern discovery, rule postprocessing, etc. The system uses a novel framework for locating frequently occurring episodes from sequential data. The TASA system offers a variety of selection and ordering criteria for episodes, and supports iterative retrieval from the discovered knowledge. This means that a large part of the iterative nature of the KDD process can be replaced by iteration in the rule postprocessing stage. The user i...
A Framework for Measuring Changes in Data Characteristics
- IN PODS
, 1999
"... A data mining algorithm builds a model that captures interesting aspects of the underlying data. We develop a framework for quantifying the difference, called the deviation, between two datasets in terms of the models they induce. Our framework covers a wide variety of models including frequent item ..."
Abstract
-
Cited by 44 (1 self)
- Add to MetaCart
A data mining algorithm builds a model that captures interesting aspects of the underlying data. We develop a framework for quantifying the difference, called the deviation, between two datasets in terms of the models they induce. Our framework covers a wide variety of models including frequent itemsets, decision tree classifiers, and clusters, and captures standard measures of deviation such as the misclassification rate and the chi-squared metric as special cases. We also show how statistical techniques can be applied to the deviation measure to assess whether the difference between two models is meaningful (i.e., whether the underlying datasets have statistically significant differences in their characteristics), and discuss several practical applications.
On private scalar product computation for privacy-preserving data mining
- In Proceedings of the 7th Annual International Conference in Information Security and Cryptology
, 2004
"... Abstract. In mining and integrating data from multiple sources, there are many privacy and security issues. In several different contexts, the security of the full privacy-preserving data mining protocol depends on the security of the underlying private scalar product protocol. We show that two of t ..."
Abstract
-
Cited by 40 (4 self)
- Add to MetaCart
Abstract. In mining and integrating data from multiple sources, there are many privacy and security issues. In several different contexts, the security of the full privacy-preserving data mining protocol depends on the security of the underlying private scalar product protocol. We show that two of the private scalar product protocols, one of which was proposed in a leading data mining conference, are insecure. We then describe a provably private scalar product protocol that is based on homomorphic encryption and improve its efficiency so that it can also be used on massive datasets. Keywords: Privacy-preserving data mining, private scalar product protocol, vertically partitioned frequent pattern mining 1
Subgroup Discovery with CN2-SD
- Journal of Machine Learning Research
, 2004
"... discovery. The goal of subgroup discovery is to find rules describing subsets of the population that are sufficiently large and statistically unusual. The paper presents a subgroup discovery algorithm, CN2-SD, developed by modifying parts of the CN2 classification rule learner: its covering algorit ..."
Abstract
-
Cited by 34 (7 self)
- Add to MetaCart
discovery. The goal of subgroup discovery is to find rules describing subsets of the population that are sufficiently large and statistically unusual. The paper presents a subgroup discovery algorithm, CN2-SD, developed by modifying parts of the CN2 classification rule learner: its covering algorithm, search heuristic, probabilistic classification of instances, and evaluation measures. Experimental evaluation of CN2-SD on 23 UCI data sets shows substantial reduction of the number of induced rules, increased rule coverage and rule significance, as well as slight improvements in terms of the area under ROC curve, when compared with the CN2 algorithm. Application of CN2-SD to a large traffic accident data set confirms these findings.
The Representation Race - Preprocessing for Handling Time Phenomena
- In Ramon Lopez de Mantaras and Enric Plaza, editors, Machine Learning: ECML 2000, Lecture Notes in Artificial Intelligence
, 2000
"... . Designing the representation languages for the input,LE , and output, LH , of a learning algorithm is the hardest task within machine learning applications. This paper emphasizes the importance of constructing an appropriate representation LE for knowledge discovery applications using the exam ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
. Designing the representation languages for the input,LE , and output, LH , of a learning algorithm is the hardest task within machine learning applications. This paper emphasizes the importance of constructing an appropriate representation LE for knowledge discovery applications using the example of time related phenomena. Given the same raw data -- most frequently a database with time-stamped data -- rather different representations have to be produced for the learning methods that handle time. In this paper, a set of learning tasks dealing with time is given together with the input required by learning methods which solve the tasks. Transformations from raw data to the desired representation are illustrated by three case studies. 1 Introduction Designing the representation languages for the input and output of a learning algorithm is the hardest task within machine learning applications. The "no free lunch theorem" actually implies that if a hard learning task becomes e...
Applying Data Mining Techniques in Text Analysis
, 1997
"... Anumber of recent data mining techniques have been targeted especially for the analysis of sequential data. Traditional examples of sequential data involve telecommunication alarms, Www log les, user action registration for Hci studies, or any other series of events consisting ofanevent type and ati ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Anumber of recent data mining techniques have been targeted especially for the analysis of sequential data. Traditional examples of sequential data involve telecommunication alarms, Www log les, user action registration for Hci studies, or any other series of events consisting ofanevent type and atime of occurrence. Text can also be seen as sequential data, in many respects similar to the data collected by sensors, or other observation systems. Traditionally, texts have been analysed using various information retrieval related methods, such as full-text analysis, and natural language processing. However, only few examples of data mining in text, particularly in full text, are available. In this paper we show that general data mining methods are applicable to text analysis tasks under certain conditions. Moreover, we present a general framework for text mining. The framework follows the general Kdd process, thus containing steps from preprocessing tothe utilization of the results. The data mining method that weapply is based on generalized episodes and episode rules. We consider preprocessing ofthe text to beessentialintext mining: by shifting the focus in the preprocessing phase, data mining can be used to obtain results for various purposes. We give concrete examples of howto preprocess texts based on the intended use of the discovered results andhow to balance preprocessing with postprocessing. We also present example applications including search for key words, key phrases andother co-occurringwords, e.g. collocations and generalized concordances. These applications are both common and relevant tasks in information retrieval and natural language processing. We also present results from real-life data experiments to show that our approach isapplicable in practice.
Using condensed representations for interactive association rule mining
- In Proc. Principles and Practice of Knowledge Discovery in Databases PKDD’02, volume 2431 of LNAI
, 2002
"... Abstract. Association rule mining is a popular data mining task. It has an interactive and iterative nature, i.e., the user has to refine his mining queries until he is satisfied with the discovered patterns. To support such an interactive process, we propose to optimize sequences of queries by mean ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Abstract. Association rule mining is a popular data mining task. It has an interactive and iterative nature, i.e., the user has to refine his mining queries until he is satisfied with the discovered patterns. To support such an interactive process, we propose to optimize sequences of queries by means of a cache that stores information from previous queries. Unlike related works, we use condensed representations like free and closed itemsets for both data mining and caching. This results in a much more efficient mining technique in highly correlated data and a much smaller cache than in previous approaches.
Learning Temporal Rules from State Sequences
, 2001
"... In this paper we consider the problem of learning ..."

