Results 1 - 10
of
50
Mining Intrusion Detection Alarms for Actionable Knowledge
- In The 8th ACM International Conference on Knowledge Discovery and Data Mining
, 2002
"... In response to attacks against enterprise networks,administrators increasingly deploy intrusion detection systems. These systems monitor hosts,networks,and other resources for signs of security violations.The use of intrusion detection has given rise to another difficult problem,namely the handling ..."
Abstract
-
Cited by 46 (1 self)
- Add to MetaCart
In response to attacks against enterprise networks,administrators increasingly deploy intrusion detection systems. These systems monitor hosts,networks,and other resources for signs of security violations.The use of intrusion detection has given rise to another difficult problem,namely the handling of a generally large number of alarms.In this paper,we mine historical alarms to learn how future alarms can be handled more efficiently.First,we investigate episode rules with respect to their suitability in this approach. We report the difficulties encountered and the unexpected in sights gained.In addition,we introduce a new conceptual clustering technique,and use it in extensive experiments with real-world data to show that intrusion detection alarms can be handled efficiently by using previously mined knowledge.
Efficient discovery of error-tolerant frequent itemsets in high dimensions
- In SIGKDD 2001
, 2001
"... We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies errortolerant frequent clusters of items in transactional data (customerpurchase data, web browsing data, text, etc ..."
Abstract
-
Cited by 44 (0 self)
- Add to MetaCart
We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies errortolerant frequent clusters of items in transactional data (customerpurchase data, web browsing data, text, etc.). The algorithm exploits sparseness of the underlying data to find large groups of items that are correlated over database records (rows). The notion of transaction coverage allows us to extend the algorithm and view it as a fast clustering algorithm for discovering segments of similar transactions in binary sparse data. We evaluate the new algorithm on three real-world applications: clustering highdimensional data, query selectivity estimation and collaborative filtering. Results show that the algorithm consistently uncovers structure in large sparse databases that other traditional clustering algorithms fail to find.
Feature selection for clustering
- in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2000
"... Abstract. Clustering is an important data mining task. Data mining often concerns large and high-dimensional data but unfortunately most of the clustering algorithms in the literature are sensitive to largeness or high-dimensionality or both. Di erent features a ect clusters di erently, some are imp ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
Abstract. Clustering is an important data mining task. Data mining often concerns large and high-dimensional data but unfortunately most of the clustering algorithms in the literature are sensitive to largeness or high-dimensionality or both. Di erent features a ect clusters di erently, some are important for clusters while others may hinder the clustering task. An e cient wayof handling it is by selecting a subset of important features. It helps in nding clusters e ciently, understanding the data better and reducing data size for e cient storage, collection and processing. The task of nding original important features for unsupervised data is largely untouched. Traditional feature selection algorithms work only for supervised data where class information is available. For unsupervised data, without class information, often principal components (PCs) are used, but PCs still require all features and they may be di cult to understand. Our approach: rst features are ranked according to their importance on clustering and then a subset of important features are selected. For large data we use a scalable method using sampling. Empirical evaluation shows the e ectiveness and scalability of our approach for benchmark and synthetic data sets. 1
Finding localized associations in market basket data
- Knowledge and Data Engineering
, 2002
"... In this paper, we discuss a technique for discovering localized associations in segments of the data using clustering. Often the aggregate behavior of a data set may be very di erent from localized segments. In such cases, it is desirable to design algorithms which are e ective in discovering locali ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
In this paper, we discuss a technique for discovering localized associations in segments of the data using clustering. Often the aggregate behavior of a data set may be very di erent from localized segments. In such cases, it is desirable to design algorithms which are e ective in discovering localized associations, because they expose a customer pattern which is more speci c than the aggregate behavior. This information may bevery useful for target marketing. We present empirical results which show that the method is indeed able to nd a signi cantly larger number of associations than what can be discovered by analysis of the aggregate data.
Electricity based external similarity of categorical attributes
- In PAKDD 2003
, 2003
"... Abstract. Similarity or distance measures are fundamental and critical properties for data mining tools. Categorical attributes abound in databases. The Car Make, Gender, Occupation, etc. fields in a automobile insurance database are very informative. Sadly, categorical data is not easily amenable t ..."
Abstract
-
Cited by 22 (6 self)
- Add to MetaCart
Abstract. Similarity or distance measures are fundamental and critical properties for data mining tools. Categorical attributes abound in databases. The Car Make, Gender, Occupation, etc. fields in a automobile insurance database are very informative. Sadly, categorical data is not easily amenable to similarity computations. A domain expert might manually specify some or all of the similarity relationships, but this is error-prone and not feasible for attributes with large domains, nor is it useful for cross-attribute similarities, such as between Gender and Occupation. External similarity functions define a similarity between, say, Car Makes by looking at how they co-occur with the other categorical attributes. We exploit a rich duality between random walks on graphs and electrical circuits to develop REP, an external similarity function. REP is theoretically grounded while the only prior work was ad-hoc. The usefulness of REP is shown in two experiments. First, we cluster categorical attribute values showing improved inferred relationships. Second, we use REP effectively as a nearest neighbour classifier. 1
Automatic categorization of query results
- SIGMOD Conf
, 2004
"... Exploratory ad-hoc queries could return too many answers – a phenomenon commonly referred to as “information overload”. In this paper, we propose to automatically categorize the results of SQL queries to address this problem. We dynamically generate a labeled, hierarchical category structure – users ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
Exploratory ad-hoc queries could return too many answers – a phenomenon commonly referred to as “information overload”. In this paper, we propose to automatically categorize the results of SQL queries to address this problem. We dynamically generate a labeled, hierarchical category structure – users can determine whether a category is relevant or not by examining simply its label; she can then explore just the relevant categories and ignore the remaining ones, thereby reducing information overload. We first develop analytical models to estimate information overload faced by a user for a given exploration. Based on those models, we formulate the categorization problem as a cost optimization problem and develop heuristic algorithms to compute the min-cost categorization. 1.
A Survey on Wavelet Applications in Data Mining
, 2003
"... Recently there has been significant development in the use of wavelet methods in various data mining processes. However, there has been written no comprehensive survey available on the topic. The goal of this is paper to fill the void. First, the paper presents a high-level data-mining framework tha ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
Recently there has been significant development in the use of wavelet methods in various data mining processes. However, there has been written no comprehensive survey available on the topic. The goal of this is paper to fill the void. First, the paper presents a high-level data-mining framework that reduces the overall process into smaller components. Then applications of wavelets for each component are reviewd. The paper concludes by discussing the impact of wavelets on data mining research and outlining potential future research directions and applications.
Entropy-Based Criterion in Categorical Clustering
- Proc. of Intl. Conf. on Machine Learning (ICML
, 2004
"... Entropy-type measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropy-based criterion in clustering categorical data. It first shows that the entropy-based criterion can be derived in the formal framework of probabilistic clustering models and e ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
Entropy-type measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropy-based criterion in clustering categorical data. It first shows that the entropy-based criterion can be derived in the formal framework of probabilistic clustering models and establishes the connection between the criterion and the approach based on dissimilarity coefficients.
CLOPE: A Fast and Effective Clustering Algorithm for Transactional Data
- In: Proc of KDD’02
, 2002
"... This paper studies the problem of categorical data clustering, especially for transactional data characterized by high dimensionality and large volume. Starting from a heuristic method of increasing the height-to-width ratio of the cluster histogram, we develop a novel algorithm -- CLOPE, which is v ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
This paper studies the problem of categorical data clustering, especially for transactional data characterized by high dimensionality and large volume. Starting from a heuristic method of increasing the height-to-width ratio of the cluster histogram, we develop a novel algorithm -- CLOPE, which is very fast and scalable, while being quite effective. We demonstrate the performance of our algorithm on two real world datasets, and compare CLOPE with the state-of-art algorithms.
scalable clustering of categorical data
- In EDBT
, 2004
"... Abstract. Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
Abstract. Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. As a hierarchical algorithm, LIMBO has the advantage that it can produce clusterings of different sizes in a single execution. We use the IB framework to define a distance measure for categorical tuples and we also present a novel distance measure for categorical attribute values. We show how the LIMBO algorithm can be used to cluster both tuples and values. LIMBO handles large data sets by producing a memory bounded summary model for the data. We present an experimental evaluation of LIMBO, and we study how clustering quality compares to other categorical clustering algorithms. LIMBO supports a trade-off between efficiency (in terms of space and time) and quality. We quantify this trade-off and demonstrate that LIMBO allows for substantial improvements in efficiency with negligible decrease in quality. 1

