## Detecting anomalous records in categorical datasets (2007)

### Cached

### Download Links

- [www.cs.cmu.edu]
- [www-2.cs.cmu.edu]
- [www.dbmi.pitt.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining |

Citations: | 18 - 2 self |

### BibTeX

@INPROCEEDINGS{Das07detectinganomalous,

author = {Kaustav Das},

title = {Detecting anomalous records in categorical datasets},

booktitle = {Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining},

year = {2007},

pages = {220--229},

publisher = {ACM Press}

}

### OpenURL

### Abstract

We consider the problem of detecting anomalies in high arity categorical datasets. In most applications, anomalies are defined as data points that are ’abnormal’. Quite often we have access to data which consists mostly of normal records, along with a small percentage of unlabelled anomalous records. We are interested in the problem of unsupervised anomaly detection, where we use the unlabelled data for training, and detect records that do not follow the definition of normality. A standard approach is to create a model of normal data, and compare test records against it. A probabilistic approach builds a likelihood model from the training data. Records are tested for anomalousness based on the complete record likelihood given the probability model. For categorical attributes, bayes nets give a standard representation of the likelihood. While this approach is good at finding outliers in the dataset, it often tends to detect records with attribute values that are rare. Sometimes, just detecting rare values of an attribute is not desired and such outliers are not considered as anomalies in that context. We present an alternative definition of anomalies, and propose an approach of comparing against marginal distributions of attribute subsets. We show that this is a more meaningful way of detecting anomalies, and has a better performance over semi-synthetic as well as real world datasets.

### Citations

2455 | Mining association rules between sets of items in large databases
- Agrawal, Imielinski, et al.
- 1993
(Show Context)
Citation Context ...thods Applied to Categorical Datasets 2.1.1 Association Rule Based Approaches The task of association rule mining has received considerable attention especially, in the case of market basket analysis =-=[3]-=-. An association rule is an expression of the form X ⇒ Y , where X and Y are sets of items. Given a database of records (or transactions) D, where each record T ∈ D is a set of items, X ⇒ Y expresses ... |

339 | Dectecting intrusions using system calls: alternative data models
- Warrander, Forrest, et al.
- 1999
(Show Context)
Citation Context ...s proposed by Denning [28]. Traditional anomaly detection approaches build models of normal data and detect deviations from the normal model in observed data. A survey of these techniques is given in =-=[31]-=-. One approach is to use sequence analysis to determine anomalies. A method of modeling normal sequences using look ahead pairs and contiguous sequences is presented in [16], and a statistical method ... |

326 | Data mining approaches for intrusion detection
- Lee
- 1998
(Show Context)
Citation Context ...ling normal sequences using look ahead pairs and contiguous sequences is presented in [16], and a statistical method to determine frequent sequences in intrusion data is presented in [15]. Lee et al. =-=[21]-=- uses a decision tree model over normal data andsGhosh et al. [13] uses a neural network to obtain the model. Eskin [12] uses a probability distribution model from the training data to determine anoma... |

295 | Intrusion Detection Using Sequences of System Calls
- Hofmeyr, Forrest, et al.
- 1998
(Show Context)
Citation Context ...ese techniques is given in [31]. One approach is to use sequence analysis to determine anomalies. A method of modeling normal sequences using look ahead pairs and contiguous sequences is presented in =-=[16]-=-, and a statistical method to determine frequent sequences in intrusion data is presented in [15]. Lee et al. [21] uses a decision tree model over normal data andsGhosh et al. [13] uses a neural netwo... |

227 |
Unsupervised learning
- Barlow
- 1989
(Show Context)
Citation Context ...ually exclusive subsets having up to k attributes, and calculate the corresponding r-values. A ratio of the form r = P(A,B) has been proposed as a P(A)P(B) measure of suspicious coincidence by Barlow =-=[6]-=-. It states that two candidate fragments A and B should be combined into a composite object AB if the probability of their joint appearance P(A,B) is much higher than the probability ex(1)spected in c... |

174 | A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data
- Eskin, Arnold, et al.
- 2002
(Show Context)
Citation Context ... from the training data to determine anomalous data. They use a mixture model to explain the presence of anomalies. A clustering based approach to detecting anomalies in a dataset is used in [22] and =-=[11]-=-. One-class SVMs [23, 14] and Genetic Algorithms [29] have also been used to classify anomalies in this context. Anomaly detection is also commonly applied in time series data to detect unusual fluctu... |

172 |
A spatial scan statistic
- Kulldorff
- 1997
(Show Context)
Citation Context ...tion is also commonly applied in time series data to detect unusual fluctuations compared to past data points [4, 18, 7, 26]. Another area of considerable recent interest is spatial anomaly detection =-=[19]-=-. The methods described so far apply to real valued data or work in a supervised setting when we have labeled training data. We now describe methods that apply to the problem of interest, i.e., on cat... |

122 | Cached sufficient statistics for efficient machine learning with large datasets
- Moore, Lee
- 1998
(Show Context)
Citation Context ...) and C(bt) are greater than this bound. 3.1.6 Using AD Trees for computing counts The required counts are conjunctive counting queries on the dataset, and can be efficiently queried using an AD Tree =-=[24]-=-. The AD Tree building algorithm scans the dataset once, and precomputes information needed to answer every possible query in time independent of the number of records. The parameter leaflist size can... |

118 | Towards parameter-free data mining
- Keogh, Lonardi, et al.
- 2004
(Show Context)
Citation Context ...ed such as by zip code or area. It can also be computed dynamically similar to spatial scan. Apart from this, we can also use the association based dissimilarity measures such as methods presented in =-=[20, 17]-=- for grouping records. Another possible improvement is the way we deal with real valued attributes. Since we deal with actual probability values P rather than probability densities p, all the real val... |

102 | Anomaly detection over noisy data using learned probability distributions
- Eskin
- 2000
(Show Context)
Citation Context ...determine frequent sequences in intrusion data is presented in [15]. Lee et al. [21] uses a decision tree model over normal data andsGhosh et al. [13] uses a neural network to obtain the model. Eskin =-=[12]-=- uses a probability distribution model from the training data to determine anomalous data. They use a mixture model to explain the presence of anomalies. A clustering based approach to detecting anoma... |

98 |
A Study in using Neural Networks for Anomaly and Misuse Detection
- Ghosh, Schwartzbard
- 1999
(Show Context)
Citation Context ...ces is presented in [16], and a statistical method to determine frequent sequences in intrusion data is presented in [15]. Lee et al. [21] uses a decision tree model over normal data andsGhosh et al. =-=[13]-=- uses a neural network to obtain the model. Eskin [12] uses a probability distribution model from the training data to determine anomalous data. They use a mixture model to explain the presence of ano... |

93 | Finding Surprising Patterns in a Time Series Database in Linear Time and Space
- Keogh
- 2002
(Show Context)
Citation Context ...c Algorithms [29] have also been used to classify anomalies in this context. Anomaly detection is also commonly applied in time series data to detect unusual fluctuations compared to past data points =-=[4, 18, 7, 26]-=-. Another area of considerable recent interest is spatial anomaly detection [19]. The methods described so far apply to real valued data or work in a supervised setting when we have labeled training d... |

41 | Optimal reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning
- Moore, Wong
- 2003
(Show Context)
Citation Context ...s emails [9] and disease outbreak detection [32]. Any good structure and parameter learning algorithm is appropriate to learn the model. For our experiments, we used the optimal reinsertion algorithm =-=[25]-=- to learn the structure, and then did a maximum likelihood estimation of the network parameters. Once the model is built, to test any record we find its complete record likelihood given the probabilit... |

35 | Bayesian network anomaly pattern detection for disease outbreaks
- Wong, Moore, et al.
- 2003
(Show Context)
Citation Context ...and efficient learning and inference techniques. Bayes Net have been used for detecting anomalies in network intrusion detection [2, 33], detecting malicious emails [9] and disease outbreak detection =-=[32]-=-. Any good structure and parameter learning algorithm is appropriate to learn the model. For our experiments, we used the optimal reinsertion algorithm [25] to learn the structure, and then did a maxi... |

30 | S.: Mining motifs in massive time series databases
- Patel, Keogh, et al.
- 2002
(Show Context)
Citation Context ...c Algorithms [29] have also been used to classify anomalies in this context. Anomaly detection is also commonly applied in time series data to detect unusual fluctuations compared to past data points =-=[4, 18, 7, 26]-=-. Another area of considerable recent interest is spatial anomaly detection [19]. The methods described so far apply to real valued data or work in a supervised setting when we have labeled training d... |

28 |
A statistically based system for prioritizing information exploration under uncertainty
- Helman, Bhangoo
- 1997
(Show Context)
Citation Context ... A method of modeling normal sequences using look ahead pairs and contiguous sequences is presented in [16], and a statistical method to determine frequent sequences in intrusion data is presented in =-=[15]-=-. Lee et al. [21] uses a decision tree model over normal data andsGhosh et al. [13] uses a neural network to obtain the model. Eskin [12] uses a probability distribution model from the training data t... |

28 | Unsupervised Anomaly Detection in Network Intrusion Detection Using Clustering
- Leung, Leckie
- 2005
(Show Context)
Citation Context ...ion model from the training data to determine anomalous data. They use a mixture model to explain the presence of anomalies. A clustering based approach to detecting anomalies in a dataset is used in =-=[22]-=- and [11]. One-class SVMs [23, 14] and Genetic Algorithms [29] have also been used to classify anomalies in this context. Anomaly detection is also commonly applied in time series data to detect unusu... |

25 | Rule-Based Anomaly Pattern Detection for Detecting Disease Outbreaks [Internet
- Wong, Moore, et al.
- 2002
(Show Context)
Citation Context ... algorithms. Balderas et al. [5] mine hidden association rules, or rules that are not common, but confident. Such rules are assumed to represent the rare anomaly class. WSARE developed by Wong et al. =-=[30]-=- also uses rules to identify anomalies. But in this case, the rules are learnt from a historical dataset, and are applied on a collection of records from the current time interval, to detect unusual c... |

22 | One Class Support Vector Machines for Detecting Anomalous Windows Registry Accesses
- Heller, Svore, et al.
- 2003
(Show Context)
Citation Context ...ta to determine anomalous data. They use a mixture model to explain the presence of anomalies. A clustering based approach to detecting anomalies in a dataset is used in [22] and [11]. One-class SVMs =-=[23, 14]-=- and Genetic Algorithms [29] have also been used to classify anomalies in this context. Anomaly detection is also commonly applied in time series data to detect unusual fluctuations compared to past d... |

13 | A machine learning approach to anomaly detection
- Chan, Mahoney, et al.
- 2003
(Show Context)
Citation Context ...vents that an attribute of T takes some particular values. Association rule mining is commonly used in the analysis of market-basket data, where the target of mining is not predetermined. Chan et al. =-=[8]-=- have developed a rule learning method LERAD to detect anomalies. They consider rules of the form X ⇒ Y , where X and Y are mutually exclusive subsets of attributes taking on particular values. They s... |

11 |
Probabilistic networks with undirected links for anomaly detection
- Ye, Xu, et al.
(Show Context)
Citation Context ...es for categorical data because of its parsimonious use of parameters, and efficient learning and inference techniques. Bayes Net have been used for detecting anomalies in network intrusion detection =-=[2, 33]-=-, detecting malicious emails [9] and disease outbreak detection [32]. Any good structure and parameter learning algorithm is appropriate to learn the model. For our experiments, we used the optimal re... |

9 |
Using bayesian networks for detecting anomalies in Internet services
- Bronstein, Das, et al.
- 1993
(Show Context)
Citation Context ...es for categorical data because of its parsimonious use of parameters, and efficient learning and inference techniques. Bayes Net have been used for detecting anomalies in network intrusion detection =-=[2, 33]-=-, detecting malicious emails [9] and disease outbreak detection [32]. Any good structure and parameter learning algorithm is appropriate to learn the model. For our experiments, we used the optimal re... |

8 |
An oscillatory neural network model of sparse distributed memory and novelty detection
- Borisyuk, Denham, et al.
- 2000
(Show Context)
Citation Context ...c Algorithms [29] have also been used to classify anomalies in this context. Anomaly detection is also commonly applied in time series data to detect unusual fluctuations compared to past data points =-=[4, 18, 7, 26]-=-. Another area of considerable recent interest is spatial anomaly detection [19]. The methods described so far apply to real valued data or work in a supervised setting when we have labeled training d... |

7 |
Internet security: malicious e-mails detection and protection
- Dong-Her, Hsiu-Sen, et al.
- 2004
(Show Context)
Citation Context ...ts parsimonious use of parameters, and efficient learning and inference techniques. Bayes Net have been used for detecting anomalies in network intrusion detection [2, 33], detecting malicious emails =-=[9]-=- and disease outbreak detection [32]. Any good structure and parameter learning algorithm is appropriate to learn the model. For our experiments, we used the optimal reinsertion algorithm [25] to lear... |

7 | Probabilistic principles in unsupervised learning of visual structure: human data and a model
- Hiles, P, et al.
- 2002
(Show Context)
Citation Context ...) is much higher than the probability ex(1)spected in case of statistical independence P(A)P(B). It has also been used to investigate unsupervised learning of complex visual stimuli by human subjects =-=[10]-=-. Here large values of r are interesting as it signifies a suspicious coincidence of the events co-occurring. We are interested in exactly the opposite situation, where low r values signify that the e... |

7 |
An association-based dissimilarity measure for categorical data
- Ho
- 2005
(Show Context)
Citation Context ...ed such as by zip code or area. It can also be computed dynamically similar to spatial scan. Apart from this, we can also use the association based dissimilarity measures such as methods presented in =-=[20, 17]-=- for grouping records. Another possible improvement is the way we deal with real valued attributes. Since we deal with actual probability values P rather than probability densities p, all the real val... |

7 |
W.: Improving one-class SVM for anomaly detection
- Li, Haung, et al.
- 2003
(Show Context)
Citation Context ...ta to determine anomalous data. They use a mixture model to explain the presence of anomalies. A clustering based approach to detecting anomalies in a dataset is used in [22] and [11]. One-class SVMs =-=[23, 14]-=- and Genetic Algorithms [29] have also been used to classify anomalies in this context. Anomaly detection is also commonly applied in time series data to detect unusual fluctuations compared to past d... |

6 |
A machine learning framework for network anomaly detection using SVM and GA
- Shon
- 2005
(Show Context)
Citation Context ...They use a mixture model to explain the presence of anomalies. A clustering based approach to detecting anomalies in a dataset is used in [22] and [11]. One-class SVMs [23, 14] and Genetic Algorithms =-=[29]-=- have also been used to classify anomalies in this context. Anomaly detection is also commonly applied in time series data to detect unusual fluctuations compared to past data points [4, 18, 7, 26]. A... |

4 |
Scalable and practical probability density estimators for scientific anomaly detection
- Pelleg
- 2004
(Show Context)
Citation Context ... anomalous. For multivariate categorical data, dependency trees and bayesian networks are common representations of a probability density model. Dependency trees have been used to detect anomalies in =-=[27]-=-. We choose a bayesian network as the standard model against which we compare our algorithm. Hence, we give an overview of this method next. 2.1.3 Anomaly Detection Using Bayes Network A Bayesian netw... |

4 |
Working sets past and present
- PJ
(Show Context)
Citation Context ... a discussion of possible extensions of the current work. 2. RELATED WORK Anomaly detection applied to network intrusion detection has been an active area of research since it was proposed by Denning =-=[28]-=-. Traditional anomaly detection approaches build models of normal data and detect deviations from the normal model in observed data. A survey of these techniques is given in [31]. One approach is to u... |

4 | Summary of biosurveillance-relevant technologies
- Moore, Cooper, et al.
- 2003
(Show Context)
Citation Context |

3 |
Discovering hidden association rules
- Balderas, Berzal, et al.
- 2005
(Show Context)
Citation Context ... disadvantage of this method is that it learns a very small subset of all the possible rules. We have used this method as one of the baseline methods for comparison of our algorithms. Balderas et al. =-=[5]-=- mine hidden association rules, or rules that are not common, but confident. Such rules are assumed to represent the rare anomaly class. WSARE developed by Wong et al. [30] also uses rules to identify... |

1 |
Summary of biosurveillance-relevant
- Moore, Cooper, et al.
(Show Context)
Citation Context |