## A Linear Method for Deviation Detection in Large Databases (1996)

### Cached

### Download Links

- [rakesh.agrawal-family.com]
- [www.almaden.ibm.com]
- [www.rakesh.agrawal-family.com]
- [140.115.82.191]
- DBLP

### Other Repositories/Bibliography

Citations: | 85 - 1 self |

### BibTeX

@INPROCEEDINGS{Arning96alinear,

author = {Andreas Arning and Rakesh Agrawal and Prabhakar Raghavan},

title = {A Linear Method for Deviation Detection in Large Databases},

booktitle = {},

year = {1996},

pages = {164--169}

}

### Years of Citing Articles

### OpenURL

### Abstract

We describe the problem of finding deviations in large data bases. Normally, explicit information outside the data, like integrity constraints or predefined patterns, is used for deviation detection. In contrast, we approach the problem from the inside of the data, using the implicit redundancy of the data. We give a formal description of the problem and present a linear algorithm for detecting deviations. Our solution simulates a mechanism familiar to human beings: after seeing a series of similar data, an element disturbing the series is considered an exception. We also present experimental results from the application of this algorithm on real-life datasets showing its effectiveness. Index Terms: Data Mining, Knowledge Discovery, Deviation, Exception, Error Introduction The importance of detecting deviations (or exceptions) in data has been recognized in the fields of Databases and Machine Learning for a long time. Deviations have been often viewed as outliers, or errors, or nois...

### Citations

10995 |
Computers and Intractability: A Guide to the Theory of NP-Completeness
- GAREY, JOHNSON
- 1979
(Show Context)
Citation Context ...ing for examining the elements that yields the optimal solution for every C and D. Indeed, there exist C and D for which the problem is NP-hard by a reduction from Maximum Independent Set in a graph (=-=Garey & Johnson 1979-=-). This motivates the definition of the sequential exception problem. Sequential Exception Given: ffl a set of items I (and thus its power set P(I)); ffl a sequence S of n subsets I 1 ; I 2 ; :::; I n... |

3382 | Induction of Decision Trees
- Quinlan
- 1989
(Show Context)
Citation Context ...all amount of noise (e.g. (Aha, Kibler, & Albert 1991)). Additionally, some work has been done to determine the impact of erroneous examples on the learning results: there is experimental work as in (=-=Quinlan 1986-=-) and quantitative theoretical work as in (Angluin & Laird 1988), extending the valuable work of (Valiant 1984). Rather than being considered outliers that need to be tolerated during the main-line pr... |

2694 | Fast Algorithms for Mining Association Rules
- Agrawal, Srikant
- 1994
(Show Context)
Citation Context ...ami 1993) that presents a method for discovering frequent cooccurrences of attribute values in large data sets. But the goal of that algorithm is quite different, and the performance of this system ((=-=Agrawal & Srikant 1994-=-)) is a consequence of ignoring early the seldom occurring values -- thus the possible exceptions. Problem Description We begin by giving a formal definition of the deviation detection problem. Exact ... |

643 | Knowledge Acquisition via Incremental Conceptual Clustering
- Fisher
- 1987
(Show Context)
Citation Context ...o isolate small minorities. Moreover, these methods generally require the existence of a metrical distance function between data elements. Such is the case with even refined clustering algorithms as (=-=Fisher 1987-=-) (Hanson & Bauer 1989) (Rumelhart & Zipser 1985). Michalski (Michalski & Stepp 1983) replaces the usual Euclidian distance measure between data elements by enriching the measure with conceptual simil... |

581 |
Applied Multivariate Statistical Analysis
- Johnson, Wichern
- 1998
(Show Context)
Citation Context ...arning for a long time. Deviations have been often viewed as outliers, or errors, or noise in data. There has been work in Statistics on identifying outliers (e.g. (Hoaglin, Mosteller, & Tukey 1983) (=-=Johnson 1992-=-)). There has been work in extending learning algorithms to cope with a small amount of noise (e.g. (Aha, Kibler, & Albert 1991)). Additionally, some work has been done to determine the impact of erro... |

282 | Database mining: A performance perspective - Agrawal, Imielinski, et al. - 1993 |

248 |
Learning from Observation: Conceptual Clustering
- Michalski, Stepp
- 1983
(Show Context)
Citation Context ... existence of a metrical distance function between data elements. Such is the case with even refined clustering algorithms as (Fisher 1987) (Hanson & Bauer 1989) (Rumelhart & Zipser 1985). Michalski (=-=Michalski & Stepp 1983-=-) replaces the usual Euclidian distance measure between data elements by enriching the measure with conceptual similarity, but still requires a symmetrical distance measure and imposes restrictive con... |

147 | Understanding Robust and Exploratory Data Analysis - Hoaglin, Mosteller, et al. - 1983 |

107 |
information and kolmogorov complexity
- Grunwald, Vitanyi
(Show Context)
Citation Context ... degree to which a data element causes the "dissimilarity " of the data set to increase. This function does not have to fulfill the conditions to be metrical. The literature on Kolmogorov co=-=mplexity (Li & Vitanyi 1991-=-) is relevant to the problem of deviation detection. We are looking for the subset of data that leads to the greatest reduction in Kolmogorov complexity for the amount of data discarded. Our problem i... |

89 |
Stochastic Complexity in Statistical Inquiry; World Scientific
- Rissanen
- 1989
(Show Context)
Citation Context ...g for the subset of data that leads to the greatest reduction in Kolmogorov complexity for the amount of data discarded. Our problem is also related to the Minimum Description Length (MDL) principle (=-=Rissanen 1989-=-). The MDL principle states that the best model for encoding data is the one that minimizes the sum of the cost of describing the data in terms of the model and the cost of describing the model. Below... |

38 |
Using the New DB2: IBM's Object-Relational Database System
- Chamberlin
- 1996
(Show Context)
Citation Context ...s play the leading part in this paper: the one and only purpose of our proposed method is to discover them. Usually, explicit information sources residing outside the data like integrity constraints (=-=Chamberlin 1996-=-) or predefined error/non-error patterns (Val 1995) are used for deviation detection. We approach the problem from the inside of the data, using the implicit redundancy in the data to detect deviation... |

30 | Instance-based learning algorithmsā€¯. Machine Learning 6 - Aha, Kibler, et al. - 1991 |

24 |
Conceptual clustering, categorization, and polymorphy. Machine Learning 3(4):343{372
- Hanson, Bauer
- 1989
(Show Context)
Citation Context ...l minorities. Moreover, these methods generally require the existence of a metrical distance function between data elements. Such is the case with even refined clustering algorithms as (Fisher 1987) (=-=Hanson & Bauer 1989-=-) (Rumelhart & Zipser 1985). Michalski (Michalski & Stepp 1983) replaces the usual Euclidian distance measure between data elements by enriching the measure with conceptual similarity, but still requi... |

16 | Feature discovery by competitive learning. Cognitive Science 9:75{112 - Rumelhart, Zipser - 1985 |

11 |
Learning from noisy examples. Machine Learning 2(4):343{370
- Angluin, Laird
- 1988
(Show Context)
Citation Context .... Additionally, some work has been done to determine the impact of erroneous examples on the learning results: there is experimental work as in (Quinlan 1986) and quantitative theoretical work as in (=-=Angluin & Laird 1988-=-), extending the valuable work of (Valiant 1984). Rather than being considered outliers that need to be tolerated during the main-line processing, deviations play the leading part in this paper: the o... |

5 |
Feature discovery by competitive learning. Cognit Sci
- DE, Zipser
- 1985
(Show Context)
Citation Context ..., these methods generally require the existence of a metrical distance function between data elements. Such is the case with even refined clustering algorithms as (Fisher 1987) (Hanson & Bauer 1989) (=-=Rumelhart & Zipser 1985-=-). Michalski (Michalski & Stepp 1983) replaces the usual Euclidian distance measure between data elements by enriching the measure with conceptual similarity, but still requires a symmetrical distance... |

1 |
Fehlersuche in gro en Datenmengen unter Verwendung der in den Daten vorhandenen Redundanz. PhD dissertation, Universitat Osnabruck, Fachbereich Sprach{ und Literaturwissenschaft
- Arning
- 1995
(Show Context)
Citation Context ...ted from the string pattern that covers all elements: M S (I j ) := 1 3 \Theta c \Gamma w + 2 with c being the total number of characters and w being the number of needed wildcards. For details, see (=-=Arning 1995-=-). The auxiliary function MS (I j ) computes a user customizable maximum value for the elements of I j in order to find those elements that particularly increase this maximum. In the definition above,... |

1 | Integrity product Overview - Inc - 1995 |