Results 1 - 10
of
13
Outlier Detection Using Replicator Neural Networks
- In Proc. of the Fifth Int. Conf. and Data Warehousing and Knowledge Discovery (DaWaK02
, 2002
"... We consider the problem of finding outliers in large multivariate databases. Outlier detection can be applied during the data cleansing process of data mining to identify problems with the data itself, and to fraud detection where groups of outliers are often of particular interest. ..."
Abstract
-
Cited by 51 (1 self)
- Add to MetaCart
(Show Context)
We consider the problem of finding outliers in large multivariate databases. Outlier detection can be applied during the data cleansing process of data mining to identify problems with the data itself, and to fraud detection where groups of outliers are often of particular interest.
comparative study of rnn for outlier detection in data mining
- in ICDM
, 2002
"... We have proposed replicator neural networks (RNNs) as an outlier detecting algorithm [15]. Here we compare RNN for outlier detection with three other methods using both publicly available statistical datasets (generally small) and data mining datasets (generally much larger and generally real data). ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
(Show Context)
We have proposed replicator neural networks (RNNs) as an outlier detecting algorithm [15]. Here we compare RNN for outlier detection with three other methods using both publicly available statistical datasets (generally small) and data mining datasets (generally much larger and generally real data). The smaller datasets provide insights into the relative strengths and weaknesses of RNNs against the compared methods. The larger datasets particularly test scalability and practicality of application. This paper also develops a methodology for comparing outlier detectors and provides performance benchmarks against which new outlier detection methods can be assessed.
M-Kernel Merging: Towards Density Estimation over Data Streams
- In Proc. of DASFAA
, 2003
"... Density estimation is a costly operation for computing distribution information of data sets underlying many important data mining applications, such as clustering and biased sampling. However, traditional density estimation methods are inapplicable for streaming data, which are continuous arriving ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
(Show Context)
Density estimation is a costly operation for computing distribution information of data sets underlying many important data mining applications, such as clustering and biased sampling. However, traditional density estimation methods are inapplicable for streaming data, which are continuous arriving large volume of data, because of their request for linear storage and square size calculation. The shortcoming limits the application of many existing effective algorithms on data streams, for which the mining problem is an emergency for applications and a challenge for research. In this paper, the problem of computing density functions over data streams are examined. A novel method attacking this shortcoming of existing methods is developed to enable density estimation for large volume of data in linear time, fixed size memory, and without lose of accuracy. The method is based on M-Kernel merging, so that limited kernel functions to be maintained are determined intelligently. The application of the new method on different streaming data models are discussed, and the result of intensive experiments is presented. The analytical and empirical result show that this new density estimation algorithm for data streams can calculate density functions on demand at any time with high accuracy for different streaming data models. 1
Maintaining Nonparametric Estimators over Data Streams
- In Proc. of BTW, March 2-4, Karlsruhe-Germany
, 2005
"... Abstract: An effective processing and analysis of data streams is of utmost impor-tance for a plethora of emerging applications like network monitoring, traffic manage-ment, and financial tickers. In addition to the management of transient and potentially unbounded streams, their analysis with advan ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
(Show Context)
Abstract: An effective processing and analysis of data streams is of utmost impor-tance for a plethora of emerging applications like network monitoring, traffic manage-ment, and financial tickers. In addition to the management of transient and potentially unbounded streams, their analysis with advanced data mining techniques has been identified as a research challenge. A well-established class of mining techniques is based on nonparametric statistics where especially nonparametric density estimation is among the essential building blocks. In this paper, we examine the maintenance of nonparametric estimators over data streams. We present a tailored framework that in-crementally maintains a nonparametric estimator over a data stream while consuming only a fixed amount of memory. Our framework is memory-adaptive and therefore, supports a fundamental requirement for an operator within a data stream manage-ment system. As an example, we apply our framework to selectivity estimation of range queries, which is a popular use-case for statistical estimators. After providing an analysis of the processing cost, results of experimental comparisons are reported where synthetic data streams as well as real-world ones are considered. Our results demonstrate the accuracy of the results being produced by estimators derived from our framework. 1
Outlier detection based on the distribution of distances between data points
- Informatica
"... Abstract. A novel approach to outlier detection on the ground of the properties of distribution of distances between multidimensional points is presented. The basic idea is to evaluate the outlier factor for each data point. The factor is used to rank the dataset objects regarding their degree of be ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract. A novel approach to outlier detection on the ground of the properties of distribution of distances between multidimensional points is presented. The basic idea is to evaluate the outlier factor for each data point. The factor is used to rank the dataset objects regarding their degree of being an outlier. Selecting the points with the minimal factor values can then identify outliers. The main advantages of the approach are: (1) no parameter choice in outlier detection is necessary; (2) detection is not dependent on clustering algorithms. To demonstrate the quality of the outlier detection, the experiments were performed on widely used datasets. A comparison with some popular detection methods shows the superiority of our approach. Key words: outlier detection, high-dimensional data, distribution of distances. 1.
B.: Wavelet Density Estimators over Data Streams
- In: Proc. of SAC. (2005
"... Many scientific and commercial applications rely on an immediate processing of transient data streams. In addition to processing queries over streams, their continuative analysis has received more attention recently. Due to specific characteristics of data streams, common analysis techniques known f ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
Many scientific and commercial applications rely on an immediate processing of transient data streams. In addition to processing queries over streams, their continuative analysis has received more attention recently. Due to specific characteristics of data streams, common analysis techniques known from the area of data mining are not directly applicable to streams. One of the core operations in data analysis is density estimation, which is used for capturing an unknown distribution in various analysis tasks. Modern density estimates base either on kernel functions or wavelets, whereas the wavelet-based ones profit from their ability to identify discontinuities as well as local oscillations. In this paper, we present a new approach to computing wavelet density estimators over data streams. Our estimators allow continuous updates by arrival of new data and provide accurate analytical results, while consuming only a constant amount of memory. Moreover, our estimators are adaptive according to memory as well as CPU usage, i. e., we may change the memory size as well as its computing overhead at runtime. An experimental evaluation proves the feasibility of our approach and shows the superiority of wavelet density estimators compared to their kernel-based counterparts. 1
Audio: An integrity auditing framework of outlier-mining-as-a-service systems
- In ECML/PKDD
, 2012
"... Abstract. Spurred by developments such as cloud computing, there has been considerable recent interest in the data-mining-as-a-service paradigm. Users lacking in expertise or computational resources can outsource their data and mining needs to a third-party service provider (server). Out-sourcing, h ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Spurred by developments such as cloud computing, there has been considerable recent interest in the data-mining-as-a-service paradigm. Users lacking in expertise or computational resources can outsource their data and mining needs to a third-party service provider (server). Out-sourcing, however, raises issues about result integrity: how can the data owner verify that the mining results returned by the server are correct? In this paper, we present AUDIO, an integrity auditing framework for the specific task of distance-based outlier mining outsourcing. It provides efficient and practical verification approaches to check both completeness and correctness of the mining results. The key idea of our approach is to insert a small amount of artificial tuples into the outsourced data; the ar-tificial tuples will produce artificial outliers and non-outliers that do not exist in the original dataset. The server’s answer is verified by analyzing the presence of artificial outliers/non-outliers, obtaining a probabilistic guarantee of correctness and completeness of the mining result. Our em-pirical results show the effectiveness and efficiency of our method. 1
Data Mining Techniques for Geospatial Applications
"... This paper, prepared for a committee of the Computer Science and Telecommunications Board, should not be cited as a National Research Council report ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
This paper, prepared for a committee of the Computer Science and Telecommunications Board, should not be cited as a National Research Council report
Towards Kernel Density Estimation over Streaming Data
"... A variety of real-world applications heavily relies on the analysis of transient data streams. Due to the rigid processing requirements of data streams, common analysis techniques as known from data mining are not applicable. A fundamental building block of many data mining and analysis approaches i ..."
Abstract
- Add to MetaCart
(Show Context)
A variety of real-world applications heavily relies on the analysis of transient data streams. Due to the rigid processing requirements of data streams, common analysis techniques as known from data mining are not applicable. A fundamental building block of many data mining and analysis approaches is density estimation. It provides a well-defined estimation of a continuous data distribution, a fact which makes its adaptation to data streams desirable. A convenient method for density estimation utilizes kernels. However, its computational complexity collides with the processing requirements of data streams. In this work, we present a new approach to this problem that combines linear processing cost with a constant amount of allocated memory. We even support a dynamic memory adaptation to changing system resources. Our kernel density estimators over streaming data are related to M-Kernels, a previously proposed technique, but substantially improve them in terms of accuracy as well as processing time. The results of an experimental study with synthetic and real-world data streams substantiate the efficiency and effectiveness of our approach as well as its superiority to M-Kernels with respect to estimation quality and processing time. 1
Density Estimation Over Data Stream ∗
"... Density estimation is an important but costly operation for applications that need to know the distribution of a data set. Moreover, when the data comes as a stream, traditional density estimation methods cannot cope with it efficiently. In this paper, we examined the problem of computing density fu ..."
Abstract
- Add to MetaCart
Density estimation is an important but costly operation for applications that need to know the distribution of a data set. Moreover, when the data comes as a stream, traditional density estimation methods cannot cope with it efficiently. In this paper, we examined the problem of computing density function over data streams and developed a novel method to solve it. A new concept M-Kernel is used in our algorithm, and it is of the following characteristics: (1) the running time is in linear with the data size, (2) it can keep the whole computing in limited size of memory, (3) its accuracy is comparable to the traditional methods, (4) a useable density model could be available at any time during the processing, (5) it is flexible and can suit with different stream models. Analytical and experimental results showed the efficiency of the proposed algorithm.