Results 1  10
of
32
A vertical distancebased outlier detection method with local pruning
 In CAINE
, 2004
"... “One person’s noise is another person’s signal”. Outlier detection is used to clean up datasets and also to discover useful anomalies, such as criminal activities in electronic commerce, computer intrusion attacks, terrorist threats, agricultural pest infestations, etc. Thus, outlier detection is cr ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
“One person’s noise is another person’s signal”. Outlier detection is used to clean up datasets and also to discover useful anomalies, such as criminal activities in electronic commerce, computer intrusion attacks, terrorist threats, agricultural pest infestations, etc. Thus, outlier detection is critically important in the informationbased society. This paper focuses on finding outliers in large datasets using distancebased methods. First, to speedup outlier detections, we revise Knorr and Ng’s distancebased outlier definition; second, a vertical data structure, instead of traditional horizontal structures, is adopted to facilitate efficient outlier detection further. We tested our methods against national hockey league dataset and show an order of magnitude of speed improvement compared to the contemporary distancebased outlier detection approaches.
PINE  Podium Incremental Neighbor Evaluator for Classifying Spatial Data
 Symposium on Applied Computing (SAC’03
, 2003
"... Given a set of training data, nearest neighbor classification predicts the class value for an unknown tuple X by searching the training set for the k nearest neighbors to X and then classifying X according to the most frequent class among the k neighbors. Each of the k nearest neighbors casts an equ ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Given a set of training data, nearest neighbor classification predicts the class value for an unknown tuple X by searching the training set for the k nearest neighbors to X and then classifying X according to the most frequent class among the k neighbors. Each of the k nearest neighbors casts an equal vote for the class of X. In this paper, we propose a new algorithm, Podium Incremental Neighbor Evaluator (PINE), in which nearest neighbors are weighted for voting. A metric called HOBBit is used as the distance metric, and a data structure, the Ptree ,is used for efficient implementation of the PINE algorithm on spatial data. Our experiments show that by using a Gaussian podium function, PINE outperforms the knearest neighbor (KNN) method in terms of classification accuracy for spatial data. In addition, in the PINE algorithm, all the instances are potential neighbors so that the value of k need not be prespecified as in KNN methods. By assigning high weights to the nearest neighbors and low (even zero) weights to other neighbors, high classification accuracy can be achieved.
Arabic text categorization using kNN algorithm
"... Many algorithms have been implemented to the problem of text categorization. Most of the work in this area was carried out for the English text; on the other hand very few researches have been carried out for the Arabic text. In this project we have implemented the key Nearest Neighbor (kNN) algorit ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Many algorithms have been implemented to the problem of text categorization. Most of the work in this area was carried out for the English text; on the other hand very few researches have been carried out for the Arabic text. In this project we have implemented the key Nearest Neighbor (kNN) algorithm, which is known to be one of top performing classifiers applied for the English text along with the Support Vector Machines (SVMs) algorithm. However the nature of Arabic text is different than that of the English text and the preprocessing of the Arabic text is different and a little bit more challenging. For the problem of keyword extraction and reduction we have implemented a method to extract keywords based on the Document Frequency threshold (DF) method. The results show that kNN is applicable to Arabic text; we have reached a 0.95 microaverage precision and recall scores, using a data set of 621 Arabic text documents that belong to 6 different categories for training and testing.
Extensions of the k nearest neighbour methods for classification problems
 in Proc. of 26th IASTED International Conference on Artificial Intelligence and Applications, CD Proceedings ISBN: 9780889867109, 2008
"... The k Nearest Neighbour (kNN) method is a widely used technique which has found several applications in clustering and classification. In this paper, we focus on classification problems and we propose modifications of the nearest neighbour method that exploit information from the structure of a data ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
The k Nearest Neighbour (kNN) method is a widely used technique which has found several applications in clustering and classification. In this paper, we focus on classification problems and we propose modifications of the nearest neighbour method that exploit information from the structure of a dataset. The results of our experiments using datasets from the UCI repository demonstrate that the classifiers produced perform generally better than the classic kNN and are more reliable, without being significantly slower. KEY WORDS
Efficient OLAP operations for spatial data using Peano trees
 In DMKD ’03
, 2003
"... Online Analytical Processing (OLAP) is an important application of data warehouses. With more and more spatial data being collected, such as remotely sensed images, geographical information, digital sky survey data, efficient OLAP for spatial data is in great demand. In this paper, we build up a new ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Online Analytical Processing (OLAP) is an important application of data warehouses. With more and more spatial data being collected, such as remotely sensed images, geographical information, digital sky survey data, efficient OLAP for spatial data is in great demand. In this paper, we build up a new data warehouse structure – PDcube. With PDcube, OLAP operations and queries can be efficiently implemented. All these are accomplished based on the fast logical operations of Peano Trees (PTrees∗). One of the Ptree variations, Predicate Ptree, is used to efficiently reduce data accesses by filtering out “bit holes” consisting of consecutive 0’s. Experiments show that OLAP operations can be executed much faster than with traditional OLAP methods.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation
 in Large Datasets, CATA2005
, 2005
"... In this paper, we introduce the vertical set square distance (VSSD) technique that is designed to efficiently and scalably measure the total variation of a set about a fixed point in large datasets. The set can be any projected subspace of any vector space, including oblique subspaces (not just dime ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we introduce the vertical set square distance (VSSD) technique that is designed to efficiently and scalably measure the total variation of a set about a fixed point in large datasets. The set can be any projected subspace of any vector space, including oblique subspaces (not just dimensional subspaces). VSSD can determine the closeness of a point to a set of points in a dataset, which can be very useful for classification, clustering and outlier detection tasks. The technique employs a vertical data structure called the Predicatetree (Ptree) 1. Performance evaluations based on both synthetic and realworld datasets show that VSSD technology is fast, accurate and scales well to very large datasets, as compared to similar techniques utilizing horizontal recordbased data structure.
Vertical Set Squared Distance Based Clustering without Prior Knowledge of K
 Proceedings of the 14th International Conference on Intelligent and Adaptive Systems and Software Engineering (IASSE05
, 2005
"... Clustering is automated identification of groups of objects based on similarity. In clustering two major research issues are scalability and the requirement of domain knowledge to determine input parameters. Most approaches suggest the use of sampling to address the issue of scalability. However, sa ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Clustering is automated identification of groups of objects based on similarity. In clustering two major research issues are scalability and the requirement of domain knowledge to determine input parameters. Most approaches suggest the use of sampling to address the issue of scalability. However, sampling does not guarantee the best solution and can cause significant loss in accuracy. Most approaches also require the use of domain knowledge, trial and error techniques, or exhaustive searching to figure out the required input parameters. In this paper we introduce a new clustering technique based on the set square distance. Cluster membership is determined based on the set squared distance to the respective cluster. As in the case of mean for kmeans and median for kmedoids, the cluster is represented by the entire cluster of points for each evaluation of membership. The set square distance for all n items can be computed efficiently in O(n) using a vertical data structure and a few precomputed values. Special ordering of the set square distance is used to break the data into the “natural ” clusters compared to the need of a known k for kmeans or kmedoids type of partition clustering. Superior results are observed when the new clustering technique is compared with the classical kmeans clustering. To prove the cluster quality and the resolution of the unknown k, data sets with known classes such as the iris data, the uci_kdd network intrusion data, and synthetic data are used. The scalability of the proposed technique is proved using a large RSI data set. Keywords Vertical Set Square Distance, Ptrees, Clustering. 1.
Lazy Classifiers Using Ptrees
, 2002
"... Lazy classifiers store all of the training samples and do not build a classifier until a new sample needs to be classified. It differs from eager classifiers, such as decision tree induction, which build a general model (such as a decision tree) before receiving new samples. Knearest neighbor (KNN) ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Lazy classifiers store all of the training samples and do not build a classifier until a new sample needs to be classified. It differs from eager classifiers, such as decision tree induction, which build a general model (such as a decision tree) before receiving new samples. Knearest neighbor (KNN) classification is a typical lazy classifier. Given a set of training data, a knearest neighbor classifier predicts the class value for an unknown tuple X by searching the training set for the k nearest neighbors to X and then assigning to X the most common class among its k nearest neighbors. Lazy classifiers are faster at training time than eager classifiers, but slower at predicating time since all computation is delayed to that time. In this paper, we introduce approaches to efficient construction of lazy classifiers, using a data structure, Peano Count Tree (Ptree) .Ptreeisa lossless and compressed representation of the original data that records the count information to facilitate efficient data mining. With Ptree structure, we introduced two classifiers, Ptree based knearest neighbor classifier (PKNN), and Podium Incremental Neighbor Evaluator (PINE). Performance analysis shows that our algorithms outperform classical KNN methods.
Discernibility Concept in Classification Problems
"... The main idea behind this project is that the pattern classification process can be enhanced by taking into account the geometry of class structure in datasets of interest. In contrast to previous work in the literature, this research not only develops a measure of discernibility of individual patte ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
The main idea behind this project is that the pattern classification process can be enhanced by taking into account the geometry of class structure in datasets of interest. In contrast to previous work in the literature, this research not only develops a measure of discernibility of individual patterns but also consistently applies it to various stages of the classification process. The applications of the discernibility concept cover a wide range of issues from preprocessing to the actual classification and beyond that. Specifically, we apply it for: (a) finding feature subsets of similar classification quality (applicable in diverse ensembles), (b) feature selection, (c) data reduction, (d) reject option, and (e) enhancing the kNN classifier. Also, a number of auxiliary algorithms and measures are developed to facilitate the proposed methodology. Experiments have been carried out using datasets of the University of California at Irvine (UCI) repository. The experiments provide numerical evidence that the proposed approach does improve the performance of various classifiers. This, together with its simplicity renders it a novel,