Efficient and Effective Clustering Methods for Spatial Data Mining
, 1994
"... Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. In this paper, we explore whether clustering methods have a role to play in spatial data mining. To this end, we develop a new clustering method called CLARANS which ..."
Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. In this paper, we explore whether clustering methods have a role to play in spatial data mining. To this end, we develop a new clustering method called CLARANS which is based on randomized search. We also de velop two spatial data mining algorithms that use CLARANS. Our analysis and experiments show that with the assistance of CLARANS, these two algorithms are very effective and can lead to discoveries that are difficult to find with current spatial data mining algorithms.
Efficient similarity search in sequence databases
, 1994
"... We propose an indexing method for time sequences for processing similarity queries. We use the Discrete Fourier Transform (DFT) to map time sequences to the frequency domain, the crucial observation being that, for most sequences of practical interest, only the first few frequencies are strong. Anot ..."
We propose an indexing method for time sequences for processing similarity queries. We use the Discrete Fourier Transform (DFT) to map time sequences to the frequency domain, the crucial observation being that, for most sequences of practical interest, only the first few frequencies are strong. Another important observation is Parseval's theorem, which specifies that the Fourier transform preserves the Euclidean distance in the time or frequency domain. Having thus mapped sequences to a lowerdimensionality space by using only the first few Fourier coe cients, we use Rtrees to index the sequences and e ciently answer similarity queries. We provide experimental results which show that our method is superior to search based on sequential scanning. Our experiments show that a few coefficients (13) are adequate to provide good performance. The performance gain of our method increases with the number and length of sequences.
Scalable Algorithms for Association Mining
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2000
"... Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets, and then forming conditional implication rules among them. In this paper we present efficient algorithms for the discovery ..."
Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets, and then forming conditional implication rules among them. In this paper we present efficient algorithms for the discovery of frequent itemsets, which forms the compute intensive phase of the task. The algorithms utilize the structural properties of frequent itemsets to facilitate fast discovery. The items are organized into a subset lattice search space, which is decomposed into small independent chunks or sublattices, which can be solved in memory. Efficient lattice traversal techniques are presented, which quickly identify all the long frequent itemsets, and their subsets if required. We also present the effect of using different database layout schemes combined with the proposed decomposition and traversal techniques. We experimentally compare the new algorithms against the previous approaches, obtaining ...
Outlier detection for high dimensional data
, 2001
"... The outlier detection problem has important applications in the eld of fraud detection, netw ork robustness analysis, and intrusion detection. Most suc h applications are high dimensional domains in whic hthe data can con tain hundreds of dimensions. Many recen t algorithms use concepts of pro ximit ..."
The outlier detection problem has important applications in the eld of fraud detection, netw ork robustness analysis, and intrusion detection. Most suc h applications are high dimensional domains in whic hthe data can con tain hundreds of dimensions. Many recen t algorithms use concepts of pro ximity in order to nd outliers based on their relationship to the rest of the data. Ho w ever, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective ofproximitybased de nitions. Consequently, for high dimensional data, the notion of nding meaningful outliers becomes substantially more complex and nonobvious. In this paper, w e discuss new techniques for outlier detection whic h nd the outliers by studying the behavior of projections from the data set. 1.
Efficient data mining for path traversal patterns
 IEEE Transactions on Knowledge and Data Engineering
, 1998
"... Abstract—In this paper, we explore a new data mining capability that involves mining path traversal patterns in a distributed informationproviding environment where documents or objects are linked together to facilitate interactive access. Our solution procedure consists of two steps. First, we der ..."
Abstract—In this paper, we explore a new data mining capability that involves mining path traversal patterns in a distributed informationproviding environment where documents or objects are linked together to facilitate interactive access. Our solution procedure consists of two steps. First, we derive an algorithm to convert the original sequence of log data into a set of maximal forward references. By doing so, we can filter out the effect of some backward references, which are mainly made for ease of traveling and concentrate on mining meaningful user access sequences. Second, we derive algorithms to determine the frequent traversal patterns¦i.e., large reference sequences¦from the maximal forward references obtained. Two algorithms are devised for determining large reference sequences; one is based on some hashing and pruning techniques, and the other is further improved with the option of determining large reference sequences in batch so as to reduce the number of database scans required. Performance of these two methods is comparatively analyzed. It is shown that the option of selective scan is very advantageous and can lead to prominent performance improvement. Sensitivity analysis on various parameters is conducted. Index Terms—Data mining, traversal patterns, distributed information system, World Wide Web, performance analysis.
ECommerce Recommendation Applications
, 2001
"... Recommender systems are being used by an everincreasing number of Ecommerce sites to help consumers find products to purchase. What started as a novelty has turned into a serious business tool. Recommender systems use product knowledge  either handcoded knowledge provided by experts or "mi ..."
Recommender systems are being used by an everincreasing number of Ecommerce sites to help consumers find products to purchase. What started as a novelty has turned into a serious business tool. Recommender systems use product knowledge  either handcoded knowledge provided by experts or "mined" knowledge learned from the behavior of consumers  to guide consumers through the oftenoverwhelming task of locating products they will like. In this article we present an explanation of how recommender systems are related to some traditional database analysis techniques. We examine how recommender systems help Ecommerce sites increase sales and analyze the recommender systems at six marketleading sites. Based on these examples, we create a taxonomy of recommender systems, including the inputs required from the consumers, the additional knowledge required from the database, the ways the recommendations are presented to consumers, the technologies used to create the recommendations, and t...
Private queries in location based services: anonymizers are not necessary
 In SIGMOD
, 2008
"... Mobile devices equipped with positioning capabilities (e.g., GPS) can ask locationdependent queries to Location Based Services (LBS). To protect privacy, the user location must not be disclosed. Existing solutions utilize a trusted anonymizer between the users and the LBS. This approach has several ..."
Mobile devices equipped with positioning capabilities (e.g., GPS) can ask locationdependent queries to Location Based Services (LBS). To protect privacy, the user location must not be disclosed. Existing solutions utilize a trusted anonymizer between the users and the LBS. This approach has several drawbacks: (i) All users must trust the third party anonymizer, which is a single point of attack. (ii) A large number of cooperating, trustworthy users is needed. (iii) Privacy is guaranteed only for a single snapshot of user locations; users are not protected against correlation attacks (e.g., history of user movement). We propose a novel framework to support private locationdependent queries, based on the theoretical work on Private Information Retrieval (PIR). Our framework does not require a trusted third party, since privacy is achieved via cryptographic techniques. Compared to existing work, our approach achieves stronger privacy for snapshots of user locations; moreover, it is the first to provide provable privacy guarantees against correlation attacks. We use our framework to implement approximate and exact algorithms for nearestneighbor search. We optimize query execution by employing data mining techniques, which identify redundant computations. Contrary to common belief, the experimental results suggest that PIR approaches incur reasonable overhead and are applicable in practice.
Finding Intensional Knowledge of DistanceBased Outliers
 In VLDB
, 1999
"... Existing studies on outliers focus only on the identification aspect; none provides any intensional knowledge of the outliersby which we mean a description or an explanation of why an identified outlier is exceptional. For many applications, a description or explanation is at least as vital to t ..."
Existing studies on outliers focus only on the identification aspect; none provides any intensional knowledge of the outliersby which we mean a description or an explanation of why an identified outlier is exceptional. For many applications, a description or explanation is at least as vital to the user as the identification aspect. Specifically, intensional knowledge helps the user to: (i) evaluate the validity of the identified outliers, and (ii) improve one's understanding of the data. The two main issues addressed in this paper are: what kinds of intensional knowledge to provide, and how to optimize the computation of such knowledge. With respect to the first issue, we propose finding strongest and weak outliers and their corresponding structural intensional knowledge. With respect to the second issue, we first present a naive and a seminaive algorithm. Then, by means of what we call path and semilattice sharing of I/O processing, we develop two optimized approaches. We provi...
HARMONY: Efficiently mining the best rules for classification
 In Proc. of SDM
, 2005
"... Many studies have shown that rulebased classification algorithms perform well in classifying categorical and sparse highdimensional databases. However, a fundamental limitation with many rulebased classifiers is that they find the classification rules in a coarsegrained manner. They usually use h ..."
Many studies have shown that rulebased classification algorithms perform well in classifying categorical and sparse highdimensional databases. However, a fundamental limitation with many rulebased classifiers is that they find the classification rules in a coarsegrained manner. They usually use heuristic methods to prune the search space, and select the rules based on the sequential database covering paradigm. Thus, the somined rules may not be the globally best rules for some instances in the training database. To make worse, these algorithms fail to fully exploit some more effective search space pruning methods in order to scale to large databases. In this paper we propose a new classifier, HARMONY, which directly mines the final set of classification rules. HARMONY uses an instancecentric rulegeneration approach in the sense that it can assure for each training instance, one of the highestconfidence rules covering this instance is included in the result set, which helps a lot in achieving high classification accuracy. By introducing several novel search strategies and pruning methods into the traditional frequent itemset mining framework, HARMONY also has high efficiency and good scalability. Our thorough performance study with some large text and categorical databases has shown that HARMONY outperforms many wellknown classifiers in terms of both accuracy and efficiency, and scales well w.r.t. the database size.
The discrete basis problem
, 2005
"... We consider the Discrete Basis Problem, which can be described as follows: given a collection of Boolean vectors find a collection of k Boolean basis vectors such that the original vectors can be represented using disjunctions of these basis vectors. We show that the decision version of this problem ..."
We consider the Discrete Basis Problem, which can be described as follows: given a collection of Boolean vectors find a collection of k Boolean basis vectors such that the original vectors can be represented using disjunctions of these basis vectors. We show that the decision version of this problem is NPcomplete and that the optimization version cannot be approximated within any finite ratio. We also study two variations of this problem, where the Boolean basis vectors must be mutually otrhogonal. We show that the other variation is closely related with the wellknown Metric kmedian Problem in Boolean space. To solve these problems, two algorithms will be presented. One is designed for the variations mentioned above, and it is solely based on solving the kmedian problem, while another is a heuristic intended to solve the general Discrete Basis Problem. We will also study the results of extensive experiments made with these two algorithms with both synthetic and realworld data. The results are twofold: with the synthetic data, the algorithms did rather well, but with the realworld data the results were not as good.