Results 1  10
of
29
A Fast FixedPoint Algorithm for Independent Component Analysis of Complex Valued Signals
, 2000
"... Separation of complex valued signals is a frequently arising problem in signal processing. For example, separation of convolutively mixed source signals involves computations on complex valued signals. In this article it is assumed that the original, complex valued source signals are mutually statis ..."
Abstract

Cited by 85 (1 self)
 Add to MetaCart
Separation of complex valued signals is a frequently arising problem in signal processing. For example, separation of convolutively mixed source signals involves computations on complex valued signals. In this article it is assumed that the original, complex valued source signals are mutually statistically independent, and the problem is solved by the independent component analysis (ICA) model. ICA is a statistical method for transforming an observed multidimensional random vector into components that are mutually as independent as possible. In this article, a fast xedpoint type algorithm that is capable of separating complex valued, linearly mixed source signals is presented and its computational efficiency is shown by simulations. Also, the local consistency of the estimator given by the algorithm is proved.
Exploiting Hierarchical Domain Structure to Compute Similarity
 ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 2003
"... ..."
Electricity based external similarity of categorical attributes
 In PAKDD 2003
, 2003
"... Abstract. Similarity or distance measures are fundamental and critical properties for data mining tools. Categorical attributes abound in databases. The Car Make, Gender, Occupation, etc. fields in a automobile insurance database are very informative. Sadly, categorical data is not easily amenable t ..."
Abstract

Cited by 31 (10 self)
 Add to MetaCart
Abstract. Similarity or distance measures are fundamental and critical properties for data mining tools. Categorical attributes abound in databases. The Car Make, Gender, Occupation, etc. fields in a automobile insurance database are very informative. Sadly, categorical data is not easily amenable to similarity computations. A domain expert might manually specify some or all of the similarity relationships, but this is errorprone and not feasible for attributes with large domains, nor is it useful for crossattribute similarities, such as between Gender and Occupation. External similarity functions define a similarity between, say, Car Makes by looking at how they cooccur with the other categorical attributes. We exploit a rich duality between random walks on graphs and electrical circuits to develop REP, an external similarity function. REP is theoretically grounded while the only prior work was adhoc. The usefulness of REP is shown in two experiments. First, we cluster categorical attribute values showing improved inferred relationships. Second, we use REP effectively as a nearest neighbour classifier. 1
The igrid index: Reversing the dimensionality curse for similarity indexing in high dimensional space
 In Proceedings of the Sixth ACM International Conference on Knowledge Discovery and Data Mining
, 2000
"... The similarity searc h and indexing problem is w ell kno wn to be a di cult one for high dimensional applications. Most indexing structures show a rapid degradation with increasing dimensionality whic hleads to an access of the en tire database for each query. F urthermore, recen t research results ..."
Abstract

Cited by 30 (5 self)
 Add to MetaCart
The similarity searc h and indexing problem is w ell kno wn to be a di cult one for high dimensional applications. Most indexing structures show a rapid degradation with increasing dimensionality whic hleads to an access of the en tire database for each query. F urthermore, recen t research results sho w that in high dimensional space, even the concept of similarity may not be very meaningful. In this paper, w e propose theIGridindex; a method for similarity indexing which uses a distance function whose meaningfulness is retained with increasing dimensionality. In addition, this technique shows performance which is unique to all known index structures; the percentage of data accessed is inversely proportional to the overall data dimensionality. Th us, this technique relies on the dimensionality to be high in order to pro vide performance e cient similarity results. The IGridindex can also support a special kind of query whic hw e refer to as projected range queries; a query whic his increasingly relevan tfor very high dimensional data mining applications.
Simfusion: measuring similarity using unified relationship matrix
 In SIGIR
, 2005
"... In this paper we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous data objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlinks, user clickthrough sequences). We claim that iterative computations over the URM can help overcome the data sparseness p ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
In this paper we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous data objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlinks, user clickthrough sequences). We claim that iterative computations over the URM can help overcome the data sparseness problem and detect latent relationships among heterogeneous data objects, thus, can improve the quality of information applications that require combination of information from heterogeneous sources. To support our claim, we present a unified similaritycalculating algorithm, SimFusion. By iteratively computing over the URM, SimFusion can effectively integrate relationships from heterogeneous sources when measuring the similarity of two data objects. Experiments based on a web search engine query log and a web page collection demonstrate that SimFusion can improve similarity measurement of web objects over both traditional content based algorithms and the cutting edge SimRank algorithm.
Local and Global Methods in Data Mining: Basic Techniques and Open Problems
 In ICALP 2002, 29th International Colloquium on Automata, Languages, and Programming, Malaga
, 2002
"... Data mining has in recent years emerged as an interesting area in the boundary between algorithms, probabilistic modeling, statistics, and databases. Data mining research can be divided into global approaches, which try to model the whole data, and local methods, which try to find useful patterns oc ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
Data mining has in recent years emerged as an interesting area in the boundary between algorithms, probabilistic modeling, statistics, and databases. Data mining research can be divided into global approaches, which try to model the whole data, and local methods, which try to find useful patterns occurring in the data. We discuss briefly some simple local and global techniques, review two attempts at combining the approaches, and list open problems with an algorithmic flavor.
Coupled Clustering: A Method for Detecting Structural Correspondence
 Journal of Machine Learning Research
, 2002
"... This paper proposes a new paradigm and a computational framework for revealing equivalencies (analogies) between substructures of distinct composite systems that are initially represented by unstructured data sets. For this purpose, we introduce and investigate a variant of traditional data cluster ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
This paper proposes a new paradigm and a computational framework for revealing equivalencies (analogies) between substructures of distinct composite systems that are initially represented by unstructured data sets. For this purpose, we introduce and investigate a variant of traditional data clustering, termed coupled clustering, which outputs a configuration of corresponding subsets of two such representative sets. We apply our method to synthetic as well as textual data. Its achievements in detecting topical correspondences between textual corpora are evaluated through comparison to performance of human experts.
ContextBased Similarity Measures for Categorical Databases
 In PKDD
"... Similarity between complex data objects is one of the central notions in data mining. We propose certain similarity (or distance) measures between various components of a 0/1 relation. We define measures between attributes, between rows, and between subrelations of the database. They find import ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
Similarity between complex data objects is one of the central notions in data mining. We propose certain similarity (or distance) measures between various components of a 0/1 relation. We define measures between attributes, between rows, and between subrelations of the database. They find important applications in clustering, classification, and several other data mining processes. Our measures are based on the contexts of individual components. For example, two products (i.e., attributes) are deemed similar if their respective sets of customers (i.e., subrelations) are similar. This reveals more subtle relationships between components, something that is usually missing in simpler measures. Our problem of finding distance measures can be formulated as a system of nonlinear equations. We present an iterative algorithm which, when seeded with random initial values, converges quickly to stable distances in practice (typically requiring less than five iterations). The algorithm requires only one database scan. Results on artificial and real data show that our method is efficient, and produces results with intuitive appeal.
Indirect Association: Mining Higher Order Dependencies in Data
 IN PRINCIPLES OF DATA MINING AND KNOWLEDGE DISCOVERY
, 2000
"... This paper introduces a novel pattern called indirect association and examines its utility in various application domains. Existing algorithms for mining associations, such as Apriori, will only discover itemsets that have support above a userdefined threshold. Any itemsets with support below the m ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
This paper introduces a novel pattern called indirect association and examines its utility in various application domains. Existing algorithms for mining associations, such as Apriori, will only discover itemsets that have support above a userdefined threshold. Any itemsets with support below the minimum support requirement are filtered out. We believe
Efficient Progressive Sampling for Association Rules
, 2002
"... In data mining, sampling has often been suggested as an effective tool to reduce the size of the dataset operated at some cost to accuracy. However, this loss to accuracy is often difficult to measure and characterize since the exact nature of the learning curve (accuracy vs. sample size) is paramet ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
In data mining, sampling has often been suggested as an effective tool to reduce the size of the dataset operated at some cost to accuracy. However, this loss to accuracy is often difficult to measure and characterize since the exact nature of the learning curve (accuracy vs. sample size) is parameter and data dependent, i.e., we do not know apriori what sample size is needed to achieve a desired accuracy on a particular dataset for a particular set of parameters. In this article we propose the use of progressive sampling to determine the required sample size for association rule mining. We first show that a naive application of progressive sampling is not very efficient for association rule mining. We then present a refinement based on equivalence classes, that seems to work extremely well in practice and is able to converge to the desired sample size very quickly and very accurately. An additional novelty of our approach is the definition of a supportsensitive, interactive measure of accuracy across progressive samples.