Results 1 - 10
of
17
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule
, 2003
"... Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic ..."
Abstract
-
Cited by 84 (4 self)
- Add to MetaCart
Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the e#ciency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
A Survey of Kernels for Structured Data
"... Kernel methods in general and support vector machines in particular have been successful in various learning tasks on data represented in a single table. Much 'real-world ' data, however, is structured- it has no natural representation in a single table. Usually, to apply kernel methods to 'realworl ..."
Abstract
-
Cited by 84 (3 self)
- Add to MetaCart
Kernel methods in general and support vector machines in particular have been successful in various learning tasks on data represented in a single table. Much 'real-world ' data, however, is structured- it has no natural representation in a single table. Usually, to apply kernel methods to 'realworld' data, extensive pre-processing is performed toembed the data into areal vector space and thus in a single table. This survey describes several approaches ofdefining positive definite kernels on structured instances directly.
Logical hidden markov models
- Journal of Artificial Intelligence Research
, 2006
"... Logical hidden Markov models (LOHMMs) upgrade traditional hidden Markov models to deal with sequences of structured symbols in the form of logical atoms, rather than flat characters. This note formally introduces LOHMMs and presents solutions to the three central inference problems for LOHMMs: evalu ..."
Abstract
-
Cited by 33 (10 self)
- Add to MetaCart
Logical hidden Markov models (LOHMMs) upgrade traditional hidden Markov models to deal with sequences of structured symbols in the form of logical atoms, rather than flat characters. This note formally introduces LOHMMs and presents solutions to the three central inference problems for LOHMMs: evaluation, most likely hidden state sequence and parameter estimation. The resulting representation and algorithms are experimentally evaluated on problems from the domain of bioinformatics. 1.
Fisher kernels for logical sequences
- In Proc. of 15th European Conference on Machine Learning (ECML-04
, 2004
"... Abstract. One approach to improve the accuracy of classifications based on generative models is to combine them with successful discriminative algorithms. Fisher kernels were developed to combine generative models with a currently very popular class of learning algorithms, kernel methods. Empiricall ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Abstract. One approach to improve the accuracy of classifications based on generative models is to combine them with successful discriminative algorithms. Fisher kernels were developed to combine generative models with a currently very popular class of learning algorithms, kernel methods. Empirically, the combination of hidden Markov models with support vector machines has shown promising results. So far, however, Fisher kernels have only been considered for sequences over flat alphabets. This is mostly due to the lack of a method for computing the gradient of a generative model over structured sequences. In this paper, we show how to compute the gradient of logical hidden Markov models, which allow for the modelling of logical sequences, i.e., sequences over an alphabet of logical atoms. Experiments show a considerable improvement over results achieved without Fisher kernels for logical sequences.
Frequent Subgraph Mining in Outerplanar Graphs
- PROC. 12TH ACM SIGKDD INT. CONF. ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2006
"... In recent years there has been an increased interest in frequent pattern discovery in large databases of graph structured objects. While the frequent connected subgraph mining problem for tree datasets can be solved in incremental polynomial time, it becomes intractable for arbitrary graph databases ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
In recent years there has been an increased interest in frequent pattern discovery in large databases of graph structured objects. While the frequent connected subgraph mining problem for tree datasets can be solved in incremental polynomial time, it becomes intractable for arbitrary graph databases. Existing approaches have therefore resorted to various heuristic strategies and restrictions of the search space, but have not identified a practically relevant tractable graph class beyond trees. In this paper, we consider the class of outerplanar graphs, a strict generalization of trees, develop a frequent subgraph mining algorithm for outerplanar graphs, and show that it works in incremental polynomial time for the practically relevant subclass of well-behaved outerplanar graphs, i.e., which have only polynomially many simple cycles. We evaluate the algorithm empirically on chemo- and bioinformatics applications.
Declarative kernels
, 2004
"... We introduce a declarative approach to kernel design based on background knowledge expressed in the form of logic programs. The theoretical foundation of declarative kernels is mereotopology, a general theory for studying parts and wholes and for defining topological relations among parts. Declarati ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We introduce a declarative approach to kernel design based on background knowledge expressed in the form of logic programs. The theoretical foundation of declarative kernels is mereotopology, a general theory for studying parts and wholes and for defining topological relations among parts. Declarative kernels can be used to specify a broad class of kernels over relational data and represent a step towards bridging statistical learning and inductive logic programming. The flexibility and the effectiveness of declarative kernels is demonstrated in a number of real world problems. 1
Kernel-based distances for relational learning
- In Proceedings of the Workshop on Multi-Relational Data Mining at KDD-2004
, 2004
"... Abstract. In this paper we present a novel and general framework for kernel-based learning over relational schemata. We exploit the notion of foreign keys to perform the leap from a flat attribute-value representation to a structured representation that underlines relational learning. We define a ne ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract. In this paper we present a novel and general framework for kernel-based learning over relational schemata. We exploit the notion of foreign keys to perform the leap from a flat attribute-value representation to a structured representation that underlines relational learning. We define a new attribute type which builds on the notion of foreign keys that we call instance-set. It is shown that this more database oriented approach enables intuitive modeling of relational problems. We also define some kernel functions over relational schemata and adapt them so that they are used as a basis for a relational instance-based learning algorithm. We check the performance of our algorithm on a number of well known relational benchmark datasets. 1
L.: Inductive databases in the relational model: The data as the bridge
- In: KDID
"... Abstract. We present a new and comprehensive approach to inductive databases in the relational model. The main contribution is a new inductive query language extending SQL, with the goal of supporting the whole knowledge discovery process, from pre-processing via data mining to post-processing. A pr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. We present a new and comprehensive approach to inductive databases in the relational model. The main contribution is a new inductive query language extending SQL, with the goal of supporting the whole knowledge discovery process, from pre-processing via data mining to post-processing. A prototype system supporting the query language was developed in the SINDBAD (structured inductive database development) project. Setting aside models and focusing on distance-based and instance-based methods, closure can easily be achieved. An example scenario from the area of gene expression data analysis demonstrates the power and simplicity of the concept. We hope that this preliminary work will help to bring the fundamental issues, such as the integration of various pattern domains and data mining techniques, to the attention of the inductive database community. 1
A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with
"... Abstract. Clustering is an essential data mining task with various types of applications. Traditional clustering algorithms are based on a vector space model representation. A relational database system often contains multirelational information spread across multiple relations (tables). In order to ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. Clustering is an essential data mining task with various types of applications. Traditional clustering algorithms are based on a vector space model representation. A relational database system often contains multirelational information spread across multiple relations (tables). In order to cluster such data, one would require to restrict the analysis to a single representation, or to construct a feature space comprising all possible representations from the data stored in multiple tables. In this paper, we present a data summarization approach, borrowed from the Information Retrieval theory, to clustering in multi-relational environment. We find that the data summarization technique can be used here to capture the typical high volume of multiple instances and numerous forms of patterns. Our experiments demonstrate a technique to cluster data in a multi-relational environment and show the evaluation results on the mutagenesis dataset. In addition, the effect of varying the number of features considered in clustering on the classification performance is also evaluated.
M.: Distance-based learning over extended relational algebra structures
- In: Proceedings of the 15th International Conference of Inductive Logic Programming. (2005
"... Abstract. In (Kalousis et al., 2005) we presented a novel unifying framework for relational distance-based learning where learning examples are stored in a relational database. This approach is based on concepts from relational algebra and exploits the notion of foreign keys associations to define a ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. In (Kalousis et al., 2005) we presented a novel unifying framework for relational distance-based learning where learning examples are stored in a relational database. This approach is based on concepts from relational algebra and exploits the notion of foreign keys associations to define a new attribute of type set. We defined several relational distances whose blocks are distances between tuples of relations and distances between sets. In this paper we extend this relational algebra representation language such that it allows for modeling of lists of complex objects (relational instances in our case). We define a new type of foreign keys associations which, in addition to attributes of type set, gives rise to a new attribute of type list. We extend the well known alignment-based edit distance measure on lists to fit within our framework. Our extended distancebased learning algorithm in tested on a protein fingerprint classification dataset for which promising results are reported. 1

