Results 1 - 10
of
47
An efficient algorithm for discovering frequent subgraphs
- IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Abstract — Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to non-traditional domains, existing frequent pattern discovery approach cannot be used. This i ..."
Abstract
-
Cited by 68 (5 self)
- Add to MetaCart
Abstract — Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to non-traditional domains, existing frequent pattern discovery approach cannot be used. This is because the transaction framework that is assumed by these algorithms cannot be used to effectively model the datasets in these domains. An alternate way of modeling the objects in these datasets is to represent them using graphs. Within that model, one way of formulating the frequent pattern discovery problem is as that of discovering subgraphs that occur frequently over the entire set of graphs. In this paper we present a computationally efficient algorithm, called FSG, for finding all frequent subgraphs in large graph datasets. We experimentally evaluate the performance of FSG using a variety of real and synthetic datasets. Our results show that despite the underlying complexity associated with frequent subgraph discovery, FSG is effective in finding all frequently occurring subgraphs in datasets containing over 200,000 graph transactions and scales linearly with respect to the size of the dataset. Index Terms — Data mining, scientific datasets, frequent pattern discovery, chemical compound datasets.
Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds
- In Proceedings of ICDM’03
, 2003
"... In this paper we study the problem of classifying chemical compound datasets. We present a sub-structure-based classification algorithm that decouples the sub-structure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topologi ..."
Abstract
-
Cited by 65 (3 self)
- Add to MetaCart
In this paper we study the problem of classifying chemical compound datasets. We present a sub-structure-based classification algorithm that decouples the sub-structure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric sub-structures present in the dataset. The advantage of our approach is that during classification model construction, all relevant sub-structures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Our experimental evaluation on eight different classification problems shows that our approach is computationally scalable and outperforms existing schemes by 10% to 35%, on the average.
Statistical relational learning for link prediction
- In Proceedings of the Workshop on Learning Statistical Models from Relational Data at IJCAI-2003
, 2003
"... Link prediction is a complex, inherently relational, task. Be it in the domain of scientific citations, social networks or hypertext links, the underlying data are extremely noisy and the characteristics useful for prediction are not readily available in a “flat ” file format, but rather involve com ..."
Abstract
-
Cited by 48 (5 self)
- Add to MetaCart
Link prediction is a complex, inherently relational, task. Be it in the domain of scientific citations, social networks or hypertext links, the underlying data are extremely noisy and the characteristics useful for prediction are not readily available in a “flat ” file format, but rather involve complex relationships among objects. In this paper, we propose the application of our methodology for Statistical Relational Learning to building link prediction models. We propose an integrated approach to building regression models from data stored in relational databases in which potential predictors are generated by structured search of the space of queries to the database, and then tested for inclusion in a logistic regression. We present experimental results for the task of predicting citations made in scientific literature using relational data taken from CiteSeer. This data includes the citation graph, authorship and publication venues of papers, as well as their word content. 1
Understanding the crucial role of attribute interaction in data mining
- Artif. Intel. Rev
, 2001
"... This is a review paper, whose goal is to significantly improve our understanding of the crucial role of attribute interaction in data mining. The main contributions of this paper are as follows. Firstly, we show that the concept of attribute interaction has a crucial role across different kinds of p ..."
Abstract
-
Cited by 39 (11 self)
- Add to MetaCart
This is a review paper, whose goal is to significantly improve our understanding of the crucial role of attribute interaction in data mining. The main contributions of this paper are as follows. Firstly, we show that the concept of attribute interaction has a crucial role across different kinds of problem in data mining, such as attribute construction, coping with small disjuncts, induction of first-order logic rules, detection of Simpson’s paradox, and finding several types of interesting rules. Hence, a better understanding of attribute interaction can lead to a better understanding of the relationship between these kinds of problems, which are usually studied separately from each other. Secondly, we draw attention to the fact that most rule induction algorithms are based on a greedy search which does not cope well with the problem of attribute interaction, and point out some alternative kinds of rule discovery methods which tend to cope better with this problem. Thirdly, we discussed several algorithms and methods for discovering interesting knowledge that, implicitly or explicitly, are based on the concept of attribute interaction.
Statistical Relational Learning for Document Mining
, 2003
"... A major obstacle to fully integrated deployment of statistical learners is the assumption that data sits in a single table, even though most real-world databases have complex relational structures. In this paper, we introduce an integrated approach to building regression models from data stored ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
A major obstacle to fully integrated deployment of statistical learners is the assumption that data sits in a single table, even though most real-world databases have complex relational structures. In this paper, we introduce an integrated approach to building regression models from data stored in relational databases. Potential features are generated by structured search of the space of queries to the database, and then tested for inclusion in a logistic regression. We present experimental results for the task of predicting where scientific papers will be published based on relational data taken from CiteSeer. This data includes word counts in the document, frequently cited authors or papers, co-citations, publication venues of cited papers, word co-occurrences, and word counts in cited or citing documents. Our approach results in classification accuracies superior to those achieved when using classical "flat" features. Our classification task also serves as a "where to publish?" conference/journal recommendation task.
Comparative Evaluation of Approaches to Propositionalization
, 2003
"... Propositionalization has already been shown to be a promising approach for robustly and e#ectively handling relational data sets for knowledge discovery. In this paper, we compare up-to-date methods for propositionalization from two main groups: logic-oriented and databaseoriented techniques. Ex ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Propositionalization has already been shown to be a promising approach for robustly and e#ectively handling relational data sets for knowledge discovery. In this paper, we compare up-to-date methods for propositionalization from two main groups: logic-oriented and databaseoriented techniques. Experiments using several learning tasks --- both ILP benchmarks and tasks from recent international data mining competitions --- show that both groups have their specific advantages. While logic-oriented methods can handle complex background knowledge and provide expressive first-order models, database-oriented methods can be more e#cient especially on larger data sets. Obtained accuracies vary such that a combination of the features produced by both groups seems a further valuable venture.
Learning Statistical Models from Relational Data
, 2001
"... This workshop is the second in a series of workshops held in conjunction with AAAI and IJCAI. The first workshop was held in July, 2000 at AAAI. Notes from that workshop are available at ..."
Abstract
-
Cited by 33 (6 self)
- Add to MetaCart
This workshop is the second in a series of workshops held in conjunction with AAAI and IJCAI. The first workshop was held in July, 2000 at AAAI. Notes from that workshop are available at
Lattice-Search Runtime Distributions May Be Heavy-Tailed
- In Proceedings of the 12th International Conference on Inductive Logic Programming
, 2002
"... Recent empirical studies show that runtime distributions of backtrack procedures for solving hard combinatorial problems often have intriguing properties. Unlike standard distributions (such as the normal) , such distributions decay slower than exponentially and have "heavy tails". Procedures charac ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
Recent empirical studies show that runtime distributions of backtrack procedures for solving hard combinatorial problems often have intriguing properties. Unlike standard distributions (such as the normal) , such distributions decay slower than exponentially and have "heavy tails". Procedures characterized by heavy-tailed runtime distributions exhibit large variability in efficiency, but a very straightforward method called rapid randomized restarts has been designed to essentially improve their average performance. We show on two experimental domains that heavy-tailed phenomena can be observed in ILP, namely in the search for a clause in the subsumption lattice. We also reformulate the technique of randomized rapid restarts to make it applicable in ILP and show that it can reduce the average search-time.
Experiments in Predicting Biodegradability
- Applied Artificial Intelligence
, 1999
"... . We present a novel application of inductive logic programming (ILP) in the area of quantitative structure-activity relationships (QSARs). The activity we want to predict is the biodegradability of chemical compounds in water. In particular, the target variable is the half-life in water for aer ..."
Abstract
-
Cited by 22 (8 self)
- Add to MetaCart
. We present a novel application of inductive logic programming (ILP) in the area of quantitative structure-activity relationships (QSARs). The activity we want to predict is the biodegradability of chemical compounds in water. In particular, the target variable is the half-life in water for aerobic aqueous biodegradation. Structural descriptions of chemicals in terms of atoms and bonds are derived from the chemicals' SMILES encodings. Definition of substructures are used as background knowledge. Predicting biodegradability is essentially a regression problem, but we also consider a discretized version of the target variable. We thus employ a number of relational classification and regression methods on the relational representation and compare these to propositional methods applied to different propositionalisations of the problem. Some expert comments on the induced theories are also given. 1 Introduction The persistence of chemicals in the environment (or to environmen...
Towards Structural Logistic Regression: Combining Relational and Statistical Learning
, 2002
"... Inductive logic programming (ILP) techniques are useful for analyzing data in multi-table relational databases. Learned rules can potentially discover relationships that are not obvious in ``flattened'' data. Statistical learners, on the other hand, are generally not constructed to search relational ..."
Abstract
-
Cited by 22 (5 self)
- Add to MetaCart
Inductive logic programming (ILP) techniques are useful for analyzing data in multi-table relational databases. Learned rules can potentially discover relationships that are not obvious in ``flattened'' data. Statistical learners, on the other hand, are generally not constructed to search relational data, they expect to be presented with a single table containing a set of feature candidates. However, statistical learners often yield more accurate models than the logical forms of ILP, and can better handle certain types of data, such as counts. We propose a new approach which integrates structure navigation from ILP with regression modeling. Our approach propositionalizes the first-order rules at each step of ILP's relational structure search, generating features for potential inclusion in a regression model. Ideally, feature generation by ILP and feature selection by stepwise regression should be integrated into a single loop. Preliminary results for scientific literature classification are presented using a relational form of the data extracted by ResearchIndex (formerly CiteSeer). We use FOIL and logistic regression as our ILP and statistical components (decoupled at this stage). Word counts and citation-based features learned with FOIL are modeled together by logistic regression. The combination often significantly improves performance when high precision classification is desired.

