Results 1  10
of
120
Frequent SubStructureBased Approaches for Classifying Chemical Compounds
 In Proceedings of ICDM’03
, 2003
"... In this paper we study the problem of classifying chemical compound datasets. We present a substructurebased classification algorithm that decouples the substructure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topologi ..."
Abstract

Cited by 140 (6 self)
 Add to MetaCart
In this paper we study the problem of classifying chemical compound datasets. We present a substructurebased classification algorithm that decouples the substructure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric substructures present in the dataset. The advantage of our approach is that during classification model construction, all relevant substructures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Our experimental evaluation on eight different classification problems shows that our approach is computationally scalable and outperforms existing schemes by 10% to 35%, on the average.
Finding frequent patterns in a large sparse graph
 SIAM Data Mining Conference
, 2004
"... This paper presents two algorithms based on the horizontal and vertical pattern discovery paradigms that find the connected subgraphs that have a sufficient number of edgedisjoint embeddings in a single large undirected labeled sparse graph. These algorithms use three different methods to determine ..."
Abstract

Cited by 130 (4 self)
 Add to MetaCart
(Show Context)
This paper presents two algorithms based on the horizontal and vertical pattern discovery paradigms that find the connected subgraphs that have a sufficient number of edgedisjoint embeddings in a single large undirected labeled sparse graph. These algorithms use three different methods to determine the number of the edgedisjoint embeddings of a subgraph that are based on approximate and exact maximum independent set computations and use it to prune infrequent subgraphs. Experimental evaluation on real datasets from various domains show that both algorithms achieve good performance, scale well to sparse input graphs with more than 100,000 vertices, and significantly outperform a previously developed algorithm.
Patterns of influence in a recommendation network,”
 Proc. 10th PacificAsia Conf. on Advances in Knowledge Discovery and Data Mining (PAKDD),
, 2006
"... Abstract. Information cascades are phenomena whereby individuals adopt a new action or idea due to influence by others. As such a process spreads through an underlying social network, it can result in widespread adoption overall. We consider information cascades in the context of recommendations, a ..."
Abstract

Cited by 105 (14 self)
 Add to MetaCart
(Show Context)
Abstract. Information cascades are phenomena whereby individuals adopt a new action or idea due to influence by others. As such a process spreads through an underlying social network, it can result in widespread adoption overall. We consider information cascades in the context of recommendations, and in particular study the patterns of cascading recommendations that arise in large social networks. We investigate a large persontoperson recommendation network, consisting of four million people who made sixteen million recommendations on half a million products. Such a dataset allows to pose a number of fundamental questions: What cascades arise frequently in real life? What features distinguish them? We enumerate and count cascade subgraphs on large directed graphs; as one component of this, we develop a novel efficient heuristic based on graph isomorphism testing that scales to large datasets. We discover novel patterns: the distribution of cascade sizes and depths follows a power law. Generally, cascades tend to be shallow, but occasional large bursts of propagation can occur. Cascade subgraphs are mainly treelike, but we observe variability in connectivity and branching across recommendations for different types of products.
Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications
 in IEEE Transaction on Knowledge and Data Engineering
, 2005
"... Mining frequent trees is very useful in domains like bioinformatics, Web mining, mining semistructured data, etc. We formulate the problem of mining (embedded) subtrees in a forest of rooted, labeled, and ordered trees. We present TREEMINER,a novel algorithm to discover all frequent subtrees in a ..."
Abstract

Cited by 70 (0 self)
 Add to MetaCart
(Show Context)
Mining frequent trees is very useful in domains like bioinformatics, Web mining, mining semistructured data, etc. We formulate the problem of mining (embedded) subtrees in a forest of rooted, labeled, and ordered trees. We present TREEMINER,a novel algorithm to discover all frequent subtrees in a forest, using a new data structure called scopelist. We contrast TREEMINER with a pattern matching tree mining algorithm (PATTERNMATCHER), and we also compare it with TREEMINERD, which counts only distinct occurrences of a pattern. We conduct detailed experiments to test the performance and scalability of these methods. We also use tree mining to analyze RNA structure and phylogenetics data sets from bioinformatics domain.
Frequent Subtree Mining  An Overview
, 2005
"... Mining frequent subtrees from databases of labeled trees is a new research field that has many practical applications in areas such as computer networks, Web mining, bioinformatics, XML document mining, etc. These applications share a requirement for the more expressive power of labeled trees to ..."
Abstract

Cited by 52 (3 self)
 Add to MetaCart
Mining frequent subtrees from databases of labeled trees is a new research field that has many practical applications in areas such as computer networks, Web mining, bioinformatics, XML document mining, etc. These applications share a requirement for the more expressive power of labeled trees to capture the complex relations among data entities. Although frequent subtree mining is a more difficult task than frequent itemset mining, most existing frequent subtree mining algorithms borrow techniques from the relatively mature association rule mining area. This paper provides an overview of a broad range of tree mining algorithms. We focus on the common theoretical foundations of the current frequent subtree mining algorithms and their relationship with their counterparts in frequent itemset mining. When comparing the algorithms, we categorize them according to their problem definitions and the techniques employed for solving various subtasks of the subtree mining problem. In addition, we also present a thorough performance study for a representative family of algorithms.
Comparison of descriptor spaces for chemical compound retrieval and classification
 International Conference in Datamining. (ICDM
, 2006
"... Abstract. In recent years the development of computational techniques that build models to correctly assign chemical compounds to various classes or to retrieve potential druglike compounds has been an active area of research. Many of the bestperforming techniques for these tasks utilize a descript ..."
Abstract

Cited by 42 (5 self)
 Add to MetaCart
(Show Context)
Abstract. In recent years the development of computational techniques that build models to correctly assign chemical compounds to various classes or to retrieve potential druglike compounds has been an active area of research. Many of the bestperforming techniques for these tasks utilize a descriptorbased representation of the compound that captures various aspects of the underlying molecular graph’s topology. In this paper we compare five different set of descriptors that are currently used for chemical compound classification. We also introduce four different descriptors derived from all connected fragments present in the molecular graphs primarily for the purpose of comparing them to the currently used descriptor spaces and analyzing what properties of descriptor spaces are helpful in providing effective representation for molecular graphs. In addition, we introduce an extension to existing vectorbased kernel functions to take into account the length of the fragments present in the descriptors. We experimentally evaluate the performance of the previously introduced and the new descriptors in the context of SVMbased classification and rankedretrieval on 28 classification and retrieval problems derived from 18 datasets. Our experiments show that for both of these tasks, two of the four descriptors introduced in this paper along with the extended connectivity fingerprint based descriptors consistently and statistically outperform previously developed schemes based on the widely used fingerprint and Maccs keysbased descriptors, as well as recently introduced descriptors obtained by mining and analyzing the structure of the molecular graphs.
Discovering Frequent Geometric Subgraphs
 In IEEE Intl. Conference on Data Mining ’02
, 2002
"... As data mining techniques are being increasingly applied to nontraditional domains, existing approaches for finding frequent itemsets cannot be used as they cannot model the requirement of these domains. An alternate way of modeling the objects in these data sets, is to use a graph to model the ..."
Abstract

Cited by 38 (1 self)
 Add to MetaCart
(Show Context)
As data mining techniques are being increasingly applied to nontraditional domains, existing approaches for finding frequent itemsets cannot be used as they cannot model the requirement of these domains. An alternate way of modeling the objects in these data sets, is to use a graph to model the database objects. Within that model, the problem of finding frequent patterns becomes that of discovering subgraphs that occur frequently over the entire set of graphs. In this paper we present a computationally e#cient algorithm for finding frequent geometric subgraphs in a large collection of geometric graphs. Our algorithm is able to discover geometric subgraphs that can be rotation, scaling and translation invariant, and it can accommodate inherent errors on the coordinates of the vertices. We evaluated the performance of the algorithm using a large database of over 20,000 real two dimensional chemical structures, and our experimental results show that our algorithms requires relatively little time, can accommodate low support values, and scales linearly on the number of transactions.
A Survey of Frequent Subgraph Mining Algorithms
 THE KNOWLEDGE ENGINEERING REVIEW
, 2004
"... Graph mining is an important research area within the domain of data mining. The field of study concentrates on the identification of frequent subgraphs within graph data sets. The research goals are directed at: (i) effective mechanisms for generating candidate subgraphs (without generating duplica ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
Graph mining is an important research area within the domain of data mining. The field of study concentrates on the identification of frequent subgraphs within graph data sets. The research goals are directed at: (i) effective mechanisms for generating candidate subgraphs (without generating duplicates) and (ii) how best to process the generated candidate subgraphs so as to identify the desired frequent subgraphs in a way that is computationally efficient and procedurally effective. This paper presents a survey of current research in the field of frequent subgraph mining, and proposed solutions to address the main research issues.
Maximal Biclique Subgraphs and Closed Pattern Pairs of the Adjacency Matrix: A Onetoone Correspondence and Mining Algorithms
, 2007
"... Maximal biclique (also known as complete bipartite) subgraphs can model many applications in web mining, business, and bioinformatics. Enumerating maximal biclique subgraphs from a graph is a computationally challenging problem, as the size of the output can become exponentially large with respect ..."
Abstract

Cited by 24 (8 self)
 Add to MetaCart
Maximal biclique (also known as complete bipartite) subgraphs can model many applications in web mining, business, and bioinformatics. Enumerating maximal biclique subgraphs from a graph is a computationally challenging problem, as the size of the output can become exponentially large with respect to the vertex number when the graph grows. In this paper, we efficiently enumerate them through the use of closed patterns of the adjacency matrix of the graph. For an undirected graph G without selfloops, we prove that: (i) the number of closed patterns in the adjacency matrix of G is even; (ii) the number of the closed patterns is precisely double the number of maximal biclique subgraphs of G; and (iii) for every maximal biclique subgraph, there always exists a unique pair of closed patterns that matches the two vertex sets of the subgraph. Therefore, the problem of enumerating maximal bicliques can be solved by using efficient algorithms for mining closed patterns, which are algorithms extensively studied in the data mining field. However, this direct use of existing algorithms causes a duplicated enumeration. To achieve high efficiency, we propose an O(mn) time delay algorithm for a nonduplicated enumeration, in particular for enumerating those maximal bicliques with a large size, where m and n are the number of edges and vertices of the graph respectively. We evaluate the high efficiency of our algorithm by comparing it to stateoftheart algorithms on three categories of graphs: randomly generated graphs, benchmarks, and a reallife protein interaction network. In this paper, we also prove that if selfloops are allowed in a graph, then the number of closed patterns in the adjacency matrix is not necessarily even; but the maximal bicliques are exactly the same as those of the graph after removing all the selfloops.