Results 1 - 10
of
16
Finding frequent patterns in a large sparse graph
- SIAM Data Mining Conference
, 2004
"... This paper presents two algorithms based on the horizontal and vertical pattern discovery paradigms that find the connected subgraphs that have a sufficient number of edge-disjoint embeddings in a single large undirected labeled sparse graph. These algorithms use three different methods to determine ..."
Abstract
-
Cited by 45 (3 self)
- Add to MetaCart
This paper presents two algorithms based on the horizontal and vertical pattern discovery paradigms that find the connected subgraphs that have a sufficient number of edge-disjoint embeddings in a single large undirected labeled sparse graph. These algorithms use three different methods to determine the number of the edge-disjoint embeddings of a subgraph that are based on approximate and exact maximum independent set computations and use it to prune infrequent subgraphs. Experimental evaluation on real datasets from various domains show that both algorithms achieve good performance, scale well to sparse input graphs with more than 100,000 vertices, and significantly outperform a previously developed algorithm.
Graph database indexing using structured graph decomposition
- In ICDE
, 2007
"... We introduce a novel method of indexing graph databases in order to facilitate subgraph isomorphism and similarity queries. The index is comprised of two major data structures. The primary structure is a directed acyclic graph which contains a node for each of the unique, induced subgraphs of the da ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
We introduce a novel method of indexing graph databases in order to facilitate subgraph isomorphism and similarity queries. The index is comprised of two major data structures. The primary structure is a directed acyclic graph which contains a node for each of the unique, induced subgraphs of the database graphs. The secondary structure is a hash table which crossindexes each subgraph for fast isomorphic lookup. In order to create a hash key independent of isomorphism, we utilize a code-based canonical representation of adjacency matrices, which we have further refined to improve computation speed. We validate the concept by demonstrating its effectiveness in answering queries for two practical datasets. Our experiments show that for subgraph isomorphism queries, our method outperforms existing methods by more than an order of magnitude. 1.
Unordered Tree Mining with Applications to Phylogeny
- In Proc. of the 20th International Conference on Data Engineering
, 2004
"... Frequent structure mining (FSM) aims to discover and extract patterns frequently occurring in structural data, such as trees and graphs. FSM finds many applications in bioinformatics, XML processing, Web log analysis, and so on. In this paper we present a new FSM technique for finding patterns in ro ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Frequent structure mining (FSM) aims to discover and extract patterns frequently occurring in structural data, such as trees and graphs. FSM finds many applications in bioinformatics, XML processing, Web log analysis, and so on. In this paper we present a new FSM technique for finding patterns in rooted unordered labeled trees. The patterns of interest are cousin pairs in these trees. A cousin pair is a pair of nodes sharing the same parent, the same grandparent, or the same great-grandparent, etc. Given a tree, our algorithm finds all interesting cousin pairs of ¢
DNA Sequence Classification via an Expectation Maximization Algorithm and Neural Networks: A Case Study
- IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
, 2001
"... This paper presents new techniques for biosequence classification, with a focus on recognizing E. Coli promoters in DNA. Specifically, given an unlabeled DNA sequence S, we want to determine whether or not S is an E. Coli promoter. We use an expectation-maximization (EM) algorithm to locate the-35 a ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
This paper presents new techniques for biosequence classification, with a focus on recognizing E. Coli promoters in DNA. Specifically, given an unlabeled DNA sequence S, we want to determine whether or not S is an E. Coli promoter. We use an expectation-maximization (EM) algorithm to locate the-35 and-10 binding sites in an E. Coli promoter sequence. The EM algorithm differs from previously published EM algorithms in that, instead of assuming a uniform distribution for the lengths of the spacer between the-35 binding site and the-10 binding site as well as the spacer between the-10 binding site and the transcriptional start site, our algorithm deduces the probability distribution for these lengths. Based on the located binding sites, we select features in each E. Coli promoter sequence according to their information contents and represent the features using an orthogonal encoding method. We then feed the features to a neural network for promoter recognition. Empirical studies show that the proposed approach achieves good performance on different datasets.
Parallel Algorithms for Mining Frequent Structural Motifs in Scientific Data
- In ACM International Conference on Supercomputing (ICS) 2004
, 2004
"... Discovery of important substructures from molecules is an important data mining problem. The basic motivation is that the structure of a molecule has a role to play in its biochemical function. There is interest in finding important, often recurrent, substructures both within a single molecule and a ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Discovery of important substructures from molecules is an important data mining problem. The basic motivation is that the structure of a molecule has a role to play in its biochemical function. There is interest in finding important, often recurrent, substructures both within a single molecule and across a class of molecules. Recently, we have developed a general purpose suite of algorithms – the MotifMiner Toolkit – that can mine for structural motifs in a wide area of biomolecular datasets. While the algorithms have proven to be extremely useful in their ability to identify novel substructures, the algorithms themselves are quite time consuming. There are two reasons for this: i) inherently the algorithm suffers from the curse of subgraph isomorphism; and ii) handling noise effects (e.g. protein structure data) results in a significant slowdown. To address this problem in this paper we propose parallelization strategies in a cluster environment for the above algorithms. We identify key optimizations that handle load imbalance, scheduling, and communication overheads. Results show that the optimizations are quite effective and that we are able to obtain good speedup on moderate sized clusters. 1.
Implicit Enumeration of Patterns
- Knowledge Discovery in Inductive Databases, 3rd International Workshop, KDID 2004
, 2004
"... Condensed representations of pattern collections have been recognized to be important building blocks of inductive databases, a promising theoretical framework for data mining, and recently they have been studied actively. However, there has not been much research on how condensed representation ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Condensed representations of pattern collections have been recognized to be important building blocks of inductive databases, a promising theoretical framework for data mining, and recently they have been studied actively. However, there has not been much research on how condensed representations should actually be represented.
A new approach to protein structure mining and alignment
- In 4th Workshop on Data Mining in Bioinformatics
, 2004
"... One of the largest areas of bioinformatic and data mining research has been in the protein domain. These efforts have included protein structure prediction, folding pathway prediction, sequence alignment, ab initio simulation, structure alignment, substructure detection and many others. Substructure ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
One of the largest areas of bioinformatic and data mining research has been in the protein domain. These efforts have included protein structure prediction, folding pathway prediction, sequence alignment, ab initio simulation, structure alignment, substructure detection and many others. Substructure detection is generally defined as the mining of a molecule’s 3D structure in order to find interesting/frequent domains. Sequence alignment involves determining the similarity of two (or more) protein molecules based on the how well their amino acid sequences “match. ” There are potential pitfalls when trying solve both of these problems, however. In the case of substructure mining, focusing solely on structural information can lead to the discovery of biologically irrelevant substructures. With sequence alignment, the alignment results can vary greatly, depending on the substitution matrix used. In this paper we describe a method that combines the benefits of both substructure mining and sequence alignment in an attempt to determine the similarity between protein molecules. In the absence of biological information, our work will quickly and efficiently mine a protein molecule in order to determine frequent local structures. With the addition of biological sequence information, however, our algorithm provides a way to align proteins with similar local structures and sequence, yielding a global alignment between molecules. We present a novel structure mining/alignment algorithm as well as some additional work into a new clustering metric for amino acids based on several different physico-chemical properties. This metric is used with our alignment algorithm in order to provide a mechanism for globally aligning protein molecules.
Discovering spatial relationships between approximately equivalent patterns
- In The 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD
, 2004
"... ..."
alpha-Surface and Its Application to Mining Protein Data
"... Given a finite set of points in three dimensional Euclidean space # # , the subset that forms its surface could be different when observed in different levels of details. In this paper, we introduce a notion called #-surface. We present an algorithm that extracts the #-surface from a finite set of ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Given a finite set of points in three dimensional Euclidean space # # , the subset that forms its surface could be different when observed in different levels of details. In this paper, we introduce a notion called #-surface. We present an algorithm that extracts the #-surface from a finite set of points in # # . We apply the algorithm to extracting the #- surfaces of proteins and discover patterns from these surface structures, using the pattern discovery algorithm we developed earlier. We then use these patterns to classify the proteins. Experimental results show the good performance of the proposed approach.
unknown title
"... Motivation: Here we present two approaches to solving the substructure similarity problem. The first approach is a parallel version of a previously developed substructure mining algorithm. The second is an algorithm that operates on the protein domain and uses a protein’s amino acid sequence in addi ..."
Abstract
- Add to MetaCart
Motivation: Here we present two approaches to solving the substructure similarity problem. The first approach is a parallel version of a previously developed substructure mining algorithm. The second is an algorithm that operates on the protein domain and uses a protein’s amino acid sequence in addition to substructures to determine similarity. 1

