• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

gSpan: Graph-Based Substructure Pattern Mining (2002)

by Xifeng Yan, Jiawei Han
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 650
Next 10 →

Mining sequential patterns by pattern-growth: The PrefixSpan approach

by Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Jianyong Wang, Helen Pinto, Qiming Chen, Umeshwar Dayal, Mei-Chun Hsu - IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING , 2004
"... Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining ..."
Abstract - Cited by 206 (10 self) - Add to MetaCart
Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [1] to reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments. Based on an initial study of the pattern growth-based sequential pattern mining, FreeSpan [8], we propose a more efficient method, called PSP, which offers ordered growth and reduced projected databases. To further improve the performance, a pseudoprojection technique is developed in PrefixSpan. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE [29] (a sequential pattern mining algorithm that adopts vertical data format), and PrefixSpan integrated with pseudoprojection is the fastest among all the tested algorithms. Furthermore, this mining methodology can be extended to mining sequential patterns with user-specified constraints. The high promise of the pattern-growth approach may lead to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures.

Graph Indexing: A Frequent Structure-based Approach

by Xifeng Yan , Philip S. Yu, Jiawei Han , 2004
"... Graph has become increasingly important in modelling complicated structures and schemaless data such as proteins, chemical compounds, and XML documents. Given a graph query, it is desirable to retrieve graphs quickly from a large database via graph-based indices. In this paper, we investigate the is ..."
Abstract - Cited by 201 (25 self) - Add to MetaCart
Graph has become increasingly important in modelling complicated structures and schemaless data such as proteins, chemical compounds, and XML documents. Given a graph query, it is desirable to retrieve graphs quickly from a large database via graph-based indices. In this paper, we investigate the issues of indexing graphs and propose a novel solution by applying a graph mining technique. Di#erent from the existing path-based methods, our approach, called gIndex, makes use of frequent substructure as the basic indexing feature. Frequent substructures are ideal candidates since they explore the intrinsic characteristics of the data and are relatively stable to database updates. To reduce the size of index structure, two techniques, size-increasing support constraint and discriminative fragments, are introduced. Our performance study shows that gIndex has 10 times smaller index size, but achieves 3-10 times better performance in comparison with a typical path-based method, GraphGrep. The gIndex approach not only provides an elegant solution to the graph indexing problem, but also demonstrates how database indexing and query processing can benefit from data mining, especially frequent pattern mining. Furthermore, the concepts developed here can be applied to indexing sequences, trees, and other complicated structures as well.

Efficient Mining of Frequent Subgraph in the Presence of Isomorphism

by Jun Huan , Wei Wang , Jan Prins
"... Frequent subgraph mining is an active research topic in the data mining community. A graph is a general model to represent data and has been used in many domains like cheminformatics and bioinformatics. Mining patterns from graph databases is challenging since graph related operations, such as subgr ..."
Abstract - Cited by 194 (23 self) - Add to MetaCart
Frequent subgraph mining is an active research topic in the data mining community. A graph is a general model to represent data and has been used in many domains like cheminformatics and bioinformatics. Mining patterns from graph databases is challenging since graph related operations, such as subgraph testing, generally have higher time complexity than the corresponding operations on itemsets, sequences, and trees, which have been studied extensively. In this paper, we propose a novel frequent subgraph mining algorithm: FFSM, which employs a vertical search scheme within an algebraic graphical framework we have developed to reduce the number of redundant candidates proposed. Our empirical study on synthetic and real datasets demonstrates that FFSM achieves a substantial performance gain over the current start-of-the-art subgraph mining algorithm gSpan.

Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds

by Mukund Deshpande, Michihiro Kuramochi, George Karypis - In Proceedings of ICDM’03 , 2003
"... In this paper we study the problem of classifying chemical compound datasets. We present a sub-structure-based classification algorithm that decouples the sub-structure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topologi ..."
Abstract - Cited by 140 (6 self) - Add to MetaCart
In this paper we study the problem of classifying chemical compound datasets. We present a sub-structure-based classification algorithm that decouples the sub-structure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric sub-structures present in the dataset. The advantage of our approach is that during classification model construction, all relevant sub-structures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Our experimental evaluation on eight different classification problems shows that our approach is computationally scalable and outperforms existing schemes by 10% to 35%, on the average.

Preserving privacy in social networks against neighborhood attacks

by Bin Zhou, Jian Pei - In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on , 2008
"... Abstract — Recently, as more and more social network data has been published in one way or another, preserving privacy in publishing social network data becomes an important con-cern. With some local knowledge about individuals in a social network, an adversary may attack the privacy of some victims ..."
Abstract - Cited by 134 (4 self) - Add to MetaCart
Abstract — Recently, as more and more social network data has been published in one way or another, preserving privacy in publishing social network data becomes an important con-cern. With some local knowledge about individuals in a social network, an adversary may attack the privacy of some victims easily. Unfortunately, most of the previous studies on privacy preservation can deal with relational data only, and cannot be applied to social network data. In this paper, we take an initiative towards preserving privacy in social network data. We identify an essential type of privacy attacks: neighborhood attacks. If an adversary has some knowledge about the neighbors of a target victim and the relationship among the neighbors, the victim may be re-identified from a social network even if the victim’s identity is preserved using the conventional anonymization techniques. We show that the problem is challenging, and present a practical solution to battle neighborhood attacks. The empirical study indicates that anonymized social networks generated by our method can still be used to answer aggregate network queries with high accuracy. I.
(Show Context)

Citation Context

...e backward edges. The vertices in the graph are encoded v0 to v3 according to the pre-order of the corresponding DFS-trees. To solve the uniqueness problem, a minimum DFS code notation is proposed in =-=[19]-=-. For any connected graph G, let T be a DFS-tree of G. Then, an edge is always listed as (vi, vj) such that i < j. A linear order ≺ on the edges in G can be defined as follows. Given edges e = (vi, vj...

Graph mining: laws, generators, and algorithms

by Deepayan Chakrabarti, Christos Faloutsos - ACM COMPUT SURV (CSUR , 2006
"... How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M: N relation in ..."
Abstract - Cited by 132 (7 self) - Add to MetaCart
How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M: N relation in database terminology can be represented as a graph. A lot of these questions boil down to the following: “How can we generate synthetic but realistic graphs? ” To answer this, we must first understand what patterns are common in real-world graphs and can thus be considered a mark of normality/realism. This survey give an overview of the incredible variety of work that has been done on these problems. One of our main contributions is the integration of points of view from physics, mathematics, sociology, and computer science. Further, we briefly describe recent advances on some related and interesting graph problems.

Finding frequent patterns in a large sparse graph

by Michihiro Kuramochi, George Karypis - SIAM Data Mining Conference , 2004
"... This paper presents two algorithms based on the horizontal and vertical pattern discovery paradigms that find the connected subgraphs that have a sufficient number of edge-disjoint embeddings in a single large undirected labeled sparse graph. These algorithms use three different methods to determine ..."
Abstract - Cited by 130 (4 self) - Add to MetaCart
This paper presents two algorithms based on the horizontal and vertical pattern discovery paradigms that find the connected subgraphs that have a sufficient number of edge-disjoint embeddings in a single large undirected labeled sparse graph. These algorithms use three different methods to determine the number of the edge-disjoint embeddings of a subgraph that are based on approximate and exact maximum independent set computations and use it to prune infrequent subgraphs. Experimental evaluation on real datasets from various domains show that both algorithms achieve good performance, scale well to sparse input graphs with more than 100,000 vertices, and significantly outperform a previously developed algorithm.
(Show Context)

Citation Context

..., and help in determining the similarity between graphs [54, 23, 42, 59, 9, 49, 13, 60, 66]. Within the context of graphs, the most widely used definition of a pattern is that of a connected subgraph =-=[8, 68, 32, 29, 69, 30, 44]-=- and is the definition that we will use in this paper. However, different pattern definitions have been proposed as well [32]. There are two distinct problem formulations for frequent pattern mining i...

PEGASUS: A Peta-Scale Graph Mining System- Implementation and Observations

by U Kang, Charalampos E. Tsourakakis, Christos Faloutsos - IEEE INTERNATIONAL CONFERENCE ON DATA MINING , 2009
"... Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga-, Tera- or P ..."
Abstract - Cited by 128 (26 self) - Add to MetaCart
Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIM-V (Generalized Iterated Matrix-Vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with ≈ 6,7 billion edges. Keywords-PEGASUS; graph mining; hadoop I.
(Show Context)

Citation Context

...: There are a huge number of graph mining algorithms, computing communities (eg., [3], DENGRAPH [4], METIS [5]), subgraph discovery(e.g., GraphSig [6], [7], [8], [9], gPrune [10], gApprox [11], gSpan =-=[12]-=-, Subdue [13], HSIGRAM/VSIGRAM [14], ADI [15], CSV [16]), finding important nodes (e.g., PageRank [17] and HITS [18]), computing the number of triangles [19], [20], computing the diameter [21], topic ...

Top 10 algorithms in data mining

by Xindong Wu, Vipin Kumar, J. Ross, Quinlan Joydeep, Ghosh Qiang Yang, Hiroshi Motoda, Geoffrey J. Mclachlan, Angus Ng, Bing Liu, Philip S. Yu, Dan Steinberg, X. Wu (b, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda , 2007
"... Abstract This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining a ..."
Abstract - Cited by 126 (2 self) - Add to MetaCart
Abstract This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification,
(Show Context)

Citation Context

...tly. AprioriSMP uses this principle [45]. 5) using richer expressions than itemset: Many algorithms have been proposed for sequences, tree and graphs to enable mining from more complex data structure =-=[73, 35]-=-. 6) closed itemsets: A frequent itemset is closed if it is not included in any other frequent itemsets. Thus, once the closed itemsets are found, all the frequent itemsets can be derived from them. L...

An efficient algorithm for discovering frequent subgraphs

by Michihiro Kuramochi, George Karypis - IEEE Transactions on Knowledge and Data Engineering , 2002
"... Abstract — Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to non-traditional domains, existing frequent pattern discovery approach cannot be used. This i ..."
Abstract - Cited by 120 (7 self) - Add to MetaCart
Abstract — Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to non-traditional domains, existing frequent pattern discovery approach cannot be used. This is because the transaction framework that is assumed by these algorithms cannot be used to effectively model the datasets in these domains. An alternate way of modeling the objects in these datasets is to represent them using graphs. Within that model, one way of formulating the frequent pattern discovery problem is as that of discovering subgraphs that occur frequently over the entire set of graphs. In this paper we present a computationally efficient algorithm, called FSG, for finding all frequent subgraphs in large graph datasets. We experimentally evaluate the performance of FSG using a variety of real and synthetic datasets. Our results show that despite the underlying complexity associated with frequent subgraph discovery, FSG is effective in finding all frequently occurring subgraphs in datasets containing over 200,000 graph transactions and scales linearly with respect to the size of the dataset. Index Terms — Data mining, scientific datasets, frequent pattern discovery, chemical compound datasets.
(Show Context)

Citation Context

... focused on approximate algorithms [23], [28], [35], [46] that use various heuristics to prune the search space. However, a number of exact algorithms have been developed [5], [10], [17], [24], [25], =-=[45]-=- that guarantee to find all subgraphs that satisfy certain minimum support or other constraints. Probably the most well-known heuristic-based approach is the SUBDUE system, originally developed in 199...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University