Results 1  10
of
53
Graphsatatime: Query Language and Access Methods for Graph Databases
, 2008
"... With the prevalence of graph data in a variety of domains, there is an increasing need for a language to query and manipulate graphs with heterogeneous attributes and structures. We propose a query language for graph databases that supports arbitrary attributes on nodes, edges, and graphs. In this l ..."
Abstract

Cited by 70 (0 self)
 Add to MetaCart
(Show Context)
With the prevalence of graph data in a variety of domains, there is an increasing need for a language to query and manipulate graphs with heterogeneous attributes and structures. We propose a query language for graph databases that supports arbitrary attributes on nodes, edges, and graphs. In this language, graphs are the basic unit of information and each query manipulates one or more collections of graphs. To allow for flexible compositions of graph structures, we extend the notion of formal languages from strings to the graph domain. We present a graph algebra extended from the relational algebra in which the selection operator is generalized to graph pattern matching and a composition operator is introduced for rewriting matched graphs. Then, we investigate access methods of the selection operator. Pattern matching over large graphs is challenging due to the NPcompleteness of subgraph isomorphism. We address this by a combination of techniques: use of neighborhood subgraphs and profiles, joint reduction of the search space, and optimization of the search order. Experimental results on real and synthetic large graphs demonstrate that our graph specific optimizations outperform an SQLbased implementation by orders of magnitude.
LargeScale Malware Indexing Using FunctionCall Graphs
"... A major challenge of the antivirus (AV) industry is how to effectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly determine if a new malware sample is similar to any previouslyseen malware program. In this paper, we design ..."
Abstract

Cited by 55 (0 self)
 Add to MetaCart
(Show Context)
A major challenge of the antivirus (AV) industry is how to effectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly determine if a new malware sample is similar to any previouslyseen malware program. In this paper, we design, implement and evaluate a malware database management system called SMIT (Symantec Malware Indexing Tree) that can efficiently make such determination based on malware’s functioncall graphs, which is a structural representation known to be less susceptible to instructionlevel obfuscations commonly employed by malware writers to evade detection of AV software. Because each malware program is represented as a graph, the problem of searching for the most similar malware program in a database to a given malware sample is cast into a nearestneighbor search problem in a graph database. To speed
Taming Verification Hardness: An Efficient Algorithm for Testing Subgraph Isomorphism
"... Graphs are widely used to model complicated data semantics in many applications. In this paper, we aim to develop efficient techniques to retrieve graphs, containing a given query graph, from a large set of graphs. Considering the problem of testing subgraph isomorphism is generally NPhard, most of ..."
Abstract

Cited by 50 (9 self)
 Add to MetaCart
(Show Context)
Graphs are widely used to model complicated data semantics in many applications. In this paper, we aim to develop efficient techniques to retrieve graphs, containing a given query graph, from a large set of graphs. Considering the problem of testing subgraph isomorphism is generally NPhard, most of the existing techniques are based on the framework of filteringandverification to reduce the precise computation costs; consequently various novel featurebased indexes have been developed. While the existing techniques work well for small query graphs, the verification phase becomes a bottleneck when the query graph size increases. Motivated by this, in the paper we firstly propose a novel and efficient algorithm for testing subgraph isomorphism, QuickSI. Secondly, we develop a new featurebased index technique to accommodate QuickSI in the filtering phase. Our extensive experiments on real and synthetic data demonstrate the efficiency and scalability of the proposed techniques, which significantly improve the existing techniques. 1.
A novel spectral coding in a large graph database
 In Proceedings of the International Conference on Extending Database Technology
, 2008
"... Retrieving related graphs containing a query graph from a large graph database is a key issue in many graphbased applications, such as drug discovery and structural pattern recognition. Because subgraph isomorphism is a NPcomplete problem [4], we have to employ a filterandverification framework ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
(Show Context)
Retrieving related graphs containing a query graph from a large graph database is a key issue in many graphbased applications, such as drug discovery and structural pattern recognition. Because subgraph isomorphism is a NPcomplete problem [4], we have to employ a filterandverification framework to speed up the search efficiency, that is, using an effective and efficient pruning strategy to filter out the false positives (graphs that are not possible in the results) as many as possible first, then validating the remaining candidates by subgraph isomorphism checking. In this paper, we propose a novel filtering method, a spectral encoding method, i.e. GCoding. Specifically, we assign a signature to each vertex based on its local structures. Then, we generate a spectral graph code by combining all vertex signatures in a graph. Based on spectral graph codes, we derive a necessary condition for subgraph isomorphism. Then we propose two pruning rules for subgraph search problem, and prove that they satisfy the nofalsenegative requirement (no dismissal in answers). Since graph codes are in numerical space, we take this advantage and conduct efficient filtering over graph codes. Extensive experiments show that GCoding outperforms existing counterpart methods. 1.
On Graph Query Optimization in Large Networks
"... The dramatic proliferation of sophisticated networks has resulted in a growing need for supporting effective querying and mining methods over such largescale graphstructured data. At the core of many advanced network operations lies a common and critical graph query primitive: how to search graph ..."
Abstract

Cited by 34 (3 self)
 Add to MetaCart
The dramatic proliferation of sophisticated networks has resulted in a growing need for supporting effective querying and mining methods over such largescale graphstructured data. At the core of many advanced network operations lies a common and critical graph query primitive: how to search graph structures efficiently within a large network? Unfortunately, the graph query is hard due to the NPcomplete nature of subgraph isomorphism. It becomes even challenging when the network examined is large and diverse. In this paper, we present a high performance graph indexing mechanism, SPath, to address the graph query problem on large networks. SPath leverages decomposed shortest paths around vertex neighborhood as basic indexing units, which prove to be both effective in graph search space pruning and highly scalable in index construction and deployment. Via SPath, a graph query is processed and optimized beyond the traditional vertexatatime fashion to a more efficient pathatatime way: the query is first decomposed to a set of shortest paths, among which a subset of candidates with good selectivity is picked by a query plan optimizer; Candidate paths are further joined together to help recover the query graph to finalize the graph query processing. We evaluate SPath with the stateoftheart GraphQL on both real and synthetic data sets. Our experimental studies demonstrate the effectiveness and scalability of SPath, which proves to be a more practical and efficient indexing method in addressing graph queries on large networks. 1.
Comparing Stars: On Approximating Graph Edit Distance
, 2009
"... Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer visio ..."
Abstract

Cited by 29 (0 self)
 Add to MetaCart
(Show Context)
Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer vision etc. Unfortunately, the problem of graph edit distance computation is NPHard in general. Accordingly, in this paper we introduce three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time. Applying these methods, two algorithms AppFull and AppSub are introduced to perform different kinds of graph search on graph databases. Comprehensive experimental studies are conducted on both real and synthetic datasets to examine various aspects of the methods for bounding graph edit distance. Result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs. The effectiveness of these algorithms also confirms the usefulness of using our bounds in filtering and searching of graphs.
Neighborhood based fast graph search in large networks
 in SIGMOD
, 2011
"... Complex social and information network search becomes important with a variety of applications. In the core of these applications, lies a common and critical problem: Given a labeled network and a query graph, how to efficiently search the query graph in the target network. The presence of noise a ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
(Show Context)
Complex social and information network search becomes important with a variety of applications. In the core of these applications, lies a common and critical problem: Given a labeled network and a query graph, how to efficiently search the query graph in the target network. The presence of noise and the incomplete knowledge about the structure and content of the target network make it unrealistic to find an exact match. Rather, it is more appealing to find the topk approximate matches. In this paper, we propose a neighborhoodbased similarity measure that could avoid costly graph isomorphism and edit distance computation. Under this new measure, we prove that subgraph similarity search is NP hard, while graph similarity match is polynomial. By studying the principles behind this measure, we found an information propagation model that is able to convert a large net
GADDI: Distance index based subgraph matching in biological networks
 In Proceedings of the 12th international conference on extending database technology (EDBT’09
, 2009
"... Currently, a huge amount of biological data can be naturally represented by graphs, e.g., protein interaction networks, gene regulatory networks, etc. The need for indexing large graphs is an urgent research problem of great practical importance. The main challenge is size. Each graph may contain ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
(Show Context)
Currently, a huge amount of biological data can be naturally represented by graphs, e.g., protein interaction networks, gene regulatory networks, etc. The need for indexing large graphs is an urgent research problem of great practical importance. The main challenge is size. Each graph may contain thousands (or more) vertices. Most of the previous work focuses on indexing a set of small or medium sized database graphs (with only tens of vertices) and finding whether a query graph occurs in any of these. In this paper, we are interested in finding all the matches of a query graph in a given large graph of thousands of vertices, which is a very important task in many biological applications. This increases the complexity significantly. We propose a novel distance measurement which reintroduces the idea of frequent substructures in a single large graph. We devise the novel structure distance based approach (GADDI) to efficiently find matches of the query graph. GADDI is further optimized by the use of a dynamic matching scheme to minimize redundant calculations. Last but not least, a number of real and synthetic data sets are used to evaluate the efficiency and scalability of our proposed method. 1.
GBASE: A Scalable and General Graph Management System
"... Graphs appear in numerous applications including cybersecurity, the Internet, social networks, protein networks, recommendation systems, and many more. Graphs with millions or even billions of nodes and edges are commonplace. How to store such large graphs efficiently? What are the core operations ..."
Abstract

Cited by 24 (5 self)
 Add to MetaCart
(Show Context)
Graphs appear in numerous applications including cybersecurity, the Internet, social networks, protein networks, recommendation systems, and many more. Graphs with millions or even billions of nodes and edges are commonplace. How to store such large graphs efficiently? What are the core operations/queries on those graph? How to answer the graph queries quickly? We propose GBASE, a scalable and general graph management and mining system. The key novelties lie in 1) our storage and compression scheme for a parallel setting and 2) the carefully chosen graph operations and their efficient implementation. We designed and implemented an instance of GBASE using MAPREDUCE/HADOOP. GBASE provides a parallel indexing mechanism for graph mining operations that both saves storage space, as well as accelerates queries. We ran numerous experiments on real graphs, spanning billions of nodes and edges, and we show that our proposed GBASE is indeed fast, scalable and nimble, with significant savings in space and time.
Towards proximity pattern mining in large graphs
 In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. SIGMOD ’10. ACM
, 2010
"... Mining graph patterns in large networks is critical to a variety of applications such as malware detection and biological module discovery. However, frequent subgraphs are often ineffective to capture association existing in these applications, due to the complexity of isomorphism testing and the ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
(Show Context)
Mining graph patterns in large networks is critical to a variety of applications such as malware detection and biological module discovery. However, frequent subgraphs are often ineffective to capture association existing in these applications, due to the complexity of isomorphism testing and the inelastic pattern definition. In this paper, we introduce proximity pattern which is a significant departure from the traditional concept of frequent subgraphs. Defined as a set of labels that cooccur in neighborhoods, proximity pattern blurs the boundary between itemset and structure. It relaxes the rigid structure constraint of frequent subgraphs, while introducing connectivity to frequent itemsets. Therefore, it can benefit from both: efficient mining in itemsets and structure proximity from graphs. We developed two models to define proximity patterns. The second one, called Normalized Probabilistic Association (NmPA), is able to transform a complex graph mining problem to a simplified probabilistic itemset mining problem, which can be solved efficiently by a modified FPtree algorithm, called pFP. NmPA and pFP are evaluated on reallife social and intrusion networks. Empirical results show that it not only finds interesting patterns that are ignored by the existing approaches, but also achieves high performance for finding proximity patterns in largescale graphs.