Results 1  10
of
71
Substructure similarity search in graph databases
 In SIGMOD
, 2005
"... Advanced database systems face a great challenge raised by the emergence of massive, complex structural data in bioinformatics, cheminformatics, and many other applications. The most fundamental support needed in these applications is the efficient search of complex structured data. Since exact mat ..."
Abstract

Cited by 85 (6 self)
 Add to MetaCart
(Show Context)
Advanced database systems face a great challenge raised by the emergence of massive, complex structural data in bioinformatics, cheminformatics, and many other applications. The most fundamental support needed in these applications is the efficient search of complex structured data. Since exact matching is often too restrictive, similarity search of complex structures becomes a vital operation that must be supported efficiently. In this paper, we investigate the issues of substructure similarity search using indexed features in graph databases. By transforming the edge relaxation ratio of a query graph into the maximum allowed missing features, our structural filtering algorithm, called Grafil, can filter many graphs without performing pairwise similarity computations. It is further shown that using either too few or too many features can result in poor filtering performance. Thus the challenge is to design an effective feature set selection strategy for filtering. By examining the effect of different feature selection mechanisms, we develop a multifilter composition strategy, where each filter uses a distinct and complementary subset of the features. We identify the criteria to form effective feature sets for filtering, and demonstrate that combining features with similar size and selectivity can improve the filtering and search performance significantly. Moreover, the concept presented in Grafil can be applied to searching approximate nonconsecutive sequences, trees, and other complicated structures as well. 1.
Summarization System Evaluation Revisited: NGram Graphs
 ACM TRANSACTIONS ON SPEECH AND LANGUAGE PROCESSING
, 2008
"... This article presents a novel automatic method (AutoSummENG) for the evaluation of summarization systems, based on comparing the character ngram graphs representation of the extracted summaries and a number of model summaries. The presented approach is language neutral, due to its statistical natur ..."
Abstract

Cited by 36 (16 self)
 Add to MetaCart
This article presents a novel automatic method (AutoSummENG) for the evaluation of summarization systems, based on comparing the character ngram graphs representation of the extracted summaries and a number of model summaries. The presented approach is language neutral, due to its statistical nature, and appears to hold a level of evaluation performance that matches and even exceeds other contemporary evaluation methods. Within this study, we measure the effectiveness of different representation methods, namely, word and character ngram graph and histogram, different ngram neighborhood indication methods as well as different comparison methods between the supplied representations. A theory for the a priori determination of the methods ’ parameters along with supporting experiments concludes the study to provide a complete alternative to existing methods concerning the automatic summary system evaluation process.
A Binary Linear Programming Formulation of the Graph Edit Distance
"... A binary linear programming formulation of the graph edit distance for unweighted, undirected graphs with vertex attributes is derived and applied to a graph recognition problem. A general formulation for editing graphs is used to derive a graph edit distance that is proven to be a metric provided t ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
(Show Context)
A binary linear programming formulation of the graph edit distance for unweighted, undirected graphs with vertex attributes is derived and applied to a graph recognition problem. A general formulation for editing graphs is used to derive a graph edit distance that is proven to be a metric provided the cost function for individual edit operations is a metric. Then, a binary linear program is developed for computing this graph edit distance, and polynomial time methods for determining upper and lower bounds on the solution of the binary program are derived by applying solution methods for standard linear programming and the assignment problem. A recognition problem of comparing a sample input graph to a database of known prototype graphs in the context of a chemical information system is presented as an application of the new method. The costs associated with various edit operations are chosen by using a minimum normalized variance criterion applied to pairwise distances between nearest neighbors in the database of prototypes. The new metric is shown to perform quite well in comparison to existing metrics when applied to a database of chemical graphs.
Comparing Stars: On Approximating Graph Edit Distance
, 2009
"... Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer visio ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
(Show Context)
Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer vision etc. Unfortunately, the problem of graph edit distance computation is NPHard in general. Accordingly, in this paper we introduce three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time. Applying these methods, two algorithms AppFull and AppSub are introduced to perform different kinds of graph search on graph databases. Comprehensive experimental studies are conducted on both real and synthetic datasets to examine various aspects of the methods for bounding graph edit distance. Result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs. The effectiveness of these algorithms also confirms the usefulness of using our bounds in filtering and searching of graphs.
Correlation Search in Graph Databases
 KDD'07
, 2007
"... Correlation mining has gained great success in many application domains for its ability to capture the underlying dependency between objects. However, the research of correlation mining from graph databases is still lacking despite the fact that graph data, especially in various scientific domains, ..."
Abstract

Cited by 20 (6 self)
 Add to MetaCart
Correlation mining has gained great success in many application domains for its ability to capture the underlying dependency between objects. However, the research of correlation mining from graph databases is still lacking despite the fact that graph data, especially in various scientific domains, proliferate in recent years. In this paper, we propose a new problem of correlation mining from graph databases, called Correlated Graph Search (CGS). CGS adopts Pearson’s correlation coefficient as a correlation measure to take into consideration the occurrence distributions of graphs. However, the problem poses significant challenges, since every subgraph of a graph in the database is a candidate but the number of subgraphs is exponential. We derive two necessary conditions which set bounds on the occurrence probability of a candidate in the database. With this result, we design an efficient algorithm that operates on a much smaller projected database and thus we are able to obtain a significantly smaller set of candidates. To further improve the efficiency, we develop three heuristic rules and apply them on the candidate set to further reduce the search space. Our extensive experiments demonstrate the effectiveness of our method on candidate reduction. The results also justify the efficiency of our algorithm in mining correlations from large real and synthetic datasets.
Featurebased similarity search in graph structures
 ACM TODS
"... Similarity search of complex structures is an important operation in graphrelated applications since exact matching is often too restrictive. In this article, we investigate the issues of substructure similarity search using indexed features in graph databases. By transforming the edge relaxation r ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
(Show Context)
Similarity search of complex structures is an important operation in graphrelated applications since exact matching is often too restrictive. In this article, we investigate the issues of substructure similarity search using indexed features in graph databases. By transforming the edge relaxation ratio of a query graph into the maximum allowed feature misses, our structural filtering algorithm can filter graphs without performing pairwise similarity computation. It is further shown that using either too few or too many features can result in poor filtering performance. Thus the challenge is to design an effective feature set selection strategy that could maximize the filtering capability. We prove that the complexity of optimal feature set selection is Ω(2 m) in the worst case, where m is the number of features for selection. In practice, we identify several criteria to build effective feature sets for filtering, and demonstrate that combining features with similar size and selectivity can improve the filtering and search performance significantly within a multifilter composition framework. The proposed featurebased filtering concept can be generalized and applied to searching approximate nonconsecutive sequences, trees, and other structured data as well. Categories and Subject Descriptors: H.2.4 [Database Management]: Systems – Query processThis is a preliminary release of an article accepted by ACM Transactions on Database Systems. The definitive version is currently in production at ACM and, when released, will supersede this version.
Searching Substructures with Superimposed Distance
"... Efficient indexing techniques have been developed for the exact and approximate substructure search in large scale graph databases. Unfortunately, the retrieval problem of structures with categorical or geometric distance constraints is not solved yet. In this paper, we develop a method called PIS ( ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
(Show Context)
Efficient indexing techniques have been developed for the exact and approximate substructure search in large scale graph databases. Unfortunately, the retrieval problem of structures with categorical or geometric distance constraints is not solved yet. In this paper, we develop a method called PIS (Partitionbased Graph Index and Search) to support similarity search on substructures with superimposed distance constraints. PIS selects discriminative fragments in a query graph and uses an index to prune the graphs that violate the distance constraints. We identify a criterion to distinguish the selectivity of fragments in multiple graphs and develop a partition method to obtain a set of highly selective fragments, which is able to improve the pruning performance. Experimental results show that PIS is effective in processing real graph queries.
Extraction and Search of Chemical Formulae in Text Documents on the Web
 Proceedings of the 16th International World Wide Web Conference (WWW 2007
, 2007
"... Often scientists seek to search for articles on the Web related to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets articles where the exact keyword string expressing the chemical formula is found. Searching for the exact occurrence of key ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
(Show Context)
Often scientists seek to search for articles on the Web related to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets articles where the exact keyword string expressing the chemical formula is found. Searching for the exact occurrence of keywords during searching results in two problems for this domain: a) if the author searches for CH4 and the article has H4C, the article is not returned, and b) ambiguous searches like “He ” return all documents where Helium is mentioned as well as documents where the pronoun “he” occurs. To remedy these deficiencies, we propose a chemical formula search engine. To build a chemical formula search engine, we must solve the following problems: 1) extract chemical formulae from text documents, 2) index chemical
SociaLite: Datalog Extensions for Efficient Social Network Analysis
"... Abstract — With the rise of social networks, largescale graph analysis becomes increasingly important. Because SQL lacks the expressiveness and performance needed for graph algorithms, lowerlevel, generalpurpose languages are often used instead. For greater ease of use and efficiency, we propose ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Abstract — With the rise of social networks, largescale graph analysis becomes increasingly important. Because SQL lacks the expressiveness and performance needed for graph algorithms, lowerlevel, generalpurpose languages are often used instead. For greater ease of use and efficiency, we propose SociaLite, a highlevel graph query language based on Datalog. As a logic programming language, Datalog allows many graph algorithms to be expressed succinctly. However, its performance has not been competitive when compared to lowlevel languages. With SociaLite, users can provide highlevel hints on the data layout and evaluation order; they can also define recursive aggregate functions which, as long as they are meet operations, can be evaluated incrementally and efficiently. We evaluated SociaLite by running eight graph algorithms (shortest paths, PageRank, hubs and authorities, mutual neighbors, connected components, triangles, clustering coefficients, and betweenness centrality) on two reallife social graphs, LiveJournal and Last.fm. The optimizations proposed in this paper speed up almost all the algorithms by 3 to 22 times. SociaLite even outperforms typical Java implementations by an average of 50 % for the graph algorithms tested. When compared to highly optimized Java implementations, SociaLite programs are an order of magnitude more succinct and easier to write. Its performance is competitive, giving up only 16 % for the largest benchmark. Most importantly, being a query language, SociaLite enables many more users who are not proficient in software engineering to make social network queries easily and efficiently. I.