Results 1 -
8 of
8
Correlation Search in Graph Databases
- KDD'07
, 2007
"... Correlation mining has gained great success in many application domains for its ability to capture the underlying dependency between objects. However, the research of correlation mining from graph databases is still lacking despite the fact that graph data, especially in various scientific domains, ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
Correlation mining has gained great success in many application domains for its ability to capture the underlying dependency between objects. However, the research of correlation mining from graph databases is still lacking despite the fact that graph data, especially in various scientific domains, proliferate in recent years. In this paper, we propose a new problem of correlation mining from graph databases, called Correlated Graph Search (CGS). CGS adopts Pearson’s correlation coefficient as a correlation measure to take into consideration the occurrence distributions of graphs. However, the problem poses significant challenges, since every subgraph of a graph in the database is a candidate but the number of subgraphs is exponential. We derive two necessary conditions which set bounds on the occurrence probability of a candidate in the database. With this result, we design an efficient algorithm that operates on a much smaller projected database and thus we are able to obtain a significantly smaller set of candidates. To further improve the efficiency, we develop three heuristic rules and apply them on the candidate set to further reduce the search space. Our extensive experiments demonstrate the effectiveness of our method on candidate reduction. The results also justify the efficiency of our algorithm in mining correlations from large real and synthetic datasets.
Low-Entropy Set Selection
"... Most pattern discovery algorithms easily generate very large numbers of patterns, making the results impossible to understand and hard to use. Recently, the problem of instead selecting a small subset of informative patterns from a large collection of patterns has attracted a lot of interest. In thi ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Most pattern discovery algorithms easily generate very large numbers of patterns, making the results impossible to understand and hard to use. Recently, the problem of instead selecting a small subset of informative patterns from a large collection of patterns has attracted a lot of interest. In this paper we present a succinct way of representing data on the basis of itemsets that identify strong interactions. This new approach, LESS, provides a more powerful and more general technique to data description than existing approaches. Low-entropy sets consider the data symmetrically and as such identify strong interactions between attributes, not just between items that are present. Selection of these patterns is executed through the MDL-criterion. This results in only a handful of sets that together form a compact lossless description of the data. By using entropy-based elements for the data description, we can successfully apply the maximum likelihood principle to locally cover the data optimally. Further, it allows for a fast, natural and well performing heuristic. Based on these approaches we present two algorithms that provide high-quality descriptions of the data in terms of strongly interacting variables. Experiments on these methods show that high-quality results are mined: very small pattern sets are returned that are easily interpretable and understandable descriptions of the data, and can be straightforwardly visualized. Swap randomization experiments and high compression ratios show that they capture the structure of the data well.
Discovering Spatial Interaction Patterns
, 2007
"... tutorial article, which has been submitted for publication in a journal or for consideration by the commissioning organization. The report represents the ideas of its author, and should not be taken as the official views of the School or the University. Any discussion of the content of the report sh ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
tutorial article, which has been submitted for publication in a journal or for consideration by the commissioning organization. The report represents the ideas of its author, and should not be taken as the official views of the School or the University. Any discussion of the content of the report should be sent to the author, at the address shown on the cover.
An Information-Theoretic Approach to Quantitative Association Rule Mining ⋆
"... Abstract. Quantitative Association Rule (QAR) mining has been rec-ognized an influential research problem over the last decade due to the popularity of quantitative databases and the usefulness of associ-ation rules in real life. Unlike Boolean Association Rules (BARs), which only consider boolean a ..."
Abstract
- Add to MetaCart
Abstract. Quantitative Association Rule (QAR) mining has been rec-ognized an influential research problem over the last decade due to the popularity of quantitative databases and the usefulness of associ-ation rules in real life. Unlike Boolean Association Rules (BARs), which only consider boolean attributes, QARs consist of quantitative attributes which contain much richer information than the boolean attributes. How-ever, the combination of these quantitative attributes and their value in-tervals always gives rise to the generation of an explosively large number of itemsets, thereby severely degrading the mining efficiency. In this paper, we propose an information-theoretic approach to avoid un-rewarding combinations of both the attributes and their value intervals being generated in the mining process. We study the mutual information between the attributes in a quantitative database and devise a normal-ization on the mutual information to make it applicable in the context of QAR mining. To indicate the strong informative relationships among the
Network-wide Information Correlation and Exploration (NICE): Framework, Applications, and Experience
"... Scalable event detection and trouble shooting capabilities are critical for ensuring high levels of network reliability and performance. Although network operations systems are typically well designed for dealing with hard network outages (e.g., link failures), detecting and analyzing chronic condit ..."
Abstract
- Add to MetaCart
Scalable event detection and trouble shooting capabilities are critical for ensuring high levels of network reliability and performance. Although network operations systems are typically well designed for dealing with hard network outages (e.g., link failures), detecting and analyzing chronic conditions- particularly those associated with short term performance impairments- still remains challenging. Detecting and trouble shooting such conditions typically requires detailed analysis of data collected from different monitoring tools, to obtain a comprehensive view of network events. This is typically performed manually, making it an imperfect, time consuming and costly process. The ability to perform correlations is a fundamental yet powerful building block when it comes to analyzing multiple data series collectively. We present a novel framework, NICE (Network-wide Information Correlation and Exploration), that scalably analyzes network-wide statistical event correlations. The core components of NICE include a flexible infrastructure for pair-wise correlation testing as well as tools for subsequent analysis of resulting correlation patterns and automatic drill-down for surprising correlations. Above our core NICE infrastructure, we have prototyped two exciting applications: (i) for trouble-shooting known problems, and (ii) for discovering undesirable modes of network operation that may traditionally have been flying under the operations team’s radar, yet potentially impacting customers. We evaluate the accuracy of NICE by examining several data streams from a tier-1 ISP backbone network. We also present case studies that demonstrate the efficacy of our tool-kit by revealing surprising correlations for the same tier-1 network. The NICE methodology and algorithms promise to be of immense use to network operators in analyzing network behavior and identifying anomalous network conditions.
On Mining Statistically Significant Attribute Association Information
"... Knowledge of the association information between the attributes in a data set provides insight into the underlying structure of the data and explains the relationships (independence, synergy, redundancy) between the attributes. Complex models learnt computationally from the data are more interpretab ..."
Abstract
- Add to MetaCart
Knowledge of the association information between the attributes in a data set provides insight into the underlying structure of the data and explains the relationships (independence, synergy, redundancy) between the attributes. Complex models learnt computationally from the data are more interpretable to a human analyst when such interdependencies are known. In this paper, we focus on mining two types of association information among the attributes- correlation information and interaction information which capture multivariate dependencies between the data attributes. Identifying the statistically significant attribute associations is a computationally challenging task- the number of possible associations increases exponentially and many associations contain redundant information when a number of correlated attributes are present. In this paper, we explore efficient data mining methods to discover non-redundant attribute sets that contain significant association information indicating the presence of informative patterns in the data.
Correlation-Based Methods for Biological Data Cleaning
, 2007
"... Data overload combine with widespread use of automated large-scale analysis and mining result in a rapid depreciation of the World’s data quality. Data cleaning is an emerging domain that aims at improving data quality through the detection and elimination of data artifacts. These data artifacts com ..."
Abstract
- Add to MetaCart
Data overload combine with widespread use of automated large-scale analysis and mining result in a rapid depreciation of the World’s data quality. Data cleaning is an emerging domain that aims at improving data quality through the detection and elimination of data artifacts. These data artifacts comprise of errors, discrepancies, redundancies, ambiguities, and incompleteness that hamper the efficacy of analysis or data mining.
Despite the importance, data cleaning remains neglected in certain knowledge-driven domains. One such example is Bioinformatics; biological data are often used uncritically
without considering the errors or noises contained within, and research on both the “causes” of data artifacts and the corresponding data cleaning remedies are lacking. In this thesis, we conduct the an in-depth study of what constitutes data artifacts in real-world biological
databases. To the best of our knowledge, this is the first complete investigation of the data quality factors in biological data. The result of our study indicates that biological data quality problem is by nature multifactorial and requires a number of different data cleaning
approaches. While some existing data cleaning methods are directly applicable to certain artifacts, others such as annotation errors and multiple duplicate relations have not been studied. This provides the inspirations for us to devise new data cleaning methods.
Current data cleaning approaches derive observations of data artifacts from the values of independent attributes and records. On the other hand, the correlation patterns between the attributes provide additional information of the relationships embedded within a data set among the entities. In this thesis, we exploit the correlations between data entities to identify data artifacts that existing data cleaning methods fall short of addressing. We propose 3 novel data cleaning methods for detecting outliers and duplicates, and further apply them to real-world biological data as proof-of-concepts.
Traditional outlier detection approaches rely on the rarity of the target attribute or records. While rarity may be a good measure for class outliers, for attribute outliers, rarity may not equate abnormality. The ODDS (Outlier Detection from Data Subspaces) method utilizes deviating correlation patterns for the identification of common yet abnormal attributes. Experimental validation shows that it can achieve an accuracy of up to 88%.
The ODDS method is further extended to XODDS, an outlier detection method for semi-structured data models such as XML which is rapidly emerging as a new standard for data representation and exchange on the World Wide Web (WWW). In XODDS, we leverage on the hierarchical structure of the XML to provide addition context information enabling knowledge-based data cleaning. Experimental validation shows that the contextual information in XODDS elevates both efficiency and the effectiveness of detecting outliers.
Traditional duplicate detection methods regard duplicate relation as a boolean property. Moreover, different types of duplicates exists, some of which cannot be trivially merged. Our third contribution, the correlation-based duplicate detection method induced rules from associations between attributes in order to identify different types of duplicates.
Correlation-based methods aimed at resolving data cleaning problems are conceptually new. This thesis demonstrates they are effective in addressing some data artifacts that cannot be tackled by existing data cleaning techniques, with evidence of practical applications to real-world biological databases.
Effective Ranking of XML Keyword Search Results (Extended Version)
"... The popularity of XML has exacerbated the need for an easy-to-use, high precision query interface for XML data. When traditional document-oriented keyword search techniques do not suffice, natural language interfaces and keyword search techniques that take advantage of XML structure make it very eas ..."
Abstract
- Add to MetaCart
The popularity of XML has exacerbated the need for an easy-to-use, high precision query interface for XML data. When traditional document-oriented keyword search techniques do not suffice, natural language interfaces and keyword search techniques that take advantage of XML structure make it very easy for ordinary users to query XML databases. Unfortunately, current approaches to processing these queries rely heavily on heuristics that are intuitively appealing but ultimately ad hoc. These approaches often retrieve false positive answers, overlook correct answers, and cannot rank answers appropriately. To address these problems for data-centric XML, we propose coherency ranking, a domain- and database design-independent ranking method for XML keyword queries that is based on an extension of the concepts of data dependencies and mutual information. With coherency ranking, the results of a keyword query are invariant under schema reorganization. We analyze the way in which previous approaches to XML keyword search approximate coherency ranking, and present efficient algorithms to process queries and rank their answers using coherency ranking. Our empirical evaluation with two real-world XML data sets shows that coherency ranking has better precision and recall and provides better ranking than all previous approaches. Coherency ranking can also be used

