Results 1 - 10
of
14
Learning classifiers from distributed, semantically heterogeneous, autonomous data sources
, 2004
"... Recent advances in computing, communications, and digital storage technologies, together with development of high throughput data acquisition technologies have made it possible to gather and store large volumes of data in digital form. These developments have resulted in unprecedented opportunities ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Recent advances in computing, communications, and digital storage technologies, together with development of high throughput data acquisition technologies have made it possible to gather and store large volumes of data in digital form. These developments have resulted in unprecedented opportunities for large-scale data-driven knowledge acquisition with the potential for fundamental gains in scientific understanding (e.g., characterization of macromolecular structure-function relationships in biology) in many data-rich domains. In such applications,
the data sources of interest are typically physically distributed, semantically heterogeneous and autonomously owned and operated, which makes it impossible to use traditional machine learning algorithms for knowledge acquisition.
However, we observe that most of the learning algorithms use only certain statistics computed from data in the process of generating the hypothesis that they output and we use this observation to design a general strategy for transforming traditional algorithms for learning from data into algorithms for learning from distributed data. The resulting algorithms are provably exact in that the classifiers produced by them are identical to those obtained by the corresponding algorithms in the centralized setting (i.e., when all of the data is available in a central location) and they compare favorably to their centralized counterparts in terms of time and communication complexity.
To deal with the semantical heterogeneity problem, we introduce ontology-extended data sources and define a user perspective consisting of an ontology and a set of interoperation constraints between data source ontologies and the user ontology. We show how these constraints can be used to define mappings and conversion functions needed to answer statistical queries from semantically heterogeneous data viewed from a certain user perspective. That is further used to extend our approach for learning from distributed data into a theoretically sound approach to learning from semantically heterogeneous data.
The work described above contributed to the design and implementation of AirlDM, a collection of data source independent machine learning algorithms through the means of sufficient statistics and data source wrappers, and to the design of INDUS, a federated, query-centric system for knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources.
Differential Association Rule Mining for the Study of Protein-Protein Interaction Networks
- In Proceedings 4th Workshop on Data Mining in Bioinformatics at SIGKDD
, 2004
"... Protein-protein interactions are of great interest to biologists. A variety of high-throughput techniques have been devised, each of which leads to a separate definition of an interaction network. The concept of differential association rule mining is introduced to study the annotations of proteins ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Protein-protein interactions are of great interest to biologists. A variety of high-throughput techniques have been devised, each of which leads to a separate definition of an interaction network. The concept of differential association rule mining is introduced to study the annotations of proteins in the context of one or more interaction networks. Differences among items across edges of a network are explicitly targeted. As a second step we identify differences between networks that are separately defined on the same set of nodes. The technique of differential association rule mining is applied to the comparison of protein annotations within an interaction network and between different interaction networks. In both cases we were able to find rules that explain known properties of protein interaction networks as well as rules that show promise for advanced study.
FAT-miner: Mining frequent attribute trees
- In: SAC ’07: Proceedings of the 2007 ACM symposium on Applied computing
, 2007
"... Data that can conceptually be viewed as tree structures abounds in domains such as bio-informatics, web logs, XML databases and multi-relational databases. Besides structural information such as nodes and edges, tree structured data also often contains attributes, that represent properties of nodes. ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Data that can conceptually be viewed as tree structures abounds in domains such as bio-informatics, web logs, XML databases and multi-relational databases. Besides structural information such as nodes and edges, tree structured data also often contains attributes, that represent properties of nodes. Current algorithms for finding frequent patterns in structured data, do not take these attributes into account, and hence potentially useful information is neglected. We present FAT-miner, an algorithm for frequent pattern discovery in tree structured data with attributes. To illustrate the applicability of FAT-miner, we use it to explore the properties of good and bad loans in a well-known multi-relational financial database. 1
Experiments with MRDTL – a multi-relational decision tree learning algorithm
- University of Alberta
, 2002
"... www.cs.iastate.edu/~honavar/aigroup.html Abstract. We describe experiments with an implementation of Multi-relational decision tree learning (MRDTL) algorithm for induction of decision trees from relational databases using an approach proposed by Knobbe et al. [1999a]. Our results show that the perf ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
www.cs.iastate.edu/~honavar/aigroup.html Abstract. We describe experiments with an implementation of Multi-relational decision tree learning (MRDTL) algorithm for induction of decision trees from relational databases using an approach proposed by Knobbe et al. [1999a]. Our results show that the performance of MRDTL is competitive with that of other algorithms for learning classifiers from multiple relations including Progol [Muggleton, 1995], FOIL [Quinlan, 1993], Tilde [Blockeel, 1998]. Preliminary results indicate that MRDTL, when augmented with principled methods for handling missing attribute values, could be competitive with the state-of-the-art algorithms for learning classifiers from multiple relations on real-world data sets such as those used in the KDD Cup 2001 data mining competition [Cheng et al., 2002]. 1
UNIC: UNique Item Counts for Association Rule Mining in Relational Data
, 2004
"... Association rule mining (ARM) can be generalized to relational data by using joined relations as basis. We demonstrate that typically such an approach results in an overwhelming number of rules that reflect nothing but trivial properties of the data. Worse, even rules that appear interesting may be ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Association rule mining (ARM) can be generalized to relational data by using joined relations as basis. We demonstrate that typically such an approach results in an overwhelming number of rules that reflect nothing but trivial properties of the data. Worse, even rules that appear interesting may be due to combinations of rather general statistical properties of the data. We introduce an ARM algorithm, UNIC, that systematically excludes any influence of correlations among items that represent the same realworld quantity but belong to different entities. The concept of UNIC is to base the ARM algorithm on items that are unique to only one entity instance of a joined relation. This strategy is highly effective at eliminating undesired contributions to rule metrics like support and confidence, while achieving most pruning even before frequent item sets are computed.
ReMauve: A Relational Model Tree Learner
"... Abstract. Model trees are a special case of regression trees in which linear regression models are predicted in the leaves. Little attention has been paid to model trees in relational learning, mainly because the task of learning linear regression equations in this context involves dealing with nond ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Model trees are a special case of regression trees in which linear regression models are predicted in the leaves. Little attention has been paid to model trees in relational learning, mainly because the task of learning linear regression equations in this context involves dealing with nondeterminacy of predictive attributes. Whereas existing approaches handle this non-determinacy issue either by selecting a single value or by aggregating over all values, in this paper we present a model tree learning system that tries to combine both. 1
ALGORITHMS FOR NON-PARAMETRIC CLASSIFIERS IN MULTI-RELATIONAL DATA MINING
, 2006
"... Over the last decades, due to the advances in information technologies, both the indus-trial and scientific communities have acquired large volumes of data in digital form. Most of these data sets are stored using relational databases consisting of multiple tables and associations. Moreover, the dat ..."
Abstract
- Add to MetaCart
Over the last decades, due to the advances in information technologies, both the indus-trial and scientific communities have acquired large volumes of data in digital form. Most of these data sets are stored using relational databases consisting of multiple tables and associations. Moreover, the data used in the fields of bio-informatics, computational biol-ogy, HTML and XML documents are relational in nature. However, most of the existing approaches to knowledge discovery in databases, assume that the data are stored in a single table. Therefore, new algorithms are needed in order to exploit the relational infor-mation provided in these data sets. This thesis proposes two novel solutions to the task of supervised classification in relational domains, based on traditional non-parametric clas-sifiers and built upon relational algebra. The first approach is based on Kernel Density Estimation, and the second technique is based on Gaussian Mixture Models. Both tech-niques are evaluated using three real world relational data sets, drawn from the fields of organic chemistry, medicine and genetics.
Discretization Numerical Data for Relational Data with One-to-Many Relations
"... Abstract: Problem statement: Handling numerical data stored in a relational database has been performed differently from handling those numerical data stored in a single table due to the multiple occurrences (one-to-many association) of an individual record in the non-target table and non-determinat ..."
Abstract
- Add to MetaCart
Abstract: Problem statement: Handling numerical data stored in a relational database has been performed differently from handling those numerical data stored in a single table due to the multiple occurrences (one-to-many association) of an individual record in the non-target table and non-determinate relations between tables. Numbers in Multi-Relational Data Mining (MRDM) were often discretized, after considering the schema of the relational database. Study the effects of taking the one-to-many association issue into consideration in the process of discretizing continuous numbers. Approach: Different alternatives for dealing with continuous attributes in MRDM were considered in this study, namely equal-width (EWD), Equal-Height (EH), equal-weight (EWG) and Entropy-Based (EB). The discretization procedures considered in this study included algorithms that were not depended on the multi-relational structure of the data and also that are sensitive to this structure. A new method of discretization, called the entropy instance-based (EIB) discretization method was implemented and evaluated with respect to C4.5 on the two well-known multi-relational databases that include the Mutagenesis dataset and the Hepatitis dataset for Discovery Challenge PKDD 2005. Results: When the number of bins, b, is big (b = 8), the entropy-instance-based discretization method produced better data summarization results compared to the other discretization methods, in the mutagenesis dataset. In
Complex Patterns in Streams (COMPASS) Open Competition Project NWO
, 2009
"... In recent years there has been a growing interest in the study and analysis of flows of so-called data streams. Typical examples of such streams include Internet traffic data and continuous sensor readings. Traditional data mining approaches are not suitable for mining such streams, because they ass ..."
Abstract
- Add to MetaCart
In recent years there has been a growing interest in the study and analysis of flows of so-called data streams. Typical examples of such streams include Internet traffic data and continuous sensor readings. Traditional data mining approaches are not suitable for mining such streams, because they assume static data stored in a database, whereas streams are continuous, high speed, and unbounded. Therefore, streams must be analyzed as they are produced and high quality, online results need to be guaranteed. Until now, most pattern mining techniques focus either on non-streaming data, or only consider very simple patterns, such as identifying the hot items from one stream, or constantly maintaining the frequencies in a window sliding over the stream. The challenging task we set forward in this project is to extend the existing state-of-the-art techniques into two, orthogonal directions: on the one hand, the mining of more complex patterns in streams, such as sequential patterns and evolving graph patterns and on the other hand, more natural stream support measures taking into account the temporal nature of most data streams. The developed techniques will be tested on real-life data, such as social network data and the World-Wide Web. Next to those datasets, in the project we will have access to the data streams generated by a sensor network mounted on a large bridge in The Netherlands.

