Results 1 
6 of
6
Classification Spanning Correlated Data Streams
"... In many applications, classifiers need to be built based on multiple related data streams. For example, stock streams and news streams are related, where the classification patterns may involve features from both streams. Thus instead of mining on a single isolated stream, we need to examine multipl ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
In many applications, classifiers need to be built based on multiple related data streams. For example, stock streams and news streams are related, where the classification patterns may involve features from both streams. Thus instead of mining on a single isolated stream, we need to examine multiple related data streams in order to find such patterns and build an accurate classifier. Other examples of related streams include traffic reports and car accidents, sensor readings of different types or at different locations, etc. In this paper, we consider the classification problem defined over slidingwindow join of several input data streams. As the data streams arrive in fast pace and the manytomany join relationship blows up the data arrival rate even more, it is impractical to compute the join and then build the classifier each time the window slides forward. We present an efficient algorithm to build a Naïve Bayesian classifier in such context. Our method does not need to perform the join operations but is still able to build exactly the same classifier as if built on the joined result. It only examines each input tuple twice, independent of the number of tuples it joins in other streams, therefore, is able to keep pace with the fast arriving data streams in the presence of manytomany join relationships. The experiments confirmed that our classification algorithm is more efficient than conventional methods while maintaining good classification accuracy.
Distributionfree Bounds for Relational Classification
"... Statistical Relational Learning (SRL) is a subarea in Machine Learning which addresses the problem of performing statistical inference on data that is correlated and not independently and identically distributed (i.i.d.)  as is generally assumed. For the traditional i.i.d. setting, distribution ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Statistical Relational Learning (SRL) is a subarea in Machine Learning which addresses the problem of performing statistical inference on data that is correlated and not independently and identically distributed (i.i.d.)  as is generally assumed. For the traditional i.i.d. setting, distribution free bounds exist, such as the Hoeffding bound, which are used to provide confidence bounds on the generalization error of a classification algorithm given its holdout error on a sample size of N. Bounds of this form are currently not present for the type of interactions that are considered in the data by relational classification algorithms. In this paper we extend the Hoeffding bounds to the relational setting. In particular, we derive distribution free bounds for certain classes of data generation models that do not produce i.i.d. data and are based on the type of interactions that are considered by relational classification algorithms that have been developed in SRL. We conduct empirical studies on synthetic and real data which show that these data generation models are indeed realistic and the derived bounds are tight enough for practical use.
Test Set Bounds for Relational Data that Vary with Strength of Dependence
"... A large portion of the data that is collected in various application domains such as online social networking, finance, biomedicine, etc. is relational in nature. A subfield of Machine Learning namely; Statistical Relational Learning (SRL) is concerned with performing statistical inference on relati ..."
Abstract
 Add to MetaCart
A large portion of the data that is collected in various application domains such as online social networking, finance, biomedicine, etc. is relational in nature. A subfield of Machine Learning namely; Statistical Relational Learning (SRL) is concerned with performing statistical inference on relational data. A defining property of relational data that separates it from independently and identically distributed data (i.i.d.) is the existence of correlations between individual datapoints. A major portion of the theory developed in machine learning assumes the data is i.i.d. In this paper we develop theory for the relational setting. In particular, we derive distributionfree bounds on the generalization error of a classifier for the relational setting, where the class of data generation models we consider are inspired from the type joint distributions that are represented by relational classification models developed by the SRL community. A key aspect of the bound we derive is that the tightness of the bound is a function of the strength of dependence between related datapoints, with the bound reducing to the standard Hoeffding’s or McDiarmid’s inequality when there is no dependence. To the best of our knowledge this is the first bound for relational data whose tightness varies with the strength of dependence. Moreover, the bound provides insight in the computation of effective sample size which is an important notion introduced by Jensen and Neville (2002).
COMPUTATIONAL TECHNIQUES FOR INFERRING REGULATORY NETWORKS
"... To Mom, for making this dream possible, Ian, for supporting and sharing it and Lillian for making it all worthwhile. ii In this era where healthcare is one of the world’s largest and fastest growing industries, there is great interest in understanding what is happening within our cells and organs at ..."
Abstract
 Add to MetaCart
(Show Context)
To Mom, for making this dream possible, Ian, for supporting and sharing it and Lillian for making it all worthwhile. ii In this era where healthcare is one of the world’s largest and fastest growing industries, there is great interest in understanding what is happening within our cells and organs at the molecular level. Fortunately, innovations and improvements in technology continue to spur the quantity and types of highthroughput (a process where large amounts of samples can be measured by a system at once) biological data that can be measured. Additionally, abundant information from many years of detailed research can be found in annotated or computationally extracted databases. These data sets, especially combined, have great potential for novel discoveries that can lead to advances in biology and medicine. The main focus of this thesis is the investigation of machine learning techniques for inferring gene regulatory networks from the combination of highthroughput time series gene expression array data and other data sources. A gene regulatory network is a collection
Research on Statistical Relational Learning
"... This paper presents an overview of the research on learning statistical models of relational data being carried out at the University of Washington. Our work falls into five main directions: learning models of social networks; learning models of sequential relational processes; scaling up stati ..."
Abstract
 Add to MetaCart
This paper presents an overview of the research on learning statistical models of relational data being carried out at the University of Washington. Our work falls into five main directions: learning models of social networks; learning models of sequential relational processes; scaling up statistical relational learning to massive data sources; learning for knowledge integration; and learning programs in procedural languages. We describe some of the common themes and research issues arising from this work.
Editor:??
"... Many organizations today have more than very large databases; they have databases that grow without limit at a rate of several million records per day. Mining these continuous data streams brings unique opportunities, but also new challenges. In this paper we present a method that can semiautomatic ..."
Abstract
 Add to MetaCart
(Show Context)
Many organizations today have more than very large databases; they have databases that grow without limit at a rate of several million records per day. Mining these continuous data streams brings unique opportunities, but also new challenges. In this paper we present a method that can semiautomatically enhance a wide class of existing learning algorithms so they can learn from such highspeed data streams in real time. The method works by sampling just enough data from the data stream to make each decision required by the learning process. The method is applicable to essentially any induction algorithm based on discrete search. After its application, the algorithm: learns from data streams in an incremental, anytime fashion; runs in time independent of the amount of data seen, while making decisions that are essentially identical to those that would be made from infinite data; uses a constant amount of RAM no matter how much data it sees; and adjusts its learned models in a finegrained manner as the datagenerating process changes over time. We evaluate our method by using it to produce two systems: the VFDT system for learning decision trees from massive data streams; and CVFDT, an enhanced version of VFDT that keeps the trees it learns uptodate as the datagenerating process changes over time. We evaluate these learners with extensive studies on synthetic data sets, and by mining the continuous stream of Web access data from the whole University of Washington main campus.