Results 1  10
of
18
Subgroup Discovery with CN2SD
 Journal of Machine Learning Research
, 2004
"... discovery. The goal of subgroup discovery is to find rules describing subsets of the population that are sufficiently large and statistically unusual. The paper presents a subgroup discovery algorithm, CN2SD, developed by modifying parts of the CN2 classification rule learner: its covering algorit ..."
Abstract

Cited by 52 (10 self)
 Add to MetaCart
discovery. The goal of subgroup discovery is to find rules describing subsets of the population that are sufficiently large and statistically unusual. The paper presents a subgroup discovery algorithm, CN2SD, developed by modifying parts of the CN2 classification rule learner: its covering algorithm, search heuristic, probabilistic classification of instances, and evaluation measures. Experimental evaluation of CN2SD on 23 UCI data sets shows substantial reduction of the number of induced rules, increased rule coverage and rule significance, as well as slight improvements in terms of the area under ROC curve, when compared with the CN2 algorithm. Application of CN2SD to a large traffic accident data set confirms these findings.
Comparative Evaluation of Approaches to Propositionalization
, 2003
"... Propositionalization has already been shown to be a promising approach for robustly and e#ectively handling relational data sets for knowledge discovery. In this paper, we compare uptodate methods for propositionalization from two main groups: logicoriented and databaseoriented techniques. Ex ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
Propositionalization has already been shown to be a promising approach for robustly and e#ectively handling relational data sets for knowledge discovery. In this paper, we compare uptodate methods for propositionalization from two main groups: logicoriented and databaseoriented techniques. Experiments using several learning tasks  both ILP benchmarks and tasks from recent international data mining competitions  show that both groups have their specific advantages. While logicoriented methods can handle complex background knowledge and provide expressive firstorder models, databaseoriented methods can be more e#cient especially on larger data sets. Obtained accuracies vary such that a combination of the features produced by both groups seems a further valuable venture.
Learning Ensembles of FirstOrder Clauses for RecallPrecision Curves: A Case Study in Biomedical Information Extraction
 Proceedings of the 14th International Conference on Inductive Logic Programming (ILP
, 2004
"... Many domains in the field of Inductive Logic Programming (ILP) involve highly unbalanced data. Our research has focused on Information Extraction (IE), a task that typically involves many more negative examples than positive examples. IE is the process of finding facts in unstructured text, such as ..."
Abstract

Cited by 24 (8 self)
 Add to MetaCart
Many domains in the field of Inductive Logic Programming (ILP) involve highly unbalanced data. Our research has focused on Information Extraction (IE), a task that typically involves many more negative examples than positive examples. IE is the process of finding facts in unstructured text, such as biomedical journals, and putting those facts in an organized system. In particular, we have focused on learning to recognize instances of the proteinlocalization relationship in Medline abstracts. We view the problem as a machinelearning task: given positive and negative extractions from a training corpus of abstracts, learn a logical theory that performs well on a heldaside testing set. A common way to measure performance in these domains is to use precision and recall instead of simply using accuracy. We propose Gleaner, a randomized search method which collects good clauses from a broad spectrum of points along the recall dimension in recallprecision curves and employs an "at least N of these M clauses" thresholding method to combine the selected clauses. We compare Gleaner to ensembles of standard Aleph theories and find that Gleaner produces comparable testset results in a fraction of the training time needed for ensembles.
Naive Bayesian Classification of Structured Data
, 2003
"... In this paper we present 1BC and 1BC2, two systems that perform naive Bayesian classification of structured individuals. The approach of 1BC is to project the individuals along firstorder features. These features are built from the individual using structural predicates referring to related objects ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
In this paper we present 1BC and 1BC2, two systems that perform naive Bayesian classification of structured individuals. The approach of 1BC is to project the individuals along firstorder features. These features are built from the individual using structural predicates referring to related objects (e.g. atoms within molecules), and properties applying to the individual or one or several of its related objects (e.g. a bond between two atoms). We describe an individual in terms of elementary features consisting of zero or more structural predicates and one property; these features are treated as conditionally independent in the spirit of the naive Bayes assumption. 1BC2 represents an alternative firstorder upgrade to the naive Bayesian classifier by considering probability distributions over structured objects (e.g., a molecule as a set of atoms), and estimating those distributions from the probabilities of its elements (which are assumed to be independent). We present a unifying view on both systems in which 1BC works in language space, and 1BC2 works in individual space. We also present a new, efficient recursive algorithm improving upon the original propositionalisation approach of 1BC. Both systems have been implemented in the context of the firstorder descriptive learner Tertius, and we investigate the differences between the two systems both in computational terms and on artificially generated data. Finally, we describe a range of experiments on ILP benchmark data sets demonstrating the viability of our approach.
KnowledgeBased Sampling for Subgroup Discovery
 Local Pattern Detection. Volume 3539 of Lecture Notes in Computer Science
, 2005
"... Subgroup discovery aims at finding interesting subsets of a classified example set that deviates from the overall distribution. The search is guided by a socalled utility function, trading the size of subsets (coverage) against their statistical unusualness. By choosing the utility function acc ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Subgroup discovery aims at finding interesting subsets of a classified example set that deviates from the overall distribution. The search is guided by a socalled utility function, trading the size of subsets (coverage) against their statistical unusualness. By choosing the utility function accordingly, subgroup discovery is well suited to find interesting rules with much smaller coverage and bias than possible with standard classifier induction algorithms. Smaller subsets can be considered local patterns, but this work uses yet another definition: According to this definition global patterns consist of all patterns reflecting the prior knowledge available to a learner, including all previously found patterns.
Evolutionary fuzzy rule induction process for subgroup discovery: A case study in marketing
 Transactions on Fuzzy Systems
, 2007
"... Abstract—This paper presents a genetic fuzzy system for the data mining task of subgroup discovery, the subgroup discovery iterative genetic algorithm (SDIGA), which obtains fuzzy rules for subgroup discovery in disjunctive normal form. This kind of fuzzy rule allows us to represent knowledge about ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract—This paper presents a genetic fuzzy system for the data mining task of subgroup discovery, the subgroup discovery iterative genetic algorithm (SDIGA), which obtains fuzzy rules for subgroup discovery in disjunctive normal form. This kind of fuzzy rule allows us to represent knowledge about patterns of interest in an explanatory and understandable form that can be used by the expert. Experimental evaluation of the algorithm and a comparison with other subgroup discovery algorithms show the validity of the proposal. SDIGA is applied to a market problem studied in the University of Mondragón, Spain, in which it is necessary to extract automatically relevant and interesting information that helps to improve fair planning policies. The application of SDIGA to this problem allows us to obtain novel and valuable knowledge for experts. Index Terms—Data mining, descriptive induction, evolutionary algorithms, genetic fuzzy systems, subgroup discovery. I.
Sequential Data Mining: A Comparative Case Study in Development of Atherosclerosis Risk Factors
"... Abstract—Sequential data represent an important source of potentially new medical knowledge. However, this type of data is rarely provided in a format suitable for immediate application of conventional mining algorithms. This paper summarizes and compares three different sequential mining approaches ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract—Sequential data represent an important source of potentially new medical knowledge. However, this type of data is rarely provided in a format suitable for immediate application of conventional mining algorithms. This paper summarizes and compares three different sequential mining approaches based, respectively, on windowing, episode rules, and inductive logic programming. Windowing is one of the essential methods of data preprocessing. Episode rules represent general sequential mining, while inductive logic programming extracts firstorder features whose structure is determined by background knowledge. The three approaches are demonstrated and evaluated in terms of a case study STULONG. It is a longitudinal preventive study of atherosclerosis where the data consist of a series of longterm observations recording the development of risk factors and associated conditions. The intention is to identify frequent sequential/temporal patterns. Possible relations between the patterns and an onset of any of the observed cardiovascular diseases are also studied. Index Terms—Anachronism, episode rules, inductive logic programming, temporal pattern, trend analysis, windowing. I.
Leveraging Network Effects for Predictive Modelling in Customer Relationship Management
"... Abstract: Predictive modelling and classification problems are important analytical tasks in Customer Relationship Management (CRM). CRM analysts typically do not have information about how customers interact with each other. Phone carriers are an exception, where companies accumulate huge amounts o ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract: Predictive modelling and classification problems are important analytical tasks in Customer Relationship Management (CRM). CRM analysts typically do not have information about how customers interact with each other. Phone carriers are an exception, where companies accumulate huge amounts of telephone calling records providing information not only about the usage behaviour of a single customer, but also about how customers interact with each other. In this paper, we do not measure network effects, but we analyze techniques to improve classific ation tasks in CRM leveraging network effects. In contrast to traditional classification alg orithms, we try to take into account the information about a customer’s communication network neighbors in order to better predict usage behavior. The presumption in our experiment is that a customer’s SMS (Short Message Service) usage also depends on the SMS usage of his social network. However, analysing huge amounts of call detail data which exhibits a graph structure poses new challenges for predictive modelling. In our work, we focus on ways to improve predictive modelling and classification leveraging data about the social network of a customer. We describe the results of an experiment using realworld data form a cell phone provider and benchmark the results against traditional approaches.
Efektivn Prevod Multirelacn Databaze Na Jednorelacn
"... ILP nepraktickymi. Algoritmy ILP naprklad casto vyzaduj, aby uzivatel reprezentacn jazyk vymezil komplikovanymi deklaracemi. Prizpusobit existujc nastroje pozadavkum uzivatelu na jednoduche ovladan je (mozna prekvapive) tezkym ukolem, o cemz svedc napr. nedavny neuspech projektu zamereneho na vclene ..."
Abstract
 Add to MetaCart
ILP nepraktickymi. Algoritmy ILP naprklad casto vyzaduj, aby uzivatel reprezentacn jazyk vymezil komplikovanymi deklaracemi. Prizpusobit existujc nastroje pozadavkum uzivatelu na jednoduche ovladan je (mozna prekvapive) tezkym ukolem, o cemz svedc napr. nedavny neuspech projektu zamereneho na vclenen znameho ILP programu Progol [11] do softwaroveho balku Clementine [3]. Setkali se dobyvatel znalost s ulohou vyhledat souvislosti rozprostrene v nekolika relacch datab aze, v typickem prpade potrebne relace spoj do jedne tabulky naprklad pomoc databazoveho dotazu, cmz je cesta k aplikaci siroke nabdky jednorelacnch DZD nastroju otevrena. Protoze velikost vysledne tabulky roste obecne velmi rychle s poctem spojovanych relac, jsou do uziteho databazoveho dotazu obycejne zacleneny omezujc podmnky vychazejc z intuice uzivatele. V tomto prstupu vsak nelze vetsinou rozmery vysledne relace predem odhadnout a ani snadno urcit, zda se uzitym typem spojen neztrac z dat nektera dulezita cast. Uziva