## Distribution-based aggregation for relational learning with identifier attributes (2004)

Venue: | Machine Learning |

Citations: | 33 - 10 self |

### BibTeX

@INPROCEEDINGS{Perlich04distribution-basedaggregation,

author = {Claudia Perlich and Foster Provost},

title = {Distribution-based aggregation for relational learning with identifier attributes},

booktitle = {Machine Learning},

year = {2004},

pages = {2006}

}

### Years of Citing Articles

### OpenURL

### Abstract

Feature construction through aggregation plays an essential role in modeling relational domains with one-to-many relationships between tables. One-to-many relationships lead to bags (multisets) of related entities, from which predictive information must be captured. This paper focuses on aggregation from categorical attributes that can take many values (e.g., object identifiers). We present a novel aggregation method as part of a relational learning system ACORA, that combines the use of vector distance and meta-data about the class-conditional distributions of attribute values. We provide a theoretical foundation for this approach deriving a “relational fixed-effect ” model within a Bayesian framework, and discuss the implications of identifier aggregation on the expressive power of the induced model. One advantage of using identifier attributes is the circumvention of limitations caused either by missing/unobserved object properties or by independence assumptions. Finally, we show empirically that the novel aggregators can generalize in the presence of identifier (and other high-dimensional) attributes, and also explore the limitations of the applicability of the methods. 1

### Citations

4931 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...A estimates a classification model and makes predictions. Feature selection, model estimation, and prediction use conventional approaches including logistic regression, the decision tree learner C4.5 =-=[43]-=-, and naive Bayes (using the WEKA package [46]), and are not discussed further in this paper. 3.1 Aggregation using Distributional Meta-Data The result of the join (on CID) of the two tables in our ex... |

2963 |
Data mining : practical machine learning tools and techniques
- Witten, Hall, et al.
- 2011
(Show Context)
Citation Context ...redictions. Feature selection, model estimation, and prediction use conventional approaches including logistic regression, the decision tree learner C4.5 [43], and naive Bayes (using the WEKA package =-=[46]-=-), and are not discussed further in this paper. 3.1 Aggregation using Distributional Meta-Data The result of the join (on CID) of the two tables in our example database (step 7 in the pseudocode) is p... |

670 |
A theory and methodology of inductive learning
- Michalski
- 1983
(Show Context)
Citation Context ...on aggregates like MODE and MAX). This might prove to be an interesting starting point for theoretical work on the expressiveness of relational models. Traditional work on constructive induction (CI) =-=[32]-=- stressed the importance of the relationship between induction and representation and the intertwined search for a good representation. CI focused initially on the capability of “formulating new descr... |

455 | Inductive logic programming: Theory and methods
- Muggleton, Raedt
- 1994
(Show Context)
Citation Context ...erators. Furthermore, statistical relational model estimation typically treats aggregation as a preprocessing step that is independent of the model estimation process. In Inductive Logic Programming (=-=[35]-=-) and logic-based propositionalization, aggregation of one-to-many relationships is achieved through existential quantification and is part of the active search through the model space. Propositionali... |

435 | The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms - Bradley - 1997 |

384 | Enhanced hypertext categorization using hyperlinks
- Chakrabarti, Dom, et al.
- 1998
(Show Context)
Citation Context ...em. More specifically, is the collection of training data itself part of the background knowledge that will be available for prediction? This view is often appropriate for networked domains ([29],[9],=-=[5]-=-). 4 Formal Analysis and Implications We suggested distance-based aggregates to address a particular problem: the aggregation of categorical variables of high cardinality. The empirical results in Sec... |

327 | Mining the network value of customers
- Domingos, Richardson
- 2001
(Show Context)
Citation Context ...roblem. More specifically, is the collection of training data itself part of the background knowledge that will be available for prediction? This view is often appropriate for networked domains ([29],=-=[9]-=-,[5]). 4 Formal Analysis and Implications We suggested distance-based aggregates to address a particular problem: the aggregation of categorical variables of high cardinality. The empirical results in... |

277 | Meta-analysis in clinical trials
- DerSimonian, Laird
- 1986
(Show Context)
Citation Context ...get case. A similar distinction has been made in traditional statistical estimation. 4.2 A Relational Fixed-Effect Model Statistical estimation contrasts random-effect models from fixed-effect models =-=[8]-=-. In a random-effect model, model parameters are not assumed to be constant but instead to be drawn from different distributions for different observations. Estimating one distribution for each bag co... |

210 | FOIL: A Midterm Report
- Quinlan, Cameron-Jones
- 1993
(Show Context)
Citation Context ... identifier attributes, they also have no information except for the few attributes in EBook and IPO. To illustrate, we compare (on the IPO domain) ACORA to four logic-based relational learners (FOIL =-=[44]-=-, TILDE [2], Lime [31], and Progol [34]). Since ILP systems typically (with the exception of TILDE) only predict the class, not the probability of class membership, we compare in Table 12 the accuracy... |

188 | Probabilistic frame-based systems
- Koller, Pfeffer
- 1998
(Show Context)
Citation Context ...with stronger relation skew). 5.5 Comparison to Other Relational Learners We do not report a comprehensive study comparing ACORA to a wide variety of statistical relational modeling approaches (e.g., =-=[21]-=- [36] [16] [42]). This paper focuses on novel aggregation methods; ACORA is a vehicle for applying and studying these methods. We conjecture that these new aggregators ought to improve other relationa... |

169 | Automating the construction of internet portals with machine learning
- McCallum, Nigam, et al.
(Show Context)
Citation Context ... converts a domain explicitly into a graph representation and finds related information using breadth-first search for graph traversal. As an example to illustrate this process we use the CORA domain =-=[30]-=-, a bibliographic database of machine learning papers (see Section 7). CORA comprises three tables: Paper, Author and Citation, as shown in Figure 4. We do not use the text of the papers, only the cit... |

164 | Adaptive fraud detection
- Fawcett, Provost
- 1997
(Show Context)
Citation Context ...g. Naive Bayes stores class-conditional likelihoods for each attribute. In fraud detection, distributions of normal activity have been stored, to produce variables indicating deviations from the norm =-=[10]-=-. Aggregates like the mean and the standard deviation of related numeric values also summarize the underlying distribution; under the assumption of normality those two aggregates fully describe the di... |

122 | Real world performance of association rule algorithms
- Zheng, Kohavi, et al.
- 2001
(Show Context)
Citation Context ... { if (rand()<0.25){$num=10000+rand()*99000;} else{$num=rand()*100000} } $num=int $num; print REL "n$i,n$num\n"; $c=$c-1; } $i=$i+1; } close TAR; close REL; 41sCustomer Behavior (KDD) Blue Martini =-=[49]-=- published, together with the data for the KDDCUP 2000, three additional customer data sets to evaluate the performance of association rule algorithms. We use the BMS-WebView-1 set of 59600 transactio... |

113 | A survey of kernels for structured data
- Gärtner
- 2003
(Show Context)
Citation Context ...h bag) and a second step aggregates all distances through MIN. The recent convergence of relational learning and kernel methods has produced a variety of kernels for structured data, see for instance =-=[12]-=-. Structured kernels estimate distances between complex objects and are typically tailored towards a particular domain. This distance estimation also involves aggregation and often uses sums. Statisti... |

110 | Hypothesis-driven constructive induction in AQ17-HCI: A method and experiments
- Wnek, Michalski
- 1994
(Show Context)
Citation Context ...n the capability of “formulating new descriptors” from a given set of original attributes using general or domain-specific constructive operators like AND, OR, MINUS, DIV IDE, etc. Wnek and Michalski =-=[47]-=- extended the definition of CI to include any change in the representation space while still focusing on propositional reformulations. Under the new definition, propositionalization and aggregation ca... |

105 | Probabilistic classification and clustering in relational data
- Taskar, Segal, et al.
- 2001
(Show Context)
Citation Context ...re across the 7 model predictions. The figure compares ACORA to prior published results using Probabilistic Relational Models (PRM, [21]) based on both text and relational information (as reported by =-=[45]-=-), and a Simple Relational Classifier (SRC, [29]) that assumes strong autocorrelation in the class labels (specifically, assuming that documents from a particular field will dominantly cite previously... |

103 |
Propositionalization approaches to relational data mining
- Kramer, Lavrac, et al.
- 2001
(Show Context)
Citation Context ...tion space while still focusing on propositional reformulations. Under the new definition, propositionalization and aggregation can be seen as CI for relational domains as pointed out by [18, 33] and =-=[22]-=- for logic-based approaches. 36s7 Conclusion We have presented novel aggregation techniques for relational classification, which estimate class-conditional distributions to construct discriminative fe... |

95 | Linkage and autocorrelation cause feature selection bias in relational learning
- Jensen, Neville
- 2002
(Show Context)
Citation Context ...aggregation can extend the expressive power significantly as shown empirically in Section 5. Identifiers have other interesting properties. They may often be the cause of relational auto-correlation (=-=[17]-=-). Because a customer bought the first part of the trilogy, he now wants to read how the story continues. Given such a concept, we expect to see auto-correlation between customers that are linked thro... |

82 | A simple relational classifier
- Macskassy, Provost
- 2003
(Show Context)
Citation Context ...lar problem. More specifically, is the collection of training data itself part of the background knowledge that will be available for prediction? This view is often appropriate for networked domains (=-=[29]-=-,[9],[5]). 4 Formal Analysis and Implications We suggested distance-based aggregates to address a particular problem: the aggregation of categorical variables of high cardinality. The empirical result... |

69 |
Raedt. Top-down induction of first-order logical decision trees
- Blockeel, De
- 1998
(Show Context)
Citation Context ... N, and a particular value v of attribute Tji, the value of B f at position O(v) is equal to the Tij number of occurrences cv of value v in the bag. B f [O(v)] = cv (4) Tij For example B C2 T Y P E = =-=[2, 1]-=- for RT Y P E(C2, 1) = 〈Non-Fiction,Non-Fiction,Fiction〉, under the order O(Non-Fiction)=1, O(Fiction)=2. We will use the term case vector to mean this vector representation of the bag of values relat... |

68 | Relational learning with statistical predicate invention: Better models for hypertext
- Craven, Slattery
(Show Context)
Citation Context ...automatically. Besides special purpose methods (e.g., recency and frequency for direct marketing) only a few new aggregation-based feature construction methods have been proposed. Craven and Slattery =-=[7]-=- use Naive Bayes in combination with FOIL to construct features for hypertext classification. Perlich and Provost [39] use vector distances and class-conditional distributions for noisy relational dom... |

63 | A polynomial approach to the constructive induction of structural knowledge
- Kietz, Morik
- 1994
(Show Context)
Citation Context ...he representation space while still focusing on propositional reformulations. Under the new definition, propositionalization and aggregation can be seen as CI for relational domains as pointed out by =-=[18, 33]-=- and [22] for logic-based approaches. 36s7 Conclusion We have presented novel aggregation techniques for relational classification, which estimate class-conditional distributions to construct discrimi... |

63 | Tree induction vs. logistic regression: a learning-curve analysis - Perlich, Provost, et al. - 2003 |

60 | Simple estimators for relational Bayesian classifiers
- Neville, Jensen, et al.
- 2003
(Show Context)
Citation Context ...stronger relation skew). 5.5 Comparison to Other Relational Learners We do not report a comprehensive study comparing ACORA to a wide variety of statistical relational modeling approaches (e.g., [21] =-=[36]-=- [16] [42]). This paper focuses on novel aggregation methods; ACORA is a vehicle for applying and studying these methods. We conjecture that these new aggregators ought to improve other relational lea... |

54 | Aggregation-based feature invention and relational concept classes
- Perlich, Provost
- 2003
(Show Context)
Citation Context ...d for relational feature construction, based on this analysis, including novel aggregation operators. To our knowledge, this is the first relational 1 This paper is an extension of the second half of =-=[39]-=-. 2saggregation approach that can be applied generally to categorical attributes with high cardinality. 3. It draws an analogy to the statistical distinction between random- and fixed-effect modeling,... |

46 |
S.: Transformation-based learning using multirelational aggregation
- Krogel, Wrobel
- 2001
(Show Context)
Citation Context ...l attribute with high cardinality poses a problem for aggregation. This has been recognized implicitly in prior work (see Section 6), but rarely addressed explicitly. Some relational learning systems =-=[24]-=- only consider attributes with cardinality of less than n, typically below 50; Woznica et al. [48] define standard attributes excluding keys, and many ILP systems require the explicit identification o... |

43 |
Characterizing the applicability of classification algorithms using meta-level learning
- Brazdil, Gama, et al.
- 1994
(Show Context)
Citation Context ... once), unconditional prior of class 1, and average bag size. 31ssuch as inherent discriminability, the number of features, the skew of the marginal class distribution (the class “prior”), and others =-=[4]-=-,[40]. Relational domains have additional characteristics; particularly important in our case are two: the skew in the relationship distribution and the average size of bags of related values. We alre... |

30 | Data mining in social networks
- Jensen, Neville
- 2002
(Show Context)
Citation Context ...ger relation skew). 5.5 Comparison to Other Relational Learners We do not report a comprehensive study comparing ACORA to a wide variety of statistical relational modeling approaches (e.g., [21] [36] =-=[16]-=- [42]). This paper focuses on novel aggregation methods; ACORA is a vehicle for applying and studying these methods. We conjecture that these new aggregators ought to improve other relational learners... |

28 | Propositionalisation and aggregates
- Knobbe, Haas, et al.
(Show Context)
Citation Context ...es in order to allow the application of standard statistical modeling techniques. The potential advantages of such a transformation, or “propositionalization,” approach have been discussed previously =-=[20]-=-. Aggregation of bags of values plays an important role in the transformation process, but only two types of automated aggregation are regularly used: (1) simple aggregates, such as the arithmetic ave... |

25 | Restructuring databases for knowledge discovery by consolidation and link formation
- Goldberg, Senator
- 1995
(Show Context)
Citation Context ...ither in terms of parameters or increasing complexity) space of many possible solutions. Although aggregation has been identified as a fundamental problem for relational learning from real-world data =-=[14]-=-, machine learning research has considered only a limited set of aggregation operators. Furthermore, statistical relational model estimation typically treats aggregation as a preprocessing step that i... |

23 | Towards structural logistic regression: Combining relational and statistical learning
- Popescul, Ungar, et al.
- 2002
(Show Context)
Citation Context ...elation skew). 5.5 Comparison to Other Relational Learners We do not report a comprehensive study comparing ACORA to a wide variety of statistical relational modeling approaches (e.g., [21] [36] [16] =-=[42]-=-). This paper focuses on novel aggregation methods; ACORA is a vehicle for applying and studying these methods. We conjecture that these new aggregators ought to improve other relational learners as w... |

22 |
Distance based approaches to relational learning and clustering
- Kirsten, Wrabel, et al.
- 2001
(Show Context)
Citation Context ... choice of aggregation operator can have a much stronger impact on the resultant model’s generalization performance than the choice of the model induction method. Distance-based relational approaches =-=[19]-=- use simple aggregates such as MIN to aggregate distances between two bags of values. A first step estimates the distances between all possible pairs of objects (one element from each bag) and a secon... |

22 | CProgol4.4: a tutorial introduction
- Muggleton, Firth
- 2001
(Show Context)
Citation Context ...no information except for the few attributes in EBook and IPO. To illustrate, we compare (on the IPO domain) ACORA to four logic-based relational learners (FOIL [44], TILDE [2], Lime [31], and Progol =-=[34]-=-). Since ILP systems typically (with the exception of TILDE) only predict the class, not the probability of class membership, we compare in Table 12 the accuracy as a function of training size. We als... |

21 | Naive bayesian classification of structured data
- Flach, Lachiche
(Show Context)
Citation Context ...likelihood as the distance function, the relational fixed-effect model can be given a theoretical foundation within a general relational Bayesian framework very similar to that of Flach and Lachiche (=-=[11]-=-,[26]). In a relational context, a target object tt is not only described by its attributes, but it also has an identifier (CID in our example) that maps into bags of related objects from different ba... |

20 |
1BC2: a true first-order Bayesian classifier
- Lachiche, Flach
- 2003
(Show Context)
Citation Context ...ihood as the distance function, the relational fixed-effect model can be given a theoretical foundation within a general relational Bayesian framework very similar to that of Flach and Lachiche ([11],=-=[26]-=-). In a relational context, a target object tt is not only described by its attributes, but it also has an identifier (CID in our example) that maps into bags of related objects from different backgro... |

13 | Statistical Relational Learning: Four Claims and a Survey
- Neville, Jensen, et al.
- 2003
(Show Context)
Citation Context ...nces between complex objects and are typically tailored towards a particular domain. This distance estimation also involves aggregation and often uses sums. Statistical relational learning approaches =-=[37]-=- [15] include network models as well as upgrades of propositional models (e.g., Probabilistic Relational Models [21], Relational Bayesian Classifier [36], Relational Probability Trees [16]). They typi... |

12 |
Discovering Knowledge from Relational Data Extracted from Business News
- Bernstein, Clearwater, et al.
- 2002
(Show Context)
Citation Context ... C(c, ĉ) where c is a vector of binary class labels and ĉ a vector of probabilities of class membership, we define the relational probability estimation task as finding a mapping M∗ : (T f t , RDB) → =-=[0, 1]-=- from instances in Tt (along with all information in RDB), subject to minimizing the cost in expectation over any possible set of target cases, T ∗ t : M∗ = argmin ET M ∗ t [C(T ∗ tc , M(T ∗ t , RDB))... |

9 |
Facets of aggregation approaches to propositionalization
- Krogel, Wrobel
- 2003
(Show Context)
Citation Context ... high cardinality: ACORA constructs COUNT S for the 10 most common values (an extended MODE) and counts for all values if the number of distinct values is at most 50 as suggested by Krogel and Wrobel =-=[25]-=-. ACORA generally includes an attribute representing the bag size as well as all original attributes from the target table. Feature construction: Table 6 summarizes the different aggregation methods. ... |

7 | Induction in first order logic from noisy training samples and fixed sample sizes
- McCreath
- 1999
(Show Context)
Citation Context ..., they also have no information except for the few attributes in EBook and IPO. To illustrate, we compare (on the IPO domain) ACORA to four logic-based relational learners (FOIL [44], TILDE [2], Lime =-=[31]-=-, and Progol [34]). Since ILP systems typically (with the exception of TILDE) only predict the class, not the probability of class membership, we compare in Table 12 the accuracy as a function of trai... |

6 |
S.: Comparative evaluation of approaches to propositionalization
- Zelezn´y, Flach, et al.
- 2003
(Show Context)
Citation Context ...ancial task and a customer-classification problem (ECML 1998 discovery challenge) in comparison to Progol and Dinus [27], a logic-based proposition35salization approach. Similar work by Krogel et al. =-=[23]-=- presents an empirical comparison of Boolean and numeric aggregation in propositionalization approaches across multiple domains, including synthetic and domains with low noise; however their results a... |

4 | Kernel-based distances for relational learning
- Woznica, Kalousis, et al.
- 2004
(Show Context)
Citation Context ...itly in prior work (see Section 6), but rarely addressed explicitly. Some relational learning systems [24] only consider attributes with cardinality of less than n, typically below 50; Woznica et al. =-=[48]-=- define standard attributes excluding keys, and many ILP systems require the explicit identification of categorical values to be considered for equality tests, leaving the selection to the user. 2.1 D... |

3 | Tailoring representations to different requirements
- Morik
- 1999
(Show Context)
Citation Context ...he representation space while still focusing on propositional reformulations. Under the new definition, propositionalization and aggregation can be seen as CI for relational domains as pointed out by =-=[18, 33]-=- and [22] for logic-based approaches. 36s7 Conclusion We have presented novel aggregation techniques for relational classification, which estimate class-conditional distributions to construct discrimi... |

2 |
Naive Bayesian classifier with ILP-R
- Pompe, Kononenko
- 1995
(Show Context)
Citation Context ...merical attributes and MODE and COUNT S for categorical attributes with few possible values) or aggregate by creating Boolean features (e.g., Structural Logistic Regression [42], Naive Bayes with ILP =-=[41]-=-). Krogel and Wrobel [24] and Knobbe et al. [20] were to our knowledge the first to suggest the combination of such numerical aggregates and FOL clauses to propositionalize relational problems automat... |

1 |
techniques for studying set languages, bag languages and aggregate functions
- New
- 1994
(Show Context)
Citation Context ...ited cardinality. Theoretical work outside of relational modeling investigates the extension of relational algebra [38] through aggregation; however it does not suggest new operators. Libkin and Wong =-=[28]-=- analyze the expressive power of relational languages with bag aggregates, based on a count operator and Boolean comparison (sufficient to express the common aggregates like MODE and MAX). This might ... |

1 |
Extending relational algebra andrelational calculus with set-valued atributes and aggregate functions
- Özsoyo´glu, Özsoyo´glu, et al.
- 1987
(Show Context)
Citation Context ...elated to our analysis in Section 4.2 but apply it only to normal attributes with limited cardinality. Theoretical work outside of relational modeling investigates the extension of relational algebra =-=[38]-=- through aggregation; however it does not suggest new operators. Libkin and Wong [28] analyze the expressive power of relational languages with bag aggregates, based on a count operator and Boolean co... |