Results 1  10
of
29
Empirical Bayes Screening for MultiItem Associations
, 2001
"... This paper considers the framework of the socalled "market basket problem", in which a database of transactions is mined for the occurrence of unusually frequent item sets. In our case, "unusually frequent" involves estimates of the frequency of each item set divided by a baseline frequency compute ..."
Abstract

Cited by 56 (0 self)
 Add to MetaCart
This paper considers the framework of the socalled "market basket problem", in which a database of transactions is mined for the occurrence of unusually frequent item sets. In our case, "unusually frequent" involves estimates of the frequency of each item set divided by a baseline frequency computed as if items occurred independently. The focus is on obtaining reliable estimates of this measure of interestingness for all item sets, even item sets with relatively low frequencies. For example, in a medical database of patient histories, unusual item sets including the item "patient death" (or other serious adverse event) might hopefully be flagged with as few as 5 or 10 occurrences of the item set, it being unacceptable to require that item sets occur in as many as 0.1% of millions of patient reports before the data mining algorithm detects a signal. Similar considerations apply in fraud detection applications. Thus we abandon the requirement that interesting item sets must contain a re...
Squashing Flat Files Flatter
, 1999
"... A feature of data mining that distinguishes it from "classical" machine learning (ML) and statistical modeling (SM) is scale. The community seems to agree on this yet progress to this point has been limited. We present a methodology that addresses scale in a novel fashion that has the potential for ..."
Abstract

Cited by 48 (3 self)
 Add to MetaCart
A feature of data mining that distinguishes it from "classical" machine learning (ML) and statistical modeling (SM) is scale. The community seems to agree on this yet progress to this point has been limited. We present a methodology that addresses scale in a novel fashion that has the potential for revolutionizing the field. While the methodology applies most directly to flat (row by column) data sets we believe that it can be adapted to other representations. Our approach to the problem is not to scale up individual ML and SM methods. Rather we prefer to leverage the entire collection of existing methods by scaling down the data set. We call the method squashing. Our method demonstrably outperforms random sampling and a theoretical argument suggests how and why it works well. Squashing consists of three modular steps: grouping, momentizing, and generating (GMG). These three steps describe the squashing pipeline whereby the original (very large data set) is sectioned off into mutual...
Exponential Language Models, Logistic Regression, and Semantic Coherence
 In Proceedings of the NIST/DARPA Speech Transcription Workshop
, 2000
"... In this paper, we modify the traditional trigram model by using utterancelevel semantic coherence features in an exponential model. The semantic coherence features are collected by measuring the correlations among contentword pairs occurring in sentences of two corpora, the real corpus and a corpu ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
In this paper, we modify the traditional trigram model by using utterancelevel semantic coherence features in an exponential model. The semantic coherence features are collected by measuring the correlations among contentword pairs occurring in sentences of two corpora, the real corpus and a corpus generated by the baseline trigram model. The measure we use for estimating the semantic association of content word pairs is Yule's Q statistic. For our preliminary analysis, we have further simplified the modeling task by extracting a small set of statistics from each sentencebased Q statistics and applying them as features to the exponential model. We also simplified the process of obtaining the MLE solutions of the exponential models by approximating it with a logistic regression model. We account for the uncertainty in the estimates of Q by constructing confidence intervals. The new model results in a slight reduction in testset perplexity. We also discuss and compare alternative mea...
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets
"... As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent i ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s ∗ for a dataset, such that the number of itemsets with support at least s ∗ represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate. Our methodology hinges on a Poisson approximation to the Harvard School of Engineering and Applied Sciences, Cambridge,
Association rule discovery with unbalanced class
 in ‘Proceedings of the 16th Australian Joint Conference on Artificial Intelligence (AI03), Lecture Notes in Artificial Intelligence
, 2003
"... There are many methods for finding association rules in very large data. However it is well known that most general association rule discovery methods find too many rules, which include a lot of uninteresting rules. Furthermore, the performances of many such algorithms deteriorate when the minimum s ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
There are many methods for finding association rules in very large data. However it is well known that most general association rule discovery methods find too many rules, which include a lot of uninteresting rules. Furthermore, the performances of many such algorithms deteriorate when the minimum support is low. They fail to find many interesting rules even when support is low, particularly in the case of significantly unbalanced classes. In this paper we present an algorithm which finds association rules based on a set of new interestingness criteria. The algorithm is applied to a realworld health data set and successfully identifies groups of patients with high risk of adverse reaction to certain drugs. A statistically guided method of selecting appropriate features has also been developed. Initial results have shown that the proposed algorithm can find interesting patterns from data sets with unbalanced class distributions without performance loss.
SelfSufficient Itemsets: An Approach to Screening Potentially Interesting Associations Between Items
"... Selfsufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not selfsufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Selfsufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not selfsufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the itemsets on which their frequency depends is known. We present statistical tests for statistically sound discovery of selfsufficient itemsets, and computational techniques that allow those tests to be applied as a postprocessing step for any itemset discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.
Defining interestingness for association rules
 In Int. journal of information theories and applications
"... Abstract: Interestingness in Association Rules has been a major topic of research in the past decade. The reason is that the strength of association rules, i.e. its ability to discover ALL patterns given some thresholds on support and confidence, is also its weakness. Indeed, a typical association r ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Abstract: Interestingness in Association Rules has been a major topic of research in the past decade. The reason is that the strength of association rules, i.e. its ability to discover ALL patterns given some thresholds on support and confidence, is also its weakness. Indeed, a typical association rules analysis on real data often results in hundreds or thousands of patterns creating a data mining problem of the second order. In other words, it is not straightforward to determine which of those rules are interesting for the enduser. This paper provides an overview of some existing measures of interestingness and we will comment on their properties. In general, interestingness measures can be divided into objective and subjective measures. Objective measures tend to express interestingness by means of statistical or mathematical criteria, whereas subjective measures of interestingness aim at capturing more practical criteria that should be taken into account, such as unexpectedness or actionability of rules. This paper only focusses on objective measures of interestingness.
Empirical Bayes Screening for Link Analysis
 In Proc. Text Mining and LinkAnalysis Workshop at the 18th IJCAI Conference
, 2003
"... The domain of link analysis has recently reignited interest among researchers due to its applicability to new areas such as intelligence analysis (for example, identifying cliques of suspicious people), large scale social network analysis and genomics. ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
The domain of link analysis has recently reignited interest among researchers due to its applicability to new areas such as intelligence analysis (for example, identifying cliques of suspicious people), large scale social network analysis and genomics.
Bayesian Data Analysis for Data Mining
 In Handbook of Data Mining
, 2002
"... Introduction The Bayesian approach to data analysis computes conditional probability distribu tions of quantities of interest (such as future observables) given the observed data. Bayesian analyses usually begin with a .full probability model  a joint probability dis tribution for all the observ ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Introduction The Bayesian approach to data analysis computes conditional probability distribu tions of quantities of interest (such as future observables) given the observed data. Bayesian analyses usually begin with a .full probability model  a joint probability dis tribution for all the observable and unobservable quantities under study  and then use Bayes' theorem (Bayes, 1763) to compute the requisite conditional probability distributions (called poster'Joy distributions). The theorem itself is innocuous enough. In its simplest form, if Q denotes a quantity of interest and D denotes data, the theorem states: P(ql D) P(;lq) X P(q)/P(). This theorem prescribes the basis for statistical learning in the probabilistic frame work. With p(Q) regarded as a probabilistic statement of prior knowledge about Q before obtaining the data D, p(QI D) becomes a revised probabilistic statement of our knowledge about Q in the light of the data (Bernardo and Smith, 1994, p.2). The marginal lik
Empirical bayesian data mining for discovering patterns in postmarketing drug safety
 In Proceedings of KDD 2003
, 2003
"... Because of practical limits in characterizing the safety profiles of therapeutic products prior to marketing, manufacturers and regulatory agencies perform postmarketing surveillance based on the collection of adverse reaction reports ("pharmacovigilance"). The resulting databases, while ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Because of practical limits in characterizing the safety profiles of therapeutic products prior to marketing, manufacturers and regulatory agencies perform postmarketing surveillance based on the collection of adverse reaction reports ("pharmacovigilance"). The resulting databases, while rich in realworld information, are notoriously difficult to analyze using traditional techniques. Each report may involve multiple medicines, symptoms, and demographic factors, and there is no easily linked information on drug exposure in the reporting population. KDD techniques, such as association finding, are wellmatched to the problem, but are difficult for medical staff to apply and interpret. To deploy KDD effectively for pharmacovigilance, Lincoln Technologies and GlaxoSmithKline collaborated to create a webbased safety data mining web environment. The analytical core is a highperformance implementation of the MGPS (MultiItem Gamma Poisson Shrinker) algorithm described previously by DuMouchel and Pregibon, with several significant extensions and enhancements. The environment offers an interface for specifying data mining runs, a batch execution facility, tabular and graphical methods for exploring associations, and drilldown to case details. Substantial work was involved in preparing the raw adverse event data for mining, including harmonization of drug names and removal of duplicate reports. The environment can be used to explore both drugevent and multiway associations (interactions, syndromes). It has been used to study age/gender effects, to predict the safety profiles of proposed combination drugs, and to separate contributions of individual drugs to safety problems in polytherapy situations.