Results 1  10
of
33
Empirical Bayes Screening for MultiItem Associations
, 2001
"... This paper considers the framework of the socalled "market basket problem", in which a database of transactions is mined for the occurrence of unusually frequent item sets. In our case, "unusually frequent" involves estimates of the frequency of each item set divided by a baseli ..."
Abstract

Cited by 56 (0 self)
 Add to MetaCart
This paper considers the framework of the socalled "market basket problem", in which a database of transactions is mined for the occurrence of unusually frequent item sets. In our case, "unusually frequent" involves estimates of the frequency of each item set divided by a baseline frequency computed as if items occurred independently. The focus is on obtaining reliable estimates of this measure of interestingness for all item sets, even item sets with relatively low frequencies. For example, in a medical database of patient histories, unusual item sets including the item "patient death" (or other serious adverse event) might hopefully be flagged with as few as 5 or 10 occurrences of the item set, it being unacceptable to require that item sets occur in as many as 0.1% of millions of patient reports before the data mining algorithm detects a signal. Similar considerations apply in fraud detection applications. Thus we abandon the requirement that interesting item sets must contain a re...
Squashing Flat Files Flatter
, 1999
"... A feature of data mining that distinguishes it from "classical" machine learning (ML) and statistical modeling (SM) is scale. The community seems to agree on this yet progress to this point has been limited. We present a methodology that addresses scale in a novel fashion that has the pote ..."
Abstract

Cited by 48 (3 self)
 Add to MetaCart
A feature of data mining that distinguishes it from "classical" machine learning (ML) and statistical modeling (SM) is scale. The community seems to agree on this yet progress to this point has been limited. We present a methodology that addresses scale in a novel fashion that has the potential for revolutionizing the field. While the methodology applies most directly to flat (row by column) data sets we believe that it can be adapted to other representations. Our approach to the problem is not to scale up individual ML and SM methods. Rather we prefer to leverage the entire collection of existing methods by scaling down the data set. We call the method squashing. Our method demonstrably outperforms random sampling and a theoretical argument suggests how and why it works well. Squashing consists of three modular steps: grouping, momentizing, and generating (GMG). These three steps describe the squashing pipeline whereby the original (very large data set) is sectioned off into mutual...
Exponential Language Models, Logistic Regression, and Semantic Coherence
 In Proceedings of the NIST/DARPA Speech Transcription Workshop
, 2000
"... In this paper, we modify the traditional trigram model by using utterancelevel semantic coherence features in an exponential model. The semantic coherence features are collected by measuring the correlations among contentword pairs occurring in sentences of two corpora, the real corpus and a corpu ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
In this paper, we modify the traditional trigram model by using utterancelevel semantic coherence features in an exponential model. The semantic coherence features are collected by measuring the correlations among contentword pairs occurring in sentences of two corpora, the real corpus and a corpus generated by the baseline trigram model. The measure we use for estimating the semantic association of content word pairs is Yule's Q statistic. For our preliminary analysis, we have further simplified the modeling task by extracting a small set of statistics from each sentencebased Q statistics and applying them as features to the exponential model. We also simplified the process of obtaining the MLE solutions of the exponential models by approximating it with a logistic regression model. We account for the uncertainty in the estimates of Q by constructing confidence intervals. The new model results in a slight reduction in testset perplexity. We also discuss and compare alternative mea...
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets
"... As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent i ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s ∗ for a dataset, such that the number of itemsets with support at least s ∗ represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate. Our methodology hinges on a Poisson approximation to the Harvard School of Engineering and Applied Sciences, Cambridge,
SelfSufficient Itemsets: An Approach to Screening Potentially Interesting Associations Between Items
"... Selfsufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not selfsufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Selfsufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not selfsufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the itemsets on which their frequency depends is known. We present statistical tests for statistically sound discovery of selfsufficient itemsets, and computational techniques that allow those tests to be applied as a postprocessing step for any itemset discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.
Association rule discovery with unbalanced class
 in ‘Proceedings of the 16th Australian Joint Conference on Artificial Intelligence (AI03), Lecture Notes in Artificial Intelligence
, 2003
"... There are many methods for finding association rules in very large data. However it is well known that most general association rule discovery methods find too many rules, which include a lot of uninteresting rules. Furthermore, the performances of many such algorithms deteriorate when the minimum s ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
There are many methods for finding association rules in very large data. However it is well known that most general association rule discovery methods find too many rules, which include a lot of uninteresting rules. Furthermore, the performances of many such algorithms deteriorate when the minimum support is low. They fail to find many interesting rules even when support is low, particularly in the case of significantly unbalanced classes. In this paper we present an algorithm which finds association rules based on a set of new interestingness criteria. The algorithm is applied to a realworld health data set and successfully identifies groups of patients with high risk of adverse reaction to certain drugs. A statistically guided method of selecting appropriate features has also been developed. Initial results have shown that the proposed algorithm can find interesting patterns from data sets with unbalanced class distributions without performance loss.
Evaluation of statistical association measures for the automatic signal generation in pharmacovigilance
 IEEE Transactions on Information Technology in Biomedicine
, 2005
"... Abstract—Pharmacovigilance aims at detecting the adverse effects of marketed drugs. It is generally based on the spontaneous reporting of events thought to be the adverse effects of drugs. Spontaneous Reporting Systems (SRSs) supply huge databases that pharmacovigilance experts cannot exhaustively e ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Abstract—Pharmacovigilance aims at detecting the adverse effects of marketed drugs. It is generally based on the spontaneous reporting of events thought to be the adverse effects of drugs. Spontaneous Reporting Systems (SRSs) supply huge databases that pharmacovigilance experts cannot exhaustively exploit without data mining tools. Data mining methods; i.e., statistical association measures in conjunction with signal generation criteria, have been proposed in the literature but there is no consensus regarding their applicability and efficiency, especially since such methods are difficult to evaluate on the basis of actual data. The objective of this paper is to evaluate association measures on simulated datasets obtained with SRS modeling. We compared association measures using the percentage of false positive signals among a given number of the most highly ranked drug–event combinations according to the values of the association measures. By considering 150 drugs and 100 adverse events, these percentages of false positives, among the 500 most highly ranked drugevent couples, vary from 1.1 % to 53.4 % (averages over 1000 simulated datasets). As the measures led to very different results, we could identify which measures appeared to be the most relevant for pharmacovigilance. Index Terms—Adverse drug reaction reporting systems, association measures, computer simulation, information systems, validation studies.
Defining interestingness for association rules
 In Int. journal of information theories and applications
"... Abstract: Interestingness in Association Rules has been a major topic of research in the past decade. The reason is that the strength of association rules, i.e. its ability to discover ALL patterns given some thresholds on support and confidence, is also its weakness. Indeed, a typical association r ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Abstract: Interestingness in Association Rules has been a major topic of research in the past decade. The reason is that the strength of association rules, i.e. its ability to discover ALL patterns given some thresholds on support and confidence, is also its weakness. Indeed, a typical association rules analysis on real data often results in hundreds or thousands of patterns creating a data mining problem of the second order. In other words, it is not straightforward to determine which of those rules are interesting for the enduser. This paper provides an overview of some existing measures of interestingness and we will comment on their properties. In general, interestingness measures can be divided into objective and subjective measures. Objective measures tend to express interestingness by means of statistical or mathematical criteria, whereas subjective measures of interestingness aim at capturing more practical criteria that should be taken into account, such as unexpectedness or actionability of rules. This paper only focusses on objective measures of interestingness.
Adverse Drug Effect Detection
"... Abstract—Large collections of electronic patient records provide abundant but underexplored information on the realworld use of medicines. Although they are maintained for patient administration, they provide a broad range of clinical information for data analysis. One growing interest is drug saf ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract—Large collections of electronic patient records provide abundant but underexplored information on the realworld use of medicines. Although they are maintained for patient administration, they provide a broad range of clinical information for data analysis. One growing interest is drug safety signal detection from these longitudinal observational data. In this paper, we proposed two novel algorithms—a likelihood ratio model and a Bayesian network model—for adverse drug effect discovery. Although the performance of these two algorithms is comparable to the stateoftheart algorithm, Bayesian confidence propagation neural network, the combination of three works better due to their diversity in solutions. Since the actual adverse drug effects on a given dataset cannot be absolutely determined, we make use of the simulated OMOP dataset constructed with the predefined adverse drug effects to evaluate our methods. Experimental results show the usefulness of the proposed pattern discovery method on the simulated OMOP dataset by improving the standard baseline algorithm—chisquare—by 23.83%. Index Terms—adverse drug effect, correlation, BCPNN, likelihood ratio, Bayesian network.
Spontaneous Reporting System Modelling for Data Mining Methods Evaluation in Pharmacovigilance
"... The pharmacovigilance aims at detecting adverse effects of marketed drugs. It is based on the spontaneous reporting of events that are supposed to be adverse effects of drugs. The Spontaneous Reporting System (SRS) is supplying huge databases that pharmacovigilance experts cannot exhaustively exploi ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
The pharmacovigilance aims at detecting adverse effects of marketed drugs. It is based on the spontaneous reporting of events that are supposed to be adverse effects of drugs. The Spontaneous Reporting System (SRS) is supplying huge databases that pharmacovigilance experts cannot exhaustively exploit without any data mining tools. Data mining methods have been proposed in the literature but none of them is the object of a consensus in terms of applicability and efficiency. It is especially due to the difficulties to evaluate the methods on real data. In this context, the aim of this paper is to propose the SRS modelling in order to simulate realistic data that would permit to complete the methods evaluation and comparison, with the perspective to help in defining surveillance strategies. In fact, as the status of the drugevent relations is known in the simulated dataset, the signal generated by the data mining methods can be labelled as ”true ” or ”false”. Spontaneous Reporting process is viewed as a Poisson process depending on the drugs exposure frequency, on the delay from the drugs launch, on the adverse events background incidence and seriousness and on a reporting probability. This reporting probability, quantitatively unknown, is derived from the qualitative knowledge found in literature and expressed by experts. This knowledge is represented and exploited by means of a fuzzy characterisation of variables and a set of fuzzy rules. Simulated data are described and two Bayesian data mining methods are applied to illustrate the kind of information, on methods performances, that can be derived from the SRS modelling and from the data simulation. 1