Results 11  20
of
33
Data Mining At The Interface Of Computer Science And Statistics
, 2001
"... This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, i ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, involving the application of a variety of techniques from both computer science and statistics. The chapter discusses how computer scientists and statisticians approach data from different but complementary viewpoints and highlights the fundamental differences between statistical and computational views of data mining. In doing so we review the historical importance of statistical contributions to machine learning and data mining, including neural networks, graphical models, and flexible predictive modeling. The primary conclusion is that closer integration of computational methods with statistical thinking is likely to become increasingly important in data mining applications. Keywords: Data mining, statistics, pattern recognition, transaction data, correlation. 1.
On the Existence and Significance of Data Preprocessing Biases in WebUsage Mining
 INFORMS Journal on Computing
, 2003
"... The literature on webusage mining is replete with data preprocessing techniques, which correspond to many closely related problem formulations. We survey datapreprocessing techniques for sessionlevel pattern discovery and compare three of these techniques in the context of understanding sessionle ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
The literature on webusage mining is replete with data preprocessing techniques, which correspond to many closely related problem formulations. We survey datapreprocessing techniques for sessionlevel pattern discovery and compare three of these techniques in the context of understanding sessionlevel purchase behavior on the web. Using real data collected from 20,000 users ’ browsing behavior over a period of six months, four different models (linear regressions, logistic regressions, neural networks, and classification trees) are built based on data preprocessed using three different techniques. The results demonstrate that the three approaches result in radically different conclusions and provide initial evidence that a data preprocessing bias exists, the effect of which can be significant.
A Theory Of Empirical Spatial Knowledge Supporting Rough Set Based Knowledge Discovery in Geographic Databases
 University of Otago
, 1998
"... ..."
Bayesian Analysis of Massive Datasets Via Particle Filters
"... Markov Chain Monte Carlo (MCMC) techniques revolutionized statistical practice in the 1990s by providing an essential toolkit for making the rigor and flexibility of Bayesian analysis computationally practical. At the same time the increasing prevalence of massive datasets and the expansion of the f ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Markov Chain Monte Carlo (MCMC) techniques revolutionized statistical practice in the 1990s by providing an essential toolkit for making the rigor and flexibility of Bayesian analysis computationally practical. At the same time the increasing prevalence of massive datasets and the expansion of the field of data mining has created the need to produce statistically sound methods that scale to these large problems. Except for the most trivial examples, current MCMC methods require a complete scan of the dataset for each iteration eliminating their candidacy as feasible data mining techniques.
Mining the News: Trends, Associations, and Deviations
 COMPUTACIÓN Y SISTEMAS
, 2001
"... News reports are an important source of information about society. Their analysis allows to understand its current interests and to measure the social importance of many events.
In this paper, we use the analysis of news as a means to explore the society interests. We present a text mining technique ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
News reports are an important source of information about society. Their analysis allows to understand its current interests and to measure the social importance of many events.
In this paper, we use the analysis of news as a means to explore the society interests. We present a text mining technique that uncovers trends, discovers associations and detects deviations from news notes. The method uses simple statistical representations of the news reports (frequencies
and probability distributions of topics) and statistical measures (the average or median, the standard deviation, and the correlation coefficient) for analysis and discovery of useful information.
We illustrate the method with some results obtained from preliminary experiments and discuss their main implications.
Knowledge discovery and data mining in databases
 Handbook of Software Engineering and Knowledge Engineering Fundamentals, World Scientific Publishing Co., Singapore
, 2001
"... ..."
(Show Context)
Credit Risk Assessment using Statistical and Machine Learning: Basic Methodology and Risk Modeling Applications
, 1997
"... Abstract. Risk assessment of financial intermediaries is an area of renewed interest due to the financial crises of the 1980’s and 90’s. An accurate estimation of risk, and its use in corporate or global financial risk models, could be translated into a more efficient use of resources. One importan ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. Risk assessment of financial intermediaries is an area of renewed interest due to the financial crises of the 1980’s and 90’s. An accurate estimation of risk, and its use in corporate or global financial risk models, could be translated into a more efficient use of resources. One important ingredient to accomplish this goal is to find accurate predictors of individual risk in the credit portfolios of institutions. In this context we make a comparative analysis of different statistical and machine learning modeling methods of classification on a mortgage loan dataset with the motivation to understand their limitations and potential. We introduced a specific modeling methodology based on the study of error curves. Using stateoftheart modeling techniques we built more than 9,000 models as part of the study. The results show that CART decisiontree models provide the best estimation for default with an average 8.31 % error rate for a training sample of 2,000 records. As a result of the error curve analysis for this model we conclude that if more data were available, approximately 22,000 records, a potential 7.32 % error rate could be achieved. Neural Networks provided the second best results with an average error of 11.00%. The KNearest Neighbor algorithm had an average error rate of 14.95%. These results outperformed the standard Probit algorithm which attained an average error rate of 15.13%.
Data Mining and Knowledge Discovery in the Geographical Domain
"... Introduction Despite enormous efforts in quantification, our understanding of many of the Earth's systems remains nonaxiomatic; the systems are 'open' and consequently it is not possible to deduce all outcomes from known laws. Science must therefore adopt a manner that encourages th ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Introduction Despite enormous efforts in quantification, our understanding of many of the Earth's systems remains nonaxiomatic; the systems are 'open' and consequently it is not possible to deduce all outcomes from known laws. Science must therefore adopt a manner that encourages the creation or uncovering of new knowledge (Baker, 1999; Takatsuka & Gahegan, 2001). For this reason alone and completely uncoupled from concerns about increasing data volume: '.'t is vital that knowledge discovery methods can be brought successfully to bear across the geosciences. The focus of this position paper is on the relationship of data mining and knowledge discovery to the different approaches used for scientific inference. It is via an understanding of this relationship that we can categorize the kinds of knowledge that can be discovered or learned, and gain an understanding of the roles played in the knowledge discovery process by the domain expert and the computational tools used. Many differe
A Statistical Approach to the Discovery of Ephemeral Associations among News Topics
 H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113
, 2001
"... News reports are an important source of information about society. ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
News reports are an important source of information about society.
International Journal on Artificial Intelligence Tools c ○ World Scientific Publishing Company A SemiAutomatic Approach for ConfoundingAware Subgroup Discovery
"... This paper presents a semiautomatic approach for confoundingaware subgroup discovery: Confounding essentially disturbs the measured effect of an association between variables due to the influence of other parameters that were not considered. The proposed method is embedded into a general subgroup ..."
Abstract
 Add to MetaCart
This paper presents a semiautomatic approach for confoundingaware subgroup discovery: Confounding essentially disturbs the measured effect of an association between variables due to the influence of other parameters that were not considered. The proposed method is embedded into a general subgroup discovery approach, and provides the means for detecting potentially confounded subgroup patterns, other unconfounded relations, and/or patterns that are affected by effectmodification. Since there is no purely automatic test for confounding, the discovered relations are presented to the user in a semiautomatic approach. Furthermore, we utilize (causal) domain knowledge for improving the results of the algorithm, since confounding is itself a causal concept. The applicability and benefit of the presented technique is illustrated by realworld examples from a casestudy in the medical domain. 1.