Results 1 - 10
of
122
How many clusters? Which clustering method? Answers via model-based cluster analysis
- THE COMPUTER JOURNAL
, 1998
"... ..."
Neural Networks and Statistical Models
, 1994
"... There has been much publicity about the ability of artificial neural networks to learn and generalize. In fact, the most commonly used artificial neural networks, called multilayer perceptrons, are nothing more than nonlinear regression and discriminant models that can be implemented with standard s ..."
Abstract
-
Cited by 82 (1 self)
- Add to MetaCart
There has been much publicity about the ability of artificial neural networks to learn and generalize. In fact, the most commonly used artificial neural networks, called multilayer perceptrons, are nothing more than nonlinear regression and discriminant models that can be implemented with standard statistical software. This paper explains what neural networks are, translates neural network jargon into statistical jargon, and shows the relationships between neural networks and statistical models such as generalized linear models, maximum redundancy analysis, projection pursuit, and cluster analysis.
Automated support for classifying software failure reports
- In ICSE
, 2003
"... This paper proposes automated support for classi ying reported software failures in order to facilitate prioritizing them and diagnosing their causes. A classification strategy is presented that involves the use of supervised and unsupervised pattern classification and multivariate visualization. Th ..."
Abstract
-
Cited by 64 (1 self)
- Add to MetaCart
This paper proposes automated support for classi ying reported software failures in order to facilitate prioritizing them and diagnosing their causes. A classification strategy is presented that involves the use of supervised and unsupervised pattern classification and multivariate visualization. These techniques are applied to profiles o f failed executions in order to group together failures with the same or similar causes. The resulting classification is then used to assess the frequency and severity o f failures caused by particular defects and to help diagnose those defects. The results of applying the proposed classification strategy to failures of three large subject programs are reported. that the strategy can be effective. These results indicate
Characterizing the Scalability of a Large Web-Based Shopping System
- ACM Transactions on Internet Technology (TOIT)
, 2001
"... This article presents an analysis of five days of workload data from a large Web-based shopping system. The multitier environment of this Web-based shopping system includes Web servers, application servers, database servers, and an assortment of load-balancing and firewall appliances. We characteriz ..."
Abstract
-
Cited by 60 (2 self)
- Add to MetaCart
This article presents an analysis of five days of workload data from a large Web-based shopping system. The multitier environment of this Web-based shopping system includes Web servers, application servers, database servers, and an assortment of load-balancing and firewall appliances. We characterize user requests and sessions and determine their impact on system performance and scalability. The purpose of our study is to assess scalability and support capacity planning exercises for the multitier system. We find that horizontal scalability is not always an adequate mechanism for supporting increased workloads and that personalization and robots can have a significant impact on system scalability.
Finding Intensional Knowledge of Distance-Based Outliers
- In VLDB
, 1999
"... Existing studies on outliers focus only on the identification aspect; none provides any intensional knowledge of the outliers---by which we mean a description or an explanation of why an identified outlier is exceptional. For many applications, a description or explanation is at least as vital to t ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
Existing studies on outliers focus only on the identification aspect; none provides any intensional knowledge of the outliers---by which we mean a description or an explanation of why an identified outlier is exceptional. For many applications, a description or explanation is at least as vital to the user as the identification aspect. Specifically, intensional knowledge helps the user to: (i) evaluate the validity of the identified outliers, and (ii) improve one's understanding of the data. The two main issues addressed in this paper are: what kinds of intensional knowledge to provide, and how to optimize the computation of such knowledge. With respect to the first issue, we propose finding strongest and weak outliers and their corresponding structural intensional knowledge. With respect to the second issue, we first present a naive and a semi-naive algorithm. Then, by means of what we call path and semi-lattice sharing of I/O processing, we develop two optimized approaches. We provi...
Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions
, 1988
"... Efficiently answering decision support queries is an important problem. Most of the work in this direction has been in the context of the data cube. Queries are efficiently answered by pre-computing large parts of the cube. Besides having large space requirements, such pre-computation requires that ..."
Abstract
-
Cited by 52 (3 self)
- Add to MetaCart
Efficiently answering decision support queries is an important problem. Most of the work in this direction has been in the context of the data cube. Queries are efficiently answered by pre-computing large parts of the cube. Besides having large space requirements, such pre-computation requires that the hierarchy along each dimension be fixed (hence dimensions are categorical or prediscretized) . Queries that take advantage of pre-computation can thus only drill-down or roll-up along this fixed hierarchy. Another disadvantage of existing pre-computation techniques is that the target measure, along with the aggregation function of interest, is fixed for each cube. Queries over more than one target measure or using different aggregation functions, would require pre-computing larger data cubes. In this paper, we propose a new compressed representation of the data cube that (a) drastically reduces storage requirements, (b) does not require the discretization hierarchy along each query dimension to be fixed beforehand and (c) treats each dimension as a potential target measure and supports multiple aggregation functions without additional storage costs. The tradeoff is approximate, yet relatively accurate, answers to queries. We outline mechanisms to reduce the error in the approximation. Our performance evaluation indicates that our compression technique effectively addresses the limitations of existing approaches.
Algorithms for model-based Gaussian hierarchical clustering
- SIAM Journal on Scientific Computing
, 1998
"... 1 Funded by the O ce of Naval Research under contracts N00014-96-1-0192 and N00014-96-1- ..."
Abstract
-
Cited by 43 (11 self)
- Add to MetaCart
1 Funded by the O ce of Naval Research under contracts N00014-96-1-0192 and N00014-96-1-
40 “What does European institutional integration tell us about trade integration?” by
, 2005
"... publications will feature a motif taken from the €50 banknote. This paper can be downloaded without charge from the ECB’s website ..."
Abstract
-
Cited by 40 (1 self)
- Add to MetaCart
publications will feature a motif taken from the €50 banknote. This paper can be downloaded without charge from the ECB’s website
Scaling EM (Expectation-Maximization) Clustering to Large Databases
, 1999
"... Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the Expectation-Maximization (EM) algorithm. The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...
A Unified Notion of Outliers: Properties and Computation
- In Proc. of the International Conference on Knowledge Discovery and Data Mining
, 1997
"... As said in signal processing, "One person's noise is another person's signal." For many applications, such as the exploration of satellite or medical images, and the monitoring of criminal activities in electronic commerce, identifying exceptions can often lead to the discovery of truly unexpected k ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
As said in signal processing, "One person's noise is another person's signal." For many applications, such as the exploration of satellite or medical images, and the monitoring of criminal activities in electronic commerce, identifying exceptions can often lead to the discovery of truly unexpected knowledge. In this paper, we study an intuitive notion of outliers. A key contribution of this paper is to show how the proposed notion of outliers unifies or generalizes many existing notions of outliers provided by discordancy tests for standard statistical distributions. Thus, a unified outlier detection system can replace a whole spectrum of statistical discordancy tests with a single module detecting only the kinds of outliers proposed. A second contribution of this paper is the development of an approach to find all outliers in a dataset. The structure underlying this approach resembles a data cube, which has the advantage of facilitating integration with the many OLAP and data mining s...

