Results 11 - 20
of
1,587
Capturing, indexing, clustering, and retrieving system history
- In SOSP
, 2005
"... system performance, Bayesian networks, information retrieval, problem signatures We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similari ..."
Abstract
-
Cited by 65 (5 self)
- Add to MetaCart
system performance, Bayesian networks, information retrieval, problem signatures We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual "raw " values of collected measurements is
Sensing meets mobile social networks: The design, implementation and evaluation of the CenceMe application
- in Proceedings of the International Conference on Embedded Networked Sensor Systems (SenSys
, 2008
"... We present the design, implementation, evaluation, and user experiences of the CenceMe application, which represents the first system that combines the inference of the presence of individuals using off-the-shelf, sensor-enabled mobile phones with sharing of this information through social networkin ..."
Abstract
-
Cited by 61 (9 self)
- Add to MetaCart
We present the design, implementation, evaluation, and user experiences of the CenceMe application, which represents the first system that combines the inference of the presence of individuals using off-the-shelf, sensor-enabled mobile phones with sharing of this information through social networking applications such as Facebook and MySpace. We discuss the system challenges for the development of software on the Nokia N95 mobile phone. We present the design and tradeoffs of split-level classification, whereby personal sensing presence (e.g., walking, in conversation, at the gym) is derived from classifiers which execute in part on the phones and in part on the backend servers to achieve scalable inference. We report performance measurements that characterize the computational requirements of the software and the energy consumption of the CenceMe phone client. We validate the system through a user study where twenty two people, including undergraduates, graduates and faculty, used CenceMe continuously over a three week period in a campus town. From this user study we learn how the system performs in a production environment and what uses people find for a personal sensing system.
Mining E-mail Content for Author Identification Forensics
- SIGMOD RECORD
, 2001
"... We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different em ..."
Abstract
-
Cited by 59 (1 self)
- Add to MetaCart
We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different email topics. An extended set of e-mail document features including structural characteristics and linguistic patterns were derived and, together with a Support Vector Machine learning algorithm, were used for mining the e-mail content. Experiments using a number of e-mail documents generated by different authors on a set of topics gave promising results for both aggregated and multi-topic author categorisation.
Failure diagnosis using decision trees
- In Proceedings of the International Conference on Autonomic Computing (ICAC
, 2004
"... We present a decision tree learning approach to diagnosing failures in large Internet sites. We record runtime properties of each request and apply automated machine learning and data mining techniques to identify the causes of failures. We train decision trees on the request traces from time period ..."
Abstract
-
Cited by 56 (1 self)
- Add to MetaCart
We present a decision tree learning approach to diagnosing failures in large Internet sites. We record runtime properties of each request and apply automated machine learning and data mining techniques to identify the causes of failures. We train decision trees on the request traces from time periods in which user-visible failures are present. Paths through the tree are ranked according to their degree of correlation with failure, and nodes are merged according to the observed partial order of system components. We evaluate this approach using actual failures from eBay, and find that, among hundreds of potential causes, the algorithm successfully identifies 13 out of 14 true causes of failure, along with 2 false positives. We discuss some results in applying simplified decision trees on eBay’s production site for several months. In addition, we give a cost-benefit analysis of manual vs. automated diagnosis systems. Our contributions include the statistical learning approach, the adaptation of decision trees to the context of failure diagnosis, and the deployment and evaluation of our tools on a high-volume production service. 1.
An Empirical Comparison of Supervised Learning Algorithms
- In Proc. 23 rd Intl. Conf. Machine learning (ICML’06
, 2006
"... A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90’s. We present a large-scale empirical comparison between ten supervised learning methods: SVMs, n ..."
Abstract
-
Cited by 55 (3 self)
- Add to MetaCart
A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90’s. We present a large-scale empirical comparison between ten supervised learning methods: SVMs, neural nets, logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps. We also examine the effect that calibrating the models via Platt Scaling and Isotonic Regression has on their performance. An important aspect of our study is the use of a variety of performance criteria to evaluate the learning methods. 1.
Structural semantic interconnections: a knowledge-based approach to word sense disambiguation
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2005
"... In this paper we describe the SSI algorithm, a structural pattern matching algorithm for WSD. The algorithm has been applied to the gloss disambiguation task of Senseval-3. 1 ..."
Abstract
-
Cited by 52 (14 self)
- Add to MetaCart
In this paper we describe the SSI algorithm, a structural pattern matching algorithm for WSD. The algorithm has been applied to the gloss disambiguation task of Senseval-3. 1
Learning to Attach Semantic Metadata to Web Services
- In Proc. Int. Semantic Web Conf
, 2003
"... Emerging Web standards promise a network of heterogeneous yet interoperable Web Services. Web Services would greatly simplify the development of many kinds of data integration and knowledge management applications. Unfortunately, this vision requires that services describe themselves with large amou ..."
Abstract
-
Cited by 51 (10 self)
- Add to MetaCart
Emerging Web standards promise a network of heterogeneous yet interoperable Web Services. Web Services would greatly simplify the development of many kinds of data integration and knowledge management applications. Unfortunately, this vision requires that services describe themselves with large amounts of semantic metadata "glue". We explore a variety of machine learning techniques to semiautomatically create such metadata.
Flow Clustering Using Machine Learning Techniques
- In PAM
, 2004
"... Abstract. Packet header traces are widely used in network analysis. Header traces are the aggregate of traffic from many concurrent applications. We present a methodology, based on machine learning, that can break the trace down into clusters of traffic where each cluster has different traffic chara ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
Abstract. Packet header traces are widely used in network analysis. Header traces are the aggregate of traffic from many concurrent applications. We present a methodology, based on machine learning, that can break the trace down into clusters of traffic where each cluster has different traffic characteristics. Typical clusters include bulk transfer, single and multiple transactions and interactive traffic, amongst others. The paper includes a description of the methodology, a visualisation of the attribute statistics that aids in recognising cluster types and a discussion of the stability and effectiveness of the methodology. 1
Discriminative frequent pattern analysis for effective classification
- In ICDE
, 2007
"... The application of frequent patterns in classification appeared in sporadic studies and achieved initial success in the classification of relational data, text documents and graphs. In this paper, we conduct a systematic exploration of frequent pattern-based classification, and provide solid reasons ..."
Abstract
-
Cited by 47 (13 self)
- Add to MetaCart
The application of frequent patterns in classification appeared in sporadic studies and achieved initial success in the classification of relational data, text documents and graphs. In this paper, we conduct a systematic exploration of frequent pattern-based classification, and provide solid reasons supporting this methodology. It was well known that feature combinations (patterns) could capture more underlying semantics than single features. However, inclusion of infrequent patterns may not significantly improve the accuracy due to their limited predictive power. By building a connection between pattern frequency and discriminative measures such as information gain and Fisher score, we develop a strategy to set minimum support in frequent pattern mining for generating useful patterns. Based on this strategy, coupled with a proposed feature selection algorithm, discriminative frequent patterns can be generated for building high quality classifiers. We demonstrate that the frequent pattern-based classification framework can achieve good scalability and high accuracy in classifying large datasets. Empirical studies indicate that significant improvement in classification accuracy is achieved (up to 12 % in UCI datasets) using the so-selected discriminative frequent patterns. 1.
Diverse Ensembles for Active Learning
- In Proceedings of 21st International Conference on Machine Learning (ICML-2004
, 2004
"... Query by Committee is an eective approach to selective sampling in which disagreement amongst an ensemble of hypotheses is used to select data for labeling. Query by Bagging and Query by Boosting are two practical implementations of this approach that use Bagging and Boosting, respectively, to ..."
Abstract
-
Cited by 47 (7 self)
- Add to MetaCart
Query by Committee is an eective approach to selective sampling in which disagreement amongst an ensemble of hypotheses is used to select data for labeling. Query by Bagging and Query by Boosting are two practical implementations of this approach that use Bagging and Boosting, respectively, to build the committees. For eective active learning, it is critical that the committee be made up of consistent hypotheses that are very dierent from each other. Decorate is a recently developed method that directly constructs such diverse committees using arti cial training data. This paper introduces Active-Decorate, which uses Decorate committees to select good training examples.

