Results 1 - 10
of
388
Editorial: special issue on learning from imbalanced data sets
- SIGKDD Explor. Newsl
, 2004
"... The class imbalance problem is one of the (relatively) new problems that emerged when machine learning matured from an embryonic science to an applied technology, amply used in the worlds of business, industry and scientific research. ..."
Abstract
-
Cited by 216 (5 self)
- Add to MetaCart
(Show Context)
The class imbalance problem is one of the (relatively) new problems that emerged when machine learning matured from an embryonic science to an applied technology, amply used in the worlds of business, industry and scientific research.
Exploratory Under-Sampling for Class-Imbalance Learning
"... Under-sampling is a class-imbalance learning method which uses only a subset of major class examples and thus is very efficient. The main deficiency is that many major class examples are ignored. We propose two algorithms to overcome the deficiency. EasyEnsemble samples several subsets from the majo ..."
Abstract
-
Cited by 97 (5 self)
- Add to MetaCart
Under-sampling is a class-imbalance learning method which uses only a subset of major class examples and thus is very efficient. The main deficiency is that many major class examples are ignored. We propose two algorithms to overcome the deficiency. EasyEnsemble samples several subsets from the major class, trains a learner using each of them, and combines the outputs of those learners. BalanceCascade is similar toEasyEnsemble except that it removes correctly classified major class examples of trained learners from further consideration. Experiments show that both of the proposed algorithms have better AUC scores than many existing class-imbalance learning methods. Moreover, they have approximately the same training time as that of under-sampling, which trains significantly faster than other methods. 1
Activity recognition of assembly tasks using body-worn microphones and accelerometers
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2006
"... In order to provide relevant information to mobile users, such as workers engaging in the manual tasks of maintenance and assembly, a wearable computer requires information about the user’s specific activities. This work focuses on the recognition of activities that are characterized by a hand motio ..."
Abstract
-
Cited by 72 (13 self)
- Add to MetaCart
(Show Context)
In order to provide relevant information to mobile users, such as workers engaging in the manual tasks of maintenance and assembly, a wearable computer requires information about the user’s specific activities. This work focuses on the recognition of activities that are characterized by a hand motion and an accompanying sound. Suitable activities can be found in assembly and maintenance work. Here, we provide an initial exploration into the problem domain of continuous activity recognition using on-body sensing. We use a mock “wood workshop ” assembly task to ground our investigation. We describe a method for the continuous recognition of activities (sawing, ham-mering, filing, drilling, grinding, sanding, opening a drawer, tightening a vise, and turning a screwdriver) using microphones and 3-axis accelerometers mounted at two positions on the user’s arms. Potentially “interesting ” activities are segmented from continuous streams of data using an analysis of the sound intensity detected at the two different locations. Activity classification is then performed on these detected segments using linear discriminant analysis (LDA) on the sound channel and hidden Markov
Cost Curves: An Improved Method for Visualizing Classifier Performance
- MACH LEARN
, 2006
"... Abstract This paper introduces cost curves, a graphical technique for visualizing the performance (error rate or expected cost) of 2-class classifiers over the full range of possible class distributions and misclassification costs. Cost curves are shown to be superior to ROC curves for visualizing c ..."
Abstract
-
Cited by 64 (7 self)
- Add to MetaCart
Abstract This paper introduces cost curves, a graphical technique for visualizing the performance (error rate or expected cost) of 2-class classifiers over the full range of possible class distributions and misclassification costs. Cost curves are shown to be superior to ROC curves for visualizing classifier performance for most purposes. This is because they visually support several crucial types of performance assessment that cannot be done easily with ROC curves, such as showing confidence intervals on a classifierâs performance, and visualizing the statistical significance of the difference in performance of two classifiers. A software tool supporting all the cost curve analysis described in this paper is available from the authors.
Facts or friends?: distinguishing informational and conversational questions in social Q&A sites
- In CHI
, 2009
"... Tens of thousands of questions are asked and answered every day on social question and answer (Q&A) Web sites such as Yahoo Answers. While these sites generate an enormous volume of searchable data, the problem of determining which questions and answers are archival quality has grown. One major ..."
Abstract
-
Cited by 62 (3 self)
- Add to MetaCart
(Show Context)
Tens of thousands of questions are asked and answered every day on social question and answer (Q&A) Web sites such as Yahoo Answers. While these sites generate an enormous volume of searchable data, the problem of determining which questions and answers are archival quality has grown. One major component of this problem is the prevalence of conversational questions, identified both by Q&A sites and academic literature as questions that are intended simply to start discussion. For example, a conversational question such as “do you believe in evolution? ” might successfully engage users in discussion, but probably will not yield a useful web page for users searching for information about evolution. Using data from three popular Q&A sites, we confirm that humans can reliably distinguish between these conversational questions and other informational questions, and present evidence that conversational questions typically have much lower potential archival value than informational questions. Further, we explore the use of machine learning techniques to automatically classify questions as conversational or informational, learning in the process about categorical, linguistic, and social differences between different question types. Our algorithms approach human performance, attaining 89.7 % classification accuracy in our experiments. Author Keywords Q&A, online community, machine learning.
Object trajectory-based activity classification and recognition using hidden Markov models
- IEEE Trans. Image Process
"... Abstract—Motion trajectories provide rich spatiotemporal information about an object’s activity. This paper presents novel classification algorithms for recognizing object activity using object motion trajectory. In the proposed classification system, trajectories are segmented at points of change i ..."
Abstract
-
Cited by 59 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Motion trajectories provide rich spatiotemporal information about an object’s activity. This paper presents novel classification algorithms for recognizing object activity using object motion trajectory. In the proposed classification system, trajectories are segmented at points of change in curvature, and the subtrajectories are represented by their principal component analysis (PCA) coefficients. We first present a framework to robustly estimate the multivariate probability density function based on PCA coefficients of the subtrajectories using Gaussian mixture models (GMMs). We show that GMM-based modeling alone cannot capture the temporal relations and ordering between underlying entities. To address this issue, we use hidden Markov models (HMMs) with a data-driven design in terms of number of states and topology (e.g., left-right versus ergodic). Experiments using a database of over 5700 complex trajectories (obtained from UCI-KDD data archives and Columbia University Multimedia Group) subdivided into 85 different classes demonstrate the superiority of our proposed HMM-based scheme using PCA coefficients of subtrajectories in comparison with other techniques in the literature. Index Terms—Activity recognition, Gaussian mixture models (GMMs), hidden Markov models (HMMs), trajectory modeling.
Hybrid rule-extraction from support vector machines
- in Proc. of IEEE conference on cybernetics and intelligent systems
, 2004
"... Abstract — Support vector machines (SVMs) have shown superior performance compared to other machine learning techniques, especially in classification problems. Yet one limitation of SVMs is the lack of an explanation capability which is crucial in some applications, e.g. in the medical and security ..."
Abstract
-
Cited by 56 (0 self)
- Add to MetaCart
(Show Context)
Abstract — Support vector machines (SVMs) have shown superior performance compared to other machine learning techniques, especially in classification problems. Yet one limitation of SVMs is the lack of an explanation capability which is crucial in some applications, e.g. in the medical and security domains. In this paper, a novel approach for eclectic rule-extraction from support vector machines is presented. This approach utilizes the knowledge acquired by the SVM and represented in its support vectors as well as the parameters associated with them. The approach includes three stages; training, propositional rule-extraction and rule quality evaluation. Results from four different experiments have demonstrated the value of the approach for extracting comprehensible rules of high accuracy and fidelity.
K.: Quality and Complexity Measures for Data Linkage and Deduplication
- Studies in Computational Intelligence (SCI
, 2007
"... Abstract. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques or ..."
Abstract
-
Cited by 47 (15 self)
- Add to MetaCart
(Show Context)
Abstract. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive accuracy results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity.
A Study of Parts-Based Object Class Detection Using Complete Graphs
- INT J COMPUT VIS
, 2009
"... Object detection is one of the key components in modern computer vision systems. While the detection of a specific rigid object under changing viewpoints was considered hard just a few years ago, current research strives to detect and recognize classes of non-rigid, articulated objects. Hampered b ..."
Abstract
-
Cited by 44 (2 self)
- Add to MetaCart
Object detection is one of the key components in modern computer vision systems. While the detection of a specific rigid object under changing viewpoints was considered hard just a few years ago, current research strives to detect and recognize classes of non-rigid, articulated objects. Hampered by the omnipresent confusing information due to clutter and occlusion, the focus has shifted from holistic approaches for object detection to representations of individual object parts linked by structural information, along with richer contextual descriptions of object configurations. Along this line of research, we present a practicable and expandable probabilistic framework for parts-based object class representation, enabling the detection of rigid and articulated object classes in arbitrary views. We investigate learning of this representation from labelled training images and infer globally optimal solutions to the contextual MAP-detection problem, using A∗-search with a novel lower-bound as admissible heuristic. An assessment of the inference performance of Belief-Propagation and Tree-Reweighted Belief Propagation is obtained as a by-product. The generality of our approach is demonstrated on four different datasets utilizing domain dependent information cues.