Results 1 - 10
of
279
Toward integrating feature selection algorithms for classification and clustering
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2005
"... This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals ..."
Abstract
-
Cited by 267 (21 self)
- Add to MetaCart
(Show Context)
This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.
Variable Selection Using SVM-based Criteria
, 2003
"... We propose new methods to evaluate variable subset relevance with a view to variable selection. ..."
Abstract
-
Cited by 125 (3 self)
- Add to MetaCart
We propose new methods to evaluate variable subset relevance with a view to variable selection.
A survey of evolutionary algorithms for data mining and knowledge discovery
- In: A. Ghosh, and S. Tsutsui (Eds.) Advances in Evolutionary Computation
, 2002
"... Abstract: This chapter discusses the use of evolutionary algorithms, particularly genetic algorithms and genetic programming, in data mining and knowledge discovery. We focus on the data mining task of classification. In addition, we discuss some preprocessing and postprocessing steps of the knowled ..."
Abstract
-
Cited by 123 (3 self)
- Add to MetaCart
(Show Context)
Abstract: This chapter discusses the use of evolutionary algorithms, particularly genetic algorithms and genetic programming, in data mining and knowledge discovery. We focus on the data mining task of classification. In addition, we discuss some preprocessing and postprocessing steps of the knowledge discovery process, focusing on attribute selection and pruning of an ensemble of classifiers. We show how the requirements of data mining and knowledge discovery influence the design of evolutionary algorithms. In particular, we discuss how individual representation, genetic operators and fitness functions have to be adapted for extracting high-level knowledge from data. 1.
Simultaneous feature selection and clustering using mixture models
- IEEE TRANS. PATTERN ANAL. MACH. INTELL
, 2004
"... Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched u ..."
Abstract
-
Cited by 122 (1 self)
- Add to MetaCart
Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched upon. Feature selection for clustering is difficult because, unlike in supervised learning, there are no class labels for the data and, thus, no obvious criteria to guide the search. Another important problem in clustering is the determination of the number of clusters, which clearly impacts and is influenced by the feature selection issue. In this paper, we propose the concept of feature saliency and introduce an expectation-maximization (EM) algorithm to estimate it, in the context of mixture-based clustering. Due to the introduction of a minimum message length model selection criterion, the saliency of irrelevant features is driven toward zero, which corresponds to performing feature selection. The criterion and algorithm are then extended to simultaneously estimate the feature saliencies and the number of clusters.
Sentiment analysis in multiple languages
- ACM Transactions on Information Systems
, 2008
"... The Internet is frequently used as a medium for exchange of information and opinions, as well as propaganda dissemination. In this study the use of sentiment analysis methodologies is proposed for classification of web forum opinions in multiple languages. The utility of stylistic and syntactic feat ..."
Abstract
-
Cited by 99 (6 self)
- Add to MetaCart
The Internet is frequently used as a medium for exchange of information and opinions, as well as propaganda dissemination. In this study the use of sentiment analysis methodologies is proposed for classification of web forum opinions in multiple languages. The utility of stylistic and syntactic features is evaluated for sentiment classification of English and Arabic content. Specific feature extraction components are integrated to account for the linguistic characteristics of Arabic. The Entropy Weighted Genetic Algorithm (EWGA) is also developed, which is a hybridized genetic algorithm that incorporates the information gain heuristic for feature selection. EWGA is designed to improve performance and get a better assessment of the key features. The proposed features and techniques are evaluated on a benchmark movie review data set and U.S. and Middle Eastern web forum postings. The experimental results using EWGA with SVM indicate high performance levels, with accuracy over 95 % on the benchmark data set and over 93 % for both the U.S. and Middle Eastern forums. Stylistic features significantly enhanced performance across all test beds while EWGA also outperformed other feature selection methods, indicating the utility of these features and techniques for document level classification of sentiments.
Hierarchical Text Categorization Using Neural Networks
- Information Retrieval
, 2002
"... This paper presents the design and evaluation of a text categorization method based on the Hierarchical Mixture of Experts model. This model uses a divide and conquer principle to define smaller categorization problems based on a predefined hierarchical structure. The final classifier is a hierarchi ..."
Abstract
-
Cited by 99 (0 self)
- Add to MetaCart
(Show Context)
This paper presents the design and evaluation of a text categorization method based on the Hierarchical Mixture of Experts model. This model uses a divide and conquer principle to define smaller categorization problems based on a predefined hierarchical structure. The final classifier is a hierarchical array of neural networks. The method is evaluated using the UMLS Metathesaurus as the underlying hierarchical structure, and the OHSUMED test set of MEDLINE records. Comparisons with an optimized version of the traditional Rocchio's algorithm adapted for text categorization, as well as at neural network classifiers are provided. The results show that the use of the hierarchical structure improves text categorization performance with respect to an equivalent at model. The optimized Rocchio algorithm achieves a performance comparable with that of the hierarchical neural networks.
Feature Selection in Unsupervised Learning via Evolutionary Search
- In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2000
"... Feature subset selection is an important problem in knowl- edge discovery, not only for the insight gained from deter- mining relevant modeling variables but also for the improved understandability, scalability, and possibly, accuracy of the resulting models. In this paper we consider the problem of ..."
Abstract
-
Cited by 79 (4 self)
- Add to MetaCart
(Show Context)
Feature subset selection is an important problem in knowl- edge discovery, not only for the insight gained from deter- mining relevant modeling variables but also for the improved understandability, scalability, and possibly, accuracy of the resulting models. In this paper we consider the problem of feature selection for unsupervised learning. A number of heuristic criteria can be used to estimate the quality of clusters built from a given featuresubset. Rather than combining such criteria, we use ELSA, an evolutionary lo- cal selection algorithm that maintains a diverse population of solutions that approximate the Pareto front in a multi- dimensional objectiv espace. Each evolved solution repre- sents a feature subset and a number of clusters; a standard K-means algorithm is applied to form the given n umber of clusters based on the selected features. Preliminary results on both real and synthetic data show promise in finding Pareto-optimal solutions through which we can identify the significant features and the correct number of clusters.
Hybrid Genetic Algorithms for Feature Selection
- IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2004) 1424–1437.Knowledge Discovery
, 2007
"... Abstract—This paper proposes a novel hybrid genetic algorithm for feature selection. Local search operations are devised and embedded in hybrid GAs to fine-tune the search. The operations are parameterized in terms of their fine-tuning power, and their effectiveness and timing requirements are analy ..."
Abstract
-
Cited by 71 (1 self)
- Add to MetaCart
(Show Context)
Abstract—This paper proposes a novel hybrid genetic algorithm for feature selection. Local search operations are devised and embedded in hybrid GAs to fine-tune the search. The operations are parameterized in terms of their fine-tuning power, and their effectiveness and timing requirements are analyzed and compared. The hybridization technique produces two desirable effects: a significant improvement in the final performance and the acquisition of subset-size control. The hybrid GAs showed better convergence properties compared to the classical GAs. A method of performing rigorous timing analysis was developed, in order to compare the timing requirement of the conventional and the proposed algorithms. Experiments performed with various standard data sets revealed that the proposed hybrid GA is superior to both a simple GA and sequential search algorithms. Index Terms—Feature selection, hybrid genetic algorithm, sequential search algorithm, local search operation, atomic operation, multistart algorithm. 1
Lightweight Agents for Intrusion Detection
- Journal of Systems and Software
, 2003
"... This paper focuses on intrusion detection and countermeasures with respect to widely-used operating systems and networks. The design and architecture of an intrusion detection system built from distributed agents is proposed to implement an intelligent system on which data mining can be performed to ..."
Abstract
-
Cited by 71 (7 self)
- Add to MetaCart
This paper focuses on intrusion detection and countermeasures with respect to widely-used operating systems and networks. The design and architecture of an intrusion detection system built from distributed agents is proposed to implement an intelligent system on which data mining can be performed to provide global, temporal views of an entire networked system. A starting point for agent intelligence in our system is the research into the use of machine learning over system call traces from the privileged sendmail program on UNIX. We use a rule learning algorithm to classify the system call traces for intrusion detection purposes and show the results.
Constructive Neural Network Learning Algorithms for Pattern Classification
, 2000
"... Constructive learning algorithms offer an attractive approach for the incremental construction of near-minimal neural-network architectures for pattern classification. They help overcome the need for ad hoc and often inappropriate choices of network topology in algorithms that search for suitable we ..."
Abstract
-
Cited by 60 (15 self)
- Add to MetaCart
(Show Context)
Constructive learning algorithms offer an attractive approach for the incremental construction of near-minimal neural-network architectures for pattern classification. They help overcome the need for ad hoc and often inappropriate choices of network topology in algorithms that search for suitable weights in a priori fixed network architectures. Several such algorithms are proposed in the literature and shown to converge to zero classification errors (under certain assumptions) on tasks that involve learning a binary to binary mapping (i.e., classification problems involving binary-valued input attributes and two output categories). We present two constructive learning algorithms MPyramid-real and MTiling-real that extend the pyramid and tiling algorithms, respectively, for learning real to M-ary mappings (i.e., classification problems involving real-valued input attributes and multiple output classes). We prove the convergence of these algorithms and empirically demonstrate their applicability to practical pattern classification problems. Additionally, we show how the incorporation of a local pruning step can eliminate several redundant neurons from MTiling-real networks.