Results 1 - 10
of
145
A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data
- Applications of Data Mining in Computer Security
, 2002
"... Abstract Most current intrusion detection systems employ signature-based methods or data mining-based methods which rely on labeled training data. This training data is typically expensive to produce. We present a new geometric framework for unsupervised anomaly detection, which are algorithms that ..."
Abstract
-
Cited by 130 (8 self)
- Add to MetaCart
Abstract Most current intrusion detection systems employ signature-based methods or data mining-based methods which rely on labeled training data. This training data is typically expensive to produce. We present a new geometric framework for unsupervised anomaly detection, which are algorithms that are designed to process unlabeled data. In our framework, data elements are mapped to a feature space which is typically a vector space! d. Anomalies are detected by determining which points lies in sparse
Combining Pairwise Sequence Similarity and Support Vector Machines for Remote Protein Homology Detection
- J. Comput. Biol
, 2002
"... One key element in understanding the molecular machinery of the cell is to understand the meaning, or function, of each protein encoded in the genome. A very successful means of inferring the function of a previously unannotated protein is via sequence similarity with one or more proteins whose func ..."
Abstract
-
Cited by 116 (12 self)
- Add to MetaCart
One key element in understanding the molecular machinery of the cell is to understand the meaning, or function, of each protein encoded in the genome. A very successful means of inferring the function of a previously unannotated protein is via sequence similarity with one or more proteins whose functions are already known. Currently, one of the most powerful such homology detection methods is the SVM-Fisher method of Jaakkola, Diekhans and Haussler (ISMB 2000). This method combines a generative, profile hidden Markov model (HMM) with a discriminative classification algorithm known as a support vector machine (SVM). The current work presents an alternative method for SVMbased protein classification. The method, SVM-pairwise, uses a pairwise sequence similarity algorithm such as SmithWaterman in place of the HMM in the SVM-Fisher method. The resulting algorithm, when tested on its ability to recognize previously unseen families from the SCOP database, yields significantly better remote protein homology detection than SVM-Fisher, profile HMMs and PSI-BLAST.
Mismatch String Kernels for SVM Protein Classification
"... We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. These kernels measure sequence similarity based on shared occurrences of k-length subsequences, counted with up to m mi ..."
Abstract
-
Cited by 100 (14 self)
- Add to MetaCart
We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. These kernels measure sequence similarity based on shared occurrences of k-length subsequences, counted with up to m mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efficiently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classifier performs as well as the Fisher kernel, the most successful method for remote homology detection, while achieving considerable computational savings.
Mismatch string kernels for discriminative protein classification
- Bioinformatics
, 2004
"... Motivation Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training an ..."
Abstract
-
Cited by 90 (7 self)
- Add to MetaCart
Motivation Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. Results We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP data sets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highestweighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies. Availability SVM software is publically available at
Fast Kernels for String and Tree Matching
, 2004
"... Introduction Many problems in machine learning require a data classification algorithm to work with a set of discrete objects. Common examples include biological sequence analysis where data is represented as strings (Durbin et al., 1998) and Natural Language Processing (NLP) where the data is give ..."
Abstract
-
Cited by 60 (5 self)
- Add to MetaCart
Introduction Many problems in machine learning require a data classification algorithm to work with a set of discrete objects. Common examples include biological sequence analysis where data is represented as strings (Durbin et al., 1998) and Natural Language Processing (NLP) where the data is given in the form of a string combined with a parse tree (Collins and Du#y, 2001) or an annotated sequence (Altun et al., 2003). In order to apply kernel methods one defines a measure of similarity between discrete structures via a feature map # : X F. Here X is the set of discrete structures (eg. the set of all parse trees of a language) and F is a Hilbert space. Since #(x) F we can define a kernel by evaluating the scalar products k(x, x # ) = ##(x), #(x # )# (1.1) where x, x # X. The success of a kernel method employing k depends both on the faithful representation of discrete data and an e#cient means of computing k. Recent research e#ort has focussed on defining meaningful ker
Discriminative frequent pattern analysis for effective classification
- In ICDE
, 2007
"... The application of frequent patterns in classification appeared in sporadic studies and achieved initial success in the classification of relational data, text documents and graphs. In this paper, we conduct a systematic exploration of frequent pattern-based classification, and provide solid reasons ..."
Abstract
-
Cited by 47 (13 self)
- Add to MetaCart
The application of frequent patterns in classification appeared in sporadic studies and achieved initial success in the classification of relational data, text documents and graphs. In this paper, we conduct a systematic exploration of frequent pattern-based classification, and provide solid reasons supporting this methodology. It was well known that feature combinations (patterns) could capture more underlying semantics than single features. However, inclusion of infrequent patterns may not significantly improve the accuracy due to their limited predictive power. By building a connection between pattern frequency and discriminative measures such as information gain and Fisher score, we develop a strategy to set minimum support in frequent pattern mining for generating useful patterns. Based on this strategy, coupled with a proposed feature selection algorithm, discriminative frequent patterns can be generated for building high quality classifiers. We demonstrate that the frequent pattern-based classification framework can achieve good scalability and high accuracy in classifying large datasets. Empirical studies indicate that significant improvement in classification accuracy is achieved (up to 12 % in UCI datasets) using the so-selected discriminative frequent patterns. 1.
Graph-Driven Features Extraction From Microarray Data
, 2002
"... Gene function prediction from microarray data is a first step toward better understanding the machinery of the cell from relatively cheap and easy-to-produce data. In this paper we investigate whether the knowledge of many metabolic pathways and their catalyzing enzymes accumulated over the years ca ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Gene function prediction from microarray data is a first step toward better understanding the machinery of the cell from relatively cheap and easy-to-produce data. In this paper we investigate whether the knowledge of many metabolic pathways and their catalyzing enzymes accumulated over the years can help improve the performance of classifiers for this problem.
Kernel-Based Learning of Hierarchical Multilabel Classification Models
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Mark ..."
Abstract
-
Cited by 33 (5 self)
- Add to MetaCart
We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show
Using Semantic Role to Improve Question Answering
- In Proceedings of EMNLP 2007
, 2007
"... Shallow semantic parsing, the automatic identification and labeling of sentential constituents, has recently received much attention. Our work examines whether semantic role information is beneficial to question answering. We introduce a general framework for answer extraction which exploits semanti ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
Shallow semantic parsing, the automatic identification and labeling of sentential constituents, has recently received much attention. Our work examines whether semantic role information is beneficial to question answering. We introduce a general framework for answer extraction which exploits semantic role annotations in the FrameNet paradigm. We view semantic role assignment as an optimization problem in a bipartite graph and answer extraction as an instance of graph matching. Experimental results on the TREC datasets demonstrate improvements over state-of-the-art models. 1
Measuring and Detecting Fast-Flux Service Networks
"... We present the first empirical study of fast-flux service networks (FFSNs), a newly emerging and still not widelyknown phenomenon in the Internet. FFSNs employ DNS to establish a proxy network on compromised machines through which illegal online services can be hosted with very high availability. Th ..."
Abstract
-
Cited by 28 (4 self)
- Add to MetaCart
We present the first empirical study of fast-flux service networks (FFSNs), a newly emerging and still not widelyknown phenomenon in the Internet. FFSNs employ DNS to establish a proxy network on compromised machines through which illegal online services can be hosted with very high availability. Through our measurements we show that the threat which FFSNs pose is significant: FFSNs occur on a worldwide scale and already host a substantial percentage of online scams. Based on analysis of the principles of FFSNs, we develop a metric with which FFSNs can be effectively detected. Considering our detection technique we also discuss possible mitigation strategies. 1

