Results 1 - 10
of
155
BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection
"... Botnets are now the key platform for many Internet attacks, such as spam, distributed denial-of-service (DDoS), identity theft, and phishing. Most of the current botnet detection approaches work only on specific botnet command and control (C&C) protocols (e.g., IRC) and structures (e.g., central ..."
Abstract
-
Cited by 200 (14 self)
- Add to MetaCart
(Show Context)
Botnets are now the key platform for many Internet attacks, such as spam, distributed denial-of-service (DDoS), identity theft, and phishing. Most of the current botnet detection approaches work only on specific botnet command and control (C&C) protocols (e.g., IRC) and structures (e.g., centralized), and can become ineffective as botnets change their C&C techniques. In this paper, we present a general detection framework that is independent of botnet C&C protocol and structure, and requires no a priori knowledge of botnets (such as captured bot binaries and hence the botnet signatures, and C&C server names/addresses). We start from the definition and essential properties of botnets. We define a botnet as a coordinated group of malware instances that are controlled via C&C communication channels. The essential properties of a botnet are that the bots communicate with some C&C servers/peers, perform malicious activities, and do so in a similar or correlated way. Accordingly, our detection framework clusters similar communication traffic and similar malicious traffic, and performs cross cluster correlation to identify the hosts that share both similar communication patterns and similar malicious activity patterns. These hosts are thus bots in the monitored network. We have implemented our BotMiner prototype system and evaluated it using many real network traces. The results show that it can detect real-world botnets (IRC-based, HTTP-based, and P2P botnets including Nugache and Storm worm), and has a very low false positive rate. 1
Learning to detect and classify malicious executables in the wild
- Journal of Machine Learning Research
, 2006
"... We describe the use of machine learning and data mining to detect and classify malicious executables as they appear in the wild. We gathered 1,971 benign and 1,651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more t ..."
Abstract
-
Cited by 90 (1 self)
- Add to MetaCart
(Show Context)
We describe the use of machine learning and data mining to detect and classify malicious executables as they appear in the wild. We gathered 1,971 benign and 1,651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 million distinct n-grams. After selecting the most relevant n-grams for prediction, we evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting. Ultimately, boosted decision trees outperformed other methods with an area under the ROC curve of 0.996. Results suggest that our methodology will scale to larger collections of executables. We also evaluated how well the methods classified executables based on the function of their payload, such as opening a backdoor and mass-mailing. Areas under the ROC curve for detecting payload function were in the neighborhood of 0.9, which were smaller than those for the detection task. However, we attribute this drop in performance to fewer training examples and to the challenge of obtaining properly labeled examples, rather than to a failing of the methodology or to some inherent difficulty of the classification task. Finally, we applied detectors to 291 malicious executables discovered after we gathered our original collection, and boosted decision trees achieved a true-positive rate of 0.98 for a desired false-positive rate of 0.05. This result is particularly important, for it suggests that our methodology could be used as the basis for an operational system for detecting previously undiscovered malicious executables.
Learning to Detect Malicious Executables in the Wild
, 2004
"... In this paper, we describe the development of a fielded application for detecting malicious executables in the wild. We gathered 1971 benign and 1651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 millio ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
In this paper, we describe the development of a fielded application for detecting malicious executables in the wild. We gathered 1971 benign and 1651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 million distinct n-grams. After selecting the most relevant n-grams for prediction, we evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting. Ultimately, boosted decision trees outperformed other methods with an area under the roc curve of 0.996. Results also suggest that our methodology will scale to larger collections of executables. To the best of our knowledge, ours is the only fielded application for this task developed using techniques from machine learning and data mining.
Automatic Analysis of Malware Behavior using Machine Learning
, 2009
"... Malicious software—so called malware—poses a major threat to the security of computer systems. The amount and diversity of its variants render classic security defenses ineffective, such that millions of hosts in the Internet are infected with malware in form of computer viruses, Internet worms and ..."
Abstract
-
Cited by 51 (4 self)
- Add to MetaCart
Malicious software—so called malware—poses a major threat to the security of computer systems. The amount and diversity of its variants render classic security defenses ineffective, such that millions of hosts in the Internet are infected with malware in form of computer viruses, Internet worms and Trojan horses. While obfuscation and polymorphism employed by malware largely impede detection at file level, the dynamic analysis of malware binaries during run-time provides an instrument for characterizing and defending against the threat of malicious software. In this article, we propose a framework for automatic analysis of malware behavior using machine learning. The framework allows for automatically identifying novel classes of malware with similar behavior (clustering) and assigning unknown malware to these discovered classes (classification). Based on both, clustering and classification, we propose an incremental approach for behavior-based analysis, capable to process the behavior of thousands of malware binaries on a daily basis. The incremental analysis significantly reduces the run-time overhead of current analysis methods, while providing an accurate discovery and discrimination of novel malware variants.
Malware Phylogeny Generation using Permutations of Code
- JOURNAL IN COMPUTER VIROLOGY
, 2005
"... Malicious programs, such as viruses and worms, are frequently related to previous programs through evolutionary relationships. Discovering those relationships and constructing a phylogeny model is expected to be helpful for analyzing new malware and for establishing a principled naming scheme. Mat ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
Malicious programs, such as viruses and worms, are frequently related to previous programs through evolutionary relationships. Discovering those relationships and constructing a phylogeny model is expected to be helpful for analyzing new malware and for establishing a principled naming scheme. Matching permutations of code may help build better models in cases where malware evolution does not keep things in the same order. We describe method for constructing phylogeny models that uses features called n-perms to match possibly permuted code. An experiment was performed to compare the relative effectiveness of vector similarity measures using n-perms and n-grams when comparing permuted variants of programs. The similarity measures using n-perms maintained a greater separation between the similarity scores of permuted families of specimens versus unrelated specimens. A subsequent study using a tree generated through suggests that phylogeny models based on may help forensic analysts investigate new specimens, and assist in reconciling malware naming inconsistencies.
Hunting for metamorphic engines
- J COMPUT VIROL
, 2006
"... In this paper, we analyze several metamorphic virus generators. We define a similarity index and use it to precisely quantify the degree of metamorphism that each generator produces. Then we present a detector based on hidden Markov models and we consider a simpler detection method based on our simi ..."
Abstract
-
Cited by 32 (4 self)
- Add to MetaCart
(Show Context)
In this paper, we analyze several metamorphic virus generators. We define a similarity index and use it to precisely quantify the degree of metamorphism that each generator produces. Then we present a detector based on hidden Markov models and we consider a simpler detection method based on our similarity index. Both of these techniques detect all of the metamorphic viruses in our test set with extremely high accuracy. In addition, we show that popular commercial virus scanners do not detect the highly metamorphic virus variants in our test set.
Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning
"... Extracting useful knowledge from large network datasets has become a fundamental challenge in many domains, from scientific literature to social networks and the web. We introduce Apolo, a system that uses a mixed-initiative approach— combining visualization, rich user interaction and machine learni ..."
Abstract
-
Cited by 27 (7 self)
- Add to MetaCart
Extracting useful knowledge from large network datasets has become a fundamental challenge in many domains, from scientific literature to social networks and the web. We introduce Apolo, a system that uses a mixed-initiative approach— combining visualization, rich user interaction and machine learning—to guide the user to incrementally and interactively explore large network data and make sense of it. Apolo engages the user in bottom-up sensemaking to gradually build up an understanding over time by starting small, rather than starting big and drilling down. Apolo also helps users find relevant information by specifying exemplars, and then using a machine learning method called Belief Propagation to infer which other nodes may be of interest. We evaluated Apolo with twelve participants in a between-subjects study, with the task being to find relevant new papers to update an existing survey paper. Using expert judges, participants using Apolo found significantly more relevant papers. Subjective feedback of Apolo was also very positive.
MET: An Experimental System for Malicious Email Tracking
- In Proceedings of the New Security Paradigms Workshop (NSPW
, 2002
"... Despite the use of state of the art methods to protect against malicious progr~.ms, they continue to threaten and dam-age computer systems around the world. In this paper we present MET, the Malicious Emall Tracking system, de-signed to automatically report statistics on the flow behavior of malicio ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
(Show Context)
Despite the use of state of the art methods to protect against malicious progr~.ms, they continue to threaten and dam-age computer systems around the world. In this paper we present MET, the Malicious Emall Tracking system, de-signed to automatically report statistics on the flow behavior of malicious software delivered via email attachments both at a local and global level. MET can help reduce the spread of malicious software worldwide, especially self-replicating viruses, as well as provide further insight toward minimiz-ing damage caused by malicious programs in the future. In addition, the system can help system administrators detect all of the points of entry of a malicious email into a net-work. The core of MET's operation is a database of statis-tics about the trajectory of email attachments in and out of a network system, and the culling together of these statistics across networks to present a global view of the spread of the malicious software. From a statistical perspective sampling only a small amount of traffic (for example,.1%) of a very large email stream is sufficient to detect suspicious or other-wise new emall viruses that may be undetected by standard signature-based scanners. Therefore, relatively few MET installations would be necessary to gather sufficient data in order to provide broad protection services. Small scale simu-lations are presented to demonstrate MET in operation and suggests how detection of new virus propagations via flow statistics can be automated.
McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables
"... In this work, we propose Malware Collection Booster (McBoost), a fast statistical malware detection tool that is intended to improve the scalability of existing malware collection and analysis approaches. Given a large collection of binaries that may contain both hitherto unknown malware and benign ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
(Show Context)
In this work, we propose Malware Collection Booster (McBoost), a fast statistical malware detection tool that is intended to improve the scalability of existing malware collection and analysis approaches. Given a large collection of binaries that may contain both hitherto unknown malware and benign executables, McBoost reduces the overall time of analysis by classifying and filtering out the least suspicious binaries and passing only the most suspicious ones to a detailed binary analysis process for signature extraction. The McBoost framework consists of a classifier specialized in detecting whether an executable is packed or not, a universal unpacker based on dynamic binary analysis, and a classifier specialized in distinguishing between malicious or benign code. We developed a proof-of-concept version of McBoost and evaluated it on 5,586 malware and 2,258 benign programs. McBoost has an accuracy of 87.3%, and an Area Under the ROC curve (AUC) equal to 0.977. Our evaluation also shows that McBoost reduces the overall time of analysis to only a fraction (e.g., 13.4%) of the computation time that would otherwise be required to analyze large sets of mixed malicious and benign executables. 1
IMDS: Intelligent Malware Detection System
, 2007
"... The proliferation of malware has presented a serious threat to the security of computer systems. Traditional signature-based antivirus systems fail to detect polymorphic and new, previously unseen malicious executables. In this paper, resting on the analysis of Windows API execution sequences called ..."
Abstract
-
Cited by 22 (10 self)
- Add to MetaCart
The proliferation of malware has presented a serious threat to the security of computer systems. Traditional signature-based antivirus systems fail to detect polymorphic and new, previously unseen malicious executables. In this paper, resting on the analysis of Windows API execution sequences called by PE files, we develop the Intelligent Malware Detection System (IMDS) using Objective-Oriented Association (OOA) mining based classification. IMDS is an integrated system consisting of three major modules: PE parser, OOA rule generator, and rule based classifier. An OOA Fast FP-Growth algorithm is adapted to efficiently generate OOA rules for classification. A comprehensive experimental study on a large collection of PE files obtained from the anti-virus laboratory of King-Soft Corporation is performed to compare various malware detection approaches. Promising experimental results demonstrate that the accuracy and efficiency of our IMDS system outperform popular anti-virus software such as Norton AntiVirus and McAfee VirusScan, as well as previous data mining based detection systems which employed Naive Bayes, Support Vector Machine (SVM) and Decision Tree techniques.