Results 1 - 10
of
31
Internet Traffic Classification Demystified: The Myths, Caveats and Best Practices
- In Proc. ACM CoNEXT
, 2008
"... Recent research on Internet traffic classification algorithms has yielded a flurry of proposed approaches for distinguishing types of traffic, but no systematic comparison of the various algorithms. This fragmented approach to traffic classification research leaves the operational community with no ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Recent research on Internet traffic classification algorithms has yielded a flurry of proposed approaches for distinguishing types of traffic, but no systematic comparison of the various algorithms. This fragmented approach to traffic classification research leaves the operational community with no basis for consensus on what approach to use when, and how to interpret results. In this work we critically revisit traffic classification by conducting a thorough evaluation of three classification approaches, based on transport layer ports, host behavior, and flow features. A strength of our work is the broad range of data against which we test the three classification approaches: seven traces with payload collected in Japan, Korea, and the US. The diverse geographic locations, link characteristics and application traffic mix in these data allowed us to evaluate the approaches under a wide variety of conditions. We analyze the advantages and limitations of each approach, evaluate methods to overcome the limitations, and extract insights and recommendations for both the study and practical application of traffic classification. We make our software, classifiers, and data available for researchers interested in validating or extending this work. 1.
Offline/Online Traffic Classification Using Semi-Supervised Learning
- Perform. Eval
, 2007
"... Identifying and categorizing network traffic by application type is challenging because of the continued evolution of applications, especially of those with a desire to be undetectable. The diminished effectiveness of port-based identification and the overheads of deep packet inspection approaches m ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Identifying and categorizing network traffic by application type is challenging because of the continued evolution of applications, especially of those with a desire to be undetectable. The diminished effectiveness of port-based identification and the overheads of deep packet inspection approaches motivate us to classify traffic by exploiting distinctive flow characteristics of applications when they communicate on a network. In this paper, we explore this latter approach and propose a semi-supervised classification method that can accommodate both known and unknown applications. To the best of our knowledge, this is the first work to use semi-supervised learning techniques for the traffic classification problem. Our approach allows classifiers to be designed from training data that consists of only a few labeled and many unlabeled flows. We consider pragmatic classification issues such as longevity of classifiers and the need for retraining of classifiers. Our performance evaluation using empirical Internet traffic traces that span a 6-month period shows that: 1) high flow and byte classification accuracy (i.e., greater than 90%) can be achieved using training data that consists of a small number of labeled and a large number of unlabeled flows; 2) presence of “mice ” and “elephant ” flows in the Internet complicates the design of classifiers, especially of those with high byte accuracy, and necessities use of weighted sampling techniques to obtain training flows; and 3) retraining of classifiers is necessary only when there are non-transient changes in the network usage characteristics. As a proof of concept, we implement prototype offline and realtime classification systems to demonstrate the feasibility of our approach.
A Generic Language for Application-Specific Flow Sampling
"... Flow records gathered by routers provide valuable coarse-granularity traffic information for several measurement-related network applications. However, due to high volumes of traffic, flow records need to be sampled before they are gathered. Current techniques for producing sampled flow records are ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Flow records gathered by routers provide valuable coarse-granularity traffic information for several measurement-related network applications. However, due to high volumes of traffic, flow records need to be sampled before they are gathered. Current techniques for producing sampled flow records are either focused on selecting flows from which statistical estimates of traffic volume can be inferred, or have simplistic models for applications. Such sampled flow records are not suitable for many applications with more specific needs, such as ones that make decisions across flows. As a first step towards tailoring the sampling algorithm to an application’s needs, we design a generic language in which any particular application can express the classes of traffic of its interest. Our evaluation investigates the expressive power of our language, and whether flow records have sufficient information to enable sampling of records of relevance to applications. We use templates written in our custom language to instrument sampling tailored to three different applications—BLINC, Snort, and Bro. Our study, based on month-long datasets gathered at two different network locations, shows that by learning local traffic characteristics we can sample relevant flow records near-optimally with low false negatives in diverse applications.
A FLOW BASED APPROACH FOR SSH TRAFFIC DETECTION
"... Abstract — The basic objective of this work is to assess the utility of two supervised learning algorithms AdaBoost and RIPPER for classifying SSH traffic from log files without using features such as payload, IP addresses and source/destination ports. Pre-processing is applied to the traffic data t ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract — The basic objective of this work is to assess the utility of two supervised learning algorithms AdaBoost and RIPPER for classifying SSH traffic from log files without using features such as payload, IP addresses and source/destination ports. Pre-processing is applied to the traffic data to express as traffic flows. Results of 10-fold cross validation for each learning algorithm indicate that a detection rate of 99 % and a false positive rate of 0.7 % can be achieved using RIPPER. Moreover, promising preliminary results were obtained when RIPPER was employed to identify which service was running over SSH. Thus, it is possible to detect SSH traffic with high accuracy without using features such as payload, IP addresses and source/destination ports, where this represents a particularly useful characteristic when requiring generic, scalable solutions. I.
A Machine Learning Approach for Efficient Traffic Classification
- in Proceedings of the IEEE MASCOTS
, 2007
"... Abstract — Online traffic classification continues to be of longterm interest to the networking community. It serves as the input for practical solutions such as network monitoring, qualityof-service and intrusion-detection. In this paper we present a machine-learning approach that accurately classi ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Abstract — Online traffic classification continues to be of longterm interest to the networking community. It serves as the input for practical solutions such as network monitoring, qualityof-service and intrusion-detection. In this paper we present a machine-learning approach that accurately classifies internet traffic using C4.5 decision tree. Accuracy is not our only concern; the latency and throughput are also of extreme importance. Without inspecting packet payload, our method can identify traffic of different types of applications with 99.8 % total accuracy, by collecting 12 features at the start of the flows. I.
Byte Me: The Case for Byte Accuracy in Traffic Classification
- In SIGMETRICS’07 MineNet Workshop
, 2007
"... Numerous network traffic classification approaches have recently been proposed. In general, these approaches have focused on correctly identifying a high percentage of total flows. However, on the Internet a small number of “elephant ” flows contribute a significant amount of the traffic volume. In ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Numerous network traffic classification approaches have recently been proposed. In general, these approaches have focused on correctly identifying a high percentage of total flows. However, on the Internet a small number of “elephant ” flows contribute a significant amount of the traffic volume. In addition, some application types like Peer-to-Peer (P2P) and FTP contribute more elephant flows than other applications types like Chat. In this opinion piece, we discuss how evaluating a classifier on flow accuracy alone can bias the classification results. By not giving special attention to these traffic classes and their elephant flows in the evaluation of traffic classification approaches we might obtain significantly different performance when these approaches are deployed in operational networks for typical traffic classification tasks such as traffic shaping. We argue that byte accuracy must also be used when evaluating the accuracy of traffic classification algorithms.
Issues and Future Directions in Traffic Classification
"... Traffic classification technology has increased in relevance this decade, as it is now used in the definition and implementation of mechanisms for service differentiation, network design and engineering, security, accounting, advertising, and research. Over the past 10 years the research community a ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Traffic classification technology has increased in relevance this decade, as it is now used in the definition and implementation of mechanisms for service differentiation, network design and engineering, security, accounting, advertising, and research. Over the past 10 years the research community and the networking industry have investigated, proposed and developed several classification approaches. While traffic classification techniques are improving in accuracy and efficiency, the continued proliferation of different Internet application behaviors, in addition to growing incentives to disguise some applications to avoid filtering or blocking, are among the reasons that traffic classification remains one of many open problems in Internet research. In this article we review recent achievements and discuss future directions in traffic classification, along with their trade-offs in applicability, reliability, and privacy. We outline the persistently unsolved challenges in the field over the last decade, and suggest several strategies for tackling these challenges to promote progress in the science of Internet traffic classification.
Self-Learning Peer-to-Peer Traffic Classifier
"... The popularity of a new generation of smart peer-to-peer applications has resulted in several new challenges for accurately classifying network traffic. In this paper, we propose a novel 2-stage p2p traffic classifier, called Self Learning Traffic Classifier (SLTC), that can accurately identify p2p ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The popularity of a new generation of smart peer-to-peer applications has resulted in several new challenges for accurately classifying network traffic. In this paper, we propose a novel 2-stage p2p traffic classifier, called Self Learning Traffic Classifier (SLTC), that can accurately identify p2p traffic in high speed networks. The first stage classifies p2p traffic from the rest of the network traffic, and the second stage automatically extracts application payload signatures to accurately identify the p2p application that generated the p2p flow. For the first stage, we propose a fast, light-weight algorithm called Time Correlation Metric (TCM), that exploits the temporal correlation of flows to clearly separate peer-to-peer (p2p) traffic from the rest of the traffic. Using real network traces from tier-1 ISPs that are located in different continents, we show that the detection rate of TCM is consistently above 95 % while always keeping the false positives at 0%. For the second stage, we use the LASER signature extraction algorithm [17] to accurately identify signatures of several known and unknown p2p protocols with very small false positive rate (< 1%). Using our prototype on tier-1 ISP traces, we demonstrate that SLTC automatically learns signatures for more than 95% of both known and unknown traffic within 3 minutes. I.
Per Flow Packet Sampling for High-Speed Network Monitoring
"... Abstract—We present a per-flow packet sampling method that enables the real-time classification of high-speed network traffic. Our method, based upon the partial sampling of each flow (i.e., performing sampling at only early stages in each flow’s lifetime), provides a sufficient reduction in total t ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—We present a per-flow packet sampling method that enables the real-time classification of high-speed network traffic. Our method, based upon the partial sampling of each flow (i.e., performing sampling at only early stages in each flow’s lifetime), provides a sufficient reduction in total traffic (e.g., a factor of five in packets, a factor of ten in bytes) as to allow practical implementations at one Gigabit/s, and, using limited hardware assistance, ten Gigabit/s. I.
On Measuring the Similarity of Network Hosts: Pitfalls, New Metrics, and Empirical Analyses
"... As the scope and scale of network data grows, security practitioners and network operators are increasingly turning to automated data analysis methods to extract meaningful information. Underpinning these methods are distance metrics that represent the similarity between two values or objects. In th ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
As the scope and scale of network data grows, security practitioners and network operators are increasingly turning to automated data analysis methods to extract meaningful information. Underpinning these methods are distance metrics that represent the similarity between two values or objects. In this paper, we argue that many of the obvious distance metrics used to measure behavioral similarity among network hosts fail to capture the semantic meaning imbued by network protocols. Furthermore, they also tend to ignore long-term temporal structure of the objects being measured. To explore the role of these semantic and temporal characteristics, we develop a new behavioral distance metric for network hosts and compare its performance to a metric that ignores such information. Specifically, we propose semantically meaningful metrics for common data types found within network data, show how these metrics can be combined to treat network data as a unified metric space, and describe a temporal sequencing algorithm that captures long-term causal relationships. In doing so, we bring to light several challenges inherent in defining behavioral metrics for network data, and put forth a new way of approaching network data analysis problems. Our proposed metric is empirically evaluated on a dataset of over 30 million network flows, with results that underscore the utility of a holistic approach to network data analysis. 1

