Anomaly Detection and Traffic Analysis

The research is faced from two different perspectives: On one hand, we analyze traffic with the purpose of solving present problems related to the use
of network communications as well as improving their current performance. A deep knowledge in statistics, machine learning and data mining tools is a mandatory requirement
to dig into the entrails of real communications and infer what is happening there. On the other hand, traffic communications are deployed as high-complexity scenarios which challenge
the current state of the art of analysis methodologies and, therefore, contribute to their improvement and a better understanding of their application.

Traffic Analysis

Machine Learning

Traffic Analysis

From the perspective of solving communication networks problems, our work analyzing traffic embraces different, challenging research aspects:

    • Data Preprocessing and Transformation

The success solving problems that can be faced from a data analysis perspective comes fundamentally from an appropriate data preprocessing.
This aspect can be expressed by the following questions: How do we represent the scenario under analysis? Which features are important to solve our problem?
Hence, this is the area that requires the biggest effort and time from the whole data analysis chain. In our specific application field the question is formulated as follows: How do we represent
network traffic?
The answer to this question is obviously tied to the problem we are trying to solve, i.e. What is purpose of the analysis?. Below we introduce some of the habitual traffic analysis applications.

Data preprocessing and transformation involves the use of different strategies to summarize and gather relevant information. To cite some:
statistics, frequency analysis, data aggregation, meta-data analysis, entropy transformations, time series, univariate and multivariate analysis, etc.

Some tools that we currently deploy in this area are: Wireshark, corsaro, tcpdump, perl scripting, phyton scripting, Silk.

    • Data and Meta-Data Visualization

Many indices, results, meta-data and graphs can be obtained from analyzing network traffic. An important part of the work is to figure out the best techniques to show such
outcomes in the most understandable way, avoiding misleading and facilitating a correct interpretation of the scenarios as well as helping further inferences and conclusions.

    • Outlier Detection

One of the principal goals of traffic analysis is to detect and identify strange events and anomalies that do not match what we consider “normal” traffic (in most contexts “normal” is considered as “legitimate”). This kind of study is called: Outlier Analysis. We perform outlier analysis mainly for the following tasks:

      • The detection and identification of network attacks, e.g. scanners, hacking activities, network propagation viruses, etc.
      • The detection and identification of failures and misconfigurations, and their sources.
      • The detection of Covert Channels. A covert channel is considered as a mechanism for sending and receiving information between machines without alerting any firewalls and IDS’s on the network. In our research we focus on covert channels embedded in IP and TCP header features as well as covert timing channels.

Tools: MATLAB, Octave, Rapidminer, scripting (phyton, perl, C++)

    • Pattern Recognition

If outlier detection is one side of the coin, Pattern Recognition is the other side. They are complementary goals. In outlier detection we look for the anomaly, whereas in pattern recognition we want to define or shape what is normal. Therefore, we look for profiles, patterns, characteristics and/or models that can represent traffic in the most distinctive and less ambiguous way. We deal with different aims:

      • Classification and characterization of Internet traffic.
      • Classification and characterization of Darkspace traffic (a.k.a. Internet Background Radiation).
      • Traffic monitoring and the Internet shape.

Tools: MATLAB, Octave, Rapidminer, scripting (phyton, perl, C++)

Machine Learning

Our research is not only focused on the application of statistics, machine learning and data mining tools for solving problems related to anomaly detection and traffic classification,
we also aim to improve the state of the art of machine learning and data mining by the exploration and study of the existing theories and approaches with network communication scenarios. The complexity
and variability of network traffic data offer an excellent and demanding battlefield for any data analyst and machine learning expert.

In this respects, our research contributes to the study of supervised and unsupervised (clustering) classification and learning from different perspectives:

  • Dimensionality reduction and feature selection.
  • Comparison and analysis of proximity measures and similarity metrics.
  • Study of classification criteria and their effects on the classification or clustering.
  • Comparison of classification algorithms and approaches and identification of the qualities in applied scenarios.
  • Study and comparison of validation techniques.
  • Design and implementation of test-beds, tools and evaluation environments for machine learning and data mining problems.