MalPaCA Seq+ (repository)
Summary |
---|
🔍 An updated version of the MalPaCA algorithm that creates, based on the network flow of a software, a behavioral profile representing its actual capabilities. |
The MalPaCA algorithm is a novel, unsupervised clustering algorithm that creates, based on the network flow of a software a behavioral profile representing its actual capabilities. It takes as an input one or multiple pcap files from which it then:
- splits them into uni-directional connections
- extracts from each connection 4 sequential features, namely the packet sizes (bytes), inter-arrival-times (ms), source ports and dest ports
- computes each feature the pairwise distance between all connections and stores them in their respective distance matrix
- combines the distance matrices using a simple weighted average, where all features have equal weights
- inputs the final distance matrix into the HDBScan clustering algorithm
- post-processes the final clusters and exports them in .csv and in temporal heatmaps form
In addition to the original version, “MalPaCa Seq+” contains a number of improvements that either facilitate research into the impact of different sequence lengths on the clustering performance or that make “MalPaCA” a more viable tool for cybersecurity research in general. In particular:
- The time needed for step 3 was greatly reduced by switching the pairwise distance algorithms to versions that support the Numba JIT compiler.
- The clustering error metric introduced in the original article was further improved through the automatic generation of graphs that represent the presumed correct and incorrect elements of a cluster
- In addition to the temporal heatmaps, “MalPaCA” now can generate a variety of different graphs such as:
- transition graphs that reveal how different segments of the same connection are clustered together differently in subsequent experiments
- graphs detailing the make-up of each clusters in terms of label or application category if such information is provided through prior network analysis with NFStream.
Features
With “MalPaCA Seq+”, the user can:
- run the upgraded MalPaCA algorithm on one or multiple pcap files
- run five different experiments to answer foundational questions about the influence of the sequencing length on clustering perfomrance such as:
- Experiment 1 - What sequence length taken from the start of a connection leads to the best clustering results?
- Experiment 2 - Is there a difference in the clustering results depending on which part of a connection is being selected?
- Experiment 3 - What is the effect of taking packets from the end of a connection and of skipping some packets?
- Experiment 4 - What is the effect of breaking up one connection into multiple smaller connections of equal length?
- Experiment 5 - What is the effect of defining behavior according to Netflow v5?
Tools
Purpose | Name |
---|---|
Programming language | Python |
Dependency manager | Anaconda |
Version control system | Git |
Clustering Algorithm | HDBScan |
Graph Library | Matplotlib |
Installation Process
If you want to import this project and resolve all the dependencies associated with it, it is assumed that you have already installed Anaconda, Python, an IDE like PyCharm and that your operating system is Windows. Re-create the original MalPaCA
environment from the environment.yml
file with this command:
conda env create -f environment.yml
Activate the new environment:
conda activate MalPaCA
Lastly, check that the new environment was installed correctly:
conda env list
Contributors
The original author of “MalPaCA” was Azqa Nadeem and the original source code can be found here.
Licence
The original “MalPaCA” framework was published under the MIT license, which can be found in the LICENSE file.
If you use MalPaCA in a scientific work, consider citing the following paper:
@article{nadeembeyond,
title={Beyond Labeling: Using Clustering to Build Network Behavioral Profiles of Malware Families},
author={Nadeem, Azqa and Hammerschmidt, Christian and Ga{\~n}{\'a}n, Carlos H and Verwer, Sicco},
journal={Malware Analysis Using Artificial Intelligence and Deep Learning},
pages={381},
publisher={Springer}
}
References
The clustering result image in the logo was taken from the HDBSCAN website.