0% found this document useful (0 votes)
34 views2 pages

eBPF-based ML Intrusion Detection System

This document discusses using eBPF to develop a flow-based network intrusion detection system (IDS) based on machine learning. It proposes using a decision tree model entirely within the eBPF kernel to classify packets as malicious or not based on the network flow context. An evaluation shows the eBPF implementation achieves over a 20% performance increase compared to the same solution as a userspace program, processing on average over 20,000 more packets per second due to avoiding data copying between kernel and userspace.

Uploaded by

Ak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views2 pages

eBPF-based ML Intrusion Detection System

This document discusses using eBPF to develop a flow-based network intrusion detection system (IDS) based on machine learning. It proposes using a decision tree model entirely within the eBPF kernel to classify packets as malicious or not based on the network flow context. An evaluation shows the eBPF implementation achieves over a 20% performance increase compared to the same solution as a userspace program, processing on average over 20,000 more packets per second due to avoiding data copying between kernel and userspace.

Uploaded by

Ak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A flow-based IDS using Machine Learning in eBPF

Maximilian Bachl, Joachim Fabini, Tanja Zseby


Technische Universität Wien
[Link]@[Link]

Abstract—eBPF is a new technology which allows dynamically A drawback, however, is that because eBPF is verified, it
loading pieces of code into the Linux kernel. It can greatly speed can only use certain data structures. For example, an eBPF
up networking since it enables the kernel to process certain program cannot use normal C arrays since they allow out-
packets without the involvement of a userspace program. So far
arXiv:2102.09980v3 [[Link]] 4 Mar 2022

eBPF has been used for simple packet filtering applications such of-bounds accesses. For example, in an array of length 10,
as firewalls or Denial of Service protection. We show that it C would allow accessing the 15th element even though the
is possible to develop a flow-based network intrusion detection array only has 10 elements. Thus, eBPF programs make
system based on machine learning entirely in eBPF. Our solution use of special data structures which are safe. However, this
uses a decision tree and decides for each packet whether it can potentially be a performance penalty since checking the
is malicious or not, considering the entire previous context of
the network flow. We achieve a performance increase of over bounds of an array each time it is accessed requires extra work
20% compared to the same solution implemented as a userspace to be done by the CPU.
program. One alternative to using eBPF is using kernel modules.
However, a drawback of kernel modules is that they usually
I. I NTRODUCTION
cannot be verified for stability and that they have to be
A. eBPF compiled for a specific kernel version. Moreover, developing
eBPF is a technology which makes the Linux kernel pro- kernel modules is not straightforward and often it is not
grammable by enabling the injection of pieces of code at many possible to extend certain functionality in the kernel with a
locations of the kernel code. eBPF can be dynamically injected kernel module without changing the kernel itself. Changing
during runtime and is verified to make sure that it cannot the whole kernel itself makes it necessary to recompile the
crash and cannot get caught in infinite loops. However, this kernel, which is cumbersome.
verification is only possible for programs that are not turing-
complete. Thus eBPF programs cannot contain features such B. eBPF for a Machine Learning (ML)-based Intrusion De-
as loops of arbitrary length but instead loops must always tection System (IDS)
have a maximum number of iterations. Also, backward jumps
in the code are generally not allowed. This means eBPF can Some researchers have already investigated eBPF based
only be used to implement algorithms which do not require IDSs or thought about using ML in eBPF. [2] propose to use
turing-completeness. eBPF programs are usually written in C AI for detecting performance anomalies with eBPF. However,
and are first compiled to eBPF bytecode. Upon injection into they only propose a concept and do not provide an evaluation
the kernel, this eBPF bytecode is verified and dynamically of the benefit of their approach. [3], [4] and also [5] use eBPF
compiled to native code. to develop solutions against Denial-of-Service attacks. They
eBPF is especially suitable for packet processing: When a do not use ML.
packet arrives at a network interface, certain actions can be One challenge of using eBPF for ML is that it is not turing-
performed such as dropping the packet. This is useful for complete. However, ML algorithms such as decision trees
programs such as firewalls [1] or tcpdump, which records (DTs) or neural networks (NNs) do not require loops in the im-
packets according to certain filters. For example, if only plementation and thus don’t require turing-completeness and
packets coming from port 80 should be recorded, tcpdump can thus be implemented in eBPF. We decide to use DTs since
will compile an eBPF program which encodes this and will tree-based methods are a simple and effective ML method for
load it into the kernel. The kernel will then drop all packets IDSs [6]. As mentioned in the previous sections, eBPF data
which don’t match the filter and only pass the correct ones structures need some additional processing compared to the
to tcpdump. The alternative would be that tcpdump receives classic ones built into C. A question we want to answer in this
every packet and filters them itself. The drawback of this is that research is thus the following: Is eBPF faster than a solution
then each packet has to be passed from the kernel to tcpdump, implemented as a normal program in userspace? The userspace
which involves copying the whole packet in memory and also program has the disadvantage that all packets have to be passed
other computation steps. Thus, passing packets between the between the kernel and the program which is slow. The eBPF
kernel and programs should be avoided if possible because of program has the disadvantage that it makes use of potentially
performance reasons. eBPF allows one to do that. slower data structures. Thus, it is interesting to understand
Because eBPF bytecode is compiled to native code, it whether in practice eBPF can be faster even for complex
should generally be as fast as any other code in the kernel. programs which make use of data structures frequently.
II. E VALUATION socket and are processed by the IDS. The IDS is implemented
We envision an approach which keeps track of each network either as a classic userspace program or by using eBPF and run
flow and analyzes each packet in the context of the previous after one another (not at the same time). Both implementations
packets of the flow. For example, certain attacks could be are run for 10 s. Instead of deploying the IDS on an end host,
detected only when the fourth packet of the network flow it can also be deployed on a router or switch, which runs a
containing the attack arrives. eBPF has built-in hash tables recent Linux version and thus can run eBPF.
which allow storing information for each flow, which we use TABLE I: Maximum number of packets each implementation
for this purpose. As the key we use the classic five tuple can process (mean and standard deviation). Results averaged
of protocol type, source and destination IP and source and over 10 runs each.
destination port.
As features we use the source and destination port, the Userspace eBPF
protocol identifier (UDP, TCP, ICMP etc.), the packet length, Mean SD Mean SD

the time since the last packet of the flow and the direction packets/s 125 420 2627 152 274 1201
of the packet (from sender to receiver or vice versa) because
these features have proven useful for network traffic analysis Table I shows that the userspace implementation inspects
[6]. Additionally, we include the mean of the packet size, the fewer packets per second (125420) than the eBPF implemen-
time since the last packet and the direction for all packets tation (152274). The eBPF implementation is thus more than
received in the flow so far. Furthermore, also the mean absolute 20% faster than the userspace one.
deviation is computed for these three features, since the
standard deviation cannot be computed due to the absence of III. D ISCUSSION
advanced arithmetic operations in eBPF such as the square root An interesting consideration is that for complex ML models,
operation. A problem is that eBPF doesn’t allow floating point at some point the overhead from using eBPF’s special data
operations but only integers. We thus implement the decision structures becomes larger than the benefit that eBPF confers.
tree using fixed point arithmetic using 64 bit signed integers. For example, interesting future work would be to test whether
We use 16 bits for the part after the fixed point. random forests (RFs) or deep NNs can also achieve a perfor-
We use the popular CIC-IDS-2017 dataset [7]. We train mance advantage when implemented in eBPF. In this work,
the decision tree using scikit-learn with a maximum depth however, we could show that for a simple decision tree model,
of 10 and a maximum number of leaves of 1000 using a eBPF provides a significant performance advantage.
train/test split of 2:1, which achieves an accuracy of 99% on
R EFERENCES
the testing dataset after training. This accuracy is comparable
[1] F. Risso and M. Tumolo, “Towards a faster Iptables in eBPF,”
to the accuracy achieved in related work [8].
2018.
To enable reproducibility and to encourage further experi- [2] I. Ben-Yair, P. Rogovoy, and N. Zaidenberg, “AI & eBPF based
mentation by other researchers, we make all the source code performance anomaly detection system,” in SYSTOR, ACM, May
and other materials of this work publicly available 1 . 2019.
We implement the same IDS in userspace and also in eBPF [3] H. M. Demoulin, I. Pedisich, N. Vasilakis, V. Liu, B. T. Loo, and
L. T. X. Phan, “Detecting Asymmetric Application-layer Denial-
and use the previously trained DT. The code is completely
of-Service Attacks In-Flight with FineLame,” USENIX, 2019.
equivalent except that data structures are different because, [4] H. van Wieren, “Signature-Based DDoS Attack Mitigation: Au-
as mentioned previously, many normal data structures are tomated Generating Rules for Extended Berkeley Packet Filter
not usable in eBPF. Also, some data structures that eBPF and Express Data Path,” 2019.
has, such as hash maps, are not available in a normal C [5] Y. Choe, J.-S. Shin, S. Lee, and J. Kim, “eBPF/XDP Based
Network Traffic Visualization and DoS Mitigation for Intelligent
userspace program by default and thus we use a simple hash
Service Protection,” in Advances in Internet, Data and Web
map implementation from the Linux kernel for the userspace Technologies, Springer, 2020.
version. [6] F. Iglesias, D. C. Ferreira, G. Vormayr, M. Bachl, and T. Zseby,
We emulate a network using Linux network name spaces. “NTARC: A Data Model for the Systematic Review of Network
For this, we have one server and one client, connected via a Traffic Analysis Research,” Applied Sciences, Jan. 2020.
[7] I. Sharafaldin, A. Habibi Lashkari, and A. A. Ghorbani, “Toward
switch and configure the links to have no delay in addition
Generating a New Intrusion Detection Dataset and Intrusion
to the delay caused by the Linux kernel for forwarding the Traffic Characterization,” in ICISSP, SCITEPRESS, 2018.
packets. Furthermore, we set the maximum link speed to be [8] A. Hartl, M. Bachl, J. Fabini, and T. Zseby, “Explainability
unlimited. The server and the client are connected using iPerf. and Adversarial Robustness for RNNs,” in BigDataService 2020,
Since the link speed is unlimited, iPerf will send as fast as IEEE, Apr. 2020.
possible. The maximum speed is only limited by how fast the
computer can process the packets. The IDS is deployed by
opening a raw socket on the network interface of the server. All
packets passing through the server thus pass through the raw
1 [Link]

Common questions

Powered by AI

eBPF improves packet processing performance by allowing certain packet actions, such as filtering, to occur within the Linux kernel itself, avoiding the overhead of passing packets between the kernel and userspace programs. This involves copying packets in memory, which is inefficient. By processing packets in-kernel with eBPF, eBPF programs can achieve performance typically as fast as native kernel code, providing a more efficient alternative to userspace, as demonstrated by a 20% increase in packet processing speed in comparisons with equivalent userspace solutions .

eBPF's verification imposes limitations such as the inability to use certain data structures like normal C arrays due to possible out-of-bounds accesses. eBPF programs also cannot contain features requiring turing-completeness, like loops of arbitrary length, affecting the implementation of complex algorithms. These constraints mean eBPF is less suitable for programs requiring advanced data structures or operations, potentially offsetting its performance benefits for overly complex models like deep neural networks, where the overhead from using eBPF's safer but slower data structures may outweigh the in-kernel processing advantage .

The use of eBPF's special data structures presents trade-offs between performance and safety. While these structures are designed to be safe and prevent issues like out-of-bounds access—unlike traditional C arrays—this safety typically comes with performance penalties due to necessary bound-checking operations. However, when using eBPF for applications like packet processing in IDS, the performance loss is often offset by the gains from avoiding kernel-userspace context switching. Thus, while eBPF structures might incur additional processing overhead, they contribute to overall system robustness, highlighting a trade-off where safety is prioritized at some cost to raw performance .

Decision trees are suitable for implementing a Machine Learning-based IDS in eBPF because they do not require complex loop constructs or turing-completeness, which are restricted in eBPF. eBPF's capabilities allow for integer-based decision-making without loops of arbitrary lengths, which aligns well with the operation of decision trees. Additionally, decision trees are effective for IDS applications and can be efficiently implemented using eBPF's data structures, such as hash maps, achieving performance improvements over userspace implementations .

Using the CIC-IDS-2017 dataset for training the decision tree in the eBPF-based IDS significantly impacts the system's effectiveness by providing a comprehensive and up-to-date representation of various network attacks. This dataset allows for a detailed training regime, ensuring that the IDS achieves high accuracy—99% on the test dataset—comparable to benchmarks in related studies. The presence of diverse attack patterns in the dataset enhances the decision tree's ability to detect a wide range of anomalies, thereby improving the IDS's precision and reliability in real-world applications .

Deploying an eBPF-based IDS on existing network infrastructure like routers and switches presents practical advantages, assuming these devices run recent Linux versions supporting eBPF. This approach leverages in-kernel processing capabilities for real-time intrusion detection, minimizing latency by keeping processing within the kernel. It avoids the need for additional hardware dedicated to monitoring, thereby reducing costs. However, network operators must consider compatibility with existing systems, as older infrastructure may not support necessary eBPF functionalities without kernel upgrades, thus involving potential operational disruptions .

eBPF enables effective network flow tracking by utilizing in-kernel hash tables to store metadata for each flow, allowing real-time analysis against pre-determined criteria. Using the classic five-tuple (protocol type, source and destination IP, and source and destination port), the IDS tracks packet context across a flow, enabling detection of complex patterns such as attacks that manifest only in certain sequences. This capability facilitates a comprehensive evaluation of network flow characteristics in real-time, which is critical for precisely identifying suspicious activities as they develop .

Implementing a flow-based IDS in eBPF offers significant advantages over userspace programs chiefly due to in-kernel packet processing that reduces the latency caused by transferring packets between the kernel and userspace. In eBPF, packet inspection can harness efficient in-kernel data structures like hash tables, allowing real-time analysis and responses while maintaining context across the traffic flow. Such efficiency is evidenced by the system's ability to inspect up to 152,274 packets per second—over 20% more than userspace programs—without the overhead of memory copying between kernel and userspace processes .

The inability of eBPF to perform floating-point operations poses challenges in situations where precision is required, such as calculating averages or deviations. This limitation was addressed by using fixed-point arithmetic with 64-bit signed integers, dedicating 16 bits for the fractional part. This approach allows for necessary calculations while adhering to eBPF restrictions. For instance, in the IDS decision tree implementation, mean absolute deviation was used instead of standard deviation, avoiding advanced arithmetic operations like square root .

The decision tree's architecture, with a maximum depth of 10 and up to 1000 leaves, contributes to the IDS's performance by balancing complexity and efficiency. These parameters ensure that the decision tree is sufficiently robust to capture intricate decision boundaries while remaining computationally manageable within eBPF's constraints. A deeper or more complex tree could impose excessive computational overhead, negating eBPF's performance benefits, while a simpler tree might underfit the data. This well-chosen architecture enables the IDS to achieve 99% accuracy on the testing dataset, effectively identifying network threats without overwhelming the kernel's processing capabilities .

You might also like