WEBVTT

00:00.000 --> 00:10.200
My name is Alfredo Cardelliano, I work for Rantop.

00:10.200 --> 00:15.840
And today I want to talk about Supernix, which is this new family on network adopters, actually

00:15.840 --> 00:25.000
it's not really new, but then I have invented this new bazename for this family of adopters.

00:25.000 --> 00:30.680
So my question is, are those adopters really are game changers?

00:30.680 --> 00:38.360
So are they really introducing a lot of, at least for the performance point of view, are they

00:38.360 --> 00:43.280
introducing a huge improvement, you respect to other adopters or not?

00:43.280 --> 00:47.440
Of course, I will focus in the field of natural monitoring, traffic analysis, because we

00:47.440 --> 00:51.080
do natural monitoring at end of it.

00:51.080 --> 00:54.200
Let's recap what is anic, as Martin and the Supernix.

00:54.200 --> 01:00.760
So anic is a standard adopter that you have in a standard server, so they usually provide

01:00.760 --> 01:07.080
just data transmission, so Rx and Tx, and over the years, they introduced a few features

01:07.080 --> 01:15.480
to improve the performance, which is like RSS, the received side scaling, features

01:15.480 --> 01:21.920
or a lot of balance and leverage on multicores, and other features like filtering, etc.

01:21.920 --> 01:29.320
And they are available most adopters today, especially on Intel, Melon Ops.

01:29.320 --> 01:37.640
And then a new family of adopters have been introduced, which is the smartnix, so smartnix

01:37.640 --> 01:43.240
are more advanced at the standardnix, they are usually specialized in some natural activities

01:43.240 --> 01:48.200
like packet capture for traffic monitoring.

01:48.200 --> 01:54.760
Most of the times they are based on FPGAs, and they provide nonsense features like packet

01:54.760 --> 02:02.480
parsing, load balancing, like RSS, but more advanced, more packet filtering, more yields,

02:02.480 --> 02:07.760
more flexible, and some programmability, for instance, if you have an appetite adapter, you

02:07.760 --> 02:13.200
can use NTPL, for instance, to program the FPG to do something.

02:13.200 --> 02:17.520
And then, for instance, they usually optimize data transfer, so they don't transfer single

02:17.520 --> 02:22.320
packets on the bus, but they transfer chunks of data.

02:22.320 --> 02:27.080
Then they introduce these new family of supernix.

02:27.080 --> 02:34.560
So they are supposed to be more advanced than smartnix, and they have a lot of nice features

02:34.560 --> 02:40.520
like other accelerators for specific tasks like networking, of course, encryption, the

02:40.520 --> 02:47.000
feature, compression, storage, so they are directly connected to enemies, or to GPUs,

02:47.000 --> 02:51.000
to accelerate AI.

02:51.000 --> 03:00.880
They usually have also a number of CPUs, like on NVIDIA adapters, blue fields, there is another

03:00.880 --> 03:10.040
CPU by the same for others like the AMD pensando, and a lot of other market stuff that

03:10.040 --> 03:14.520
changes to the pending on the vendor.

03:14.520 --> 03:18.680
So let's focus on the blue field three, which is probably the latest one.

03:18.680 --> 03:29.240
This includes an AMCPU onboard 16 cores, DDR5, memory, up to 400GB connectivity, PCG5,

03:29.240 --> 03:35.520
many other accelerators, all those mentioned previous slides, and this is programmable using

03:35.520 --> 03:39.280
NSTK, which is Docker.

03:39.280 --> 03:46.880
And these exposes network interfaces using a connect text interface, and also internal

03:46.880 --> 03:51.040
is using the connect text 7.

03:51.040 --> 03:55.680
You can use these adapter in queue modes, which is at a standard link.

03:55.680 --> 04:01.920
So you just see on your host a connect text interface, and you use it as a standard network

04:01.920 --> 04:06.200
interface, or you can use these adapter in DPU mode.

04:06.200 --> 04:12.560
It means you can use DR, which is embedded into the adapter, to process the traffic in

04:12.560 --> 04:14.040
the adapter.

04:14.040 --> 04:18.880
There is a bit of which is side, which is up the accelerator with a bunch of accelerators

04:18.880 --> 04:24.960
for networking, or for other stuff, like compression, encryption, etc.

04:24.960 --> 04:29.640
And they expose the interface using connect text.

04:29.640 --> 04:35.880
So NSTK, you can program the adapter running your software on the AMCPU, and process

04:35.880 --> 04:39.200
the traffic.

04:39.200 --> 04:43.240
So I say that this adapter is programmable using Docker.

04:43.240 --> 04:49.960
Docker is a bunch of libraries that can be used to program the accelerators on the adapter,

04:49.960 --> 04:55.520
and there is, for instance, one, so it's divided into libraries, one per accelerator.

04:55.520 --> 05:00.960
For instance, if you want to accelerate networking, you use Docker flow.

05:00.960 --> 05:10.400
And this is a supported bot on Bluefield, if you want to use the DPU, or some of the components

05:10.400 --> 05:18.040
are also available on standard, and playing connect text adapters.

05:18.040 --> 05:24.960
And this Docker flow, NSTK provides APIs to build pipelines for processing traffic.

05:24.960 --> 05:29.560
So you can build pipes that do something, I will show you what.

05:29.560 --> 05:34.120
And then you can change those pipes into a pipeline.

05:34.120 --> 05:39.640
So this is an example, for instance, you can receive traffic that goes into a root pipe,

05:39.640 --> 05:44.720
and then from the root pipe, you send the packets to other pipes.

05:44.720 --> 05:50.680
For instance, you can add a pipe hay, which is matching this traffic.

05:50.680 --> 05:58.600
If the packets match some entries in this pipe, then this traffic is sent to another pipe

05:58.600 --> 06:03.480
if it's not matching an entries and to another pipe C.

06:03.480 --> 06:10.520
And this traffic then can be sent to a physical interface or to an application using a recess.

06:10.520 --> 06:18.920
So each pipe consists of a few entries that you can program into the pipe to match the traffic.

06:18.920 --> 06:26.840
Then if the traffic, the packet matches an entry, you can run some actions like you can monitor,

06:26.840 --> 06:32.760
so count, or meters that is rating the traffic, you can modify the packets,

06:32.760 --> 06:40.760
or you can forward the packets to other pipes, or again, network interfaces, applications, etc.

06:40.760 --> 06:45.400
So the question is, are superniqued better than it's Martinix, from our point of view,

06:45.400 --> 06:49.000
so for network monitoring.

06:49.000 --> 06:53.400
Okay, probably this is the odd version of this right anyway.

06:53.400 --> 06:59.400
So let's focus on the network monitoring use case.

06:59.400 --> 07:03.160
In network monitoring over the years, we've proved the performance of the applications

07:03.160 --> 07:07.960
processing traffic with several techniques, like kernel bypass,

07:07.960 --> 07:13.320
for instance, peer-freeing of the PDK implementing application-level DMA to accelerate the packet

07:13.320 --> 07:19.240
capture, or by load balance in the traffic to multiple cores using a recess, for instance,

07:19.240 --> 07:22.680
or using other offloads.

07:22.680 --> 07:28.920
But the leak speed is going and growing, the CPU is not going as fast as the leak speed,

07:28.920 --> 07:33.800
so we need to invent new techniques for accelerating the traffic analysis.

07:36.520 --> 07:40.520
A typical activity that a network monitoring application is doing

07:40.520 --> 07:46.680
is keeping the status of the network communications, usually in a flow table,

07:46.680 --> 07:51.720
which is a data structure in memory where for each flow you keep the status of the flows

07:51.720 --> 07:57.480
that can include counters, statistics, like packets and bytes, but also layer 7 information,

07:57.480 --> 08:04.120
for instance, if you run in the packet inspection, you keep information like the HDP, URL,

08:04.120 --> 08:07.880
or the CIP, YP, caller, or other information.

08:13.000 --> 08:19.800
Now, so we were looking for a way to fuel the absolute rate, the application performance,

08:19.800 --> 08:23.320
by accelerating this flow table.

08:23.320 --> 08:29.160
So, not for that, for instance, or other mathematics, actually,

08:29.160 --> 08:32.280
not for that, is the only one problem today that supports this,

08:33.480 --> 08:36.600
and introduce the idea of the flow manager.

08:36.600 --> 08:41.000
So, a way to keep track of all network communications in order, in the doctor.

08:45.000 --> 08:49.320
Of course, you cannot afford everything, because if you're running DPI,

08:49.320 --> 08:53.480
you need to still inspect the packet payload in software.

08:53.480 --> 09:00.760
So, you need to accelerate the software with the combination of hardware and software.

09:00.760 --> 09:05.960
So, you keep usually the status of the flows in order, but at the beginning of the

09:05.960 --> 09:11.080
level of communication, you inspect all the single packets in in software and then you have to load the flow.

09:12.840 --> 09:17.400
So, out at work. So, you capture the packet in the application, you start the fight

09:18.360 --> 09:23.800
so you get the flow key, and if you have DPI, you run in DPI in the payload,

09:23.800 --> 09:29.400
and when it's time to upload this flow, that can be at the first packet, if you don't have

09:29.400 --> 09:35.160
DPI or something similar, or it can be after a few packets, one DPI has done.

09:35.160 --> 09:37.800
You create a new entry and you are designed to the hardware.

09:37.800 --> 09:45.400
Then you periodically check the statistics in the adapter to, if you want to, for instance,

09:45.400 --> 09:55.560
to export packets and bytes for the flow. So, our question was, can we use bluefield,

09:55.560 --> 10:02.520
and this Docker flow to do the same, to upload a set full packet processing to the adapter?

10:03.960 --> 10:10.840
And it seems that the Docker flow CT, which is an extension or Docker flow, is a good fit for this,

10:10.840 --> 10:16.200
because it's actually a connection track used for routing, for instance, in the adapter.

10:17.240 --> 10:22.920
So, we decided to build this proof of concept to test Docker flow, the features and the performance,

10:22.920 --> 10:29.640
to see if it's really working as expected. So, Docker flow CT, in essence, provides all the ingredients,

10:29.640 --> 10:35.800
we need to provide the ability to create entries in the pipe that we've seen before,

10:35.880 --> 10:41.720
using the five top all, and so the ability to add an entry, a mobile entry, update an entry,

10:41.720 --> 10:47.320
get the statistics, when needed, and handle the flow aging. So, when a flow expires,

10:47.320 --> 10:54.040
you can automatically remove it from the adapter. And this is available both when you run in the

10:54.040 --> 11:01.480
bluefield, the queue, or when you run on using the bluefield as network adapter, or even

11:01.480 --> 11:12.440
no standard connecting adapters. Then we faced with a few problems, let's say, so the documentation

11:12.440 --> 11:18.040
is, let's say, missing some details, and the line documentation. So, it was not really the

11:18.040 --> 11:25.720
simple to to implement such a proof of concept, especially also because the API changes from

11:25.800 --> 11:30.760
mesh activation, and from adapter to adapter. So, the other remove features on different

11:30.760 --> 11:37.720
adapter models. Even the examples that didn't have too much because they have the sign to be

11:37.720 --> 11:43.960
really small examples, like capture one of the packets, we are getting of releasing the embarks,

11:43.960 --> 11:52.440
for instance, to stuff like that. So, I expect to see something working to run just a proof of

11:52.440 --> 11:59.640
concept, but I have to rewrite it from scratch. Even configuring adapters, it's sometimes

11:59.640 --> 12:04.360
complicated, you follow the instructions documentation, but it doesn't work, so you have to figure

12:04.360 --> 12:09.720
out how it works. But, when everything works, you have a lot of fun because it's really flexible,

12:09.720 --> 12:16.200
and it's, so we came up with this tool, which is script tonight, available on GitHub.

12:17.080 --> 12:22.600
It's basically in the link, okay, there is a link at the end of the presentation with the

12:22.600 --> 12:28.760
source code if you want to check it. This is implementing the flow of load that is described

12:28.760 --> 12:34.360
using local flow CT. It's also implementing a software for table, which is synchronized with

12:34.360 --> 12:40.520
the prototype on the adapter for several reasons. For instance, of course, you want to keep track

12:40.600 --> 12:47.000
of the flow of loader to get the statistics. Actually, there are also other reasons,

12:47.000 --> 12:53.560
more on implementation side, but I don't want to bother you with this, but you also need a shadow

12:53.560 --> 12:58.200
flow table because you want to keep the PIN information that you cannot store in the adapter.

12:59.480 --> 13:05.400
This tool is also able to export periodically the flows, it can work also in line, so you can

13:05.400 --> 13:12.600
configure the adapter to there is an option to, for what the traffic. Of course, it provides

13:12.600 --> 13:17.400
analysis statistics because it is mainly a benchmark in tool for us because we wanted to

13:17.400 --> 13:22.920
prove that this adapter is fast enough for what we have to do. So, this is the pipeline

13:23.480 --> 13:29.080
implemented by this application. This small application is a one file application, so the example

13:29.080 --> 13:34.760
that I was looking for, in essence, which is a good example if you want to start using the adapter

13:34.760 --> 13:42.600
for doing this, so you can take it and extend it. So, in essence, this is the main pipe

13:43.640 --> 13:49.160
created by this tool is this connection track pipe, which is keeping track of all the flows.

13:49.160 --> 13:56.760
So, when the packet is coming, hates the anatory, this is just discarded if you are passive or

13:56.840 --> 14:03.720
it is provided to the various port if it is in line. If there is no heat, this packet is sent

14:03.720 --> 14:11.080
through an RSS pipe to the application using an RSSQ. So, the application can run DPI on this

14:11.080 --> 14:18.120
and when as down with DPI adds an entry in the connection track to a flow this flow. So, you will not

14:18.440 --> 14:27.000
see other packets for the same flow. This is the test bed that we use. So, we have anic using our

14:27.000 --> 14:33.480
PF sign to send library traffic and then on the other side we have this blue field with tiptoe

14:33.480 --> 14:39.880
night and we run this both in the DPI or on the side we are using blue field the nickbot as a

14:39.880 --> 14:48.280
standard connectex adapter. So, the results, this adapter in our test, the maximum we have been

14:48.280 --> 14:54.440
able to configure is 2 million entries in the connection track. So, it is able to keep up to 2 million

14:54.440 --> 15:02.360
flows. Our traffic generator in this specific test bed was able to generate one leg

15:02.360 --> 15:10.840
a bit with standard by packets or for for a gigabit with 60 by packets. So, we run the test

15:10.840 --> 15:17.160
to measure the performance of the adapter and this adapter is able to forward all the traffic

15:17.800 --> 15:23.160
full rate in the pipe in the other pipe. So, no bottleneck there. Of course, if you

15:24.440 --> 15:28.920
follow this flow path. So, if the packet is sent to the application because at least the first

15:29.080 --> 15:35.560
packet you have to process that the application then you have to load it. Then of course, the

15:35.560 --> 15:44.840
R UCPU or the intensive PU becomes the bottleneck and we have seen that you are able to create

15:45.640 --> 15:55.240
up to 2.5 million flows per second in the adapter. So, when you hit this creation rate

15:56.200 --> 16:02.920
you start for, so you have more than 2.5 million flows, new flows per second. Of course,

16:02.920 --> 16:10.840
you each the some bottleneck and you start dropping packets. Also, if you exceed 2 million flows

16:10.840 --> 16:20.520
total, everything goes to the CPU and start dropping packets. So, not everything, but it is exceeding

16:21.480 --> 16:28.200
2 million. So, last slide, smart tick versus supernick. If you if you compare smart tick with

16:28.200 --> 16:35.080
supernick for this specific case, smart tick you can upload up to 140 million flows if you consider

16:35.080 --> 16:42.200
not protect and there is a floppy ratio rate which is 1.5 million and flows per second using one

16:42.200 --> 16:47.720
stream if you use multiple streams this increases to 3 million and you can run selected actions.

16:47.720 --> 16:54.760
So, what is possible using the NTPL? On supernick, on blue field, you can upload up to 2 million

16:54.760 --> 17:01.000
flows. The creation rate is more or less what you have on smart ticks and here we have,

17:01.000 --> 17:07.000
let's say, full programmability. So, the main difference here is the capacity of the flow table

17:07.000 --> 17:14.840
which is 2 million versus 140 million flows. Again, this is the limit high heat

17:14.840 --> 17:19.400
configuring the adapter, probably there is a way to increase this limit, I don't know.

17:21.800 --> 17:29.560
So, conclusion, the pros of supernicks is for sure hyper-mobility, hyper-formans and the ability to

17:29.560 --> 17:35.320
flow the wall application in the CPU course. So, you can really run everything in the adapter without

17:35.320 --> 17:41.960
using the server. The cost is the number of sessions which is lower with the respective smart tick

17:41.960 --> 17:49.560
as I said, the configuration is not really simple, you have to really struggle on that.

17:52.280 --> 17:57.240
You don't see if the pipeline is not enough for keeping all the sessions, you go through the

17:57.240 --> 18:04.280
flow path that can be a bottleneck and the main issue for me is that this is really tied to the

18:04.280 --> 18:08.120
PDK. So, you have to use the PDK to use this, you cannot use other frameworks.

18:08.920 --> 18:13.480
The source code is available on GitHub. So, if you want to take a look, use it,

18:13.480 --> 18:29.960
it's contributed and it will be a welcome. Thank you, sorry, thank you. Any question?

18:29.960 --> 18:46.360
Hi, thank you for the presentation. So, on this slide that I talk about the experience,

18:46.360 --> 18:51.960
trying to understand and learn from the examples, I feel the same way where you hand through the

18:51.960 --> 18:57.800
same thing. So, could you elaborate a little bit more how you found out how to use the APIs,

18:57.800 --> 19:04.600
how you ended up, what was the process to actually find out how to use and make the example work?

19:04.600 --> 19:11.800
Yeah, so the example, in the example to have 90% of the work done, but the 10% is the

19:11.800 --> 19:18.680
phase that you cannot find the documentation. So, at the end, we ended up because we have a

19:18.760 --> 19:26.840
contact in video that told us how to do that fine-tuning and how to use that specific, you know.

19:33.560 --> 19:39.080
Yeah, it's crazy. Unfortunately, it's like that. Thank you.

19:39.800 --> 19:45.000
Yeah, one more question, then I think we have a lot of time.

19:49.000 --> 19:54.440
Yeah, thanks for your presentation. So, did you manage to get your hands on a bluefield for

19:54.440 --> 19:59.400
already, and you have some experience there? I'm fortunate enough. We have the three in our lab.

20:00.600 --> 20:01.080
Thanks.

20:09.960 --> 20:22.600
Something. The pipeline looks a lot like P4 match actions. Do you know if there is any plans

20:22.600 --> 20:28.600
to support that via P4 instead of something appropriate, are you?

20:28.680 --> 20:32.600
Yes, we'll be honest today, don't know.

20:34.840 --> 20:40.120
Thank you very much for being here.