WEBVTT

00:00.000 --> 00:11.520
Next session is up. We were staying with Seth for a little bit, but slowly move inside

00:11.520 --> 00:26.240
waste. Yeah. Give it up. Hi. Thank you. My name is Sachin Prabhu and I work with IBM and

00:26.240 --> 00:32.080
what we are discussing here is one of the major problems we had when implementing the SMB service

00:32.080 --> 00:38.320
for Seth. So a quick introduction on part of IBM came from Red Hat after the acquisition.

00:40.400 --> 00:46.480
I'm part of the Seth team and we work on the SMB service as described before and we also have

00:46.480 --> 00:54.960
a GitHub page for our standard loan projects which we do as part of the team. Some of the stuff

00:55.040 --> 00:59.600
I'm talking here is part of the Seth repo, but some of the bits I use for testing are

00:59.600 --> 01:09.200
hosted at this location. So quick, the Seth SMB service, this is an SMB manager module. We

01:09.200 --> 01:16.720
export a Seth FS volume over SMB to do this. We use Samba in a container which is part of the

01:16.720 --> 01:20.240
Samba container project. This is one of these projects which are hosted on a GitHub page.

01:21.040 --> 01:31.680
And we use a new Samba VFS module or VFS Seth new. Now before we get to the problem,

01:31.680 --> 01:36.320
I need to talk a bit about the Foking model in Samba. Now every time you have a new connection

01:36.320 --> 01:42.560
coming in to Samba, a new client connection coming in, we fork a new process. The UIDGID of the

01:42.560 --> 01:48.560
process is switched to the authenticated user. Now there are many reasons for doing this and some

01:49.120 --> 01:54.320
the main reasons are portability. It makes easier to write code which is run on several platforms.

01:55.920 --> 02:00.800
We do not have to, if we do not have to keep switching the UIDGID or the process, we just switch

02:00.800 --> 02:05.920
the UIDGID at the authentication time and just let it run and robustness. So if one of the

02:05.920 --> 02:13.120
client connection dies, it does not take the whole server down with it. But this Foking model also

02:13.120 --> 02:18.320
leads us to this problem. We describe here. Now imagine we have a large number of simultaneous

02:18.560 --> 02:24.560
clients connecting to the Samba server. Each of the connection leads to a new process. Each

02:24.560 --> 02:30.400
process then has to connect to the backend SFFFS volume. It uses the lip SFFFS library.

02:31.200 --> 02:38.640
Now each lip SFFFS connection has its own metadata and data cache. Now once you have an

02:38.640 --> 02:44.400
IO with starts, each of these connections, the cache keeps growing. Eventually, it leads to

02:44.400 --> 02:55.200
memory depletion causing the server to die. So to reproduce this problem, the reproducer is part

02:55.200 --> 02:59.840
of another project, other of our projects which is the SIT test cases. It is a simple Python

02:59.840 --> 03:06.800
script which uses the SFFFS protocol Python module and all it does is it opens up multiple threads.

03:06.800 --> 03:12.160
Each thread opens up a new client connection on the Samba server. We then open and close

03:12.240 --> 03:19.120
multiple files and we perform IO on it. And what we noticed is this is on a test

03:19.120 --> 03:26.800
SFF cluster with three nodes, a four node was used to run the client tests. And what we notice

03:26.800 --> 03:32.560
is after 100 simultaneous connections, we could bring down the server because of this memory pressure.

03:34.320 --> 03:40.240
So the solution which we propose is the lip SFFFS proxy. Now this has just been

03:43.040 --> 03:49.440
added to the SFFFS app, upstream SFFFS app, we also have a design document available in the SFFFS

03:49.440 --> 03:57.360
app at that location. Now the main objective for this particular project is to avoid independent

03:57.360 --> 04:03.520
connections, cache connections for each client connection. So in this particular case, with the proxy

04:03.520 --> 04:10.160
enabled, we have run a test which on the same test test cluster, we were able to simulate

04:10.240 --> 04:17.600
1000 simultaneous connections. The proxy solution itself has two parts, two parts,

04:17.600 --> 04:24.160
one is the lip SFFFS D demon process. And the second is a proxy library. Now this proxy library

04:25.680 --> 04:31.920
is it sits in the same location where your lip SFFFS library sits. So clients are linked to the

04:31.920 --> 04:35.440
proxy library instead of the actual lip SFFFS library.

04:35.760 --> 04:43.760
The demon itself, the demon connects to the SFFFS volume using the lip SFFFS library. And what it does is

04:43.760 --> 04:56.240
it centralizes all the requests. It listens on a unique socket and the clients use this unique socket to

04:56.240 --> 05:02.240
connect to the demon. All the requests are funneled through the demon process. And what we do is

05:02.240 --> 05:13.520
we end up limiting the cache to this particular process itself. So the lip SFFFS proxy library

05:13.520 --> 05:22.080
we provide a subset of low level SFFFS API calls. And as mentioned, it is used in place of lip

05:22.080 --> 05:28.000
SFFFS today. So in this case, there is no caching done on the client itself. All it does is

05:28.080 --> 05:34.640
forwards a request which is coming in through the the the demon process over the unique socket.

05:36.560 --> 05:41.760
The same configurations, client configurations share the same connection. Every time you have

05:41.760 --> 05:48.240
a new client connection coming in with a different configuration, a new connection to the SFFFS volume

05:48.240 --> 05:54.160
is created. Now because the clients can mount different sub-directories within the volume,

05:54.160 --> 06:00.960
so that that means some calls require special handling. And these are the getCWDCHDR.

06:06.320 --> 06:14.400
Finally, for testing, our our colleagues in QE decided to test this. We used or the QE team

06:14.400 --> 06:21.040
using the product protocol spec storage, which is used to perform this test. The test were done

06:21.040 --> 06:28.240
on a cluster, which had CTDB enabled. But all the testing was done on a single summer server.

06:28.240 --> 06:31.520
So we do not actually use a cluster. We would be testing against a single summer server.

06:32.560 --> 06:37.520
The mount was using a SFFS kernel mount and those are the product versions we used.

06:38.240 --> 06:44.320
Now as expected, sorry, there are two two different workloads we use here. One is the software

06:44.320 --> 06:52.160
build which simulates a make on a software project. This is very metadata heavy. And the second

06:52.160 --> 07:00.160
is a video data acquisition where we simulate reading data from a stream a device like a camera

07:00.160 --> 07:06.000
and writing to a single file. And as expected, we have higher latency, which is higher for

07:06.960 --> 07:11.280
the build process, the process which requires a lot more metadata calls.

07:11.520 --> 07:23.120
And also the throughput decreases. So this is something we are still working on. Now for future plans,

07:23.120 --> 07:28.960
we are planning to reintroduce the metadata cache on the client end because service like

07:28.960 --> 07:36.080
Samba is very metadata heavy. So we think that the performance can be improved by adding a metadata

07:36.080 --> 07:43.680
cache on the client end. However, this right now is blocked because we require we have invalidation

07:43.680 --> 07:47.920
calls which are allowed from SFF, but these are asynchronous. These are just kept in the queue

07:47.920 --> 07:55.440
and it is invalidated on the SFF end which opens a window for data corruption. So we are in talks

07:55.440 --> 08:02.320
with the SFF developers to have a synchronous invalid invalidation call back calls added to

08:03.040 --> 08:08.720
SFF. So once that is in there, we would be able to implement a metadata cache and hopefully

08:08.720 --> 08:16.080
improve performance. There are also considering other options for the connection between the

08:16.080 --> 08:23.600
proxy library and the demon. Just last week, we tested using shad memory and a mutex for serialization.

08:23.600 --> 08:29.520
However, the performance gains we noticed was quite marginal. So it wasn't too good. So

08:30.480 --> 08:36.720
we are still considering other ideas, but that's still under development right now.

08:37.680 --> 08:43.840
And finally, we only support those low-level APIs which are used by the VFS SFF new model.

08:44.400 --> 08:50.000
So going forward, we expect to add more of these low-level API calls.

08:53.280 --> 08:56.000
Yep, that's it. Thank you very much.

08:59.520 --> 09:04.480
Oh, yeah, any questions please?