WEBVTT

00:00.000 --> 00:13.640
OK, our next talk is by Stefan Greber about Incus and how he managed to enable it to run

00:13.640 --> 00:14.640
OCI containers.

00:14.640 --> 00:19.360
All right, hello everyone.

00:19.360 --> 00:25.960
So, as Christian just said, I'm Stefan Greber, I'm the project leader for Linux containers,

00:25.960 --> 00:27.720
one of the maintainers of Incus.

00:27.720 --> 00:32.920
I'm also the owner of my home company doing consultations stuff and a CTO of future

00:32.920 --> 00:37.800
fusion, which is another company doing large scale and price Incus work.

00:37.800 --> 00:42.000
Today we're going to be talking specifically about OCI, so application container with

00:42.000 --> 00:48.520
an Incus in this case, and kind of what we've done and why and how.

00:48.520 --> 00:55.040
So, kind of kicks in off just very briefly what's Incus, because maybe some of you don't

00:55.040 --> 00:56.040
know that.

00:56.040 --> 01:00.840
So, Incus is a system container in veteran machine manager, and nowadays also application

01:00.840 --> 01:01.840
container manager.

01:01.840 --> 01:08.160
It's all image based based on a REST API, it's got a pretty simple CLI, it's got support

01:08.160 --> 01:11.280
for most of the stuff you would normally expect, so you can use non-flots, backups,

01:11.280 --> 01:14.880
or bunch of different networking storage options.

01:14.880 --> 01:21.840
It also has a small web UI, you can use projects to segment things, so you can actually

01:21.840 --> 01:26.960
with external authentication and authorization, you can turn it into multi-tenant environment,

01:26.960 --> 01:34.320
it can be clustered up to 50, 100-ish servers, so you can run it at reasonably large scale,

01:34.320 --> 01:41.120
it supports distributed storage with a three-step, but we're also adding NINSTOR now, and

01:41.120 --> 01:46.560
we support shared blocks for LVM, and then on the network side we use oven for all of

01:46.560 --> 01:51.160
the software-defined networking bits that are also an option, but physical networking works

01:51.160 --> 01:52.160
just as well.

01:52.160 --> 01:57.200
I mentioned in integrative, if you have external things, it can use oven ID, connect for

01:57.200 --> 02:03.360
authentication, can use oven FG for a fine-grained authorization, there are a number of web interfaces

02:03.360 --> 02:07.920
that's kind of the one we usually go with when someone wants to see it, personally I didn't

02:07.920 --> 02:15.320
always use the CLI, so the demo afterwards is going to be CLI, and yeah, so it's a reasonably

02:15.320 --> 02:20.500
active project these days, we've had 130-ish contributors last year, all written in

02:20.500 --> 02:25.140
go all up and so also, but she too, and get had.

02:25.140 --> 02:32.820
So why support application containers, anyways, from all the way at the beginning of the project

02:32.820 --> 02:37.420
back when we went back home for a still, like the under canonical, we only focused on system

02:37.420 --> 02:43.720
containers, so running full Linux distros on this thing, and yeah, it didn't even do

02:43.720 --> 02:48.920
v-hands, it was really just containers, just full Linux distros, and that worked pretty well,

02:48.920 --> 02:54.000
it was very useful to build a bunch of folks, and our father the time was, well, there's

02:54.000 --> 02:57.320
Libvert people are going to be using that for VMs, and they're going to do both on the same

02:57.320 --> 03:02.400
system, and we're going to be expecting Libvert to become more and more user-friendly, get

03:02.400 --> 03:07.200
an API without kind of stuff all the time, so that everyone would be happy.

03:07.200 --> 03:13.480
We noticed that this never really happened, Libvert's going to remain where they were,

03:13.480 --> 03:18.680
because in a presentation detail used by OpenStark and others, instead of turning into something

03:18.680 --> 03:22.360
that could assume that regular users really enjoy using.

03:22.360 --> 03:26.960
So after a few years of kind of waiting and to see what happened, but like, okay, fine,

03:26.960 --> 03:32.400
we'll just buy the bullet and we'll add VM support to, it was next the other time.

03:32.400 --> 03:35.720
And that worked pretty well for us, but it was actually reasonably easy, I think we've

03:35.720 --> 03:40.200
got it working just a few months, we've painted up all the couple of years to support

03:40.200 --> 03:45.040
a lot of features, but that was pretty easy, and we had to extend the same situation with

03:45.040 --> 03:46.560
application containers.

03:46.560 --> 03:51.680
We obviously, Docker has been a big thing for a while, people have been using it, they've

03:51.680 --> 03:56.920
been using it alongside in-cast, they've been using it inside of in-cast containers, both

03:56.920 --> 04:04.280
kind of work, but Docker alongside in-cast has a bit of a tendency to keep black networking,

04:04.280 --> 04:08.440
because it assumes it owns everything, and so it injects a bunch of firewall rules that

04:08.440 --> 04:10.440
then blocks everything else on the system.

04:10.440 --> 04:13.680
You need to go and mess with that, we've got documentation on how to fix it, but it's kind

04:13.680 --> 04:15.560
of a new online.

04:15.560 --> 04:19.800
You also end up having your online network on storage, both in-cast, and in-docker, and

04:19.800 --> 04:22.560
gets a bit of a new online integrating things together.

04:22.560 --> 04:27.640
If you instead go with Docker inside of in-cast, that works well enough, but now in-cast

04:27.640 --> 04:32.120
generally, running in previous containers with higher security, actually interfered with some

04:32.120 --> 04:35.840
of the Docker images and the wage-run things.

04:35.840 --> 04:40.240
The number of storage options was quite limited in that scenario, and networking was still a

04:40.240 --> 04:43.480
bit of a mess, because now you've got a network inside of a container, if you want to

04:43.480 --> 04:47.920
integrate with something outside of it, and get a bit messy.

04:47.920 --> 04:51.720
So that was kind of the state of things, but we've seen a lot of people have used for

04:51.720 --> 04:58.960
application containers, whether it's for IoT stuff, basically all of the IoT bridges, for

04:58.960 --> 05:03.640
like Zigbee, Z-Wave, whatever, they all shipped us Docker containers these days, like all

05:03.640 --> 05:14.600
of the Home Assistant components, our shipped, our shipped, our shipped as OSI images of

05:14.600 --> 05:18.440
a bit of a Docker hub, and people are consuming that instead.

05:18.440 --> 05:24.600
So there's no, there's no, it's just a bit of a weird fit to try and like manually repackage

05:24.600 --> 05:28.640
those things to run them on top of in-cast, and otherwise you were doing nested Docker, that

05:28.640 --> 05:33.200
was always a bit dodgy, there's like, and more and more applications effectively, officially

05:33.200 --> 05:38.520
shipped as a Docker OSI image, this is.

05:38.520 --> 05:42.120
So there was a bit of a need for that, we noticed that like we don't want to start competing with

05:42.120 --> 05:47.120
Kubernetes or something that's not our intention, but a lot of people just need a few containers

05:47.120 --> 05:50.840
running, they don't start getting them up and down constantly, so it can be a bit

05:50.840 --> 05:53.200
sense to add that for us.

05:53.200 --> 05:57.160
Also the reason to do it, it didn't are being quite easy and fun, so that's always a good

05:57.160 --> 05:59.440
justification for doing something.

05:59.440 --> 06:01.400
Now, how does it work?

06:01.400 --> 06:06.320
Well, what we do is actually reasonably simple, because they're good tools that they're

06:06.320 --> 06:08.080
that simplify a lot of that.

06:08.080 --> 06:13.200
So we need, obviously, to interact with a registry, so we use a scope out for that, then

06:13.200 --> 06:17.520
we need to fetch that from the registry, again, scope out as that for us, then we need

06:17.520 --> 06:22.960
to go and turn that into a viable root-fight system, so we're using much, you're much

06:22.960 --> 06:28.040
free for that, which effectively looks at the layers and squashes everything together.

06:28.040 --> 06:31.360
And once we've got that and we turn it into a normal in-case image, we load the image

06:31.360 --> 06:35.800
into in-case, we create a normal container from it, and at that stage, don't think that's

06:35.800 --> 06:41.120
different from a regular system container, is that we also process all of the OCI config and

06:41.120 --> 06:44.320
metadata, so we look at the environment variables, we look at the extra mass, we look at

06:44.320 --> 06:49.920
all that stuff, and the entry point, and we've quickly put all of that in place within

06:49.920 --> 06:53.400
the follow-config, and then the container starts up.

06:53.400 --> 06:57.240
One common misconception is that we effectively have in-case-riving Docker or something, it's

06:57.240 --> 07:03.480
not the case, we turn the OCI image into effectively in-case image, and we burn it through

07:03.480 --> 07:06.200
a normal container runtime, which is LXE.

07:06.200 --> 07:11.560
We don't use RENCY, only of those at all, in this case, we use the exact same runtime,

07:11.560 --> 07:14.920
whether it's a system container or an application container.

07:14.920 --> 07:21.000
And then, yeah, start the container, and it just works, basically.

07:21.000 --> 07:30.240
So, time for a quick demo, on the first and my file, which is always fun, and so here

07:30.240 --> 07:35.080
I've got an empty in-case project, and the first thing we need to do is actually, so for

07:35.080 --> 07:39.640
all images, it comes pre-configured file image server, we could, in theory, pre-configured

07:39.640 --> 07:42.920
a Docker hub because it's the most common one, but there are many other registries, so

07:42.920 --> 07:43.920
we just don't do it.

07:43.920 --> 07:49.160
So, you need to actually add your registry, so we, in this case, for a Docker hub, you

07:49.160 --> 07:55.280
can do that, and then see the protocol is OCI, and once you've done that, now you can

07:55.280 --> 08:03.160
do Docker, C, NGNX, and my NGNX, and I'll step the image already, don't know that, so we

08:03.160 --> 08:09.720
don't need to enjoy the Wi-Fi too much, and effectively it just launched it.

08:09.720 --> 08:15.480
So at that point, hey, I've got a container running, I can go and check that we do have

08:15.480 --> 08:22.080
NGNX actually running on this thing, which we do, and if we go look at the config, those

08:22.080 --> 08:27.680
were used to normal, in-case containers of VMs, usually the config is really empty at the

08:27.680 --> 08:33.440
beginning, it just has some image and files in some volatile info, that's different for OCI

08:33.440 --> 08:37.480
containers, you can see, like, for just specifically for that.

08:37.480 --> 08:41.560
The environment variables that are defined in the OCI image get automatically added to our

08:41.560 --> 08:46.480
config, and so that there, once that's done, it works a bit differently than what you're

08:46.480 --> 08:51.680
used to with Docker, because with Docker, it's, I don't know, maybe there's some magic stuff

08:51.680 --> 08:56.520
I don't know how to do, but it's not trivial to go and reconfigure things in place, well,

08:56.520 --> 09:00.320
as with Incus it is, like you can add additional months and stuff while the thing is

09:00.320 --> 09:05.600
running, you can change the amount of CPU memory while it's running, you can add GPUs,

09:05.600 --> 09:08.880
what it's running, and if you want to change the environment, you don't need to delete it,

09:08.880 --> 09:13.640
you can just change the environment, restart it, and you're done, so that makes it quite

09:13.640 --> 09:20.000
a bit easier, for my personal use case at home, which is mostly running old bunch of IoT

09:20.000 --> 09:25.120
home automation types stuff, I can run those things basically, if I ever, if I need to reconfigure

09:25.120 --> 09:29.080
where the MQTT endpoint is or something, I can just go change the environment, restart the

09:29.080 --> 09:35.560
thing, I'm done, it also uses normal incase networks, storage, all of those features,

09:35.560 --> 09:42.240
so it's obviously if you're running like a mix of containers and like, it's the

09:42.240 --> 09:46.400
questioners and VMs, now you can do those alongside it and it just all fits nicely, it's

09:46.400 --> 09:50.680
on the same network, you can put the same firewall policies and stuff between them, it goes

09:50.680 --> 09:54.480
on the same storage as Incus, if you're running a production cluster with red and

09:54.480 --> 09:59.560
then storage, then now you've got red and then storage on those two, so it's, it just fits

09:59.560 --> 10:04.960
really nicely and the actual amount of code and effort to do this was pretty minimal.

10:04.960 --> 10:09.920
We did have to do a bit of extra work afterwards, because for example, Incus had zero

10:09.920 --> 10:15.200
need for a notary start policy, because we were running either VMs or system containers

10:15.200 --> 10:19.880
and in those, they usually don't die, like if you, if QMQ crashes, you probably have bigger

10:19.880 --> 10:26.120
problems and if PID-1, like system they in a container crashes, you probably also have

10:26.120 --> 10:31.600
bigger problems, so we'd never needed a restart policy, but obviously with application containers,

10:31.600 --> 10:35.840
it's pretty common that if a service wants to just restart itself to reload, it just

10:35.840 --> 10:41.240
exits the container and dies and starts backup, so we've had to add auto restart, the other

10:41.240 --> 10:46.680
thing we didn't need to do in Incus is because we're running full distros or full operating

10:46.680 --> 10:51.680
systems, they usually have a network management tool of some kind that does the HCP for

10:51.680 --> 10:57.880
network config, that didn't exist here, so we actually need to write the tiny DHCP clients,

10:57.880 --> 11:01.720
which when the container starts that runs, gets a lease, stays in the background and those

11:01.720 --> 11:07.080
renewables, but it also means that you can literally bridge those OCI containers directly

11:07.080 --> 11:11.120
on your physical network, and it will just grab an IP from DHCP nice and easy, you

11:11.120 --> 11:22.720
don't need to mess with static IPs, I don't think so, now we get to the kind of what's

11:22.720 --> 11:30.760
coming up next, I mean for my personal use we're done, it works, but there are always

11:30.760 --> 11:36.200
things we can do better, currently I don't love the fact that we shed out to you much

11:36.200 --> 11:40.800
she and scop here, because both of them are local bases, we are local base, we should be

11:40.800 --> 11:46.240
able to just use the right logic and not need distributions to shift as separate tool,

11:46.240 --> 11:50.440
so that's something that we'd like to do, I know the phone view much easier, it's pretty

11:50.440 --> 11:55.400
easy to do, also a geometry creator and maintainer, it's also an incost maintainer, so if

11:55.400 --> 12:00.800
we need changes there, nice and easy, scop here is a bit worse from what I've seen, it's

12:00.800 --> 12:05.840
not particularly well split, the parts, it's not really designed to be included in other

12:05.840 --> 12:10.400
code bases, so we might need to look at what we do there, there's a bunch of discussions

12:10.400 --> 12:16.200
around handing of private registry which we currently don't do, around how do we handle

12:16.200 --> 12:21.360
the authentication and all of that, like obviously we're sure that's the API with

12:21.360 --> 12:26.280
a kind server type of design, so we're depending on the authentication, but just using

12:26.280 --> 12:29.880
in password, we can pass that through the request very easily, if it's something more

12:29.880 --> 12:34.160
complex where you need to get a contemporary, use token and stuff that gets a bit more

12:34.160 --> 12:38.880
complicated, so we're looking at the best options to handle that in a way that's mostly

12:38.880 --> 12:43.360
natural and easy for those of dealing with that on paratroopers, trees, in Docker and

12:43.360 --> 12:46.000
other tools.

12:46.000 --> 12:50.320
Something is going to be a bigger piece of work, but for us, we can't complete the set

12:50.320 --> 12:57.880
is allowing running goes as VMs as well, so we've for our normal images, if you do incost

12:57.880 --> 13:02.760
launch images Ubuntu 2444, you get a container, if you do dash dash VM, you get the same

13:02.760 --> 13:08.160
thing as a VM, we want the same experience for OCI images, so that if you launch them

13:08.160 --> 13:12.800
as they are, you get a container, if you do dash dash VM, you get a very thin VM layer with

13:12.800 --> 13:18.880
the container, image running inside it, so that's cannot cut a like design, we've probably

13:18.880 --> 13:23.680
a very, very similar design of Q and U, VOTIFS, we've said all of those things are ready

13:23.680 --> 13:28.280
for all VMs, just a matter of putting the right bits in the right places, and the last

13:28.280 --> 13:36.520
thing is potentially handling layers, but the reality so far is that 99% of the images

13:36.520 --> 13:40.840
we've looked at, there's so small ones, there's actually squashed together into a single

13:40.840 --> 13:45.400
every day, so we can manage, we've not really seen a good use case for that, the one big

13:45.400 --> 13:52.120
use case would be people doing the IML type stuff with the massive Nvidia type by base layer,

13:52.120 --> 13:57.520
there it would be a bit annoying to run three containers and have in theory 99% of the image

13:57.520 --> 14:03.720
being shared, but having them duplicates, but then for us to start supporting layers throughout

14:03.720 --> 14:08.640
all of our image management logic, volumes, or trust like, but doesn't different storage

14:08.640 --> 14:13.040
back and across a cluster and all of that, it's not trivial, so it's going to matter

14:13.040 --> 14:19.920
of balancing the need for that, so far, it's like basically if you have that need, might

14:19.920 --> 14:26.840
as well use, keep using Docker, and that's basically it for me, if you want to play with

14:26.840 --> 14:34.160
it online, we've got the online demo that is you play with incurs containers, VMs, and unless

14:34.160 --> 14:39.360
the IP address is changed and I need to be the firewall, normally also OCI images from

14:39.360 --> 14:45.800
the Docker hub, so yeah, that's a good way to effectively get a VM on an incurs cluster that

14:45.800 --> 14:50.320
has nested VM support and that has incurs installed, so you get to play with it for

14:50.320 --> 14:53.720
it, and we've got a few minutes for questions.

14:53.720 --> 15:06.400
I'm going to steal one question, couldn't you, the layering problem, couldn't you

15:06.400 --> 15:11.200
do this similar to what system you're doing with system extensions and contract extensions

15:11.200 --> 15:16.560
that you essentially have images that you compose using overlay, for example?

15:16.560 --> 15:20.440
That's probably how we would do it, yeah, you would want to, don't know the layers and then

15:20.440 --> 15:24.080
do a overlay effect, currently the biggest, and that's not necessarily that difficult

15:24.080 --> 15:25.080
for us.

15:25.080 --> 15:28.720
The part that's more difficult for me is that right now in our image store and all of

15:28.720 --> 15:33.640
our internal tracking, we've got an image like a single object, now with layers, we're

15:33.640 --> 15:37.840
going to have to track potentially 20 different objects for an image and keep track of

15:37.840 --> 15:43.320
like who's using what, and so when we do replication of images we need to cluster, we

15:43.320 --> 15:47.200
can't just have the layer replicated to one hole in the next layer on the other

15:47.200 --> 15:50.400
holes, because then it's on a wrong machine, so it's all of that tracking logic that's

15:50.400 --> 15:54.200
kind of tricky for us, more than the actual assembling the thing at the end, because

15:54.200 --> 15:59.680
yeah, assembling the thing is setting up a value of S is, we've not done it inside

15:59.680 --> 16:03.720
of ink, but we've done it before in LXC, we're pretty familiar with that process, that's

16:03.720 --> 16:04.720
not too difficult.

16:04.720 --> 16:10.480
It's mostly all of the keeping track of usage, when can you expire something, all of

16:10.480 --> 16:13.880
that stuff, which is actually more complex.

16:14.880 --> 16:20.880
Thanks, Stefan, it's very good talk, I was just wondering about something that you mentioned

16:20.880 --> 16:25.880
that you can add, for example, mounts in the container, while the container is running,

16:25.880 --> 16:31.880
and you also mentioned that this is also applicable with GPUs, and I think Christian presented

16:31.880 --> 16:38.680
how the GPUs worked a few years ago, but I was wondering, is it possible to also hot swap

16:38.680 --> 16:41.880
to remove GPUs or external devices?

16:41.880 --> 16:47.520
Yeah, exactly, so we can do hot plug, obviously for mounts, we use some weird tricks to

16:47.520 --> 16:53.480
propagate the mounts into the container, we can do that kind of stuff, for GPUs, for GPUs,

16:53.480 --> 16:58.680
we can do it, because we add the kind of same thing, effectively bind mounts, the character

16:58.680 --> 17:02.680
of the devices that are needed in, and we can remove them as two.

17:02.680 --> 17:06.080
Technically, it's a bit dodgy, because if you remove a GPU that's currently being in

17:06.080 --> 17:10.080
use, you honestly keep track of that, and you don't know what process to start killing

17:10.080 --> 17:14.680
on a thing, so removing a GPU that's currently in use will likely let them still use it

17:14.680 --> 17:20.880
until the next application tries and then it's gone, but that works just fine, and it's

17:20.880 --> 17:24.760
been something we've worked quite a bit on InCurses that's hot plug works for just about

17:24.760 --> 17:30.080
everything, both on containers and on VMs, like on VMs, you can also do GPU hot plug, it

17:30.080 --> 17:34.800
will do PCI hot plug in, PCI hot plug out, now you might get your canal panic if you've

17:34.800 --> 17:40.800
not correctly clear the usage inside the VM when we're yanking out, but we do support

17:40.800 --> 17:44.640
that, and my core VMs will also support CPU hot plug and hot remove, and that works

17:44.640 --> 17:52.480
surprisingly well with the right ACPI events, thanks, sensor.

17:52.480 --> 17:56.480
Is this already integrated into the patched LSD UI?

17:56.480 --> 18:02.400
Yeah, so, well, kind of, we do have detection for those, so they will show up as container

18:02.400 --> 18:09.840
up in there, launching them is a bit trickier because it doesn't know about all of the potential

18:09.840 --> 18:15.360
remote since it's all of the potential hubs, I think it's possible using the Yamal option,

18:15.360 --> 18:18.960
otherwise at least anything that you launch previously, you should be able to select the

18:18.960 --> 18:24.400
cashed image and create more of that, I know we did sort of work on the terraform sites that

18:24.400 --> 18:30.400
the terraform provider now handles OCI just fine, but I think the UI could do for a bit of an improvement

18:30.480 --> 18:34.560
for like saying I want an OCI and it's going to ask you like what registry what's the name

18:34.560 --> 18:40.080
because we, with scope here we can't easily go and list all images on a registry, but as

18:40.080 --> 18:46.240
as we can for distros, so you're just a bit of a gap there, and I think that's it, we're starting

18:46.240 --> 18:50.560
for someone else shortly, thank you.

