WEBVTT

00:00.000 --> 00:09.760
The next talk and last one from the bedroom and Martin Luther's going to tell us everything

00:09.760 --> 00:14.320
about magic food and the representation of the container as well as us.

00:14.320 --> 00:25.480
Hello, I'm happy to be here and talk about what we're doing in the confidential

00:25.560 --> 00:34.040
Containers project, I think leveraging a lot of prior art from system D or the whole system

00:34.040 --> 00:41.000
D ecosystem and idea was to present a bit what we're doing there, what tools are we using

00:41.000 --> 00:53.560
and maybe we can extrapolate this kind of architecture also for other similar projects.

00:53.640 --> 00:58.680
I work at Microsoft and the Azure Collinux organization and as mentioned, I'm a

00:58.680 --> 01:03.880
contributor and maintainer of some projects in the confidential container space.

01:06.280 --> 01:16.040
So first context, I think it's maybe not obvious why containers and pods and what does

01:16.040 --> 01:22.840
it have to do with boot integrity. Essentially what we have to do when we want to have

01:22.840 --> 01:30.760
confidential containers or use confidential computing in container context, we have to

01:31.480 --> 01:37.960
put those containers into virtual machines because today the technology for confidential computing

01:37.960 --> 01:48.520
is mostly using virtualization boundaries and Kubernetes container realm,

01:49.080 --> 01:58.280
the unit atom that we deal with are called pods. Those are like co-located processes that run

01:58.280 --> 02:04.520
in a sandbox and they share a few resources, their namespace but usually they run with out strong

02:04.520 --> 02:15.480
virtualization isolation on a note next to each other and what confidential containers has to

02:15.480 --> 02:24.280
do is they have to wrap those pods into VMs. So when we look at the typical container launch,

02:24.280 --> 02:31.320
we don't have to go through the whole state machine here but I think it's obvious to further

02:31.320 --> 02:40.440
there's a lot of complexity going on in terms of the user that is writing their Kubernetes spec

02:40.440 --> 02:45.640
which is like a declarative manifest of which containers are supposed to run which comments

02:45.640 --> 02:52.760
are to be executed environment variables all that. They send them to an API server and the API

02:52.760 --> 03:04.120
server dispatches it to a Q-blet which is Kubernetes component and then there's the container

03:04.200 --> 03:10.760
runtime it's called the container runtime interface so today Kubernetes is not really built on Docker

03:10.760 --> 03:19.000
anymore so there's an abstraction and container D is one of those implementers of this container

03:19.000 --> 03:29.000
runtime interface and then finally container D translates those into OCI those specs into OCI

03:29.000 --> 03:38.440
runtime calls and run C essentially launches the processes that the user intended to run in

03:38.440 --> 03:50.680
a sandbox and for confidential containers the picture looks still roughly the same but instead of

03:50.680 --> 04:00.760
having the run C launching the processes we have a step in between that's called the

04:00.760 --> 04:12.600
catashim that will launch the virtual machine and talk to an RPC or to an agent in the

04:12.600 --> 04:26.760
confidential VM and perform the same jobs as that run C in a user setup would do and additionally

04:26.760 --> 04:33.880
for confidentiality there's a ceremony involved like you need an at a station service you need

04:33.880 --> 04:40.040
hardware evidence from the confidential VM so I won't go into much detail here how this works but

04:40.040 --> 04:47.480
it's essentially like there's more than just in a more ceremony required to launch a confidential

04:47.480 --> 04:56.840
plot then there is in a Kubernetes setup so you need to do a secure key release and you need to

04:56.840 --> 05:04.440
verify the hardware evidence and all that has to be part or has to be bundled in the trusted

05:04.440 --> 05:13.000
that computing base of the confidential VM and if you look at the confidential VM components

05:13.000 --> 05:21.000
there's a part that is dynamic that is where the plot processes run but there's also the static

05:21.000 --> 05:27.560
components I was just talking about that are facilitating the key release that are launching

05:27.720 --> 05:33.880
the containers that are jailing the containers and assigning resources they do even c-groups

05:33.880 --> 05:42.280
in the VM to make sure that there's isolation between the containers in a pot

05:46.280 --> 05:52.600
yeah and for this in this context we will focus on the static components not on the run time

05:52.600 --> 05:59.240
because it's a different piece it's more complex to to measure the run time but for the static

05:59.240 --> 06:12.040
components we could leverage measured boot because the system that is launching those containers

06:12.040 --> 06:18.280
the Linux system it's really just a utility VM that has to be very small for performance reasons

06:18.280 --> 06:25.080
but also because you want to have the TCB ideally very small because you have to trust all those

06:25.080 --> 06:36.040
components in there and we have to yeah the components I managed before like to facilitate

06:36.040 --> 06:41.160
at the station there's a few they're called guest components and we have the agent so the the

06:41.160 --> 06:51.240
run time that is launching the processes and one architectural key property of confidential

06:51.240 --> 06:55.880
computing is that it mandates integrity so it doesn't make sense to talk about confidentiality

06:55.880 --> 07:01.960
when you cannot guarantee what you're running and this is often like a big stumbling drop

07:01.960 --> 07:08.840
as when I when people look into confidential containers they are surprised that they have to deal

07:08.920 --> 07:17.000
with this that they have to first get integrity right and yeah if you want to trust the Sandbox

07:17.000 --> 07:22.200
that means we have to trust the guest rest and all of this has to be measured so they're operating

07:22.200 --> 07:32.600
the guest operating system and firmware components there's options to do this like one of the

07:32.600 --> 07:38.520
options that historically was used with package firmware kernel and the cata agent is bit one

07:40.040 --> 07:47.000
without any root of us and this is rather simple and it's charming because it's simple because

07:47.000 --> 07:52.840
the measurements can be pre calculated rather easily and we have like the confidential computing

07:52.840 --> 07:59.640
architectures from AMD for example they really like this because they don't have runtime registers

08:00.600 --> 08:06.520
really well you can measure it runtime so you you have a kind of safe state that with this

08:06.520 --> 08:14.440
patch at VM launch and this safe state can be pre calculated and this is why this was the initial

08:14.440 --> 08:21.480
architecture the problem problems arise when we see like we now have to deal with discrete

08:21.480 --> 08:27.640
processes like those guest components there's three of those for architectural reasons they're

08:27.640 --> 08:35.800
communicate by RPC locally and so cata agent started to become a process manager like with

08:35.800 --> 08:41.640
and all the problems with locks and everything started to appearing like what does how do we restart

08:41.640 --> 08:54.760
those how do they orchestrated so we have tried to have a discussion about system D how we can

08:55.320 --> 09:01.560
introduce system D into this picture to manage those components because it's also enabled

09:01.560 --> 09:08.360
some use cases so for example we need to have like a one-off job that consumes configuration before

09:08.360 --> 09:13.560
the agent is even started it's pit one then the agent needs to be need also take over this job

09:13.560 --> 09:20.600
et cetera and this really simplifies the agent code and simplicity is really key in those confidential

09:21.160 --> 09:31.800
system because they're more easily you can audit them more easily and it's it doesn't help

09:31.800 --> 09:41.400
if the code becomes convoluted and we can leverage system D's measured boot facilities and I think

09:41.400 --> 09:48.440
to weeks ago so not just but weeks ago probably we got this merged so we have to now cata

09:49.320 --> 10:00.600
using system D as an unit system and we are able to use all the the niceties that we get from

10:00.600 --> 10:10.040
from system D in terms of PCR measurements and one of those tools that have been very helpful for

10:10.920 --> 10:19.480
us has been I'm cozy whatever you want to call it we call it and cozy for the time being maybe

10:21.160 --> 10:26.280
I know yeah but it's I would really encourage everyone to look at this tool I think the motivation

10:26.280 --> 10:31.720
initially was that the project was using Packer Packer changed the license

10:33.800 --> 10:39.640
that was not in compatible with C&CF anymore so it's kind of an external push

10:40.040 --> 10:46.760
to change this and we scrambled a bit and we luckily stumbled over this project and it turned

10:46.760 --> 10:52.440
out to be really really great for use case so it's declarative instead of running a bunch of

10:52.440 --> 11:00.920
patch scripts in a VM you can describe what you want and it's again like easily auditable

11:02.280 --> 11:09.480
you can build images rootlers it supports various distributions we use fedora at the moment but

11:09.640 --> 11:19.240
we also tried other distributions and we also have like for example architectures like S390X

11:19.800 --> 11:28.840
that is using mcosy for those images and it also turned out like it's really fast not just locally

11:28.840 --> 11:36.440
locally it's really really fast because it catches a lot but also on the CI we build images like in the

11:36.440 --> 11:44.120
realm of of three minutes on GitHub actions we this allows us for example to run like smoke tests

11:45.640 --> 11:50.680
against those images that we build in PRs and before they had been like a very long E2E test suite

11:51.480 --> 11:59.000
and this really short and debug cycles and it's a great improvement for what we were struggling with

11:59.080 --> 12:10.440
we also have the setup that the rootFS in our case is really immutable it's we're not

12:11.800 --> 12:17.640
we don't have to change it really the VM once booted it won't be rebooted there won't be any changes

12:17.640 --> 12:26.200
persisted on the VM really so what we can essentially do is just use existing the emberity mechanisms

12:26.200 --> 12:32.280
and this is I think in terms of config it's really more or less just this so we don't do we didn't

12:32.280 --> 12:41.560
have to do anything more and for me at least this was great to see like engineers from actually

12:41.560 --> 12:47.880
implemented this and I was searching for where was the actual magic happening but apparently

12:48.360 --> 12:59.480
it's just that easy at this point maybe a small detour why we talked about TPMs and CVMs

13:03.880 --> 13:17.560
so we can use VTPMs in confidential VMs to some degree today because the CVMs have some

13:17.560 --> 13:27.720
kind of privilege level architecture when they don't have runtime registers it's like a it's

13:29.880 --> 13:37.960
called VMPL for CVS and P that's a piece of memory that is protected from the guest OS which

13:37.960 --> 13:45.640
runs in a lesser privilege level and the host because it's confidential VM so and then

13:46.520 --> 13:52.040
you can bootstrap kind of a playground or you can extend the route of trust into a playground where

13:52.040 --> 13:58.040
the VTPM is running and exposes yeah VTPM APIs to the guest OS and then you can do runtime

13:58.040 --> 14:04.360
measurements and you also don't have to use all the bespoke interfaces from the from Intel or

14:04.360 --> 14:11.400
AMD to record your measurements you can just use existing user land so this is implemented in

14:11.400 --> 14:20.920
paravisers or secure virtual I don't know as VSM it's called it's like a early boot component

14:20.920 --> 14:29.000
firmware is that it's performing those tasks one example is open hdl for Microsoft the other is

14:29.240 --> 14:41.160
coconut SVSM and they're both open source and yeah I'm in the chart that's maybe not really

14:41.160 --> 14:48.760
legible but the idea is that the route of trust is really linked to the attestation key of the VTPM

14:49.000 --> 14:59.560
and the hardware route of trust of the TET and we are using the VTPM then for for measured boot so

14:59.560 --> 15:09.880
there's a nice chart with the PCR registry published by the UAPI group and if you look at it

15:09.960 --> 15:20.680
you see it's pretty crowded already and what we use currently is we use the PCR 11 for UKI's

15:21.560 --> 15:28.680
and we use really with PCR 8 for our configuration measurements because it's used by

15:28.680 --> 15:41.560
grab and grab we don't use at the moment we can calculate those PCR offline or by a software

15:41.560 --> 15:53.000
the TPM also possible we also use system be analyzed a lot because in our case like launch latency

15:53.080 --> 16:00.040
is really critical because those VMs they need to be instrumented or they need to be launched by

16:00.040 --> 16:07.800
the cloud provider for example and then you want to have a fast container or time to put essentially

16:08.520 --> 16:16.120
and so we would go into and see like what is where we have to fix problems so for profiling

16:16.200 --> 16:25.160
system the analyzer was really helpful we are also looking at encrypted work spaces at the moment

16:25.160 --> 16:29.640
to use this so when playing around with this we saw that this is probably also something that

16:29.640 --> 16:38.840
is rather easy to do so you just have another partition that is kind of an that can out of

16:38.840 --> 16:49.720
grow and you just add this piece of code to repart and then you get like an ephemeral encrypted

16:49.720 --> 16:55.320
work space I think we have to probably pondered a bit about the security implications of this

16:55.320 --> 17:02.360
but but from the building perspective it's again super easy with mcosm and system be at the moment

17:03.080 --> 17:10.120
nice seeing yeah there's a few future ideas we want to maybe look into sysx because some

17:11.160 --> 17:16.920
users might have opinionated a base OS like they wanted to do it on rail so we have to kind of

17:16.920 --> 17:21.000
plug the coco ecosystem on top of it and also have it particularly measured

17:23.080 --> 17:30.600
it would also be very nice to have non tpm support somehow in system d maybe this is something

17:31.000 --> 17:37.880
that would be helpful because at the moment we have to inject a lot with low level

17:39.800 --> 17:43.960
Tee hardware in bespoke ways and it would be nice to have an abstraction like

17:44.680 --> 17:49.240
lip tss maybe not lip tss but like on the same abstraction level

17:51.720 --> 17:59.560
yeah and summing up using tools from from system d on the ecosystem really gave us a great boost to

17:59.880 --> 18:05.720
iterate and quickly build on this immutable utility VM and I think this can also be

18:05.720 --> 18:12.520
using other contexts where you have a really static OS image and you run an application on it

18:12.520 --> 18:18.360
I would definitely in the future look at this as a template to go ahead for for measured boot

18:19.720 --> 18:22.520
and that's it from my side thanks for listening

18:29.880 --> 18:44.920
at runtime you mean yeah okay is it possible for the container to contribute something to the

18:44.920 --> 18:55.640
at the station no at the moment no I like I left out the container part so what we measure is pretty

18:55.640 --> 19:02.200
much an abstract notion of the of the container workload but at runtime we currently don't do this

19:02.200 --> 19:10.040
because we don't we don't have the option that every Tee has runtime registers and I think this

19:10.040 --> 19:16.280
requirement will be coming in the in the near future because you have like my rather than starting

19:16.280 --> 19:21.720
engine X containers you run long running machine learning processes and those want to have

19:21.800 --> 19:25.080
continuous measurements so we have to introduce runtime measurements

19:27.800 --> 19:33.720
another question two questions first one is the base in the right there one

19:35.320 --> 19:40.120
building our basic goals it means you don't have to be with the usable image you only have

19:40.120 --> 19:43.960
you still have to trust the build environment yeah basically because you can take the same

19:44.600 --> 19:52.760
yeah I mean this is like this yeah those are the question what they are not reproducible so

19:52.760 --> 19:58.120
we're rebuilding an image like a golden image if you will we're publishing the measurements

19:58.120 --> 20:04.120
for this but it's not reproducible and this is like

20:04.120 --> 20:13.720
given it's for some cases yeah but then yeah there was a completely different from what

20:13.720 --> 20:18.920
produced the video and chaos I just yeah depends on the file system the package management

20:18.920 --> 20:23.160
has to be changed definitely a week old right but before we go we don't have to be in

20:23.160 --> 20:27.560
yeah you have to we have to clean that yeah I have to finish an episode of all package

20:28.520 --> 20:34.600
but it's possible yeah I mean it's definitely something that would make sense but it's

20:34.600 --> 20:39.720
like out of scope for the for the time being because it's deemed not trivial

20:46.280 --> 20:47.800
yeah but next we just learned

20:57.560 --> 21:09.320
no I mean that so the question was do we depend only on PCR 11 so no we had basically attached

21:09.320 --> 21:17.800
the whole PCR quote the whole TPM quote to the evidence and the users can then pick like

21:17.800 --> 21:23.400
what they want we maybe even want to put the lock into the into the evidence so

21:24.360 --> 21:29.400
there's it's more transparent than you can replay the lock but at the moment you get like

21:29.400 --> 21:35.720
all the the the quote collect I think all measurements from the from the PCR

21:41.720 --> 21:50.920
you use like the reference values that are published I mean ideally either would build the

21:50.920 --> 21:56.680
system themselves like on the on their on their premises and then take the measurements

21:56.680 --> 22:00.280
according to the environment because there might be different for example from cloud to cloud so

22:00.280 --> 22:07.480
you would get different measurements from early boot on AWS and on Azure it's not one to one the

22:07.480 --> 22:16.920
same yeah I mean you can do it to a certain degree for PCR 11 I think you can do this but yeah

22:16.920 --> 22:35.800
in general yeah yeah but I mean that's more or less like like you have yeah so the if you're

22:35.800 --> 22:42.200
only right so the argument was if you're only use PCR 11 it won't give you a much guarantees

22:43.080 --> 22:48.200
here so but but the point is that at the moment we're not really in the business of like the

22:49.320 --> 22:54.840
deciding what reference values users want to use with more than one to just move like everything

22:54.840 --> 23:02.840
that we can get to to a validator side so they can make educated decisions based on their setup

23:02.840 --> 23:16.600
that would be for the moment my level of education so you know this is I mean obviously

23:16.600 --> 23:22.360
it's exciting but you mentioned using system the analyze to the look like yeah did you find anything

23:22.360 --> 23:29.960
interesting that maybe it's strange because I want to make our fans faster so I mean that yeah

23:29.960 --> 23:37.480
so whether we found anything when using system the analyzed makes it the startup faster I mean yes

23:37.480 --> 23:47.560
but it was like our own stuff it's we just sometimes clouds of weird ways of setting them

23:47.560 --> 23:56.840
setups up and there's cloud in it and we don't we cannot use cloud in it so this is where we

23:56.840 --> 24:03.240
had us by finding out like when do we have to call home to the provisioning survey etc those kind of

24:03.240 --> 24:14.600
respect what more you're talking to an agent we can say today explain that you have to decide

24:14.600 --> 24:21.320
for the hardware we have to sort the VDPM for the this going to take encryption we know

24:21.320 --> 24:30.360
the laboratories for it is in the case and we have kind of plans to eliminate it okay so the question

24:30.360 --> 24:37.160
was I think the question was and we have to rely on the VDPM for this encryption yeah

24:38.120 --> 24:44.120
so I don't know the context really but I mean it's for unenlightened guests I think this

24:46.680 --> 24:51.480
like unenlightened means in a confidential virtual machine without

24:54.200 --> 24:59.960
any addition that the virtual machine would know that it's running in a confidential VM so that's

24:59.960 --> 25:05.640
why I think the VDPM is convenient I don't think this requirement presents a general

25:06.280 --> 25:12.360
so for us it's just convenient because we don't have to write code I mean this is the VDPM

25:12.360 --> 25:15.640
is present then we can use as I said like a few quantic files and get measured boot

25:24.520 --> 25:27.080
yeah I don't know the context really so that's why

25:35.640 --> 25:40.200
systems going and they just somehow need to get the system the key to the

25:40.200 --> 25:45.640
VDPM and that's where you use the VDPM must have an innovation somewhere else you just

25:45.640 --> 25:52.440
get the VDPM yeah it's too long to to a bit but I think we're out of time so again thanks for

25:52.440 --> 25:55.640
speaker thank you