WEBVTT

00:00.000 --> 00:10.000
Yes, thank you, Kenneth. You can hear me, okay? Back there? All right, perfect. So I'm

00:10.000 --> 00:14.880
JP, some of you may know me from Twitter or Macedon or whatever, and I'm going to talk

00:14.880 --> 00:20.240
about the program and models that we have in Rockham. Now, I somehow miss how many minutes

00:20.240 --> 00:25.680
I have, so I've waited many slides probably, so let's keep it up. First off, AMD has

00:25.680 --> 00:29.920
too many compilers that some complaint that are here are the times that there's two compilers.

00:29.920 --> 00:34.480
There's, for instinct, there's Rockham, and for Epic, there's AOCC. Don't confuse them.

00:34.480 --> 00:38.960
Anything else is for like development and research or whatever. So Rockham, if you want to

00:38.960 --> 00:44.720
do GPU offloading AOCC, if you're targeting the Epic processors, okay? Two compilers, not too many.

00:46.480 --> 00:54.480
Okay, Rockham, what is Rockham? Rockham is like our software stack, right? So we go from

00:54.640 --> 00:59.440
deployment towards and low-level stuff at the bottom all the way up to actual applications and

00:59.440 --> 01:04.640
benchmarks. And there is libraries in between. We have support for different operating systems.

01:04.640 --> 01:12.400
We're bringing in more support for operating systems. But today, I will be mostly looking at this

01:12.400 --> 01:19.840
part of the stack. So this is the programming models, which is hit, open MP, and to some extent,

01:19.920 --> 01:24.480
open CL, and then sitting on top of that is the library stack, right? So this is like a rock

01:24.480 --> 01:31.200
glass and rock FFT or something. Okay, so thinking about the programming models,

01:32.880 --> 01:39.360
there is hip. Hip is our grid language. So this is basically, when you're used to doing CUDA,

01:39.360 --> 01:44.000
you would, for AMD GPU, you would do hip. So it's a grid language, and I'm going to explain more

01:44.320 --> 01:51.200
what I mean with that. And hip is meant for giving you maximum control of how you program your

01:51.200 --> 01:58.000
kernels, right? But it's, in a sense, vendor specific, albeit, you can run hip applications on

01:58.000 --> 02:05.680
and video hardware. So keep that in mind. Then there's open MP. Now open MP traditionally would use

02:05.680 --> 02:10.720
a for joint model from thinking, even though it has transition more to a task-based idea,

02:11.360 --> 02:15.600
but it's standardized. So there's a standard, and we implement that standard. So it's not vendor

02:15.600 --> 02:21.920
specific. That's also important and keep that in mind. And then there's open CL, which I tend to

02:21.920 --> 02:26.720
think also of like a grid language, because you actually have more control than with open MP,

02:26.720 --> 02:31.680
but it's also standardized. It's actually standardized from the kernels group. And so you can rely

02:31.680 --> 02:39.600
on having a standardized thing. So when one vendor implements the open CL specification, you can

02:39.680 --> 02:47.840
potentially pour to that vendor. Now the other thing is when you think about, when you think

02:47.840 --> 02:55.600
about programming models, you also, or at least I also think about the languages that you can

02:55.600 --> 02:59.200
use these programming models from, right? Because that's important. The programming model doesn't

02:59.200 --> 03:05.040
help you anything if you're programming in the wrong language, right? So this is why I also have

03:05.040 --> 03:10.560
the programming languages here. So we support in Rock and we have C++, C, and Fortran. So basically

03:10.560 --> 03:16.560
you can mix and match here. And then I also tend to think of what libraries do I get, because

03:16.560 --> 03:21.360
library support is also important since you don't want to redo all the work necessarily yourself,

03:21.360 --> 03:27.760
right? And so this is, I have the Rock and Lips, kind of a umbrella term here, and then we're

03:27.760 --> 03:32.960
going to look at Stutt PAR, and also hit for them. I'm going to touch on briefly later in the stock.

03:33.520 --> 03:39.520
All right, so this matrix, we're going to see further along, and I will highlight which parts

03:39.520 --> 03:48.400
of the stack I'm going to show you parts of. But first off, let's look at the models and the

03:48.400 --> 03:55.040
languages. They actually all go through the same LVM compiler back end. So that gives you, if we

03:55.040 --> 04:00.240
improve some part of the compiler that typically reflects to all programming models and all languages,

04:00.560 --> 04:06.160
right? So that's also an important part. And we are, if you download Rock and Right now,

04:06.160 --> 04:11.840
that's not entirely true, because our Fortran compiler is based on an older version of LVM,

04:11.840 --> 04:19.040
because it's still based on classic flying, but we're moving towards this. And you will get a link

04:19.040 --> 04:25.760
where you can actually get access to this stack later on in the talk. Okay, so as I said, I'm going to

04:26.320 --> 04:30.320
show this matrix more. So I will have examples for hip open and PC++ Fortran,

04:31.040 --> 04:36.160
Stutt PAR rock and lips, and I actually think hip forward too. I may have missed to make that

04:36.800 --> 04:44.640
thing visible. Thank you. So first look at, first let's look at hip and C++. So getting back

04:44.640 --> 04:49.040
to the grid fundamental or grid programming languages, what do I mean with that? So when you think

04:49.040 --> 04:55.360
about the grid programming model, then it's how you have to program a kernel and map basically

04:55.760 --> 05:04.640
your problem to the GPU, right? Or how do you execute your program on the data on the GPU?

05:04.640 --> 05:09.600
And this is what it looks like. So you have a grid, you have a grid of blocks and in those blocks,

05:09.600 --> 05:14.320
you have warps. And I know that's in video term, but we have similar things like way

05:14.320 --> 05:17.680
fronts and whatever, but this is people are more used to these terminology, right?

05:18.560 --> 05:24.160
Now one thing I'm going to make is whenever I speak of a lane, I'm actually making what would

05:24.320 --> 05:30.720
be called a thread because I prefer to term lane in this regard, right? That's just a distinction

05:30.720 --> 05:38.880
whenever I say lane, think about like a kuda thread potentially. So when you do hip grid programming,

05:38.880 --> 05:44.800
what do you do? You write a kernel. So here's an example. You put a global specifier here,

05:44.800 --> 05:50.880
and then we're going to have a running example of a XB because that's basically a nice and

05:50.880 --> 05:56.080
easy enough to put on the slides, right? You're nodding, okay? That's right. So in hip you would do a

05:56.080 --> 06:04.080
XB, you get some floats, okay? That's great. Then you compute your specific lane ID in the whole grid

06:04.720 --> 06:10.240
using this. You potentially want to check if you're outside of your data set. And if you're not,

06:10.240 --> 06:17.200
you're going to do the computation, right? And so here you see that all of these things is just

06:17.280 --> 06:24.960
identifying your specific lane, your specific work item, in this grid thing to do the computation.

06:26.080 --> 06:31.040
But so people who are used to program with kuda, this should not be a surprise to anyone, right?

06:31.040 --> 06:38.720
That's just how you do it. And it's also just what you do in hip, right? So that's GPU part,

06:38.720 --> 06:45.600
and then how do you get that onto the GPU? Is he actually say, okay? I need some pointers,

06:45.680 --> 06:51.280
I need some memory, I need to copy some data, and then I'm going to execute the kernel on the GPU.

06:51.280 --> 06:56.240
And we also have the triple chef runs syntax here for actually launching the kernel. You can put

06:56.240 --> 07:00.400
that onto stream so you can have multiple streams and flight at the same time. So that's very much,

07:00.400 --> 07:07.120
what do you would expect from kuda? Speaking of kuda, let's say you have a kuda application, right?

07:07.120 --> 07:12.320
And you want to use that for an AMD GPU, what you're going to do? So we have something that works

07:12.960 --> 07:20.400
sometimes, that's called hipify. I know it is not perfect, I know that, so you know,

07:20.400 --> 07:24.720
but it's there and it gets you some way towards actually being able to run stuff on a

07:24.720 --> 07:30.720
AMD GPU. So that would then give you a hip application to a certain extent. The neat thing about

07:31.760 --> 07:37.680
hipify, what and hip what I think is that you can actually do incremental porting because you

07:37.680 --> 07:43.680
can set hip to comply for in video. So you can replace parts of the application with just

07:43.680 --> 07:49.520
called so the hip run time and actually like hip things. And then at compile time say oh by the way,

07:49.520 --> 07:53.360
I want to run this on an video and you would still be able to execute the same thing on in video.

07:53.360 --> 07:57.920
And you can mix and match this while you're doing actually the porting. And I think that's neat.

07:59.200 --> 08:03.120
Now there's two versions of hipify. Of course, again there's too many versions, right?

08:03.840 --> 08:08.640
There's a text based translation that's more for like if you have a single file you want to do this

08:08.640 --> 08:14.320
like easily in a sense. You would use the text based version and then there's a compiler based version

08:14.320 --> 08:19.520
does is that's more elaborate, but also you have to make sure that your application already compiles

08:19.520 --> 08:24.720
with a whole thing, right? Because you have to give all the inkless paths, all the definitions,

08:24.720 --> 08:29.280
everything you need to actually compile that thing just for doing the hip translation.

08:30.080 --> 08:39.200
So that's something to keep in mind. Okay. Now open mp. I have to say I am actually more on the

08:39.200 --> 08:44.240
open mp team so maybe I'm going to show you more open mp than other people will show you but anyway.

08:44.240 --> 08:51.440
So open mpc plus plus and four to the next. First open mp fundamentals. Open mp I said for

08:51.440 --> 08:57.440
join model. What does that mean? That means that theoretically you would do okay you start

08:57.440 --> 09:02.240
sequentially and then you fan out into a parallel region and then you basically synchronize and

09:02.240 --> 09:09.280
get back to this. Yes, hip basically does the same, right? You start sequentially then you fan out and

09:09.280 --> 09:15.280
whatever. The difference is that in hip you are the person who does how do I map all the data to all the

09:15.280 --> 09:22.800
lanes. In open mp it's the compiler. So it's not you. It's in a sense easier. You don't get the full

09:22.800 --> 09:29.440
control. That's right. But you also do not have to do all the work with oh. How do I you know

09:29.440 --> 09:36.400
which lane is going to write to which data items during kernel, right? So that's I think a positive.

09:37.360 --> 09:45.920
albeit not suitable for every problem you have. Okay. So open mp and c plus plus. Sex be again.

09:45.920 --> 09:51.760
So again we have the same signature. Oh boy. Thank you. Ten minutes. We have the loop that does a

09:51.760 --> 09:57.920
computation and then we basically do impractor on ptarget a super parallel for. We map some data

09:59.280 --> 10:05.920
to the device because we do not need to transfer the result back for the x part and then we map

10:05.920 --> 10:12.320
two from for y. So we bring the data of y to the device and back. Right? So that's xp running on the

10:12.320 --> 10:18.320
GPU with open mp and c plus plus. And that makes it actually possible to have a main function that

10:18.400 --> 10:24.080
has the data does the computation and brings everything to the GPU. So that's working. That's the

10:24.080 --> 10:30.960
whole example basically. Of course I admitted some things for gravity here. So you know I'm not going

10:30.960 --> 10:35.360
initialization here or printing out the values but basically this is a fully functional GPU

10:35.360 --> 10:41.680
offloading sex be implementation. So that's great. I think that's great. The same actually also

10:41.680 --> 10:46.000
applies for fortune. So I know the fortune person by the way. Okay. So I copy paste it this basically

10:46.560 --> 10:52.400
and let's see. So we have some reals, we have some integers, we have a do loop. And then we say omptarget

10:52.400 --> 10:57.360
a distributed parallel do. Again we do the mapping and we get a functioning GPU offloading

10:57.360 --> 11:00.960
for-term program that does the sex be on the GPU. And I think that's neat.

11:03.600 --> 11:11.680
All right. C plus plus stood par and rockham dips. By the way, anybody here who prefers to write

11:11.680 --> 11:19.920
pure C plus plus and does really dislike open mp for the pragmas? Yes. Okay. Okay. I assumed there

11:19.920 --> 11:26.560
were people here in the audience that would have that. So sex be. Let's do a so transform here.

11:26.560 --> 11:33.120
That's kind of nice and nice C plus plus potentially. How do you bring that to the GPU? Well you could

11:33.120 --> 11:39.840
simply do another execution policy here which is stood execution par on sequence, give a compile

11:40.000 --> 11:45.440
of like an apple offload that transform to the GPU using some hip magic. Okay. So if you prefer

11:45.440 --> 11:56.720
to stay within C plus plus that might be a way for you to go. Rockham libraries. Maybe you don't want

11:56.720 --> 12:04.800
to do actually writing the kernels yourself, right? You can use the libraries. And so here

12:04.800 --> 12:10.400
what you would do is you would do rockblas for example. You would create a handle. Then you still need

12:10.400 --> 12:14.720
to do some memory allocations here and memory transfers. So you would need to interact with hip

12:14.720 --> 12:22.320
in that sense, doing hip malloc and hip mem copy. But then basically you make rockblas be able to

12:22.320 --> 12:28.160
access the a pointer on the host. We're using that just for the alpha value which is a scalar.

12:28.160 --> 12:34.720
So we don't bother transferring that to the GPU ourselves. Then we call into the rockblas API

12:34.960 --> 12:40.560
with us for doing this xp. And we're going to copy the result back to the host. And basically

12:40.560 --> 12:50.000
the only actual or all the GPU work happens here. And we have to destroy the handle again. So

12:50.000 --> 12:57.600
that's xp when you're using rockblas. Okay. Now I don't have to preview here. Let's see.

12:57.600 --> 13:05.600
I think now I'm talking about more 4-trend. And that's the next generation 4-trend compiler.

13:06.320 --> 13:12.320
Yeah. Great. So that's the journey of our team. So the next generation 4-trend compiler,

13:14.240 --> 13:21.840
the in 2017 upstream LVM started a project to come up with the next generation 4-trend compiler.

13:21.840 --> 13:28.720
So they started implementing the base language features. Then at some point they realized,

13:28.720 --> 13:34.400
okay, we are far enough when it comes to base language features. Let's do some open-MP host side

13:35.120 --> 13:45.280
support. And then what AMD does is for two years, two and a half years now, I think.

13:45.280 --> 13:51.920
We are actively contributing open-MP GPU support upstream for open-MP target offloading

13:51.920 --> 13:56.400
in upstream LVM's 4-trend compiler. And we also do that downstream.

13:58.720 --> 14:05.760
So there was a recent blog post about the journey of one sum or people from our

14:05.760 --> 14:12.160
apps team with the next generation 4-trend compiler. And you can find the actual article.

14:12.240 --> 14:16.640
There was a blog post. You can find that at that link. They go through what they had to do,

14:16.640 --> 14:23.760
what worked, and what didn't. And I think it's worth a read. Everybody's done taking photos?

14:25.760 --> 14:32.720
Okay. And one more. Okay. All right. So these code examples are taken from the blog post.

14:32.720 --> 14:37.840
Okay. So again, I'm not a 4-trend person. The example is a Jacobi solvers. So we have,

14:38.240 --> 14:45.360
okay. I think I can make that. So we have Jacobi. We have a module. So we have some type here.

14:45.360 --> 14:51.920
And then we continue on the next slide. There are not too many surprises here. So we

14:51.920 --> 14:57.040
simply allocate the components. And then here we actually map the components to the GPU

14:57.040 --> 15:03.760
doing the open-MP target metadata. And then we have some more code that's removed for brevity.

15:03.840 --> 15:12.560
And then in the actual run Jacobi, we have a do loop. Excuse me. And we call it some routines.

15:12.560 --> 15:17.440
And I'm going to show you two routines from that, from that example. So update and norm.

15:17.440 --> 15:25.680
And I'm only going to show this because the target annotation here, that brings us,

15:25.680 --> 15:30.480
brings this code to the GPU. And we have collapse here. So we can collapse the two loops.

15:30.560 --> 15:34.160
And that's an important feature when you're going to the GPU because you want to increase the

15:34.160 --> 15:38.160
iteration space so you can map these things better to the teams that you're running on the GPUs.

15:39.680 --> 15:46.240
And the other part is here is the norm because we have the collapse here too, but we also have

15:46.240 --> 15:52.000
the reduction. And we want to reduce on the GPU too. And so we support that, of course, we actually

15:52.000 --> 15:57.440
have a pretty fast implementation for the reduction on the GPUs, both in flying and in

15:57.440 --> 16:02.480
in Fortran and in in C++. So that's something to to keep in mind as well.

16:03.920 --> 16:11.680
And here is that's where I actually wanted to come to. So we make now preview versions of

16:11.680 --> 16:15.920
rock and available for download through the infinity hub that has the next generation

16:15.920 --> 16:24.400
Fortran compiler. So this is kind of a preview release. So you can download it, you can install it,

16:24.560 --> 16:31.040
you can use it. And we are happy if you do that to evaluate it and open tickets against any of

16:31.040 --> 16:37.760
the components in our GitHub. So you can download rock and hear in the preview build and do your

16:37.760 --> 16:47.200
experimentation. So in addition or sitting on top of the program models and the libraries,

16:47.200 --> 16:52.480
we also have other frameworks that we internally also test to make sure that we hit the performance

16:52.560 --> 16:56.640
every one. And you can make use of that. So that's for example the name one is Kokos.

16:57.840 --> 17:01.760
Sometimes it's a little, at least if you're using the OpenMP target, back end from Kokos,

17:01.760 --> 17:11.440
compile time to kind of bit. Sometimes a little too long. But we enable to use these frameworks

17:12.160 --> 17:20.240
and test internally that, you know, you get the performance you want. So wrapping up,

17:20.320 --> 17:25.520
AMD Rockham is high performance open source and portable. High performance. So it powers

17:25.520 --> 17:32.080
some of the top 500 list leaders. We have solutions for HBCN and I and AI. We have compilers,

17:32.080 --> 17:37.440
libraries and frameworks as part of the stack. It's open source and we are committed to the open

17:37.440 --> 17:42.880
ecosystem. So we contribute a lot of work upstream for LVM at least. That's why I work. So

17:42.880 --> 17:47.600
that's where I'm most familiar with. We're active in community engagement and we're driving

17:47.600 --> 17:53.200
development both in implementation and in standardization. So we are also a member for example

17:53.200 --> 18:00.560
of the OpenMP ARB and we contribute there for the standardization efforts. And it's portable. So

18:00.560 --> 18:09.760
in the sense that we think portable models should be what everybody uses. So you can easily switch

18:09.760 --> 18:18.480
between vendors. And I think that's a good solution also for the evolving landscape of accelerators.

18:19.200 --> 18:27.760
It's not necessarily just tied to GPUs. And of course, Rockham or our products are also supported

18:27.760 --> 18:33.760
through these third-party libraries as I mentioned. So if you're uncocus, you can more or less

18:33.760 --> 18:40.720
easily move to our products or GPUs. And with that, I think I have to show you this slide.

18:40.720 --> 18:43.920
And I thank you very much and I'm happy to take questions.

18:43.920 --> 19:06.880
Any questions? Can you tell me if I can tell you whether there's NPU support in Rockham?

19:06.880 --> 19:15.680
Yeah. I don't know. Sorry. I would love to have the answer myself, but I don't know.

19:15.680 --> 19:33.840
So the question is, LVM's backend has been more opinionated towards CPUs and what challenges

19:33.840 --> 19:45.200
we face. I'm not necessarily a GPU person, but I believe the sum of the optimization passes

19:45.200 --> 19:51.360
to not necessarily assume some of the address-based difficulties that we face. And we've stumbled

19:51.360 --> 19:56.160
over address-based problems every now and then that we need to fix them because there were assumptions

19:56.160 --> 20:03.120
in the actual optimizations or in the co-gen that we had to fix for targets that use more than one

20:03.120 --> 20:14.240
address-based. Yeah. Thank you.

20:14.240 --> 20:28.560
Yes. So whether the Zylings FPGA will be supported in Rockham at some point, I don't know. Sorry.

20:33.120 --> 20:51.280
Okay. Okay. Also the problems.

20:51.280 --> 20:58.960
Yeah. Yeah. So the common was that there's one programming model missing, which is

20:58.960 --> 21:05.040
a sickle, and I'm not sure if there's- so this is the officially supported programming models.

21:05.040 --> 21:09.280
Right? And I think there's a distinction to be made because we actually have more stuff in Rockham

21:09.280 --> 21:14.880
that you can do, but the question is whether it's officially supported. Because we inherited

21:14.880 --> 21:22.880
a bunch of upstream stuff, but we don't necessarily test all of it, but yeah. So on AMD GPU,

21:22.880 --> 21:28.160
you can go the route of sickle through things like adaptive CPP for example, which was formally

21:28.240 --> 21:33.200
known as hip sickle, and so you would get that too. But I don't think it's officially

21:33.200 --> 21:38.000
in part of the Rockham stack. Good comment though.

21:51.280 --> 21:57.360
What's that? What about, so I mentioned that hit the support and video, what about Intel?

21:57.840 --> 22:02.800
I don't know. Yeah, I'm sorry. It's a very short answer, but I don't know.

22:03.840 --> 22:06.800
Currently, it doesn't. For sure. Right? Okay.

22:12.800 --> 22:18.240
Is around, when you're talking about credibility, the most things we have is what's going to be

22:18.240 --> 22:24.320
intermediate by R, where does it lead across parts? My understanding is that AMD has now,

22:24.320 --> 22:28.720
you've got a portability to code low, but when you're actually targeting, when you have multiple

22:28.720 --> 22:34.560
targets in your cluster, then I handle a dozen of this review in my university data set,

22:35.360 --> 22:39.040
re-outwork, and I can't do any easiest on the way to the right amount.

22:39.040 --> 22:43.200
I've already planned to fix that, and something else fear me in the long-standing sound.

22:43.200 --> 22:46.080
Yeah. Do you agree with the different parts that we have?

22:46.400 --> 22:54.800
Yeah. So the question was whether there are any plans for an intermediate representation

22:54.800 --> 23:00.880
that we would be able to do jit compilation more or less to other targets, because right now,

23:00.880 --> 23:06.080
you would basically always have to compile to specific ice, and that kind of limits portability.

23:06.720 --> 23:12.480
So we currently land, or we recently landed patches upstream, that would allow us to compile

23:13.120 --> 23:18.960
for generic targets, so you would say gfx10 generic, and that would give you access to the actual

23:18.960 --> 23:24.800
all of the gfx10 series GPUs. You may still want to compile to the gfx10,

23:25.440 --> 23:29.440
30, for example, that's the most specific one, the ones you care about most.

23:30.480 --> 23:36.560
So that it's easy, or you get the best performance on the target you care most, but you could still

23:36.640 --> 23:47.040
target all the other GPUs, too. For the SPRV question, I'm not complete to sure what exactly the plans are,

23:47.040 --> 23:53.680
but of course we are looking into ways to exactly solve the problem, because it is a problem,

23:53.680 --> 23:56.240
even internally. Thank you.

24:06.560 --> 24:21.520
I did not quite catch that question. Maybe we can take it off line.

