WEBVTT

00:00.000 --> 00:21.000
Thank you, so currently I'm working on a new approach to call the safe registers in LVM.

00:21.000 --> 00:27.000
I find it pretty interesting, so I'm going to tell you what it is.

00:28.000 --> 00:31.000
So this work is still ongoing.

00:31.000 --> 00:40.000
There is like a perfect concept branch which people can try and see how it works.

00:40.000 --> 00:48.000
So now I'm in the process of breaking up into smaller commits and upstream in them.

00:48.000 --> 00:58.000
By one, let's first quickly recall what call the safe registers are.

00:58.000 --> 01:12.000
So there must be preserved by function, meaning that the value of the register and entry has to be the same as the value on the exit.

01:12.000 --> 01:24.000
So for the caller it means that call instruction does not modify any call the safe register.

01:24.000 --> 01:34.000
So caller can use it for the things and for the call it means that if a call is once to use call the safe register.

01:34.000 --> 01:38.000
But I'll make sure it does not swallow.

01:38.000 --> 01:45.000
Here's a by speeding, sometimes it can copy register somewhere else.

01:45.000 --> 01:54.000
And we should register the call the safe to use the specific file by calling the condition.

01:54.000 --> 02:00.000
So let's go through a minimal example but it's still interesting.

02:00.000 --> 02:07.000
So we have this tiny function which calls who.

02:07.000 --> 02:14.000
And this is there simply that we get for risk five.

02:14.000 --> 02:21.000
At least this was what I got a couple of years ago.

02:21.000 --> 02:28.000
Right, so.

02:28.000 --> 02:33.000
We have a return address register which is modified by call to fill.

02:33.000 --> 02:37.000
And since return address is call the safe.

02:37.000 --> 02:43.000
You have to, the caller has to preserve it.

02:43.000 --> 02:51.000
So first it stores return address to the stack because call full could overwrite it otherwise.

02:51.000 --> 02:56.000
Then at the exit we restore it.

02:56.000 --> 03:04.000
Okay, then the arguments are passed in a zero.

03:04.000 --> 03:11.000
So x would be passed to the caller in a zero and value of x is used in x plus two.

03:11.000 --> 03:14.000
The expression.

03:14.000 --> 03:22.000
So since call to full real.

03:22.000 --> 03:27.000
Call for a zero.

03:27.000 --> 03:33.000
We first move it.

03:33.000 --> 03:40.000
Yeah, we move a zero to s zero because this is called set.

03:40.000 --> 03:42.000
And then we use it here.

03:42.000 --> 03:45.000
So we have used s zero.

03:45.000 --> 03:52.000
So it means we have to store it and then later load it back.

03:52.000 --> 03:56.000
Well, this is five instructions.

03:56.000 --> 04:00.000
But we can actually do this for.

04:00.000 --> 04:04.000
So this is there simply for my patch.

04:04.000 --> 04:09.000
Same version of all of the m just my patch enabled.

04:09.000 --> 04:14.000
So here we store directly a zero.

04:14.000 --> 04:19.000
And then before we use it we load it back.

04:19.000 --> 04:24.000
So we don't need the move.

04:24.000 --> 04:27.000
So what is this example?

04:27.000 --> 04:36.000
Step through a VM pipeline and see if you can do something about this copy.

04:36.000 --> 04:46.000
Right after front end is we have the full LVM IR, which is pretty dumb.

04:46.000 --> 04:51.000
We have this first.

04:51.000 --> 04:53.000
Okay, it's on space.

04:53.000 --> 04:58.000
Then we store to stack the almost immediately reload.

04:58.000 --> 05:03.000
So the point of front end is simply to lower like high level.

05:03.000 --> 05:08.000
When which to LVM IR.

05:08.000 --> 05:17.000
Then a lot of need and passes happen in the optimizations.

05:17.000 --> 05:22.000
So these are mostly activity bundled.

05:22.000 --> 05:30.000
And they get rid of these unnecessary instructions.

05:30.000 --> 05:33.000
And also communicate with IR.

05:33.000 --> 05:36.000
Even though we can't really see much.

05:36.000 --> 05:42.000
In this example, accept the category of this necessary instructions.

05:42.000 --> 05:45.000
Okay, then next part is really interesting.

05:45.000 --> 05:47.000
The scope is traction selection.

05:47.000 --> 05:53.000
So after all the machine independent instructions.

05:54.000 --> 06:00.000
This stage for lower VM IR to machine IR.

06:00.000 --> 06:08.000
Which is kind of target specific instructions.

06:08.000 --> 06:13.000
So the code is still in the same form.

06:13.000 --> 06:19.000
And it uses infinitely many virtual registers.

06:20.000 --> 06:26.000
Except for calling convention.

06:26.000 --> 06:35.000
So in such a selection, also takes care of parts of calling convention.

06:35.000 --> 06:43.000
So it declares all the incoming arguments as live-ins.

06:43.000 --> 06:50.000
Then copies them in the second instruction to virtual register.

06:50.000 --> 06:57.000
Then at the exit it copies of the return values to whatever they supposed to be.

06:57.000 --> 06:59.000
By the calling convention.

06:59.000 --> 07:12.000
And if full had any arguments, we would pass the arguments to full also at this stage.

07:12.000 --> 07:24.000
All right, so then manual machine passes happen.

07:24.000 --> 07:30.000
And we arrive at register allocator.

07:30.000 --> 07:38.000
Register allocator will assign virtual registers to physical ones.

07:38.000 --> 07:45.000
And while doing so, it tries to choose assignments which makes copies identity.

07:45.000 --> 07:48.000
So it tries to get rid of the copies.

07:48.000 --> 07:54.000
So for example, this copy near the end.

07:54.000 --> 07:59.000
Register allocator assign virtual register to external.

07:59.000 --> 08:02.000
So they don't need the scope anymore.

08:02.000 --> 08:07.000
But we cannot get rid of the first copy.

08:07.000 --> 08:20.000
Because external, it's sorry, because the live-range of zero, it includes the call instruction,

08:20.000 --> 08:24.000
which provides external.

08:24.000 --> 08:30.000
So we have interference and we choose X8.

08:30.000 --> 08:37.000
For virtual register zero.

08:37.000 --> 08:46.000
Next, after we allocate the registers, we know the registers are used by the function.

08:46.000 --> 08:56.000
So the call lock and insertter needs to save all the coil-safe registers.

08:56.000 --> 09:08.000
So in this case, we have to store X1, which is the retail register and X0, X8.

09:08.000 --> 09:10.000
And then reload them back.

09:10.000 --> 09:13.000
And so, after that, we need to assembly.

09:13.000 --> 09:19.000
And that is this copy instruction, which we don't really want.

09:19.000 --> 09:29.000
So original motivation for this work was shrink ripen.

09:29.000 --> 09:38.000
In this example, we only need to save some registers under store if you take the branch.

09:38.000 --> 09:45.000
Because if you don't, then nothing clobberes anything.

09:45.000 --> 09:50.000
And we don't need to, we don't need any catalog of catalogs.

09:50.000 --> 09:56.000
But, LVM still needs to follow the catalog at the function entry.

09:56.000 --> 10:03.000
So it is executed even though we might not take the branch.

10:03.000 --> 10:18.000
So with my patch, this is what we would like to happen is that the catalog code moved inside the branch.

10:18.000 --> 10:21.000
So how can we do that?

10:21.000 --> 10:28.000
Okay, I have read some, well, when I saw this, I said it again.

10:28.000 --> 10:31.000
Maybe there is some papers on shrink ripen.

10:31.000 --> 10:34.000
And there are, from nineties.

10:34.000 --> 10:41.000
And as we did it, it seems like what a fork and which you complicated.

10:41.000 --> 10:59.000
And it also seems like, it also seems like what the basically trying to do is to choose a good place to save register and the store.

10:59.000 --> 11:01.000
Each coin is saved register.

11:01.000 --> 11:14.000
Well, how does this different form just choosing a good spill for any other register?

11:14.000 --> 11:26.000
So basically what I'm saying is register allocator is supposed to be good at choosing a good place to insert store to stack under store.

11:26.000 --> 11:31.000
And this problem is not really different from any other spill.

11:31.000 --> 11:38.000
So can we make register allocator, do all this work for us?

11:38.000 --> 11:46.000
And so I tried this kind of hack.

11:46.000 --> 11:59.000
Then we write up the instruction selection, we can, we declare every call you saved register as live will.

11:59.000 --> 12:05.000
Then we copy each physical call you saved register into virtual one.

12:05.000 --> 12:14.000
And then at each return instruction, we copy from the corresponding virtual register back to physical register.

12:14.000 --> 12:22.000
And make the return instruction use this physical register implicitly.

12:22.000 --> 12:33.000
So basically this hack says that the value of the register on entry has to be the same as on exit.

12:33.000 --> 12:43.000
So somehow this worked.

12:43.000 --> 12:57.000
And actually this approach is supposed to simplify register allocator in because right now it's a register allocator.

12:57.000 --> 13:10.000
So there is some logic which tells that if you want to use call you saved register for the first time, there is some cost.

13:10.000 --> 13:12.000
Then you would have to save it and restore.

13:12.000 --> 13:18.000
So with this approach we don't really need to save this anymore.

13:18.000 --> 13:26.000
So we can get rid of that, even though I haven't tried yet.

13:26.000 --> 13:31.000
So then I first implemented this tree.

13:31.000 --> 13:35.000
I saw the gradations like a level benchmark.

13:35.000 --> 13:39.000
And then I start investigating why this happens.

13:39.000 --> 13:49.000
And basically it's because now we so effect the size previous passes in new ways.

13:49.000 --> 13:55.000
I now we have recreated many of the old life ranges.

13:55.000 --> 14:04.000
And so things like machine written volume code motion and register allocator.

14:04.000 --> 14:10.000
And some other passes they just they needed some improvements.

14:10.000 --> 14:19.000
So I did couple tweaks and I got good results.

14:19.000 --> 14:28.000
So this is this result for part of integer spec.

14:28.000 --> 14:36.000
And this is just the any construction count.

14:36.000 --> 14:42.000
And I did this measurement on trail work load.

14:42.000 --> 14:50.000
So we got almost you have eight point five percent on jc, which is good.

14:50.000 --> 15:02.000
So I also measured what improvements on other benchmarks like FP benchmarks, but no one has verified

15:02.000 --> 15:09.000
them yet so I kind of hesitant to post this here.

15:09.000 --> 15:18.000
But then I discovered that C5 is not so easy now.

15:18.000 --> 15:21.000
Default this work I didn't even know what C5 is.

15:21.000 --> 15:28.000
So I understand the C5's they.

15:28.000 --> 15:38.000
So there is a requirement to a program that at any point we should be able to restore values of all the qualified registers.

15:38.000 --> 15:41.000
And C5 instructions.

15:41.000 --> 15:47.000
Well they're not really instructions, they are more like assembler directives.

15:47.000 --> 15:56.000
Which in code basically where qualified registers are at any important point.

15:56.000 --> 16:01.000
So before we had a dedicated pass to meet and pull a couple log.

16:01.000 --> 16:06.000
So when we come to this pass you know okay I'm saving the score you saved register.

16:06.000 --> 16:09.000
I will need C5 accordingly.

16:09.000 --> 16:18.000
But what we did what we did here is said to basically back and optimize the result.

16:18.000 --> 16:23.000
Do whatever you want but preserve this register somehow.

16:23.000 --> 16:28.000
And now we don't really know where the score happens this day.

16:28.000 --> 16:39.000
So you have to now instead calculate where C5 should be.

16:39.000 --> 16:43.000
We can do so.

16:43.000 --> 16:49.000
So this is example with one register hope to make clear.

16:49.000 --> 16:56.000
So we look how can we track what happens to X-Tem.

16:56.000 --> 17:01.000
So we know that on return the value is X-Tem.

17:01.000 --> 17:07.000
And then we just reach the initial analysis to see where each definition is rich.

17:07.000 --> 17:12.000
That X-Tem in this case is just this one load instruction.

17:12.000 --> 17:21.000
Then we say okay which definitions, which instructions right to this stack slot zero.

17:21.000 --> 17:24.000
Well it's these two instructions.

17:24.000 --> 17:35.000
And then we continue tracking backwards until we come through like the entry or function.

17:35.000 --> 17:43.000
And by doing this backtracking a pitch port of the program we basically know where call set register.

17:43.000 --> 17:49.000
So we can meet C5 accordingly.

17:49.000 --> 17:54.000
Yeah so as I said this is still like far from completed.

17:54.000 --> 17:56.000
It's still on progress.

17:56.000 --> 18:03.000
And many people help his reviews and suggestions and PRs.

18:03.000 --> 18:13.000
And thank you for listening.

18:13.000 --> 18:18.000
We have questions.

18:18.000 --> 18:19.000
Yes?

18:19.000 --> 18:24.000
So I understand you have changed this to how C5 registers are handled.

18:24.000 --> 18:30.000
To simplify some parts of the code because you don't have the special trigger of C5 registers.

18:30.000 --> 18:34.000
And it can produce better code if you adjust some of our team.

18:34.000 --> 18:35.000
Thank you very much.

18:35.000 --> 18:36.000
Our X-Tem.

18:36.000 --> 18:38.000
So this is our net wins.

18:38.000 --> 18:42.000
But I expect there could be some drawbacks.

18:42.000 --> 18:44.000
For example, compilation type.

18:44.000 --> 18:46.000
And perhaps find a resize.

18:46.000 --> 18:49.000
If you include the C5, that can be more complicated.

18:49.000 --> 18:53.000
We have never mentioned on the drawbacks as well.

18:53.000 --> 18:57.000
So can you repeat the last part of the question?

18:57.000 --> 18:58.000
No.

18:58.000 --> 18:59.000
If you repeat it.

18:59.000 --> 19:00.000
The microphone.

19:00.000 --> 19:01.000
Ah.

19:01.000 --> 19:02.000
I'm supposed to repeat.

19:02.000 --> 19:03.000
Yes.

19:03.000 --> 19:10.000
The question is if there are some impacts on binary size and compilation time.

19:10.000 --> 19:11.000
Yes.

19:11.000 --> 19:16.000
The C5 sections increase by non-factor of C.

19:16.000 --> 19:21.000
Which maybe could be reduced.

19:21.000 --> 19:22.000
I don't know.

19:22.000 --> 19:25.000
So far I just focus on like you need to know correctly.

19:25.000 --> 19:30.000
And the compilation time is yes.

19:30.000 --> 19:32.000
For example, recently I looked.

19:32.000 --> 19:35.000
I had to modify rich and definition analysis as well.

19:35.000 --> 19:39.000
For example, because it could be a textlet before.

19:39.000 --> 19:42.000
And like after my modification.

19:42.000 --> 19:47.000
When the usage increased by 60%.

19:47.000 --> 19:50.000
But I'm working on it.

19:51.000 --> 19:56.000
Very still work to do and we don't know how much impact.

19:56.000 --> 19:57.000
Yeah.

19:57.000 --> 19:59.000
We're not having the end.

19:59.000 --> 20:00.000
Yeah.

20:04.000 --> 20:08.000
And what is the C5 information used for, is it for exception?

20:08.000 --> 20:09.000
Yes.

20:09.000 --> 20:11.000
Exceptional handling.

20:11.000 --> 20:17.000
Then the question was, what is C5 information is used for?

20:17.000 --> 20:22.000
It's used for exception handling.

20:22.000 --> 20:27.000
It's used by like operating system.

20:27.000 --> 20:29.000
Unwind.

20:29.000 --> 20:31.000
Backtrays.

20:31.000 --> 20:32.000
My other things.

20:32.000 --> 20:34.000
Like I only know about C5.

20:34.000 --> 20:36.000
And then you can do this.

20:36.000 --> 20:37.000
So that's all I know.

20:37.000 --> 20:40.000
For spectrays also in debugger.

20:40.000 --> 20:41.000
Yeah.

20:41.000 --> 20:42.000
The debugger?

20:42.000 --> 20:43.000
Yeah.

20:43.000 --> 20:46.000
But the one is debugger is not a complication.

20:46.000 --> 20:49.000
Because the debugger has like it needs to know what

20:49.000 --> 20:50.000
co-locate lock ends.

20:50.000 --> 20:53.000
Because when you single step.

20:53.000 --> 20:55.000
So that's simply.

20:55.000 --> 20:58.000
You basically need to know.

20:58.000 --> 21:01.000
If you inside pro locate lock or not.

21:01.000 --> 21:05.000
So now pro locate lock is like in many different places.

21:05.000 --> 21:08.000
So that has to be fixed somehow.

21:08.000 --> 21:09.000
Yeah.

21:09.000 --> 21:13.000
Is this specific to any one back end?

21:13.000 --> 21:14.000
Yeah.

21:14.000 --> 21:17.000
So far I already tried to turn the risk five.

21:17.000 --> 21:22.000
But I think it's principle could work on other targets too.

21:22.000 --> 21:27.000
But you have the back end all fast to do that work.

21:27.000 --> 21:31.000
And well, hopefully we will like this part.

21:31.000 --> 21:34.000
Or try to make a target independent.

21:34.000 --> 21:39.000
It's just for now to simplify things to just work with one target.

21:39.000 --> 21:40.000
Yeah.

21:44.000 --> 21:47.000
Yes?

21:47.000 --> 21:50.000
Do you know how to compile it?

21:50.000 --> 21:53.000
Do you know how to compile it?

21:53.000 --> 21:56.000
Do you know how to compile it?

21:56.000 --> 21:58.000
I don't know.

21:58.000 --> 22:00.000
Other compilers in NGCC.

22:00.000 --> 22:05.000
Oh, question is if I have a compilers in your service similar.

22:05.000 --> 22:08.000
Not the time of all.

22:08.000 --> 22:16.000
All right.

22:16.000 --> 22:17.000
Did there are no questions?

22:17.000 --> 22:18.000
Thank you very much.

22:18.000 --> 22:27.000
Thank you.

22:38.000 --> 22:42.000
Thank you.

23:08.000 --> 23:10.000
Thank you.

23:38.000 --> 23:59.000
Thank you.

23:59.000 --> 24:00.000
No.

24:00.000 --> 24:01.000
Thank you.

24:30.000 --> 24:32.000
Thank you.

25:00.000 --> 25:10.000
Thank you.