WEBVTT

00:00.000 --> 00:25.000
Next, let me show you this, giving an update on incremental LTO support in GCC.

00:25.000 --> 00:33.000
So what can we run to the top, incremental LTO is the main patches of incremental LTO is finally

00:33.000 --> 00:38.000
abstain into GCC, so it will be in the next version, so what does that actually mean?

00:38.000 --> 00:44.000
First, let's start with explaining the basic explanation of LTO.

00:44.000 --> 00:52.000
Standard completion is done in parallel, we just compile individuals or sources directly into LTO

00:52.000 --> 00:54.000
and then link them together.

00:54.000 --> 00:59.000
The problem with it's nice, simple, fast.

00:59.000 --> 01:05.000
The problem with that is that if we have some function in first source one, for example,

01:05.000 --> 01:11.000
the source two doesn't know anything about the contents of the two, so we can

01:11.000 --> 01:14.000
document and do optimizations based on that.

01:14.000 --> 01:21.000
So that's why we do link time optimizations, which at another part,

01:21.000 --> 01:33.000
inside the linking process, the linker calls our plugin, which does the first part,

01:33.000 --> 01:40.000
is similar to the internal comparison, which just don't produce the instructions,

01:40.000 --> 01:49.000
but we people use the individual representation as a form of GCC, which we then,

01:49.000 --> 01:55.000
which then goes into the whole plugin analysis, which doesn't do the optimizations themselves,

01:55.000 --> 02:05.000
but it makes the decisions and then partitions that will be done in the local transformations.

02:06.000 --> 02:12.000
In this example, the function foo is propagated and the whole plugin analysis decides that

02:12.000 --> 02:25.000
the partition one will create the instructions for the foo, and the part two will be given a copy of the foo

02:25.000 --> 02:30.000
of the function foo, so it can be, for example, in line into something else.

02:30.000 --> 02:40.000
This is nice, it creates faster and smaller binaries, but it's a major problem that it's slow.

02:40.000 --> 02:47.000
For example, when I compile on my machine, the whole plugin analysis, which is single for

02:47.000 --> 02:55.000
it, it takes for this four seconds, and the local transformation part took 200 for this

02:55.000 --> 03:03.000
seconds on 16 threads. So this is a problem, especially if we are making some small changes,

03:03.000 --> 03:11.000
for example, that if we wanted to change the foo, which normally we just so using make files,

03:11.000 --> 03:22.000
and we recompile just that one file and use the previous results of the other files,

03:23.000 --> 03:32.000
but we cannot, and it results that the compilation is faster just several seconds,

03:32.000 --> 03:43.000
but the whole plugin analysis is a bit problematic because it can influence any part that flows.

03:44.000 --> 03:51.000
So we cannot do just something naive based on timestamps.

03:51.000 --> 03:59.000
So what incremental audio does is that it adds some cache for a local transformation.

03:59.000 --> 04:09.000
So if it, so it compares the produce partitions, and if it finds partition that was already in our cache,

04:09.000 --> 04:17.000
it just returns the payloads ourselves, and we can save time instead of doing the hard part.

04:17.000 --> 04:25.000
This is a nice idea, but it's useless if we don't actually reproduce the two same partitions.

04:25.000 --> 04:33.000
And if foo is always part of the foo, it's always part of the foo, it's always part of the gateway into all partitions,

04:33.000 --> 04:45.000
and change, for example, even just one byte, we cannot at least do naive comparison between those partitions.

04:45.000 --> 05:00.000
So the major part of the incremental audio project was figuring out where the partitions diverge and how we can avoid it.

05:00.000 --> 05:09.000
There are two major categories, the first one, what I call global counters, which essentially,

05:09.000 --> 05:22.000
but I give the example of the lines of code lines, which essentially counts how many lines are there before it.

05:22.000 --> 05:36.000
So the problematic part is that if we add some new line at the start of the file, a very single line after that is the number is incremented by one.

05:36.000 --> 05:44.000
So even if the function foo is identical, the line numbers are not.

05:45.000 --> 06:00.000
If we propagate the numbers, for example, for warnings, errors, then we don't have the exact same partitions and we cannot, and we cannot just find it in the cache.

06:00.000 --> 06:15.000
So specifically, line numbers are good as an example, because they are easy to understand, but problematic to solve, because these line numbers actually have meaning to the user.

06:15.000 --> 06:29.000
So we cannot just, but in other examples, in other instances of these counters, we don't actually care about the specific numbers,

06:29.000 --> 06:37.000
but we care just about some of their properties. So for example, whether they are unique or whether they have some specific ordering.

06:37.000 --> 06:49.000
So we can replace that with something that has those properties, but does not depend on exact numbers that are before it.

06:49.000 --> 06:58.000
The second category is that we often propagate information that is actually unused in the given file condition.

06:58.000 --> 07:06.000
For example, in this case, we have function foo that we call from two places with arguments two and three.

07:06.000 --> 07:25.000
In the position of constant propagation, realizes that the function is called just with those two arguments, and gives to the function information that it's called just with those two arguments, so that it can be better optimised with this knowledge.

07:25.000 --> 07:43.000
But if we create foo copy that is intended to be in line, we already know, or at least we can find better, better limits what those arguments actually can be at that place that it's in line into.

07:43.000 --> 07:58.000
So that information is actually used in that second partition where it will be in line, so we can just, we can in those cases just delete that information because we don't use it anyway.

07:58.000 --> 08:20.000
And the most problematic part, this was the example where we have some debug function that is used in many places and takes argument as line, the line again is problematic because the function is used almost everywhere, so it's in line to almost everywhere.

08:20.000 --> 08:32.000
So if we change just one line before some of those debug function, it's propagated into almost a very partition and it was one of the ones we proposed.

08:32.000 --> 08:55.000
So those are categories, and now, how does it look currently in GCC-15, so those are currently my somewhat of benchmark compiling compiler.

08:55.000 --> 09:08.000
These are all patches from from drunk that they're actually used, so those are what I use as a representative of some small change.

09:08.000 --> 09:23.000
So I have two values for debugging for and without debugging for I originally hope that I would manage to do to obtain a patch for debugging for, but it will be probably in the next version.

09:23.000 --> 09:50.000
Currently there is a, so how to read this graph is for example in this first example of the debugging for via compiler for the five partitions of those to 128, so we do roughly only one part of work that we would have to do our device.

09:50.000 --> 10:08.000
So this saved time on all the time that we would otherwise spend in the local transformations.

10:08.000 --> 10:20.000
So in future, most of the defense, debugging for debugging for and default debugging for shouldn't be there.

10:20.000 --> 10:40.000
There I have already some patching, the jobs have some problems that we have to, they add some additional information that we didn't, later I have to delete the linker, so we have to add some linker support to remove that extra information.

10:40.000 --> 10:54.000
I've identified most of the divergence is so probably in future versions it will be better, but I cannot predict that.

10:54.000 --> 11:12.000
So what I do are relevant facts, the main flag that you have that you need to use is audio in the command to which points to the directory that will be used, in which the cache will be stored.

11:12.000 --> 11:30.000
The directory must already exist and it's phylog, it has phylog, so you can use it in parallel, so you can use it in your existing built system, it doesn't matter that there are multiple gc instances that will access the cache.

11:30.000 --> 11:57.000
And the second two flags are a bit optimizations that they will result in the faulty argumental results in identifying the binary if it is used or it is not, but to other other flags will partition partition, partition the symbols differently.

11:57.000 --> 12:25.000
The faulty partition cache should keep symbols only more together from one partition, so it is, it's beneficial for the, for the case with debugging for, I haven't measured that much difference with faulty debugging for.

12:25.000 --> 12:34.000
So it's good for now, but I'm sure it will be necessary in the future.

12:34.000 --> 12:43.000
And the second parameter is a number of partition, so if you have flagged project, it might be better to increase the number of partition.

12:43.000 --> 13:09.000
So first off, you can parallelize it better, and second, if there is one partition that has one divergence, if you split it into 26, 2056, 2056, those will be two partitions and the dimensions will be only in one of those, so there will be also less work with that.

13:09.000 --> 13:24.000
So thank you for for attention, and are there any questions?

13:24.000 --> 13:35.000
Okay, yes, we could produce the dot of files with a link time optimization, is it a totally different format, or is it just extra information?

13:35.000 --> 13:45.000
It's still outfile, but it doesn't have the contents, actually, it's just contains the simple.

13:45.000 --> 14:07.000
There is possible to contain both the input and both, both instructions, but those, those are not in fact faulty, but those, I am, I am to use the dimension anymore.

14:07.000 --> 14:16.000
Yeah, do you think that would be in the future any possibility of this in the default when using it?

14:16.000 --> 14:24.000
I'm, well, but definitely, you have, well, the question was for the, it will be a little bit can be default in future.

14:24.000 --> 14:28.000
The problem is that you have to specify the very, it's the cache actually.

14:29.000 --> 14:55.000
You could, but if you, if you use the cache from multiple projects, then it will, there's some on it, how big the cache can be, so to not waste the, the space space, and if you see it from multiple projects, you will just replace the cache from one project with another and that will not be useful.

14:56.000 --> 15:12.000
I have a question for, with normal thing, for programs that need to patch the object files, like, when you don't use the other are some programs that they need to patch object files before they get linked.

15:12.000 --> 15:39.000
I did, for example, it does that, and so with LTO, some people are starting to think about using linker plugin, the default linker plugin that the linker supports, to actually intercept the object result from LTO, the LTO linker plugin, before the linker actually links it.

15:39.000 --> 15:46.000
That's the only way, right? So this new schema, would that be as if possible, or not anymore.

15:46.000 --> 15:57.000
So the question is, but we can still intercept the results of this caching, before it's linked together, before it's found.

15:57.000 --> 16:08.000
No, my question is, with normal LTO, there is a point in the linking process, where the linker actually has all the object files, which has been processed by the LTO plugin.

16:08.000 --> 16:16.000
Yeah, I think I understand it, but I don't know how to explain it in a second.

16:16.000 --> 16:23.000
So from the point of the linker, it shouldn't be any different.

16:23.000 --> 16:29.000
Okay, so even if it's incremental at some point, the linker in one single run will get all the objects.

16:29.000 --> 16:41.000
I mean, the linker gets the objects, gives them to our LTO vapor, and then gets back another farce.

16:41.000 --> 16:49.000
And inside of the LTO vapor, it's implemented the cache. So from the point of the linker, it's nothing is different.

16:49.000 --> 16:53.000
The linker is not incremental, just the LTO compilation.

16:53.000 --> 16:57.000
Okay, which is what actually a partition is?

16:57.000 --> 16:59.000
Very cute.

16:59.000 --> 17:16.000
Uh, it essentially is, we want to parallelize our work, so we separate it into multiple parts, which are similar to standard compulsion,

17:17.000 --> 17:24.000
because otherwise, we would have to do all the optimizations in single-ferred, which would be slow.

17:24.000 --> 17:30.000
We also have that option, but I don't think it's used anymore, it was the original LTO.

17:30.000 --> 17:33.000
It's related to the incremental LTO, it doesn't exist.

17:33.000 --> 17:43.000
No, it exists already in the standard LTO, because we use the partition so we can parallelize over them.

17:46.000 --> 17:50.000
I'm sorry, I'm sorry.

17:50.000 --> 17:54.000
Uh, we all have the time.

17:54.000 --> 17:56.000
Thank you.

17:56.000 --> 17:58.000
Thank you.