WEBVTT

00:00.000 --> 00:10.000
Well, in the back seat, you probably can't hear me.

00:10.000 --> 00:12.000
Anyway, I'm Roman.

00:12.000 --> 00:18.000
I'm going to talk about Project Lilipot.

00:18.000 --> 00:25.000
What is about a little bit of the history and what we're trying to do?

00:25.000 --> 00:33.000
It's mostly about reducing the memory footprint of Java applications.

00:33.000 --> 00:50.000
And specifically, it's about what we're trying to do is by reducing the size of object headers in the Java VM from currently 12 bytes down to 4 bytes.

00:51.000 --> 01:03.000
As you probably know, in typical Java applications, you have like tens or hundreds of thousands of objects and objects tends to be relatively small in many workloads.

01:03.000 --> 01:13.000
So this adds up to typically about 20 to 30 percent reduction in heap usage.

01:13.000 --> 01:23.000
Your mileage may vary. If you do the math, currently the minimum size of an object in the hotspot VM is 16 bytes.

01:23.000 --> 01:30.000
If we reduce that to like 4 bytes, this doesn't really happen because we have object alignment of 8 bytes.

01:30.000 --> 01:39.000
So the minimum object size would be 8 bytes with that. So the maximum savings that you can get out of this is like 50 percent.

01:39.000 --> 01:47.000
The minimum savings would be close to zero if we haven't heap full of really huge arrays or something like this, then you would have no savings.

01:47.000 --> 01:55.000
Most workloads tend to be in the middle of this around 20 to 30 percent heap reductions.

01:55.000 --> 02:08.000
And we can see that in, I run a, it's a well known benchmark that simulates a web shop, hey, simulates Amazon.

02:08.000 --> 02:16.000
And I run a steady workload on it and measure the heap size after periodic full cheese.

02:16.000 --> 02:27.000
And with the current JVM, we get around like almost 1,200 megabytes of heap usage after GC.

02:27.000 --> 02:37.000
With little put, we get much less is around 900. It's about 22 percent savings and heap usage.

02:37.000 --> 02:47.000
And saving heap memory is the primary effect of this. There's also secondary effects because this lowers the pressure on GC.

02:47.000 --> 02:55.000
This improves cash locality and so on. So this often also results in improved performance.

02:55.000 --> 03:06.000
In this same workload, we have this benchmark produces latency measure and a throughput measure.

03:06.000 --> 03:12.000
So the left two bars are the throughput measure.

03:12.000 --> 03:24.000
With baseline current JVM and little put. And this means that we fully get about 8 percent improved performance in this particular workload.

03:24.000 --> 03:31.000
The right-hand bars, the two bars on the right-hand side, this is the latency measure, the critical jobs.

03:31.000 --> 03:38.000
This improved even more by about 11 to 12 percent.

03:38.000 --> 03:46.000
So how did that start? So this is a bit of a rundown of the history of linear put.

03:46.000 --> 03:57.000
Now that all back in 2020, I believe at first them, I was just talking about a new big new change that I implemented in the Shenandoah GC.

03:57.000 --> 04:07.000
Back then, the Shenandoah GC used to require an extra word per object to store the forwarding pointer.

04:07.000 --> 04:16.000
And at this first them, I talked about a big change where I removed this requirement. I removed this whole word per object requirement there.

04:16.000 --> 04:26.000
And I had a chat with Andrew Haley in a taxi I believe. And he said, that's great, but could we do this for all it?

04:26.000 --> 04:38.000
Could we do this for the whole one-time? Because we used like 16 bytes of memory for each object header, but other JVM's like GCJ, they only use alt bytes.

04:38.000 --> 04:44.000
So why don't we do this in the notebook JVM? And well, how hard can it be?

04:44.000 --> 04:53.000
Right? It's good that nobody tells you how hard this could be. You would never start projects like this if you know how hard this would be in the end.

04:54.000 --> 05:09.000
So this is how it started in at first them five years ago. It took another year to actually start it because I had other things to do and more work to do on Shenandoah, so the first commit happened in March 2021.

05:09.000 --> 05:22.000
We did lots of prototyping, Thomas Stufe started to work on class pointer improvements. He will give a talk later about those improvements and how we did this for Lillipot.

05:22.000 --> 05:38.000
So I don't want to talk about this here in 2022. We realized after a bit of prototyping that we that the way that synchronization works in hotspot JVM doesn't really fit what we want to do in Lillipot.

05:38.000 --> 05:45.000
That is because synchronization, so this is what happens when you do a synchronized statement and try to need to synchronize up.

05:45.000 --> 05:57.000
This was very racy on the object header because it's changing lock, it's changing the lock bits in the object header concurrently, basically with the running application.

05:57.000 --> 06:09.000
And this didn't really work well with Lillipot. So we decided to change this the way the locking works in the hotspot JVM.

06:09.000 --> 06:14.000
And this is again the case of how hard can it be.

06:14.000 --> 06:26.000
As a kind of funny side story, there used to be a file in this code where it says as the top do not ever talk unless you really know what you're doing.

06:26.000 --> 06:33.000
And even then don't touch it. And we basically ripped it out and replaced it with something new.

06:33.000 --> 06:43.000
So this took a whole year to the prototype was quite quick at one month of work to get something working.

06:43.000 --> 06:51.000
And then it took a whole year to get the PR reviewed and reviewed again and change things and file another PR.

06:51.000 --> 07:00.000
And it was probably like 600 review commands and a lot of work on the side of Oracle and others to test it and review it.

07:00.000 --> 07:06.000
And we finally finished it in May 23 and it went into JDK 21.

07:06.000 --> 07:18.000
Even then it wasn't really finished because we also realized that the first implementation didn't have to put recursive locking and we realized that some workloads.

07:18.000 --> 07:27.000
This is quite important for performance. So from Oracle implemented recursive locking in this new stack.

07:27.000 --> 07:36.000
This went in JDK 23 and this new lightweight locking became the default locking mode in JDK 23.

07:36.000 --> 07:38.000
So this was good progress.

07:38.000 --> 07:44.000
Even then it was not really finished because there was this was only the lightweight locking side.

07:44.000 --> 07:48.000
And there's also the kind of heavyweight locking that uses monitors.

07:48.000 --> 07:59.000
And this still uses the object header to, yeah, for some purposes and that kind of work, but we wanted that to go away.

07:59.000 --> 08:01.000
So,

08:01.000 --> 08:08.000
Excellent and other thought that Oracle implemented the new object monitor table and this went in August 2024.

08:08.000 --> 08:18.000
So, 2023, what happened? We finished lightweight locking. We first proposed the version of the jet 450 for project lily port.

08:18.000 --> 08:20.000
This is what we call lily port 1.

08:20.000 --> 08:35.000
In 2024, the recursive lightweight locking and object monitor table landed and we finally integrated the first jet for this project into JDK 24, which is about to ship in March.

08:35.000 --> 08:44.000
In 2025, this is where we are now. We are prototyping what we call lily port 2, which is the actual four byte errors.

08:44.000 --> 08:56.000
And, yeah, later this year we get the EA of JDK 24, where we ship the lily port 1 basically. So, what's in it?

08:56.000 --> 09:11.000
To the current state of object layout in hotspot looks like more or less like this, though we have a header at the top, which is, so each of those rectangles here is one word, which means eight bytes, right?

09:11.000 --> 09:20.000
So, the first word of each object is what we call the object header or Mark word for some.

09:20.000 --> 09:30.000
Then we have a class pointer, which identifies, which kind of class the object has, it's useful for things like virtual cores etc.

09:30.000 --> 09:40.000
And then starts to field or the array element. For arrays, we also store the array length in this 32 bit block there. And then we have the array elements.

09:41.000 --> 09:53.000
We want to, so for lily port 1, this is what is going into JDK 24. We improve this to only take eight bytes per object. So, we merge the class pointer into the object header.

09:53.000 --> 10:03.000
So, we only have the single header here. For arrays, we still have the array length, but it shifts up a little bit and makes some free space, which we can use for the array elements.

10:03.000 --> 10:12.000
And the long-term plan is to reduce that to long-term means like, a couple of more releases.

10:12.000 --> 10:24.000
To reduce the header to just four bytes, make some more room there for actual fields or to store the array length on the other side.

10:25.000 --> 10:45.000
Let's look at the detail. So, what is really in this object header. So, this one, the, the other sketch, schema here is showing what is in those bits in the object header. So, we have two log bits here, which is used for locking is the name implies there's an unused bit here.

10:45.000 --> 11:02.000
There's GCH. This is where we track the edge of an object. This is four bytes. Then we have another unused bit here. We have 31 bits that is used for the identity hash code. And then we have some more unused bits here.

11:02.000 --> 11:10.000
And in the second word, I showed you in the slide earlier, we have a compressed class pointer which takes 32 bits.

11:10.000 --> 11:19.000
The one of the main insights of project Lily put is that most objects never get identity hash code.

11:19.000 --> 11:26.000
And most objects never get locked. So, this is really a tiny percentage of objects that are usually locked at all.

11:26.000 --> 11:35.000
So, we try not to push, push the burden for these features on every object. And instead, let only those objects that really need it pay for it.

11:35.000 --> 11:47.000
There's another little twist when the object is locked or when the object is GCH forward, which I will get to later.

11:47.000 --> 12:02.000
This whole header is replaced by a pointer into native memory, which basically points to the log object or the stack location where the, the, the monitor is.

12:03.000 --> 12:09.000
And this is the reason why we, we couldn't deal with lightweight locking so, so well.

12:09.000 --> 12:16.000
So, for Lily put one, we wanted to change this and you, you have seen all those unused bits that we make use of those unused bits.

12:16.000 --> 12:22.000
We, we packed this a little bit better. We still have the log bits. We have a bit that, in the same forwarding, which is such a detail.

12:22.000 --> 12:32.000
Still, we have the object H here. We reserve four bits for the project Valhalla because those guys told us, well, we probably going to meet those four bits.

12:32.000 --> 12:40.000
So, we have those bits here. We still have the hash code bits here. And we have a compressed class pointer in the upper 22 bits.

12:41.000 --> 12:52.000
For Lily put to the plan is or the current implementation is that we packed this even better. We only want to have two log bits still the self forwarding with those bits are all the same.

12:52.000 --> 13:02.000
We only have two bits to support the identity hash code. And then we have an even smaller compressed class pointer up there and only nineteen bits.

13:03.000 --> 13:12.000
So, how are we doing this? Thomas is going to talk about the class pointer bits. I want to talk a little bit about the identity hash code.

13:12.000 --> 13:19.000
So, how do we get from 31 bits for the identity hash code down to just two bits for the identity hash code?

13:19.000 --> 13:35.000
Just as a quick recap what this is doing. So, the identity hash code the contract is if two objects are the same object, then the identity hash code for one and the other reference must also be the same.

13:35.000 --> 13:50.000
And this is about object identity not about object equals. So, this is an important distinction. But identity hash code is used as a default implementation for object equals. So, this is used more often than you probably think.

13:50.000 --> 13:59.000
The inverse is not necessarily true. So, if the identity hash code is different doesn't mean that the objects must.

13:59.000 --> 14:12.000
Could be sorry, the inverse is not true. Anyway, the identity hash code must remain stable throughout the object lifetime which is an important property must not change.

14:12.000 --> 14:19.000
So, I want to shoot for great entropy. And yeah, as I said, the default implementation for object identity hash code.

14:19.000 --> 14:30.000
So, the current situation is how it's implemented as we have 31 bits in every object that is reserved for storing the object, the identity hash code.

14:30.000 --> 14:47.000
The system identity hash code generates basically a random number and stores it there. And whenever this identity hash code is called again, we read that number that we just stored there and then return it.

14:47.000 --> 14:57.000
How can we avoid using those 31 bits? So, the first knife idea is to always calculate a hash code that is based on the object address.

14:57.000 --> 15:10.000
The problem with that is that GCs can move the object around the address is changing. This is this violets one of the important properties that the identity hash code must stay the same.

15:11.000 --> 15:19.000
The other idea allocate a slot for the identity hash code on demand only when we need it.

15:19.000 --> 15:26.000
The problem is that usually the heap is filled consequently which means there is no space to do this.

15:26.000 --> 15:32.000
So, how can we solve this? We basically combine the two ideas.

15:33.000 --> 15:47.000
I get to this in the second. I first wanted to describe what those bits are doing. So, we only have two bits now and the meaning of those bits is the first bits says the object has been hashed or not hashed.

15:47.000 --> 15:56.000
In the second bits says the object is expanded or not expanded. And from there we basically have three states that we are using.

15:56.000 --> 16:06.000
We are going from the first state. The object is not hashed and not expanded. This is the default when the object is freshly allocated.

16:06.000 --> 16:20.000
The first time we call system identity hash code this state transitions to the second one the hash not expanded state which means we calculate the identity hash code based on the address.

16:20.000 --> 16:24.000
As long as the object does not move we can keep doing this.

16:24.000 --> 16:33.000
So, if we subsequent calls to identity hash code will recalculate the identity hash code based on the object address.

16:33.000 --> 16:47.000
So, what happens as soon as the gc moves the object around and it observes an object that is in this particular state.

16:47.000 --> 16:55.000
Recalculate the identity hash code once again based on the original object address. Then it copies the object.

16:55.000 --> 17:02.000
Allocates and x was lot for what we are doing this allocates and x was lot for storing the identity hash code.

17:02.000 --> 17:10.000
Sticks the identity hash code there and then every subsequent call for identity hash code will return the hash code that we just stored there.

17:10.000 --> 17:19.000
And we move the state to this hash and exponent states. So, we know what we are doing.

17:19.000 --> 17:25.000
And as soon as we are on this state every subsequent call will simply read this identity hash code slot.

17:25.000 --> 17:37.000
This where do we store this? We can, so the, if the field layout of the hotspot class loader can figure out a good place to do it.

17:37.000 --> 17:41.000
Sometimes there are gaps in the field layout and we can use those gaps.

17:41.000 --> 17:46.000
Sometimes there is an alignment gap at the end of an object. We can use the alignment gap.

17:46.000 --> 17:54.000
And if there is no gaps available, then we attach it at the end of the object and allocate an x for slot basically.

17:54.000 --> 18:03.000
So, often we don't even need to extend the object for real. We can reuse some free space at this there.

18:03.000 --> 18:08.000
So, this is the result of this is the only required to bits per object.

18:08.000 --> 18:12.000
And we allocate the space only when we really need it.

18:12.000 --> 18:21.000
As I said, majority of objects never need any identity hash code, so we save a lot of bits here.

18:21.000 --> 18:30.000
Question is, do I have enough time? I will quickly go over the other one.

18:31.000 --> 18:38.000
Need to make this quick. She see forwarding. Current situation is when it she see needs to copy an object.

18:38.000 --> 18:41.000
Then we need to store the new location somewhere.

18:41.000 --> 18:47.000
And this happens by usually by storing in the header itself.

18:47.000 --> 18:58.000
We flip those bits here to say one one, which means there's forwarding pointer and all the rest of the header would be interpreted as a pointer into the new location of the object.

18:59.000 --> 19:01.000
There are two cases.

19:01.000 --> 19:06.000
One is copying compaction, the other one is sliding compaction.

19:06.000 --> 19:14.000
Copying compaction is relatively easy because we copied the object to a fresh region that is basically empty.

19:14.000 --> 19:19.000
And we retain the old copy as long as we need it for updating the references.

19:19.000 --> 19:31.000
Which means we don't, the problem is that when we store the forwarding pointer here, it all writes the class pointer information that we have here.

19:31.000 --> 19:35.000
And this class pointer information is crucial for operation of GC.

19:35.000 --> 19:41.000
So it's still quite easy because we just copied the object and we can find the class pointer here.

19:41.000 --> 19:47.000
And this is relatively easy to solve and need to be careful, but it's not that too big deal.

19:47.000 --> 19:53.000
So this is how it works. I skipped this because I'm out of time.

19:53.000 --> 20:00.000
As I said, it's relatively easy to implement because we have a copy of the object and there's no problem with reading the class pointer from the new copy.

20:00.000 --> 20:08.000
Sliding compaction is a bit more tricky because what's sliding compaction does it doesn't copy the object to a new empty space.

20:08.000 --> 20:12.000
It basically, if you imagine the heap, it's a big stack of objects.

20:12.000 --> 20:20.000
It throws out all the unused dead object here and it slides all the life objects to the bottom of the heap.

20:20.000 --> 20:26.000
And the way it does is it first finds all the life objects in the marking phase.

20:26.000 --> 20:31.000
In the second phase, it calculates the forwarding addresses for each of the objects.

20:31.000 --> 20:37.000
And it stores those addresses in the object headers, at which point we would lose the class pointer information.

20:37.000 --> 20:43.000
And then updates or the references and then copies the objects to the new location, basically sliding them down.

20:43.000 --> 20:49.000
And we must not lose the class pointer information. This is the really difficult part here.

20:49.000 --> 20:54.000
But we also don't have a copy from which we could read this class pointer information.

20:54.000 --> 21:01.000
And we only have nine bits here for actually storing the forwarding pointer information.

21:01.000 --> 21:20.000
So what we can do, the idea is that those nine bits in the object header that we have is enough to address a 512 words of a range in the heap, which is kind of 248 bytes or 4 kilobytes even.

21:20.000 --> 21:26.000
So the idea is let's divide the heap into blocks of 412 words each.

21:26.000 --> 21:31.000
And then use a side table with one entry for each block.

21:31.000 --> 21:39.000
And each entry basically stores the forwarding address for objects in that block.

21:39.000 --> 21:45.000
And then we can use those nine bits in the header to find the forwarding address for the object.

21:45.000 --> 21:52.000
So the basically means look at the block in which this object resides, find the forwarding base of that block.

21:52.000 --> 21:57.000
At the bits that we have shifted by three and then you get the forwarding address.

21:57.000 --> 22:05.000
This is a kind of a forwarding table, but it has the advantage that it has a fixed size.

22:05.000 --> 22:10.000
It's basically a fixed size 1, 512 of the heap size.

22:10.000 --> 22:16.000
It's not really that big. We have basically no performance loss using this approach.

22:16.000 --> 22:30.000
We have a potential to reduce the size even more by using the bits efficiently or by extending the block size by one bit or so.

22:30.000 --> 22:34.000
And eliminates the eight terror by the limit of a little put one.

22:34.000 --> 22:38.000
So this is the advantage. I need to make it short because I'm running out of time.

22:38.000 --> 22:45.000
And putting it all together we have tiny class pointers which is going to talk about the compact identity hash code improved forwarding table.

22:45.000 --> 22:52.000
And we still have four bits basically currently unused, but we'll be used by Wahala.

22:52.000 --> 22:58.000
Which is great. And the best thing about all of this is that you can basically forget it because you don't need to remember all this.

22:58.000 --> 23:05.000
You only need to enable this feature by using this flag and you will get memory savings for free.

23:05.000 --> 23:13.000
In the future we intend to make this hopefully make this the default in one and only had a layout in hotspot JVM.

23:13.000 --> 23:18.000
And that's it. Do we have time for questions?

23:18.000 --> 23:31.000
Question.

23:31.000 --> 23:47.000
So the question was if all objects using identity hash code do we use more memory or what was.

23:47.000 --> 23:49.000
The performance.

23:49.000 --> 23:54.000
Yeah. Okay.

23:54.000 --> 23:59.000
I did some experiments. I could not see any performance degradation.

23:59.000 --> 24:06.000
I did experiments similar to what you are describing.

24:06.000 --> 24:09.000
We haven't done the full experiment yet.

24:09.000 --> 24:16.000
So I expected as long as the object doesn't move, we are recalculating the object identity hash code.

24:16.000 --> 24:25.000
And we believe that this would be expected to be much even a bit faster than reading the identity hash code for memory.

24:25.000 --> 24:31.000
Because memory access can depending on if it's in the L1 cache or if it's in the RAM, right?

24:31.000 --> 24:37.000
You can have a lot of cycles just to get this stuff into the L1 cache.

24:37.000 --> 24:42.000
The identity hash code calculation is really designed to be very fast.

24:42.000 --> 24:45.000
So this should not take many cycles. Yes?

24:45.000 --> 25:06.000
The question was if we have gaps in the layout and we have smaller gaps,

25:06.000 --> 25:10.000
smaller than 30 to bits, can we split the identity hash code?

25:10.000 --> 25:16.000
No, we currently don't do this.

25:28.000 --> 25:30.000
Yeah. I guess it would be possible.

25:30.000 --> 25:37.000
So I guess adding the complexity of making this decision is probably more expensive than I think.

25:37.000 --> 25:38.000
Yeah.

25:38.000 --> 25:41.000
So I've got a single question here.

25:41.000 --> 25:44.000
So if we have this, we're right there.

25:44.000 --> 25:45.000
Yeah.

25:45.000 --> 25:46.000
Yeah.

25:51.000 --> 25:54.000
There could be a field, like an infield or something.

25:55.000 --> 25:59.000
No.

25:59.000 --> 26:01.000
Typically four bytes.

26:01.000 --> 26:03.000
Yeah.

26:03.000 --> 26:09.000
Unless you have four byte references, so this typically fits in there.

26:09.000 --> 26:14.000
There might be a gap there, right there.

26:14.000 --> 26:16.000
Exactly.

26:16.000 --> 26:17.000
Yeah.

26:17.000 --> 26:19.000
Correct.

26:19.000 --> 26:29.000
Okay.

26:29.000 --> 26:31.000
Any more questions?

26:31.000 --> 26:32.000
No.

26:32.000 --> 26:33.000
Thank you.

26:33.000 --> 26:35.000
Thank you.