WEBVTT

00:00.000 --> 00:13.400
So, good evening everyone. I am Master Toshmera and I am a principal software engineer

00:13.400 --> 00:16.800
with Red Hat. Today I am going to talk about project

00:16.800 --> 00:21.800
lighting and we will look at what we have done so far in this project and what are some

00:21.800 --> 00:25.080
of the things that are coming up in this project.

00:25.080 --> 00:33.240
So, I have been involved with project lighting for more than a year. Let us do a quick recap

00:33.240 --> 00:38.040
of what this project is about. So, going of the lighting project is to improve the start-up

00:38.040 --> 00:43.280
time, warm-up time and equipment of your job application and it aims to achieve this

00:43.280 --> 00:48.920
by shifting computation. So, this is not a new pair and JVM has been using this technique

00:48.920 --> 00:56.200
in various ways to improve its performance in. So, you can shift computation forward

00:56.200 --> 01:02.480
in time or backward in time. So, GC where you know the GC is an example of shifting

01:02.480 --> 01:07.280
computation forward in time, your object becomes dead at some point in time, but you delay

01:07.280 --> 01:10.320
collecting that memory.

01:10.320 --> 01:18.360
CDS archive it allows the JVM to pre-pass the class files and store in a format which

01:18.440 --> 01:22.600
is easier for JVM to consume. So, that is an example of shifting computation backward

01:22.600 --> 01:30.480
in time. So, instead of doing this processing at runtime, you are doing it in archived

01:30.480 --> 01:35.360
generation time. So, so far this project in this project most of the

01:35.360 --> 01:41.360
what has been has been improving the start-up time and warm-up time. So, before we look

01:41.360 --> 01:48.120
at to what we have done so far, let us understand why the JVM is low to start.

01:48.120 --> 01:53.800
This is because JVM performs or tends to be tends to do things lazyly, it performs

01:53.800 --> 02:01.280
computations on demand as required. And the reason for this behavior is it makes the Java

02:01.280 --> 02:06.080
play from highly dynamic which is good for Java developers, they get flexibility and

02:06.080 --> 02:11.720
extensibility and it also benefits JVM in multiple ways. So, in case your application

02:11.720 --> 02:17.920
behavior changes at runtime, it can add up to it quickly. It knows the exact hardware

02:18.040 --> 02:24.400
and CPU that is running on. So, it can exploit their capabilities. And obviously, it

02:24.400 --> 02:29.000
is doing the work which is needed to run the business logic. So, there is no wasted

02:29.000 --> 02:33.520
of it here. But all of these benefits come at the expense of stretching your start-up

02:33.520 --> 02:42.520
time. But is that how the things have to be? If you look at the from the Java developer

02:42.520 --> 02:48.440
point of view, once your application is returned, most of the classes and their interconnections

02:48.440 --> 02:54.480
are established. And JVM is repeating the activity of loading and linking those classes

02:54.480 --> 03:03.480
in every run and they do not change over the from one run to another. So, here in lies

03:03.480 --> 03:10.080
the opportunity, we can shift some of the computation that the JVM is doing in every run

03:10.080 --> 03:17.360
and do them ahead of time. So, let us look at this a bit more deeper into details.

03:17.360 --> 03:25.680
Let us say you have a hallwall job program and you want to execute it. You compile it using

03:25.680 --> 03:32.080
your JVM compiler which generates your class file, it is a binary file. And the class

03:32.080 --> 03:41.680
file not, it describes your the contents of, it describes the business logic of your class,

03:41.680 --> 03:45.840
but it also establishes or it describes the connections that this class is going to have

03:45.840 --> 03:52.560
with other classes in the JDK. For instance, in this example of a hallwall, it is using

03:52.560 --> 03:58.260
a JDK class called system. It is accessing its static variable called out which is a

03:58.340 --> 04:02.580
happens to be of type print stream that is calling a print allen method in the print stream

04:02.580 --> 04:10.260
class passing some constant string. And the main method you can see there is accepting arguments

04:10.260 --> 04:16.980
which are by JVM strings. So, as we write the JVM code, we are also describing how this

04:16.980 --> 04:25.300
class depends on other classes in the system. And all of this information is represented symbolically

04:25.380 --> 04:33.860
in the class file. And when the JVM wants to execute the byte codes, it needs to transform

04:33.860 --> 04:41.700
this symbolic information into concrete addresses in its address space. And this process is called

04:41.700 --> 04:52.500
resolution. As part of the resolution, the JVM if it finds that the required class is not

04:52.580 --> 04:59.380
in its address space, it may load it from the disk. It would have to link that class again,

04:59.380 --> 05:05.780
do the initialization. And as part of that is going to run more byte codes and do the resolution

05:05.780 --> 05:12.340
again. So, this cycle repeats itself. So, the phases of loading, linking and resolution,

05:12.340 --> 05:18.980
they are feeding into each other. And that results in creating a complex graph of classes.

05:19.540 --> 05:27.540
And this process is repeated by the JVM on every run. So, let us understand this in terms of the

05:27.540 --> 05:34.020
metadata model that the JVM creates during these activities. When the JVM gets the class file

05:34.020 --> 05:38.820
by from the class loader, it creates an internal representation, which is called as instance class

05:38.820 --> 05:45.060
here. In fact, all the structures here are from the JVM point of view. So, the instance

05:45.140 --> 05:50.900
class is the central data structure here. And it has pointers to other structures, which represent

05:51.700 --> 05:57.460
different information in the class file, like your constant pull the methods. So, if you are familiar

05:57.460 --> 06:04.260
with the CDS archive, I just mentioned about it in my previous slide. So, this is the representation

06:04.260 --> 06:11.700
that gets captured in the CDS archive. And when the JVM uses the CDS archive, it gets these

06:12.020 --> 06:18.660
p connected graph of data structures. And that saves some of the startup cost.

06:21.060 --> 06:27.060
So, once the loading is done, we need to link this class, we need to establish connections

06:27.060 --> 06:32.740
with the other classes. And that happens on demand. So, when the interpreter starts executing the

06:32.820 --> 06:40.820
byte codes, it comes across symbols, symbolic representation of classes, methods, fields,

06:41.540 --> 06:49.060
called sites. And then, it asks the VM to do the resolution. As part of the resolution,

06:49.060 --> 06:54.420
the VM also stores the resolution information in certain data structures. For example, the

06:54.500 --> 07:01.940
constant pull cache and other structures hanging off it. And these structures, they act

07:01.940 --> 07:07.780
like a resolution cache for the interpreter. Once these structures are filled up, the interpreter

07:07.780 --> 07:15.620
doesn't need to ask the VM to do the resolution again. So, let us look at how these structures

07:15.620 --> 07:23.780
are filled up. Let us take an example of field resolution. So, suppose your program is accessing

07:23.860 --> 07:30.340
a couple of static fields, the bar and full. And this is represented, this would be represented

07:30.340 --> 07:35.780
in your class while using foot static byte code. And the operand to put static byte codes is an

07:35.780 --> 07:42.820
index into the constant pull, representing the field to be accessed. Now, when the JVM

07:42.820 --> 07:48.820
loads this class, it re-rise the byte codes in such a way that replaces the constant pull

07:48.900 --> 07:55.940
index within index in a structure called result field entries. So, before resolution, this is

07:55.940 --> 08:02.180
this is what the, this is what the connections look like. So, you have a foot static field byte code.

08:02.180 --> 08:08.820
It is referring to, referring to slots and the result field entries are in. And you will notice

08:08.820 --> 08:16.580
these, the result field entry structure is empty. The only information it has is the index

08:16.580 --> 08:22.020
into the constant pull, representing the field. So, when the interpreter starts executing the

08:22.020 --> 08:28.100
byte codes, it comes across foot static and it observes that the, that the corresponding

08:28.100 --> 08:32.740
result field entry structure is empty. So, it asks the VM to do the resolution.

08:33.780 --> 08:39.620
So, where the VM will refer the constant pull entry. So, it goes out the class,

08:39.620 --> 08:44.900
that contains the method, the name of the field, the class that contains the field, the name of the

08:45.140 --> 08:51.620
field. And then it has to locate the field, locate the class in its address space. If the

08:51.620 --> 08:57.780
class is not present, then it has to load it, link it and you know follow the follow the all the steps.

09:00.580 --> 09:07.940
But in the case, it will, it will locate the address of the instance class and it will find the

09:07.940 --> 09:12.340
field, present in that the offset of the field in that class and puts that information in the

09:12.340 --> 09:17.220
result field entry structure. And now, it can jump back to the interpreter. The interpreter can

09:17.220 --> 09:22.980
refer to the result field entry structure. It has the required information to locate the field,

09:22.980 --> 09:27.060
the address of the field in the address space and it can access it and continue execution.

09:29.300 --> 09:34.740
That is the another example. This time, we are trying to load some constant string constant.

09:35.540 --> 09:41.700
This is represented in the byte code as LDC byte code and again the operand to the LDC byte code

09:41.860 --> 09:49.860
is an index in the constant pull reference to the string. These LDC byte codes will then be,

09:49.860 --> 09:56.100
will be rewritten by the JVM at in runtime. And it will replace the constant pull index with

09:56.100 --> 10:05.140
an index in the result references. So, result references is nothing but a, but an area of

10:05.140 --> 10:13.700
javaling objects. And as before, when the interpreter comes across LDC byte codes, it fetches

10:13.700 --> 10:20.020
the corresponding object from the result from the result references array. It finds that it

10:20.020 --> 10:26.180
is nulls. That means, it is not exact result. So, it has to call the, call the VM to do the

10:26.180 --> 10:31.940
resolution. The VM will create a javaling string object corresponding to the string, being

10:32.020 --> 10:38.580
that needs to be accessed. It will intern that string and put a reference to it in the javaling

10:38.580 --> 10:42.660
object array. And from there onwards, the interpreter can continue execution.

10:44.820 --> 10:50.020
The resolution of other entities like the methods, it happens in the same way just that the

10:50.020 --> 10:57.380
data structures that are populated are different. So, there is a lot of back and forth happening

10:57.540 --> 11:04.020
here between the interpreter and the VM during the initial execution of the byte codes.

11:04.020 --> 11:13.060
Until the resolution information has been completely filled up. So, at the end of, at the end of

11:13.060 --> 11:21.300
this whole process, what the JVM has done is created this big graph of data structure or the

11:21.380 --> 11:31.060
metadata objects that we call it in hotspot jargon. And that describes not just the, not just

11:31.060 --> 11:37.780
the class file, but it also describes its connections with the, with the other class file

11:37.780 --> 11:45.220
at runtime. And this is done in every run of the java application. So, there is a fair bit of

11:45.220 --> 11:52.740
repetition in the java activity during this phases to reach at this stage. And we can remove

11:52.740 --> 12:00.980
this repetition. If we can somehow store this structure as is into a disk and then ask the JVM

12:00.980 --> 12:08.580
to map it and you know it gets the, it gets this graph of, pre-built graph of data structure.

12:09.460 --> 12:16.100
And this is similar to what CDS archive is doing right now, but we are going one step ahead.

12:16.100 --> 12:21.700
We are talking about storing the classes in a link state in the archive.

12:26.500 --> 12:37.940
So, that brings us to the JAP 43. This JAP is about storing, storing the

12:37.940 --> 12:44.820
floor or being ahead of time, class loading and linking. Class loading, ahead of time, class

12:44.820 --> 12:50.020
loading is already provided in some form by the current implementation of CDS archive. So,

12:50.020 --> 12:56.020
we have built on top of that existing technology and we have added the capability to, to link

12:56.020 --> 13:02.660
the, to store, to store the classes in a link state. And the way we achieve this is by,

13:03.380 --> 13:09.540
by doing training run. So, the goal of the training run here is to touch as many classes,

13:09.540 --> 13:12.980
as possible, as the application would do in a production run and actual run.

13:14.580 --> 13:20.500
And the information captured in the training run is then stored in the configuration file at the

13:20.500 --> 13:26.660
end of the training run. The JVM then reads this configuration file and emits a file called

13:26.820 --> 13:31.460
AOT cache. And this is the file that contains all your classes in a pre-link state.

13:32.180 --> 13:37.700
So, the JVM can adopt this AOT cache and can adopt these structures from the AOT cache.

13:39.300 --> 13:46.420
And this happens, this adoption happens even before the main is called. So, by the time the

13:46.420 --> 13:52.820
application starts exhibiting, your classes are already in a, in a link state. And you save

13:52.980 --> 14:02.980
quite a bit of significant time during the runtime in the start of phase. So, it is worth mentioning

14:02.980 --> 14:09.940
here that the JVM still retains the ability to link and class link and load and link classes

14:09.940 --> 14:16.580
at runtime. So, in case if you are production run touches a class which is not in the AOT cache,

14:16.580 --> 14:22.740
the JVM can still load and link the class at runtime. So, in a sense, AOT cache is not

14:23.140 --> 14:31.780
putting a closed world constraint on your application. So, let us talk about couple of challenges

14:31.780 --> 14:36.900
we faced. The first one comes from user defined notice. So, currently there is no mechanism in

14:36.900 --> 14:43.620
the JVM to carry the identity of user defined notice across the runs. And secondly, they do not

14:43.620 --> 14:48.420
have a well defined behavior. So, that makes difficult for the JVM to support user defined

14:48.500 --> 14:53.060
orders in current scheme of things. Right now, we are only supporting the built in orders,

14:53.060 --> 14:59.860
your boot loader, rate form loader and just system loader. The second challenge is in,

15:01.700 --> 15:08.820
is in running class initializers ahead of time. So, class initializers are, they are very

15:08.820 --> 15:14.660
messy, they are suppy in nature, they have, they can run random Java code. So, they are not pure,

15:14.740 --> 15:18.660
they can have side effects. And they can also pull in dependency from the environment.

15:20.100 --> 15:25.780
So, and the environment can vary from the training run to the, to the production run. So,

15:25.780 --> 15:32.420
what we have done is, we have identified a very small set of JDK classes which are,

15:32.420 --> 15:36.100
which are safe to initialize. And only those classes are initialize ahead of time.

15:37.140 --> 15:42.820
And that is also done not from the performance point of view, but it came out as a necessity

15:42.820 --> 15:51.300
when we try to do the pre-linking of invoke dynamic backwards. So, so far, all the classes

15:52.580 --> 15:58.420
that, that, that are pre-linked in the, in, in the IoT cache, they still need to be initialized

15:58.420 --> 16:07.220
and run time. So, I talked about what we have delivered so far. Let us quickly look at what's,

16:08.020 --> 16:13.460
what's in the pipeline. So, now that we have a concept of training run and IoT cache,

16:14.420 --> 16:19.940
we also want to capture the profiling information in the IoT cache. And we also want to do

16:21.140 --> 16:27.780
compilations ahead of time and store them in the cache. And what kind of benefits we can achieve from

16:27.780 --> 16:32.900
this, from these activities. So, let us understand the profiling and execution model a bit,

16:33.860 --> 16:39.140
what, what the JVM is doing right now. So, there are three players in this picture.

16:39.140 --> 16:44.580
There is an interpreter, there is a C1 compiler, which is a fast compiler, but it generates an

16:44.580 --> 16:50.980
inferior code and the C2 compiler, which is your, which is an optimizing compiler and generates

16:50.980 --> 16:56.580
the most performant code. The JVM employees here, compilation policy, that means

16:56.660 --> 17:05.460
methods start executing in the interpreter and as it's executing, some of the methods which are

17:05.460 --> 17:11.140
hot, that the JVM thinks they are worth compiling, they are moved to the, they are compiled by

17:11.140 --> 17:18.820
the C1 compiler. And out of these methods, some are very hot and performant critical and they

17:18.820 --> 17:24.820
get compiled by the C2 compiler. And these transitions happen based on the invocation count of the

17:25.140 --> 17:31.060
method. So, invocation count is acting like a, like a filter to identify the performance

17:31.060 --> 17:34.900
critical methods in the application and they get compiled by the C2 compiler.

17:40.420 --> 17:46.820
So, talking about C2 compiler, it has to, to be able to do powerful optimizations and perform

17:46.820 --> 17:53.860
the most optimal code. It needs insights about the runtime behavior of the method and this

17:53.940 --> 18:08.100
information is captured by providing the method execution. So, the, the profiles and this

18:08.100 --> 18:13.780
profiling is actually done by the C1 profile code and it can also be done by the interpreter in

18:13.780 --> 18:21.300
some cases. So, profiling information that is captured doing by this, by this activity, it tells

18:21.620 --> 18:27.300
us, it can tell us whether, whether a particular branch has been taken or not, how many times

18:27.300 --> 18:32.020
a particular branch has been taken, what are the, what is the type of the receiver object at the

18:32.020 --> 18:38.260
fall site and the type of the return values and the arguments. And all of this information allows

18:38.260 --> 18:45.940
C2 to do something called speculative compilation. What it means is C2 assumes that the, that the

18:46.020 --> 18:52.340
method is going to behave in certain way and it, and it optimizes the compiled code based on

18:52.340 --> 19:00.740
that behavior and in case the assumption fails in the future of the method, then the JVM will

19:00.740 --> 19:08.740
deoptimize the code. So, for instance, the C2 can omit out a code path, if the profiling

19:08.820 --> 19:14.020
information tells it that this particular code path has never been taken so far.

19:16.580 --> 19:23.300
Instead of that code path, what C2 does is it puts a deoptimization trap and the purpose of

19:23.300 --> 19:30.820
that deoptimization deoptrap is it acts like a safety net. So, in case in future runs, if the

19:30.820 --> 19:37.140
control does reach that control path, that code path, then the deoptrap will cause the execution,

19:37.140 --> 19:44.500
will cause the execution of the method to continue in the interpreter. So, we look at how the,

19:44.500 --> 19:51.700
how the method execution transitions from interpreter to C1 profile code to a C2 fully optimized

19:51.700 --> 19:58.900
code. And this transition takes its own time and that is why you have warm up phase in your application.

20:00.020 --> 20:06.740
This is the phase where you will observe lot of compilation activity until the, until the performance

20:06.900 --> 20:11.700
critical methods, they, they are not compiled by C2, they are get compiled by the C2 compiler.

20:12.500 --> 20:20.260
And at that stage, your application is performing at your peak level. And after that, you will observe

20:20.260 --> 20:27.540
that the compilation activity goes down significantly. So, we can now understand what we stand to

20:27.540 --> 20:32.900
gain by storing the profiling information and, you know, if you store the profiling,

20:32.980 --> 20:37.860
capture the profiling information, the training run and store it into a UT cache and make it

20:37.860 --> 20:44.500
available to the JDK or to the JDM at startup itself in the production run. Then it can actually

20:44.500 --> 20:51.140
do the, come out of the interpreter quickly and do the, do the transition to the C2 combine code

20:51.140 --> 20:55.860
much quickly. So, it has the potential of reducing your warm up phase of the application.

20:56.260 --> 21:04.260
Also, if we store the compiled code generated during the training run in the UT cache, then

21:04.260 --> 21:14.260
and make it available to the JDM during the startup phase itself. So, that it is able to use it

21:14.260 --> 21:19.620
as early as possible. Then you are saving significant amount of CPU cycles which would have otherwise

21:19.780 --> 21:26.100
we spent in compilation during the run time. And this can be used these free CPU cycles

21:26.100 --> 21:31.780
can then be used by the, by the application threads to start up quickly. So, both of these techniques,

21:31.780 --> 21:36.580
they have the potential to improve your startup time and warm up time.

21:39.460 --> 21:45.460
As with the, as with the, at of time loading and linking feature, the JDM still it is the ability

21:46.020 --> 21:52.820
to do just in time compilations. And this is very much needed because the behavior of the application

21:52.820 --> 22:00.740
in the production run can be differ from the training run. If that happens, your UT profile data

22:01.300 --> 22:07.780
may may be invalid or or the presumptions that were made during a UT compilation, they may not

22:07.780 --> 22:14.900
hold true. So, in such cases in such scenarios, JDM still needs to de optimize the code and

22:14.980 --> 22:25.460
you will still want to recompile the methods back to the C1, C2 level. And lastly, apart from

22:25.460 --> 22:32.420
compiling the Java methods, the JDM also generates code for its interpreter. It generates small

22:32.980 --> 22:38.900
snippets of code which we call as steps and blocks which have the compiling methods to do some

22:38.980 --> 22:44.580
runtime activities like throwing exception, handling of exceptions and de optimization.

22:45.380 --> 22:51.780
And then there are adapters which which allow the JDM to transition from interpreter to compile

22:51.780 --> 22:59.140
code and vice versa. And all of these, all of these are generated at the startup phase.

23:01.060 --> 23:07.460
So, although although the code, this code generation is very fast, it still takes a small bit

23:07.540 --> 23:13.780
of your startup time. And as we started using the startup time using other optimizations,

23:15.380 --> 23:22.500
the contribution of this of these will inflate. So, we are also planning or we are also working on

23:22.500 --> 23:30.660
storing these, these one time generated code into the UT cache. That is all good when

23:30.740 --> 23:37.220
vendors that arrive. So, as I mentioned, the add-of-time code, add-of-time class loading and

23:37.220 --> 23:42.660
linking feature is already available in JDK 24. It has already been delivered in the main line.

23:43.380 --> 23:50.100
The other two pieces, add-of-time method profiling and code compilation, they are expected to be

23:50.100 --> 23:57.140
in JDK 25. But you do not have to wait until these releases, there is a light and a build

23:57.220 --> 24:02.740
available. And for those who want to build who are adventures enough to build their own JDK,

24:03.940 --> 24:09.140
you can compile from the light and repo. There is a pre-main branch which has all these features.

24:11.060 --> 24:17.460
So, yeah, give this a try, try running your application and see what kind of improvements you can get.

24:18.740 --> 24:23.140
That completes my talk. Thank you.