WEBVTT

00:00.000 --> 00:12.560
Hello, everyone, and welcome to my talk about the foreign function interface, also known

00:12.560 --> 00:14.560
as Project Panama.

00:14.560 --> 00:21.560
I'm modern-der, and I've implemented the foreign function interface for the power architecture,

00:21.560 --> 00:25.480
also known as PowerPC.

00:25.480 --> 00:29.320
Obviously, I'm in the SAP machine team.

00:29.320 --> 00:34.800
SAP machine is our open-tradicated distribution.

00:34.800 --> 00:36.600
But we don't only make distributions.

00:36.600 --> 00:40.560
We are also one of the biggest contributors to open-tradicated.

00:40.560 --> 00:44.720
It's already mentioned by Mark this morning.

00:44.720 --> 00:48.760
So it's good to have you here, and let's immediately get started

00:48.800 --> 00:51.960
with an example.

00:51.960 --> 00:54.280
So let's assume we have native library.

00:54.280 --> 01:02.440
It's called Time.SO, and it contains function get time with some signature.

01:02.440 --> 01:10.600
It takes C string in a length, and now the task is to call this function from Java.

01:10.600 --> 01:14.440
So of course, we can traditionally do this with train I.

01:14.440 --> 01:20.960
So by writing a C layer, and then we can call it from Java, but let's skip that,

01:20.960 --> 01:26.400
and jump directly to the foreign function interface.

01:26.400 --> 01:29.000
So we can now do everything directly from Java.

01:29.000 --> 01:33.000
There's no need for an extra C layer anymore.

01:33.000 --> 01:38.640
And by the way, the term down call means we're calling from Java to native.

01:38.640 --> 01:41.240
So this direction.

01:41.280 --> 01:43.600
So what does this example code do?

01:43.600 --> 01:49.880
In the static initializer, we load the time.SO, the library.

01:49.880 --> 01:55.360
And then we need to set up a couple of other things before we can use it.

01:55.360 --> 02:02.120
So first of all, we describe the array with the size and the type.

02:02.120 --> 02:09.280
So in Java, the Java array is represented by a Java byte array.

02:09.280 --> 02:15.640
And we also need to describe the signature of the get time function.

02:15.640 --> 02:19.840
Then we will need the address of the function.

02:19.840 --> 02:27.800
We use a simple look up for that that allows us to create a memory segment for that.

02:27.800 --> 02:34.080
So the address is represented as a memory segment.

02:34.080 --> 02:38.720
Then we can use the native linker to create a down call handle.

02:38.720 --> 02:44.000
I'll explain that in a minute.

02:44.000 --> 02:47.280
So once we have set all of this up, it's pretty easy to use.

02:47.280 --> 02:54.400
We can just use a flat local error arena.

02:54.400 --> 03:00.200
And then create or allocate the array.

03:00.200 --> 03:02.760
And then pass it to the function.

03:02.760 --> 03:13.160
The get time method handle invokes the free or the time function in the native library.

03:13.160 --> 03:19.080
And at the end, we only need to convert the results, which we're written into the array

03:19.080 --> 03:25.400
back to a Java string, which we can print out.

03:25.400 --> 03:30.840
So I promise to tell more about the native linker, what does native linker do?

03:30.840 --> 03:35.720
It finds us a linker for the platform we are using.

03:35.720 --> 03:45.240
There are 10 different linkers implemented within the tradi-k for 10 64 bit platforms.

03:45.240 --> 03:52.160
On X86, we have one linker, which is used for Linux and Mac.

03:52.160 --> 04:00.120
On ARM, 64, there are in total three linkers, different calling conventions.

04:00.120 --> 04:06.560
And also in power, which I have worked on, we have three linkers.

04:06.560 --> 04:11.360
Power processors can run in both little and big-end-in mode.

04:11.360 --> 04:19.520
And modern Linux uses little-end-in, which comes with API for version 2.

04:19.520 --> 04:23.560
So it's a newer calling convention.

04:23.600 --> 04:31.000
We obviously look versions still use peak engine and the old API.

04:31.000 --> 04:36.800
That platform is already out of support, actually, but we're still keeping it alive,

04:36.800 --> 04:40.680
because the API is very similar to AIX.

04:40.680 --> 04:43.800
And as you can see, AIX is still supported.

04:43.800 --> 04:48.600
It's IBM's traditional Unix.

04:48.600 --> 04:56.000
And in addition to that, we also support risk-5 and S490 on Linux.

04:56.000 --> 05:02.440
So quite a few native linkers were implemented, but what about other platforms?

05:02.440 --> 05:07.800
Can we run it on a 32 bit machine, for example?

05:07.800 --> 05:08.800
Yes, we can.

05:08.800 --> 05:13.000
Other platforms are supported by a lip-F-FI.

05:13.000 --> 05:16.640
But that means we need a separate library, and we have to call for it,

05:16.640 --> 05:21.400
and it makes it, of course, a bit slower.

05:21.400 --> 05:25.120
So once we have the native linker, we can use it to create a down call handle.

05:25.120 --> 05:29.240
And this is my most complicated slide.

05:29.240 --> 05:34.640
There's a lot of things to do for that, besides some tracks and caching,

05:34.640 --> 05:42.200
downhill handle, classifies the arguments and the return types.

05:42.200 --> 05:48.440
For example, structures can be passed in registers, or by reference,

05:48.440 --> 05:53.720
and some platforms have special rules for how much in-spload

05:53.720 --> 05:56.640
aggregates, a word is terrible to pronounce,

05:56.640 --> 05:58.280
and it was even worse to implement.

06:01.640 --> 06:05.640
It's only on some platforms, and then there are more simple types

06:05.640 --> 06:10.240
like pointer, pointer, integer, and float.

06:10.240 --> 06:18.000
Once we have classified them, the down call handle can create bindings.

06:18.000 --> 06:24.560
Bindings are a recipe for processing the arguments

06:24.560 --> 06:32.000
and the return types, and they are implemented by operators.

06:32.000 --> 06:34.920
It's modeled as a stack-based interpreter,

06:34.920 --> 06:37.720
a bit similar to a Java byte code.

06:37.720 --> 06:39.880
And I have some examples on this slide.

06:39.880 --> 06:47.360
For example, the top operator, it just duplicates the top of stack.

06:47.360 --> 06:54.000
And then there's a buffer load that loads a value from a memory statement.

06:54.000 --> 06:59.360
And we have the M-store that writes a value into a register,

06:59.360 --> 07:01.720
or into a stack slot.

07:01.720 --> 07:04.920
There are much more, but I can't go into all details.

07:04.920 --> 07:10.360
It's not only similar to Java byte code.

07:10.360 --> 07:15.800
It also gets compiled or translated to Java byte code.

07:15.800 --> 07:21.200
There's a binding specializer, and that generates,

07:21.200 --> 07:25.440
as I said, Java byte code from the bindings.

07:25.440 --> 07:29.640
It can be switched off by a property, by default, it's enabled.

07:29.640 --> 07:33.120
It can also be dumped, so you can see what byte code

07:33.120 --> 07:37.520
has got generated.

07:37.520 --> 07:44.160
And it's a bit slow when you have the first invocation.

07:44.160 --> 07:46.720
But after time, if you call it bare frequently,

07:46.720 --> 07:49.640
the generated byte code will also get just in time compiled.

07:49.640 --> 07:54.440
And then after some time, it will be fast.

07:54.440 --> 07:56.720
But that's not all.

07:56.720 --> 07:59.240
We still need the backend.

07:59.240 --> 08:01.520
The backend is called via train I.

08:01.520 --> 08:04.280
There's a make-downhold stop.

08:04.280 --> 08:07.080
And by the way, we're still using train I.

08:07.080 --> 08:16.800
So it's still needed, at least inside the Trader M or the Trader K.

08:16.800 --> 08:21.160
Yeah, we'd need it at least to generate code

08:21.160 --> 08:23.200
for the foreign function in the face.

08:27.320 --> 08:31.480
So next slide, we can trace what the backend does.

08:32.160 --> 08:34.600
At least if we have a debug build,

08:34.600 --> 08:38.440
we can use this log flag.

08:38.440 --> 08:42.440
And if we have an HsDIS library, also in place,

08:42.440 --> 08:46.040
this allows us to see a disassembled code.

08:46.040 --> 08:52.240
So the backend generates binary code into the code cache.

08:52.240 --> 08:54.680
And with the logging, we can see that disassembled.

08:54.680 --> 08:56.560
It's a large enough to read, I think.

09:01.560 --> 09:03.640
Pardon?

09:03.640 --> 09:06.280
What does this code do?

09:06.280 --> 09:09.240
At the beginning, I also, this is from X86.

09:09.240 --> 09:11.960
In this case, the first instruction

09:11.960 --> 09:14.560
created a stack frame.

09:14.560 --> 09:20.080
Then we have a flat-stage transition to native.

09:20.080 --> 09:24.040
Before this point, garbage collector

09:24.040 --> 09:27.320
perhaps need to synchronize this hour flat.

09:27.320 --> 09:31.880
Because obviously, Java flat can access or modify

09:31.880 --> 09:38.080
Java objects and garbage collector can move things around.

09:38.080 --> 09:40.360
So that needs to be coordinated.

09:40.360 --> 09:43.720
And after this third-stage transition,

09:43.720 --> 09:46.560
the garbage collector flat, let's know

09:46.560 --> 09:49.760
that we are in native, in a native function.

09:49.760 --> 09:52.640
And we will no longer access Java objects,

09:52.640 --> 09:56.240
so that they can do what they want with a Java heap.

09:59.640 --> 10:02.800
So next step is to set up the argument.

10:02.800 --> 10:06.920
So in this case, all arguments are passed in registers.

10:06.920 --> 10:10.080
We don't need any stacks slots.

10:10.080 --> 10:15.720
And then here, this is the call to the native function.

10:15.720 --> 10:18.680
And once that is done, we need to change the steps

10:18.680 --> 10:21.880
back to in Java.

10:21.880 --> 10:24.080
And this direction is a bit more complicated.

10:24.080 --> 10:30.040
It includes a safe point check and also memory barrier.

10:30.040 --> 10:33.560
Yeah, the memory barrier may be a performance bottleneck,

10:33.560 --> 10:37.760
especially if you call this function very often.

10:37.760 --> 10:41.880
And you have a large system with many sockets,

10:41.880 --> 10:44.760
or at least many chips, because the memory barrier

10:44.760 --> 10:46.720
requires internship communication

10:46.720 --> 10:51.400
and that may increase the latency.

10:51.400 --> 10:54.360
There are several ways to get rid of the memory barrier.

10:54.360 --> 10:59.080
So by the way, it's a lock-at instruction on X86.

10:59.080 --> 11:01.560
And we can use the system memory barrier instead.

11:05.720 --> 11:06.560
So hopefully not me.

11:06.560 --> 11:16.720
Yeah, the system memory barrier makes this code cheaper,

11:16.720 --> 11:20.400
because the lock-at instruction will be omitted.

11:20.400 --> 11:24.680
But on the other side, it makes other things slower, for example,

11:24.680 --> 11:28.040
especially hand shakes will be slower with that.

11:28.040 --> 11:32.240
So it's not always beneficial.

11:32.240 --> 11:36.440
But it is possible to get rid of all that state transitions.

11:36.440 --> 11:39.200
Some of you may know critical train ironatives

11:39.200 --> 11:42.160
from all the telecommunication releases.

11:42.160 --> 11:44.800
And that is now also possible with a foreign function

11:44.800 --> 11:46.480
interface.

11:46.480 --> 11:51.440
You can use it by a linker option.

11:51.440 --> 11:55.160
So this looks like this.

11:55.160 --> 11:56.920
We just need to add the linker option

11:56.920 --> 12:01.640
to the down call handle, linker option critical,

12:01.640 --> 12:04.160
and by the Boolean argument here,

12:04.160 --> 12:09.880
we specify that the function is allowed to access the Java

12:09.880 --> 12:15.040
heap directly, and that makes things even more simple.

12:15.040 --> 12:21.080
So after that, we can just use a normal byte array.

12:21.080 --> 12:26.160
And we can pass that as a memory segment,

12:26.160 --> 12:29.840
and finally, we can route the array to a Java string,

12:29.840 --> 12:32.400
which we can print, so it's pretty simple now.

12:34.200 --> 12:36.120
There's a warning on this slide.

12:36.120 --> 12:40.560
Do not use for anything which may block.

12:40.560 --> 12:45.360
The type state transitions were there for a reason.

12:45.360 --> 12:47.320
If you now have anything which blocks,

12:47.320 --> 12:50.880
then you will also block the tree sees what we have

12:50.880 --> 12:52.560
to see developers in the room here,

12:52.560 --> 12:55.760
and they will not be happy if you block that.

12:55.760 --> 12:57.400
So don't do this.

12:57.400 --> 13:01.360
Maybe it would be an idea for April 1st and file a machine.

13:04.240 --> 13:09.040
All right, so what happens if we do this with the back end?

13:09.040 --> 13:13.840
You can see the generated code is pretty simple now.

13:13.840 --> 13:17.040
There's nothing's low in it anymore.

13:17.040 --> 13:22.000
You can also see the address computation is now there.

13:22.000 --> 13:27.040
We compute the start of the array address,

13:27.040 --> 13:30.400
and the native function will directly write to that.

13:35.120 --> 13:37.040
So there are more linker options.

13:37.040 --> 13:38.880
So the next one is on this slide.

13:38.880 --> 13:42.720
It is possible to figure out the kernel,

13:42.720 --> 13:46.560
and the linker option is called capture call state.

13:49.320 --> 13:52.880
If we use that, we need to pass an additional memory segment

13:52.880 --> 13:54.280
for the captured state.

13:56.680 --> 14:02.080
And I'm using here a bar handle

14:02.880 --> 14:08.880
to extract the kernel field from the structure after the call.

14:11.600 --> 14:16.240
So this is also a good feature that we can figure out

14:16.240 --> 14:19.000
the kernel after native function call.

14:20.480 --> 14:25.280
But that adds some code to the generated code,

14:25.280 --> 14:30.480
which are directly after the native function call,

14:30.480 --> 14:34.560
we see another function call to capture state,

14:34.560 --> 14:36.400
so that adds to some overhead.

14:39.760 --> 14:42.880
So my example was a bit simple.

14:42.880 --> 14:48.960
It only used an array, but we can also pass more complicated

14:48.960 --> 14:51.760
structures like an IDF.

14:51.760 --> 14:55.800
So IDF has nothing to do with Israeli defense forces.

14:55.800 --> 14:58.200
It stands for in double float.

15:00.720 --> 15:04.760
And some platform structure lay out rules require padding,

15:04.760 --> 15:06.720
which you can see here and here.

15:11.200 --> 15:14.360
You probably would not write this manually.

15:14.360 --> 15:16.440
There is a tool to generate such code.

15:16.440 --> 15:18.840
It's called tray extract.

15:18.840 --> 15:21.680
Trace extract is based on LLVM.

15:21.680 --> 15:26.160
And it parses, see how to file and it generates Java code.

15:27.120 --> 15:31.440
So it's this code now platform independent.

15:31.440 --> 15:35.920
Not always, you can also see it in the comment here.

15:35.920 --> 15:39.960
A X, for example, has a bit different layout rules.

15:39.960 --> 15:43.400
It doesn't use padding here by default.

15:43.400 --> 15:45.520
But fortunately, there's a hack.

15:45.520 --> 15:50.360
We can build the native library on AX with Pratma,

15:50.360 --> 15:52.080
a line-atro.

15:52.080 --> 15:55.960
And that will make the layout compatible with Linux.

15:55.960 --> 15:59.640
If we don't do this, we have to modify the Java code

15:59.640 --> 16:01.080
for AX.

16:01.080 --> 16:02.120
It's a bit unfortunate.

16:05.840 --> 16:09.200
So I have talked 16 minutes about the on-calls.

16:09.200 --> 16:12.120
And now I have to get to upcalls.

16:12.120 --> 16:14.120
But I will make the short.

16:14.120 --> 16:16.360
What is the upcall?

16:16.360 --> 16:17.960
Upcall is the other direction.

16:17.960 --> 16:22.080
It is called from native to Java.

16:22.080 --> 16:29.600
And the upcall stop provides us a memory segment,

16:29.600 --> 16:34.040
which can be passed to a native function,

16:34.040 --> 16:36.960
and serve as a function pointer for callback.

16:36.960 --> 16:45.360
So the native function can callback to Java, which is pretty nice.

16:45.360 --> 16:48.080
It generates a bit similar code to the downcall handle,

16:48.080 --> 16:50.120
which I'm not showing here.

16:50.120 --> 16:51.680
But it's more complex.

16:51.680 --> 16:54.800
It needs to save more registers.

16:54.800 --> 16:58.640
And it requires additional c-calls for resource management.

16:58.640 --> 17:01.240
So upcalls are always lower than downcalls.

17:05.360 --> 17:08.360
So let's sum it up.

17:08.360 --> 17:11.720
There's several ways to call native function.

17:11.720 --> 17:16.720
The traditional way is train I.

17:16.720 --> 17:20.680
And we have critical functions available.

17:20.680 --> 17:23.080
We enable it by the switch, critical train I.

17:23.080 --> 17:24.440
native.

17:24.440 --> 17:28.360
But it's only available until the tridicase 17.

17:28.360 --> 17:31.600
It was removed later.

17:31.600 --> 17:34.080
And with a new function in the phase,

17:34.080 --> 17:37.920
we have also critical functions available since tridicate

17:37.920 --> 17:38.440
22.

17:38.440 --> 17:42.400
As the link option, as I already showed.

17:42.400 --> 17:46.200
So what can we do for tridicate 21?

17:46.200 --> 17:49.040
There's one more option.

17:49.040 --> 17:50.120
It's project Nalim.

17:50.120 --> 17:55.160
And the author is also here in the room.

17:55.160 --> 17:57.760
Yeah, it is only for critical function.

17:57.760 --> 18:00.360
It is based on tridicate, MCI,

18:00.360 --> 18:04.200
on the Java virtual machine compiler interface.

18:04.200 --> 18:06.120
But it's not supported on all platforms.

18:06.120 --> 18:08.320
And it's not supported by the Open Tridicase.

18:08.320 --> 18:09.880
So we use it at your own risk.

18:09.880 --> 18:17.800
So I think I have a bit more time to talk about

18:17.800 --> 18:19.320
the power-specific things.

18:19.320 --> 18:23.440
I have mentioned that I have plotted it to power.

18:23.440 --> 18:26.880
I have one extra slide for that.

18:26.880 --> 18:32.800
So power has some special properties.

18:32.800 --> 18:34.520
The ABI has some special properties,

18:34.520 --> 18:37.640
which were not implemented before.

18:37.640 --> 18:40.640
For example, integers need to be passed as long,

18:40.640 --> 18:48.640
so we need sign extent floating points of loads.

18:48.640 --> 18:54.640
Need an instruction to convert it to double format,

18:54.640 --> 18:58.640
because floating of load values are represented

18:58.640 --> 19:03.000
as double format when they are passed in registers.

19:03.000 --> 19:05.720
Then the HFA is extremely complicated.

19:05.720 --> 19:12.080
The homogenous float aggregate, some parts are typically passed

19:12.080 --> 19:13.880
in floating point registers.

19:13.880 --> 19:17.240
And once we have all floating point registers used,

19:17.240 --> 19:20.040
the rest gets passed in general purpose registers.

19:20.040 --> 19:22.760
And if they are not enough free channel purpose registers,

19:22.760 --> 19:24.600
we also use the exploits.

19:24.600 --> 19:30.720
And the float registers used double format always.

19:30.720 --> 19:32.560
But the other parts may be compressed

19:32.560 --> 19:34.640
with 32 bits.

19:34.640 --> 19:38.240
And there are cases in which we need to use both.

19:38.240 --> 19:41.520
Retsrester and FlexStex.for the same value.

19:41.520 --> 19:46.120
So that was terrible to implement.

19:46.120 --> 19:50.760
As I mentioned, we have three different ABI's and power.

19:50.760 --> 20:02.000
Big engine needed some extra code for big engine.

20:02.080 --> 20:07.360
We had to shift the value into the values into the right position.

20:07.360 --> 20:11.680
And they had to extend the binding operators

20:11.680 --> 20:16.920
and all that, so that was some extra work to be done.

20:16.920 --> 20:21.760
So I'm already done now.

20:21.760 --> 20:24.240
Panama can be used free of charge.

20:24.240 --> 20:26.040
It's open source.

20:26.040 --> 20:28.360
I'm sure some politicians will like the statement.

20:28.360 --> 20:33.800
Don't forget to grab a submachine sticker.

20:33.800 --> 20:36.520
And I think it's time for questions now.

20:36.520 --> 20:46.800
APPLAUSE

20:46.800 --> 20:49.240
Yeah?

20:49.240 --> 20:54.240
Those native blocks can be combined into the color as well.

20:54.240 --> 20:57.480
The question is if these native blocks can inline,

20:57.480 --> 21:02.240
no, they can't only the parts which are in Java.

21:02.240 --> 21:03.640
So the general is byte code.

21:03.640 --> 21:06.880
They can be inline, but not the blocks.

21:11.880 --> 21:12.880
Any more questions?

21:16.880 --> 21:18.360
So I think we're done.

21:18.360 --> 21:21.320
And if any more questions come into your mind,

21:21.320 --> 21:22.640
they can find me here around.

21:22.640 --> 21:25.560
I'll be here for the rest of the day.

21:25.640 --> 21:29.520
So the next talk will also be about the foreign function

21:29.520 --> 21:31.920
in the face of stay tuned.

21:31.920 --> 21:36.920
APPLAUSE

