WEBVTT

00:00.000 --> 00:16.560
Hello everyone, hey, I'm Eric, I'm an engineer and the android by performance team and I'd

00:16.560 --> 00:20.360
like to talk to you a little bit about things we did for chromium and android to make

00:20.360 --> 00:28.320
it nice and fast and to give you a little sneak peek, this is what the graph looks like

00:28.320 --> 00:33.640
for speedometer performance on android and chromium over the last two years roughly.

00:33.640 --> 00:38.760
It's an graphic quite proud of, it's not just me that accomplished this of course, it's

00:38.760 --> 00:46.960
a large group of people that helped us get here, but yeah, we basically doubled web performance

00:46.960 --> 00:55.000
measured as speedometer score on android through a lot of work.

00:55.000 --> 01:01.440
To maybe briefly take a step back and ask how many of you do know what speedometer

01:01.440 --> 01:02.440
is.

01:02.440 --> 01:10.560
That's a good number, it's really a web interaction benchmark, it tries to measure how fast

01:10.560 --> 01:15.520
as a browser respond when users interact with a website.

01:15.520 --> 01:21.320
It does that using synthetic workloads, here what you can see a lot is a to-do application

01:21.320 --> 01:28.600
where we add items to to-do list and then remove them, we measure how long does that take.

01:28.600 --> 01:30.240
Why does this matter?

01:30.240 --> 01:35.000
Speedometer improvements, they mean interactions on the web get faster, but they also

01:35.000 --> 01:37.080
mean that page loads get faster.

01:37.080 --> 01:44.560
And this is an example of loading a Google Doc in a Chrome on Android back two years ago

01:44.560 --> 01:51.480
and now you can kind of see that two years ago that was almost 50% slower to do and it's

01:51.480 --> 01:55.680
all due to these kinds of improvements.

01:55.680 --> 02:02.240
So with all that bait out of the way, let me say thank you to a lot of people that helped

02:02.240 --> 02:07.840
us get here and tell you what I want to talk about here today.

02:07.840 --> 02:12.720
First I'd like to break down this timeline of what do each of these improvements, where

02:12.720 --> 02:18.120
do each of these improvements come from, how do we get to 2x?

02:18.120 --> 02:24.640
And then I'd like to dive deep into a rather geeky area and nerdy geeky area of the built-up

02:24.640 --> 02:32.240
optimization that we added.

02:32.240 --> 02:37.920
And then talk a little bit about the tooling that I enables us to understand this kind

02:37.920 --> 02:42.480
of browser performance on the lower level close to the hardware and how we can identify

02:42.480 --> 02:45.120
where there are optimizations.

02:45.120 --> 02:51.760
So it's very much a view from a browser developer close to the barebones hardware here,

02:51.760 --> 02:57.600
which might not be quite the same to a web developer, but some of the tooling, some of the

02:57.600 --> 03:04.200
insights, some of the processes might actually also be quite relevant for a web developer.

03:04.200 --> 03:09.320
So yeah, speedometer speedometer, lastly I also want to talk a little bit about what we are

03:09.320 --> 03:15.320
doing to get workloads in the lab closer to real page loads, real interactions and

03:15.320 --> 03:21.240
page loads on real websites.

03:21.240 --> 03:28.480
So yeah, this timeline, a lot of the improvements here came out of these three areas.

03:28.480 --> 03:36.320
First, improvements to Chrome's build to make it faster to execute on modern Android hardware.

03:36.320 --> 03:41.760
Second, they were a bunch of improvements to the rendering and the JavaScript engines in Chrome

03:41.760 --> 03:48.440
that contributed kind of the other half to mostly the other half of these improvements.

03:48.440 --> 03:56.720
And lastly, we had to work quite closely with other OEMs and Android partners to make

03:56.720 --> 04:04.320
sure that browsing is actually scheduled correctly on the hardware.

04:04.320 --> 04:11.120
So digging into the build first, the main thing that enabled us to make a change here was

04:11.120 --> 04:17.000
the fact that we split Chrome's build into two halves for Android.

04:17.000 --> 04:21.160
Previously, we were shipping the same APK, the same binary to all Android devices.

04:21.160 --> 04:33.960
And you might know Android devices, they range from $100 phone to $2,000 flip phone and

04:33.960 --> 04:36.840
these devices perform very differently.

04:36.840 --> 04:42.640
For the low end, it's really important that we ship an APK a build that's really small

04:42.640 --> 04:48.800
and size and has low memory overhead because these devices ship with poor flash, very

04:48.800 --> 04:56.320
little amount of flash and a lot less memory than the high end phones.

04:56.320 --> 05:03.800
So for poor, these poorer, more economy-style phones, we can't really ship a very well-optimized

05:03.800 --> 05:08.720
Chrome binary because a lot of these optimizations that we can land into Chrome, they increase

05:08.720 --> 05:13.040
our binary size, they increase the memory footprint.

05:13.040 --> 05:20.040
So splitting Chrome's build into these two halves and shipping one APK to lower end phones

05:20.040 --> 05:26.760
and another APK to a more premium end phones made a lot of these improvements possible.

05:26.760 --> 05:29.200
What are we doing differently for this high end build, I guess?

05:29.200 --> 05:32.960
The first thing we're doing differently is that we don't optimize for size in the compiler,

05:32.960 --> 05:35.440
we optimize for speed instead.

05:35.440 --> 05:40.640
We also don't have to target 32 bit, are many more, we can say, well, this build, we are

05:40.640 --> 05:45.440
only going to ship it to 64 bit devices.

05:45.440 --> 05:49.840
That's actually also something that increases the memory footprint slightly.

05:49.840 --> 05:53.480
64 bit means every pointer becomes double the size.

05:53.480 --> 05:59.680
We can make some tweaks to that, we can use pointer compression, for example, to reduce

05:59.680 --> 06:05.520
pointers in V8 and the garbage collection heaps and the JavaScript heaps and the bling heaps

06:05.520 --> 06:11.760
potentially, but overall there's still going to be a memory impact.

06:11.760 --> 06:16.400
Even on low end phones where we have 64 bit capability, we wouldn't ship a 64 bit

06:16.400 --> 06:23.200
built, we would ship only a 32 bit built because of the memory impact.

06:23.200 --> 06:29.040
And finally, what we can also now start to do is profile guided optimization, PGO, this

06:29.040 --> 06:36.160
is basically a mechanism that runs workloads in the lab on Chrome and figures out what

06:36.160 --> 06:43.680
of the code is hot, what of the code is, which code is less hot, which code is cold and

06:43.680 --> 06:49.120
applies different optimizations to a hot and cold code, and we'll get into more details

06:49.120 --> 06:52.800
of what that actually means later.

06:52.800 --> 06:56.160
So initially, we just enabled all of these things.

06:56.160 --> 07:02.480
Second, we then dug deep into how can we improve on that.

07:02.480 --> 07:09.600
We switched the generation of these provided optimizations to use different profiling data.

07:09.600 --> 07:17.800
Previously, we were kind of reusing profiles that were already present for Mac 64 bit arm devices.

07:17.800 --> 07:23.400
Now we are switching to 64 bit profiles collected on actual Android phones.

07:23.400 --> 07:27.160
And then it improves performance.

07:27.160 --> 07:34.440
We also made sure that the PGO profiles that were used in the binary built later are profiles

07:34.440 --> 07:38.320
that were very recently computed.

07:38.320 --> 07:44.920
If you let a lot of time pass between generating the profile and doing your build, these profiles

07:44.920 --> 07:52.040
become stale and might not actually apply the correct optimizations to the correct code.

07:52.040 --> 07:58.200
Another thing that we then discovered later is that we can increase inlining in the compiler

07:58.200 --> 08:09.320
so that the compiler kind of prefers to pull in more code into inline functions.

08:09.320 --> 08:16.040
In this rigorous binary size, but it's actually beneficial on a modern hardware.

08:16.040 --> 08:22.440
And lastly, we also added improvements to the order file.

08:22.440 --> 08:29.320
That's a Chrome specific compiler, more like a linker feature that tries to arrange functions

08:29.320 --> 08:33.760
across the binary in a sensible order.

08:33.760 --> 08:41.440
That's now based on speedometer and that helps a lot too.

08:41.440 --> 08:44.320
So that's all built.

08:44.320 --> 08:49.200
Beyond built, I mentioned there a lot of improvements in the Chrome-UM engine itself,

08:49.200 --> 08:56.080
like the blink rendering engine, for example, there were a bunch of small improvements

08:56.080 --> 09:02.000
that were landed across the engine and a little bit there, a little bit there, here it

09:02.000 --> 09:03.000
adds up.

09:03.240 --> 09:14.440
With even 11, 13% plus improvements just by adding up lots of tiny little things over a year,

09:14.440 --> 09:17.000
two years that adds up.

09:17.000 --> 09:21.560
But there were also a couple of bigger changes that were landed.

09:21.560 --> 09:30.200
The first one called out here is an improved parser that makes it faster to parse HTML when

09:30.280 --> 09:37.160
it is inserted dynamically via the NIHDML attribute.

09:37.160 --> 09:41.960
Again, something that we didn't ship on Android before, because binary size.

09:41.960 --> 09:48.200
Adding this extra parser means we regress low-end devices.

09:48.200 --> 09:53.400
The eight also added a new baseline compiler here that's basically a tier that sits in between

09:53.400 --> 10:01.160
the really quick to generate code, ignition interpreter and the eight,

10:01.160 --> 10:08.600
and the next level up compiler here that crunches out really well optimized code using a

10:08.600 --> 10:10.360
Jet compiler.

10:10.360 --> 10:16.360
Spark plug is a baseline compiler that is really quick to spit out somewhat better code.

10:17.320 --> 10:24.360
And adding that into V8 improves speedometer but also improves spatial significantly.

10:26.840 --> 10:29.240
The last thing to call out here is garbage collection.

10:29.240 --> 10:31.240
I think there's more opportunities in that space.

10:31.240 --> 10:36.840
But in the last few releases last year, we landed a few improvements to make sure that

10:36.840 --> 10:39.000
garbage collection happens in a better moment in time.

10:39.800 --> 10:45.800
Rather than triggering in moments when it will affect negatively speedometer scores or

10:45.800 --> 10:46.840
interactions on pages.

10:50.920 --> 10:53.160
Last area is scheduling an operating system.

10:53.160 --> 11:01.560
So it turns out that if you don't tweak your kernel to prioritize the right threats,

11:02.520 --> 11:04.840
your performance suffers. Who would have guessed?

11:05.480 --> 11:11.880
This is an active area for us. I think we landed a couple of initial winds here with some of the

11:11.880 --> 11:17.400
OEMs and Android, but Android is really fragmented in this area. A lot of OEMs use very

11:17.400 --> 11:26.440
different scheduling heuristics policies in their platforms and making sure that web browsing

11:26.440 --> 11:30.600
is effectively prioritized there is a very tricky topic.

11:30.760 --> 11:39.400
All right, so to dig a little bit deeper into this built thing, I wanted to show you a little bit

11:39.400 --> 11:46.440
of data. This is a data that we collect during speedometer on an Android device.

11:46.440 --> 11:53.000
This is in this case a pixel 8 device. This was before we landed all these PGO improvements.

11:53.880 --> 12:04.360
And what you can see here is that in speedometer, the execution in the CPU is often

12:04.360 --> 12:10.120
stalled in the front end of the CPU. The front end of the CPU is kind of the piece in the CPU

12:10.120 --> 12:15.160
that tries to fetch instructions from the memory to then pass them on into the execution units

12:15.160 --> 12:19.880
in the back end to execute. So what this means really is that the front end is having trouble

12:20.760 --> 12:29.800
identifying where to fetch the instructions from. Stalling here means we have to wait for

12:29.800 --> 12:34.440
the data to come in so that we can actually take these instructions and put them into the back end.

12:35.960 --> 12:41.480
And typically what that again means is that you are waiting for memory, you're waiting for

12:41.480 --> 12:44.360
these instructions to come out of your cache hierarchy or out of your DRAM.

12:45.320 --> 12:53.960
And what we discovered is that in Chrome, before all of these PMU, sorry, before all of these

12:53.960 --> 13:00.760
provided optimizations, we were seeing a lot of stalls were due to branches in some form or

13:00.760 --> 13:05.640
another cache misses that happened because we were mis-predicting branches.

13:07.480 --> 13:13.560
And speedometer is a workload that is very different to many other workloads that CPU engineers

13:13.560 --> 13:21.240
typically utilize to benchmark their CPUs. That's a very quick statistic here is on branches.

13:22.520 --> 13:27.880
We don't matter, it's about 20% of its instructions being branches. That's every 5th instruction

13:27.880 --> 13:35.080
is a conditional branch that wants to go somewhere. In workloads like peak bench,

13:35.640 --> 13:42.760
something that compiler engineers or even CPU engineers would be more familiar with,

13:43.000 --> 13:51.960
that number is about half. Given these many branches, it's really important that

13:53.720 --> 14:01.800
these branches are predicted correctly in the CPU. When you mis-predict a branch on an ARM CPU,

14:02.680 --> 14:06.840
you end up paying not only for the mis-predict itself, right? Like you mis-predict the branch,

14:06.840 --> 14:11.400
that means you have to roll back all your execution to the beginning of this branch

14:12.280 --> 14:18.040
throughout all the instructions that you executed in a predicted fashion and then restart instructions

14:18.040 --> 14:24.520
instruction execution there. But when you did this mis-predict what you also did is you mis-predict

14:24.520 --> 14:29.800
that what memory to fetch into your cache hierarchy to load your instructions from.

14:30.600 --> 14:35.000
So you end up polluting your cache hierarchy with the wrong instructions and you end up not having

14:35.080 --> 14:41.240
the correct instructions in the cache hierarchy, which means that again you increase the time

14:41.240 --> 14:51.640
needed to fetch data from memory in the front end. The way that you solve this is by making

14:51.640 --> 15:01.880
sure that in the code of your application, you align your code in such a way, your branches,

15:01.960 --> 15:07.160
on the assembly level, and such a way that the fall-through branches, like the fall-through

15:07.160 --> 15:14.280
path through a branch, is the one that is most often the taken one. That's what provided optimization

15:14.280 --> 15:21.800
attempts to do. It tracks which path in each branch is taken, while executing a workload.

15:22.680 --> 15:26.840
It then sees that maybe 80% of the time you have to go that way at 20% of the time you have to

15:26.840 --> 15:32.760
get that way. So it then takes that information and during compilation, it will make sure that

15:32.760 --> 15:39.160
the 80% branch is actually the path that falls through the branch. So in 80% of cases this branch

15:39.160 --> 15:48.600
does not have to be taken, you just fall through. This helps CPUs enormously. What also helps is to have a

15:48.600 --> 15:59.880
CPU that has higher branch predictors, larger branch predictors. When you, and another reason

15:59.880 --> 16:05.560
why making sure that the fall-through branch is the one that is the most prominent one, the

16:05.560 --> 16:10.520
one that is always taken, is because fall-through branches, when the CPU predicts that a branch

16:10.520 --> 16:17.160
will be not taken, it doesn't have to take up any space in the branch predictors memory,

16:17.240 --> 16:24.760
in the branch predictors' caches. Branch predictor caches only store the taken branches,

16:24.760 --> 16:40.760
the branches that you have to follow. So jumping forward in both software and in hardware

16:40.760 --> 16:48.440
one generation to a CPU that is a little bit better in Pixel 9 now, and to Chrome that uses PGO

16:48.440 --> 16:54.440
optimizations. You can see that this bottleneck in the CPU moves, they no longer as much front and

16:54.440 --> 16:59.480
bound, but instead we are now back and found. That means the instruction bottleneck that fetching

16:59.480 --> 17:06.840
the instructions, doing correct branch prediction is a lot more optimized now. But what you can

17:06.840 --> 17:13.720
also still see is that the instructions per cycle, so the efficiency of executing this workload

17:13.720 --> 17:22.120
in the CPU is still lower when the front and stalls are higher. It's still very important for us

17:22.120 --> 17:38.840
to continue optimizing in this area for the front end. So I mentioned the order file earlier,

17:38.840 --> 17:45.640
and the quick call out here is that the the order file improves on this in a separate way.

17:46.840 --> 17:51.240
The order file tries to make sure that we are across functions, across different

17:51.240 --> 17:57.720
functions, we are pulling them together in memory, and then more continuous, and more and

17:57.720 --> 18:06.680
more continuous space when these functions are often executed in temporal proximity. So often

18:06.680 --> 18:10.280
function B is executed after function A, you would want to make sure that function B is close to

18:10.280 --> 18:21.160
function A in the binary. This helps reduce pressure on the TLB. Again, improving a different

18:21.160 --> 18:30.600
bottleneck, but also still in the front end. And of course, we also now need to look into the

18:30.600 --> 18:36.040
back end stall a little bit more, what's causing all these back end stalls. I think our early

18:36.040 --> 18:43.880
insights here are also that these back end stalls are bound on cache hierarchy lookups in these CPUs.

18:44.760 --> 18:49.400
Most of the time when we are stalling in the back end, it's actually because we have to go beyond

18:50.280 --> 18:58.040
the L3 cache in the CPU, so beyond the cache in the CPU into the caches or the DRAM itself.

18:59.880 --> 19:04.120
This means we need to understand why the cache hierarchy doesn't work for the back end

19:05.240 --> 19:12.920
accesses either. Guess likely something there that again scatters memory accesses in a way

19:12.920 --> 19:25.640
that the CPU finds hard to predict. So I wanted to take a step at explaining a little bit

19:26.920 --> 19:32.440
how we might go about doing that, understanding why data accesses are scattered as well.

19:32.440 --> 19:38.760
And give you a little bit of an insight into what tooling we used to get these insights. I just

19:38.840 --> 19:48.520
show it to you. Most of these insights come from profiling and I know many of you are probably

19:48.520 --> 19:55.720
familiar with DevTools and Chrome and it's profiling options. As Chrome engineers we use another tool

19:56.600 --> 20:01.560
that gives us a little bit more insights into the browser's inner workings and that's

20:01.560 --> 20:08.760
performance tracing based on Pefetto. Pefetto gives us an attribution of all the execution

20:08.760 --> 20:17.000
in Chrome to browser tasks and also allows us to combine that information with system profiling

20:17.000 --> 20:26.200
data like scheduling information or other system counters. You might be familiar with the

20:26.200 --> 20:31.160
performance.mark and the performance.measure APIs in the web as well. These allow us to

20:31.160 --> 20:38.040
annotate these workloads with user journey information. For speedometer what we did is we added

20:38.040 --> 20:44.200
instrumentation to annotate different sub-tests in speedometer to be able to then break down

20:45.320 --> 20:50.040
are these all behaving the same way are the specific tests in speedometer that have different

20:50.040 --> 21:02.120
bottlenecks than others and we can then also bring in additional data into the traces on

21:02.120 --> 21:10.280
top of that. For example we can bring in CPUPU new counters so performance counters that the CPU

21:10.280 --> 21:15.080
tells us how many the old cycles are there at this moment in time, how many instructions

21:15.080 --> 21:24.440
that I execute, how many cycles that that take me and we can also bring in calls to examples

21:25.560 --> 21:33.960
from Chrome and from JavaScript. Together all of this should allow us to effectively find

21:33.960 --> 21:41.240
functions in JavaScript or functions in the browser that have a very poor instruction throughput.

21:41.240 --> 21:47.560
For example because they often miss encasches. So here this is really tool tooling that allows

21:47.560 --> 21:53.000
us to go down from the big picture of I'm executing speedometer it takes me 30 seconds to

21:54.200 --> 22:01.640
this particular function this particular sub-test is executing instructions really really

22:01.720 --> 22:14.040
slowly and you should better look into why. So if you want to go even further you can also

22:15.320 --> 22:20.360
let me skip this slide actually. If you want to go even further you can go down a level further

22:20.360 --> 22:27.000
and try to understand within these functions which instructions in these functions cause bottlenecks.

22:27.640 --> 22:32.760
So we can add in data from low-level sources on armships that are called ETM and SPE.

22:33.480 --> 22:40.360
ETM is a tracing mechanism on arm CPUs that allows you to get the whole instruction stream

22:40.360 --> 22:45.960
basically and identify ranges in the instruction stream where instructions were taking very

22:47.000 --> 22:55.640
very long time or where they were hiccups and SPE is a statistical

22:56.280 --> 23:02.920
statistical tool to do something very similar. It samples loads or branches and it can

23:02.920 --> 23:07.800
try to identify branches that often miss in the cache or branches that were often mispredicted.

23:09.800 --> 23:18.680
Or loads that were often missing in caches. So for example in this screenshot there's a load here

23:19.400 --> 23:27.080
that took thousands of cycles to fulfill and that's because you had to go down to DRAM to

23:27.080 --> 23:33.320
fulfill it. Now with all the symbolization data we can actually go and understand which load is

23:33.320 --> 23:39.080
that which instruction is that where in the JavaScript or in the page resources is this happen happening.

23:39.320 --> 23:56.280
And yeah. Speedometer is a good workload for us to do all of this with. But speedometer is

23:56.280 --> 24:01.560
quite a synthetic workload and it really only helps us look at a small piece of whatever

24:01.560 --> 24:09.240
browser has to do. And for that I've got a nice diagram that shows you which areas of the browser

24:09.240 --> 24:16.680
is our exercise by speedometer. Speedometer is this benchmark here and I've contrasted this with

24:16.680 --> 24:22.200
page load and with scrolling here for now. In page load we see that there are a bunch of other

24:22.200 --> 24:29.240
browser components that are exercised. In page load also affects pieces in the browser. For example

24:29.320 --> 24:34.840
we have to utilize the network. We have to prepare requests sent them out and get responses back.

24:35.560 --> 24:43.000
We have to parse a lot more of that content. We also have to do a lot more rendering during

24:43.000 --> 24:47.560
page load. There's a lot more restoration of new resources than there's in speedometer.

24:48.440 --> 24:53.240
And some pieces in speedometer while they are exercised they don't actually affect the score

24:53.240 --> 24:56.920
that was ultimately computed in speedometer. So they are less relevant there either.

24:57.320 --> 25:04.040
So overall for us that means that when we are talking about this kind of low-level data

25:04.680 --> 25:10.280
the low-level optimizations that we can attack together with partners and OEMs

25:11.080 --> 25:18.440
there's a lot a lot of the browser that we need to have a better coverage for in the lab.

25:20.200 --> 25:26.280
Chromium in the past has mainly focused on using field data to optimize for these use cases.

25:26.280 --> 25:34.520
That's why you see all these webbital metrics like INP and FCP and LCP have so much prominence

25:34.520 --> 25:43.160
in the browser world. But when you want to look at instruction level bottlenecks you cannot use

25:44.280 --> 25:51.160
you can't get this data from the field. So we need a good workload to approximate page load

25:51.160 --> 25:58.200
and scrolling in the lab. And for that we set out to create one because really we didn't have

25:58.200 --> 26:04.680
one that was up to date in Chromium. We call it load line and Gemini helped us generate a

26:04.680 --> 26:15.240
logo for it. To talk a little bit about how we approach this problem we wanted to make sure

26:15.240 --> 26:22.520
that this was a maintainable workload. A benchmark like this in the past when Chromium has attempted

26:22.520 --> 26:30.280
this it often became irrelevant because it was very hard to update the pages that were part of

26:30.280 --> 26:39.480
the benchmark or the metrics used. So we made the choice to focus on a small workload that we could

26:39.480 --> 26:45.800
maintain and that we are happy to update in the future. So we limited ourselves to only five sites

26:46.600 --> 26:53.000
and chose those based on product needs but also based on performance characteristics of those sites.

26:53.000 --> 26:58.680
We wanted a little bit of coverage of both fast websites and slow websites and websites that

26:58.680 --> 27:04.840
exercise the JavaScript engine very heavily and websites that exercise the layout engine a lot more

27:04.840 --> 27:12.280
heavily and so on. We did this by analyzing a bunch of more popular websites and then selecting

27:12.280 --> 27:17.720
once that had different characteristics in a way to maximize coverage across a bunch of

27:17.720 --> 27:26.440
dimensions. We also noticed that many of these metrics that we use in the field like FCT or LCT,

27:26.840 --> 27:31.080
they don't work that well when you try to apply them only to five websites in the lab.

27:31.160 --> 27:43.880
So instead we utilized the fact that we've chosen only five sites by building site specific

27:43.880 --> 27:49.320
metrics that utilize some knowledge of well in this case it actually matters when this element is

27:49.320 --> 28:00.760
shown or when this element becomes intractable and another aspect of utilizing this custom

28:00.760 --> 28:05.000
instrumentation custom metrics is that we can build metrics that actually behave well

28:05.960 --> 28:11.880
in terms of statistical properties in the lab. If you have metrics that are by model for example

28:11.880 --> 28:20.360
that's really problematic for a lab benchmark introduced a lot of noise. We also have to make

28:20.360 --> 28:25.240
sure that when we are measuring today we measure tomorrow that we get somewhat consistent results.

28:26.120 --> 28:30.920
So while we were using real websites we had to make sure that they kind of stay fixed in one

28:30.920 --> 28:37.720
point in time and for that we are using a tool called web page replay or WPR that takes a recording

28:37.720 --> 28:44.680
of a website and then later replace that. It's not perfect right but what we really are looking for

28:44.680 --> 28:48.520
here is a workload that we can utilize that is reasonably relevant.

28:49.080 --> 29:01.160
To give you an example this is a page load in one of our pages where we are looking at LCP

29:02.120 --> 29:10.600
and right before LCP begins or right before LCP is finished we have a very long script execution

29:12.120 --> 29:16.680
and the browser is in the deterministic sometimes that script execution happens before that paint

29:16.760 --> 29:21.000
sometimes it happens after the paint. This creates it by modality in the metric.

29:21.720 --> 29:30.520
So you see a little bit of a bump in an earlier around an earlier page load time and a little

29:30.520 --> 29:36.680
bit of a bump in a later page load time. That's something we can't have so instead we had to choose

29:36.680 --> 29:46.200
it better moment in time for for the metric for this page. Another example LCP doesn't always

29:46.280 --> 29:52.600
track a moment that's actually really relevant. This is a page load of a CNN page. It turns out

29:52.600 --> 29:59.160
that LCP happens when this image is shown but from an end user perspective that's not really

29:59.160 --> 30:03.320
relevant that's only the image there's no text. I can't really look at this page yet.

30:04.360 --> 30:08.520
I also don't I can't really scroll it yet. I can't interact with any items on the on the page yet.

30:08.520 --> 30:12.520
So instead we built a metric for this particular page that waits for

30:12.520 --> 30:19.800
these main pieces of the content to be loaded and for that content to start to be interactable.

30:21.240 --> 30:26.680
That's kind of how we approach these things and then you end up with a set of pages.

30:27.720 --> 30:34.280
For us this is the initial set, the initial set of the first version of this benchmark that we're

30:34.360 --> 30:44.360
utilizing. We have one configuration for phones. That's mainly striving to create good coverage

30:44.360 --> 30:53.400
over a void variety of websites. But websites that are a website types that are somewhat popular

30:53.400 --> 31:01.800
on the wider grind scale of things. So you have fast pages like Wikipedia. You have pages that

31:01.880 --> 31:09.000
are really slow like a news article. You have pages that are more on the average site like a product

31:09.000 --> 31:15.480
page on Amazon. And for each of those pages we develop metrics that either wait for a piece of

31:15.480 --> 31:20.440
the content to be ready or maybe in some cases actually we can use LCP because it does the right

31:20.440 --> 31:28.280
thing for this page. We also looked at tablets and our tablets are up and coming.

31:28.840 --> 31:34.680
On tablets the use cases that are relevant for browser performance are a little bit different.

31:34.680 --> 31:42.680
There's a lot more focus on productivity and challenging things. Also from a competitive standpoint.

31:44.440 --> 31:49.880
And so we chose this slightly different set of websites here skewed towards larger content

31:50.520 --> 31:52.120
more challenging content for the browser.

31:52.120 --> 32:02.280
This at the moment is an internal benchmark for Chromey engineers really. It's built also primarily

32:02.280 --> 32:09.560
for Android not so much for desktop and it only covers the fundamental browser performance. It doesn't

32:09.560 --> 32:14.680
cover absolutely every browser feature. It doesn't cover the networking part very well in the browser

32:14.680 --> 32:21.160
given that we are replaying network responses as opposed to using a live server that behaves

32:21.240 --> 32:28.040
in quite different ways. So there's a lot that you have to take into account when

32:28.040 --> 32:36.600
when utilizing this but it is available and it's quite easy to run. So give it a try if you're

32:36.600 --> 32:42.600
interested. With that let me open it up for questions.

32:52.120 --> 33:00.280
Hi, I was just curious if you could explain why the CNN page didn't show the text initially.

33:00.280 --> 33:06.920
Is that they lazy to text itself or if they have some font problem or why the metric doesn't

33:06.920 --> 33:11.640
work for CNN? I don't understand what the metric doesn't work but I'll come the image appears

33:11.640 --> 33:19.400
before the text. I don't know. It's something that I see at an engineer might want to look at.

33:19.880 --> 33:23.960
Okay, I want to look at one of the quick questions. Do you have any recommendations for

33:23.960 --> 33:29.880
framework developers to increase the chances of branch predictions being correct?

33:29.880 --> 33:34.760
Like there's sort of mechanism since C++ where you can kind of annotate the guide to CPU but

33:34.760 --> 33:39.400
it's not going to look like that in JavaScript. There's sort of so kind of presidents or certain

33:39.400 --> 33:44.040
patterns. It's a very good question and the way that I would answer it is that that's really a

33:44.120 --> 33:50.680
problem for the JavaScript engine to deal with probably. I mean, yes, you can try to reduce branches

33:50.680 --> 33:56.040
even in JavaScript code, right? You might want to avoid conditions on the hot path, but if you can.

33:56.840 --> 34:03.480
But primarily, the JavaScript engine should take care of all of this for you because

34:04.120 --> 34:09.800
the JavaScript engine, it doesn't have profile guided optimization as we use that for a native code,

34:10.520 --> 34:15.480
but it does have all the runtime jit information. So it effectively builds up the same

34:16.440 --> 34:21.000
the same data, which branches are often taken, which branches are not often taken,

34:21.000 --> 34:25.320
which functions are hot, which functions are cold. It has all of this data so that it's able to

34:25.320 --> 34:33.160
optimize very hot code and higher compiler tiers. So it should already be doing these optimizations

34:33.160 --> 34:39.320
to a degree to make sure that the hot path in a branch is the one that is the fall through.

34:39.480 --> 34:45.560
For example, we have to remove branches from the hot path, but that doesn't really include

34:45.560 --> 34:51.880
functions that doesn't really include the time spent executing a function before it is optimized

34:51.880 --> 35:00.920
by the jit. I think I know why the image erase first because they're optimized for ACP.

35:02.520 --> 35:08.440
Yes, it's a good idea to take away all the content for LCD, right? So yeah, it's another aspect

35:08.520 --> 35:11.560
of these metrics is that developers start to gain them.

35:15.080 --> 35:19.320
I do have a question. I've seen you mentioned both speedometer 2 and 3.

35:19.960 --> 35:25.720
Is that because you waited for the 3 to be released before doing the work on that or how did you

35:25.720 --> 35:30.600
work with speedometer while it was developed? And I worked on that that's why it's so interesting.

35:31.240 --> 35:36.520
Yes, I have some data that is from speedometer 3. I have some data that is from speedometer 2.

35:36.600 --> 35:42.440
That's mainly because over this time period, at the beginning of this graph, we only had speedometer

35:42.440 --> 35:47.400
2 available to us. At the end towards the end of this graph, we had speedometer 3 available.

35:47.400 --> 35:53.880
So we switched eventually to tracking the newer benchmark. Speedometer 3 is not significantly

35:53.880 --> 35:58.840
different from speedometer 2 from the low level perspective of what works well in the CPU versus

35:58.840 --> 36:04.200
what doesn't. So a lot of the work loads in speedometer 3 are the same ones as they are in speedometer 2,

36:04.280 --> 36:08.920
just slightly updated in the framework from the framework perspective.

36:10.120 --> 36:16.120
And the newer workloads are probably a little bit more stressful for a device overall.

36:17.000 --> 36:24.360
It's slightly larger workloads. Maybe a little bit more GPU work. So there's a little bit more

36:24.360 --> 36:31.240
there for us to look at and look at now. But yeah, they overall they don't look too different.

36:34.200 --> 36:46.840
For load line you talk about these custom metrics, can you say a little bit more about

36:46.840 --> 36:55.480
what they are and if or how that might scale for more than the five sites.

36:55.480 --> 37:02.040
Yeah, it doesn't scale to more sites. That is very clear to us. At the moment, they have really

37:02.040 --> 37:06.120
weight for specific elements on the page. For the pages that we need the custom metric,

37:06.120 --> 37:10.760
it's like for CNN. We make sure that the headline element is there too. On some other pages,

37:10.760 --> 37:15.160
we might interact with an element via JavaScript and then measure the time taken up to that point.

37:15.160 --> 37:19.960
For example, we wait until the menu icon appears and we'll click on the menu icon and we'll

37:19.960 --> 37:25.960
wait until the menu appears. That's a proxy for us to be able to say actually they are the

37:25.960 --> 37:31.640
contentist all there and you can interact with it. It's not like you've shown the content,

37:31.640 --> 37:36.120
but the big JavaScript that has to make the button interactive hasn't run yet. So you can't

37:36.120 --> 37:45.640
actually do anything on the page. To ask that's not something we can at least not with a lot of

37:45.640 --> 38:08.520
work generalized to any website. How many brands do you do of tests on a single website in order

38:08.520 --> 38:20.760
to get stable results? About a hundred. But there's work in progress to try and reduce that.

38:22.680 --> 38:27.960
The goal that we set out for initially was to be able to detect like a 1% difference in score

38:27.960 --> 38:35.240
and time taken to render our website in one hour. Running this benchmark a hundred times on each

38:35.240 --> 38:42.120
of these pages takes about an hour at the moment. We are roughly there for this goal,

38:42.120 --> 38:49.240
I think there's opportunity for us to optimize this. Most of the reason why it takes too long

38:49.240 --> 38:56.440
to run it is even though the page load may be takes about 500 milliseconds for a medium side.

38:57.480 --> 39:04.360
We have to tear down the browser and bring it all back up in between each iteration to make

39:04.360 --> 39:09.560
sure that things like caching and process creation, etc., is all taken into account correctly.

39:11.080 --> 39:19.240
But there's work underway to try and identify how much of that can be emulate rather than having

39:19.320 --> 39:22.440
to re-initialize everything from scratch.

39:38.040 --> 39:44.920
Do you have test related to the late CSS loading because sometimes the CSS comes in

39:45.400 --> 39:50.280
and after that the whole website has to be re-styled.

39:51.960 --> 39:55.960
I think that's an equal mouse website that can get quite some issues.

39:56.520 --> 40:02.680
It's a good point. I don't think we've covered for that particular case. At least not from what I've seen.

40:02.680 --> 40:08.760
It's possible that one of these sites does have that characteristics, but I'm not sure.

40:15.720 --> 40:21.720
All right. Thank you very much.

40:23.240 --> 40:23.720
Thank you.

