It would be better if VM comparisons included JSC rather than, or in addition to, V8. JSC tends to outperform V8 so if you find a pathology in V8 it’s just not so surprising. It would be more interesting if you found a pathology in JSC.
I think that the use of small benchmarks obscures what’s going on. The VM is trying to win in the average. It’s like a professional gambler. Observing that the VM did something dumb for a program is like observing that a professional gambler lost a bet. That’s not interesting. In a game of chance, even a really great strategy will have its outliers.
I think that to understand the quality of a VM you have to throw millions of lines of code at it and see if the optimizing JIT can consistently produced speedups or at least produced speedups more often than not using some aggregate metric. As someone who studies the behavior of JSC on million line code bases, I can tell you that a pretty good outcome is if only a small number of functions experience an “upside down” effect from optimization and ends up running slower over time.
Finally, the whole search for a methodology to pinpoint warmup is broken. It’s pure brain damage. VMs need to be fast even for small programs that don’t have a chance to warmup. Startup time is absolutely important. So it’s a methodological antipattern to even try to find the warmup.
The questions worth asking are:
- for some program, how long does it take to run that program. Start to finish. No ignoring warmup.
- how long does it take to run some very long program or the average running time of a small program averaged over many iterations
- some percentile of behavior, like the 99th, to get an average of the janky behavior.
Ideally you measure all of those things and include both short running and long running programs.
This tells you how good a VM is.
If you’re doing math or methodology to identify the warmup point then you’re effectively biasing your experiment to forgive VMs for bad behavior so long as that bad behavior happens early. Nothing could be sillier. Users care about the perf of their VMs at startup not just in steady state.
Anyway, that’s the way I like to do optimizations in JSC.
> - for some program, how long does it take to run that program. Start to finish. No ignoring warmup.
This methodology likely comes from Java, which has long-running server applications. "How long does it run" is often "until someone hits ^C". Here, startup cost can be slow as long as the peak performance is fine. It's accepted that the first minute or two of the server are slow, but that's small compared to the month or so that the server will be running for.
> This tells you how good a VM is.
I think papers like this approach it from the wrong angle. I don't care about the VM's theoretical peak performance. I care about being able to measure and track performance in a reliable way. Put simply, I'm fine with bad codegen as long as I can consistently measure it. Feel free to improve it, but adding to sometimes give me good codegen, unreliably, is much more frustrating than bad codegen. But this seems to be the way the VMs are going, with things like probabilistic profiling.
If I refactor my code and replace for(let i = 0; i < L.length; i++) with for(const i of L), what's the cost? Will performance go up or down? We don't have tools or metrics to handle that right now. How can I ensure my codegen is good won't regress?
I work on a particularly demanding website in my free time ( https://noclip.website/#smg/AstroGalaxy , unfortunately won't run in WebKit due to missing WebGL 2 ), and performance varies drastically from Chrome release to release, and I do extensive testing with node.js to make sure that I'm getting good codegen.
I know that the warmup skipping comes from Java. It was a mistake there. Saying that it’s because Java is for servers is a lame excuse and may be getting it backwards - maybe Java only succeeded on servers because all the tuning ignored warmup.
I hear ya that having tools would be great - but the best speedups do come about from probabilistic methods so it would be weird to rely on whatever a profiler told you.
This blog post, and your reply, both touch on how difficult it is to measure performance. But then in the same where you point out that there are many different valid ways you could measure performance, you also make the broad claim that "JSC tends to outperform V8". What are you basing that on?
These days we use JetStream 2 (our design) and Speedometer 2 (collaborative design between WK and Chromium folks) as the main big benchmarks but it’s not the only thing we measure and tune.
V8 used to have their own JS benchmark, Octane, but they retired it at about the same time as we beat them on it. So JSC is fast enough to make other people retire their benchmarks.
I think that the use of small benchmarks obscures what’s going on. The VM is trying to win in the average. It’s like a professional gambler. Observing that the VM did something dumb for a program is like observing that a professional gambler lost a bet. That’s not interesting. In a game of chance, even a really great strategy will have its outliers.
I think that to understand the quality of a VM you have to throw millions of lines of code at it and see if the optimizing JIT can consistently produced speedups or at least produced speedups more often than not using some aggregate metric. As someone who studies the behavior of JSC on million line code bases, I can tell you that a pretty good outcome is if only a small number of functions experience an “upside down” effect from optimization and ends up running slower over time.
Finally, the whole search for a methodology to pinpoint warmup is broken. It’s pure brain damage. VMs need to be fast even for small programs that don’t have a chance to warmup. Startup time is absolutely important. So it’s a methodological antipattern to even try to find the warmup.
The questions worth asking are:
- for some program, how long does it take to run that program. Start to finish. No ignoring warmup.
- how long does it take to run some very long program or the average running time of a small program averaged over many iterations
- some percentile of behavior, like the 99th, to get an average of the janky behavior.
Ideally you measure all of those things and include both short running and long running programs.
This tells you how good a VM is.
If you’re doing math or methodology to identify the warmup point then you’re effectively biasing your experiment to forgive VMs for bad behavior so long as that bad behavior happens early. Nothing could be sillier. Users care about the perf of their VMs at startup not just in steady state.
Anyway, that’s the way I like to do optimizations in JSC.