Math Crypto
GitHub Back to homepage

Analysis v0.9.26

We investigated the benchmark scores on a recent CPU (Intel i7-12700) with NVMe drive for polkadot version 0.9.26. All possible variations of the optimization options explained here were tested. This resulted in 36 different builds (including the official polkadot binary and docker image) that are listed below:

nb_build arch toolchain codegen lto_ldd profile
10 none stable False False release
11 none stable False True release
12 none stable True False release
13 none stable True True release
14 skylake stable False False release
15 skylake stable False True release
16 skylake stable True False release
17 skylake stable True True release
18 alderlake stable False False release
19 alderlake stable False True release
20 alderlake stable True False release
21 alderlake stable True True release
22 none nightly False False release
23 none nightly False True release
24 none nightly True False release
25 none nightly True True release
26 skylake nightly False False release
27 skylake nightly False True release
28 skylake nightly True False release
29 skylake nightly True True release
30 alderlake nightly False False release
31 alderlake nightly False True release
32 alderlake nightly True False release
33 alderlake nightly True True release
34 none stable False False production
35 none stable True False production
36 skylake stable False False production
37 skylake stable True False production
38 alderlake stable False False production
39 alderlake stable True False production
40 none nightly False False production
41 none nightly True False production
42 skylake nightly False False production
43 skylake nightly True False production
44 alderlake nightly False False production
45 alderlake nightly True False production
official none nightly True False release
docker none nightly True False release

Rust versions used were stable=1.62.1 (e092d0b6b 2022-07-16) and nightly=1.65.0-nightly (d394408fb 2022-08-07).

For each build, we repeated the following benchmark 20 times:

polkadot benchmark machine --disk-duration 30

Total CPU utilization before and after each test was negligible (< 1%) to make sure that the benchmark was not disturbed by competing CPU tasks.

You can repeat these experiments (on your machine) by using the source files on our Github page.
Starting with polkadot version 0.9.27, there is a new benchmark for timing the execution speed of an extrinsic. We will include this score in future analyses.

All comparisons

Below we plot the scores for each build in a box plot. The red line indicates the median and the box starts from the first to the third quartile values of the scores. Outliers are indicated with circles.

We first compare CPU scores. There are clear differences visible but there is no one winning build. We will investigate good build options below.

CPU scores

Regarding disk scores, all builds behave very similarly except the docker image, which is worse for random write.

Disk scores

For memory score, the situation is similar to random write (only docker underperformed).

Memory scores

Preliminary conclusions

A few conclusions are easy to draw:

  • The official polkadot binary is somewhat optimized but there is a lot of room for improvement.
  • The official docker image is about as fast as the official polkadot binary, except for random write and copy, where you have a relative penalty of 6% and 13%, resp.
  • Compiling for a specific CPU can improve the CPU scores by about 6%.
  • Compiling without optimization options at all gives worse CPU scores.
  • For the disk and memory scores, the differences were not statistically significant. In the box plots above, the median of one build is almost always included in the boxes of the other builds.

Finding good optimization options

The optimization options had a clear influence on the CPU scores (but not for disk and memory). Let us therefore plot both CPU scores and label a few interesting builds.

Memory scores

Determining a winning build is not so easy:

  • There are two CPU scores. Ideally, we want to find the best build for both scores, but from the figure we see that this is not possible: number 45 wins in SR25519-Verify, while number 33 wins BLAKE2-256.
  • As a compromise, we look for a build that has good scores for both CPU benchmarks. One way is to simply add the scores but they have different scales. A general practice for dealing with more than one objective is to find solutions (here, builds) on the Pareto front: Given a build, there is no other build that is better in SR25519-Verify and in BLAKE2-256. For our case, such Pareto efficient builds are numbers 45, 32, and 33. They can be found in the top right corner of the figure.
  • Two builds, like 39 and 45, that have very similar scores are not that different in practice. The fact that one is better than the other is probably due to chance. (This can be made more precise by estimating the error on the score which is related to the width of the box plots in the figures above. We will not do this for now.) We therefore also include builds that are very close to the Pareto efficient ones.

In the figure, we labeled the builds that can all be considered very good since they are Pareto efficient or very close to it. This gives us the following winning optimization options:

nb_build arch toolchain codegen lto_ldd profile
17 skylake stable True True release
20 alderlake stable True False release
21 alderlake stable True True release
32 alderlake nightly True False release
33 alderlake nightly True True release
39 alderlake stable True False production
45 alderlake nightly True False production

We see some clear patterns:

  • Compiling for the most specific CPU architecture (alderlake for our node) is clearly a good idea.
  • Enabling codegen (that is, codegen=1) is always good.
  • For the release profile, compiling with LTO is not needed but not harmful either. Note that the production profile is always compiled with LTO.
  • Choosing stable or nightly toolchain does not seem to matter that much.

For completeness, we check again the good candidate builds. Except for one bad outlier in build number 33, there is not much more information here: all builds have more or less the same spread. (A build with a high median score but a large spread would not be desirable due to unpredictable performance.)

Memory scores

Winning optimization options

Based on the above analysis, we recommend choosing build option 32. This corresponds to

rustup override set nightly
export RUSTFLAGS="-C target-cpu=native -C codegen=1"
cargo build --release --target=x86_64-unknown-linux-gnu --locked -Z unstable-options

These options result in very little spread and almost the best CPU scores. The build time is only 15 minutes on a modern machine, compared to almost double for LTO builds. (An interesting mathematical property is also that build 32 is better than the convex combination of build 45 and 33.)

The option target-cpu=native selects the best CPU optimization for the CPU that runs the compiler. If you want to compile for a different CPU, you need to specify the architecture.
For a different CPU (Intel or AMD), the optimal build could be different.