Math Crypto
GitHub Back to homepage

Analysis v0.9.27

We investigated the benchmark scores on a recent CPU (Intel i7-12700) with NVMe drive for polkadot version 0.9.27. Compared to our previous analysis, there were a few changes on the compilation procedure:

  • Options are specified in the Cargo.toml file (as an overwritten production profile).
  • We can reliably use lto=off, thin, or fat.
  • Added opt-level=2 or 3 as option.
  • The code-units option is now 1 or 16.
  • No more production or release profile, since we have modified our own profile.
  • Only the native architecture is compiled (here, alderlake) or no architecture.
The maximum code optimization is opt-level=3. As suggested here, lowering this value to opt-level=2 might produce better results. This is counterintuitive but, as we will see, it leads indeed to better results (sometimes). Similarly, lto=fat, while expensive to build, does not always lead to the fastest code; see here.
The same analysis explained on this page was also performed on a Ryzen 5 3600. Even though this is a less powerful processor, compiling yourself brings tremendous improvements (20% and 10%, resp.) for the SR25519-Verify score and the timing of the Remark Extrinsic. See the notebook for details.

The result are 50 different builds (including the official polkadot binary and docker image) that are listed below:

nb_build arch toolchain codegen-units lto opt-level
0 none stable 1 off 2
1 none stable 1 off 3
10 none stable 16 thin 2
11 none stable 16 thin 3
12 alderlake stable 1 off 2
13 alderlake stable 1 off 3
14 alderlake stable 1 fat 2
15 alderlake stable 1 fat 3
16 alderlake stable 1 thin 2
17 alderlake stable 1 thin 3
18 alderlake stable 16 off 2
19 alderlake stable 16 off 3
2 none stable 1 fat 2
20 alderlake stable 16 fat 2
21 alderlake stable 16 fat 3
22 alderlake stable 16 thin 2
23 alderlake stable 16 thin 3
24 none nightly 1 off 2
25 none nightly 1 off 3
26 none nightly 1 fat 2
27 none nightly 1 fat 3
28 none nightly 1 thin 2
29 none nightly 1 thin 3
3 none stable 1 fat 3
30 none nightly 16 off 2
31 none nightly 16 off 3
32 none nightly 16 fat 2
33 none nightly 16 fat 3
34 none nightly 16 thin 2
35 none nightly 16 thin 3
36 alderlake nightly 1 off 2
37 alderlake nightly 1 off 3
38 alderlake nightly 1 fat 2
39 alderlake nightly 1 fat 3
4 none stable 1 thin 2
40 alderlake nightly 1 thin 2
41 alderlake nightly 1 thin 3
42 alderlake nightly 16 off 2
43 alderlake nightly 16 off 3
44 alderlake nightly 16 fat 2
45 alderlake nightly 16 fat 3
46 alderlake nightly 16 thin 2
47 alderlake nightly 16 thin 3
5 none stable 1 thin 3
6 none stable 16 off 2
7 none stable 16 off 3
8 none stable 16 fat 2
9 none stable 16 fat 3
official none nightly 16 thin local 3
docker none nightly 16 thin local 3

Rust versions used were stable=1.62.1 and nightly=1.65.0-nightly (2befdefdd 2022-08-06).

For each build, we repeated the following benchmark 20 times:

polkadot benchmark machine --disk-duration 30

In addition, compared to the previous analysis, we added the execution speed of the remark extrinsic as extra score:

polkadot benchmark extrinsic --pallet system --extrinsic remark --chain polkadot-dev

This test was repeated 4 times (since it already has its own set of repetitions with each call).

Total CPU utilization before and after each test was negligible (< 1%) to make sure that the benchmark was not disturbed by competing CPU tasks.

You can repeat these experiments (on your machine) by using the source files on our Github page.

All comparisons

Below we plot the scores for each build in a box plot. The red line indicates the median and the box starts from the first to the third quartile values of the scores. Outliers are indicated with circles.

We first compare CPU scores. There are clear differences visible but there is no one winning build. We will investigate good build options below.

CPU scores

Regarding disk scores, all builds behave very similarly except the docker image, which is worse for random write.

Disk scores

For memory score, the situation is similar to random write (only docker underperformed).

Memory scores

Finally, the extrinsic timing is similar to the CPU scores, except that lower is better here!

Timing extrinsic

Preliminary conclusions

Same conclusions as for v0.9.26:

  • Optimization has little impact on disk and memory scores (except for docker)
  • Docker is still penalized here for copy and rnd write.
  • Optimizing has a potential big influence on CPU scores and timing of an extrinsic.

New information:

  • There are 8 builds (30, 6, 31, 7, 19, 18, 43, 42) that have very bad CPU scores.

Maybe surprisingly, the worst builds do not all simply have opt-level = 2. It seems that codegen-units=16 and lto=off is the reason, regardless of opt-level.

nb_build arch toolchain codegen-units lto opt-level
18 alderlake stable 16 off 2
19 alderlake stable 16 off 3
30 none nightly 16 off 2
31 none nightly 16 off 3
42 alderlake nightly 16 off 2
43 alderlake nightly 16 off 3
6 none stable 16 off 2
7 none stable 16 off 3

Finding good optimization options

We will now find build options that have good performance for the two CPU scores and the time for an extrinsic.

Since there is not one build that wins in all these three scores, we identify the Pareto efficient builds. In case of only two scores, determining the Pareto front can be done by hand on a scatter plot. Since our test has three scores, we compute these points algorithmically instead; details are in the Python notebook. Due to statistical errors on the scores, we also find all builds that are close to these Pareto efficient builds. To that end, we define a box around each score with width equal to its statistical error. A build is included if its box overlaps to that of a Pareto efficient build. Again, details are in the Python notebook.

This gives us the following winning optimization options:

nb_build arch toolchain codegen-units lto opt-level
15 alderlake stable 1 fat 3
17 alderlake stable 1 thin 3
21 alderlake stable 16 fat 3
38 alderlake nightly 1 fat 2
40 alderlake nightly 1 thin 2
41 alderlake nightly 1 thin 3
45 alderlake nightly 16 fat 3
47 alderlake nightly 16 thin 3

and corresponding box plots:

CPU scores

Interesting: build 47 is best for BLAKE2 but very bad for Extrinsic and Verify. It is even worse then the offical polkadot binary! The optimization options do not predict this bad behavior: lto=thin and codegen-units=16 could actually be good since it can use lto across the 16 crates. Build 45 switches this to lto=fat with dramatically better performance!

What are the winning builds?

  • Surprisingly, opt-level=3 is not always needed. However, some lto is required.
  • In fact, build 38 (opt-level=2) has excellent SR25519-Verify scores with zero variance and very good Extrinsic score.
  • Build 45 is best for SR25519-Verify and Extrinsic, and third best for BLAKE2-256. Except for its high variance, build 45 would be a clear winner. It uses codegen-units=16, which is a little surprising.
  • Builds 15, 21, 38, and 40 are good mixes between both scores.

Throwing away builds that do not improve upon the official binary, we have 15, 21, 38, 40, 45 as the good builds. They all build with `target-cpu=native``.

nb_build BLAKE2-256 relative diff (%) SR25519-Verify relative diff (%) Extr-Remark relative diff (%) toolchain codegen-units lto opt-level
15 1400 4.5 1010.2 0.9 54660 -1.1 stable 1 fat 3
21 1400 4.5 1020.1 1.9 55021.5 -0.5 stable 16 fat 3
38 1400 4.5 1020 1.9 54510.5 -1.4 nightly 1 fat 2
40 1410 5.2 1005 0.4 54947 -0.6 nightly 1 thin 2
45 1400 4.5 1022.9 2.1 54285.5 -1.8 nightly 16 fat 3
official 1340 0 1001.4 0 55277.5 0 nightly ? ? 3
docker 1340 0 1002.6 0.1 55372 0.2 nightly ? ? 3

Winning optimization options

Based on the above analysis, we subjectively choose 38 and 45 as best builds. To compile polkadot with them, you need to modify the production profile in the Cargo.toml. For build 38:

[profile.production]
inherits = "release"
codegen-units = 1
lto = "fat"
opt-level = 2

and for build 45:

[profile.production]
inherits = "release"
codegen-units = 16
lto = "fat"
opt-level = 3

Afterwards, you build as usual will the command

rustup override set nightly
export RUSTFLAGS="-C target-cpu=native"
cargo build --profile=production --target=x86_64-unknown-linux-gnu --locked -Z unstable-options

Please see our convenient Python script if instead you want to build several binaries and specify options in a simpler way.

The option target-cpu=native selects the best CPU optimization for the CPU that runs the compiler. If you want to compile for a different CPU, you need to specify the architecture.
For a different CPU (Intel or AMD), the optimal build could be different.