Use faster Rng in RandomIdGenerator (0%-6% performance improvement) #1106

hdost · 2023-06-09T08:41:48Z

Represents a minimum a 11%-22% improvement in relevant benchmarks.

Fixes #808

Changes

Swapped Rng from ThreadLocal to Pcg64Mcg for significant speed improvement.

Merge requirement checklist

CONTRIBUTING guidelines followed
Unit tests added/updated (if applicable)
Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
Changes in public API reviewed (if applicable)

Benchmarks

at 05:47:15 ➜  cargo bench -p opentelemetry_sdk --bench trace -- --baseline main
   Compiling rand_pcg v0.3.1
   Compiling opentelemetry_sdk v0.19.0 (/home/h.dost/projects/github.com/hdost/opentelemetry-rust/opentelemetry-sdk)
    Finished bench [optimized + debuginfo] target(s) in 6.57s
     Running benches/trace.rs (/home/h.dost/projects/github.com/hdost/opentelemetry-rust/target/release/deps/trace-eadd52583edeb932)
Gnuplot not found, using plotters backend
Benchmarking EvictedHashMap/insert 1: Collecting 100 samples in estimated 5.0004 s (52
EvictedHashMap/insert 1 time:   [95.338 ns 96.448 ns 97.487 ns]
                        change: [-20.671% -19.590% -18.435%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking EvictedHashMap/insert 5: Collecting 100 samples in estimated 5.0003 s (17
EvictedHashMap/insert 5 time:   [290.93 ns 294.97 ns 298.92 ns]
                        change: [-22.131% -21.194% -20.214%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
Benchmarking EvictedHashMap/insert 10: Collecting 100 samples in estimated 5.0011 s (8
EvictedHashMap/insert 10
                        time:   [605.80 ns 613.05 ns 620.85 ns]
                        change: [-22.299% -21.526% -20.657%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking EvictedHashMap/insert 20: Collecting 100 samples in estimated 5.0006 s (3
EvictedHashMap/insert 20
                        time:   [1.6257 µs 1.6457 µs 1.6656 µs]
                        change: [-14.434% -13.636% -12.633%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking start-end-span/always-sample: Collecting 100 samples in estimated 5.0001
start-end-span/always-sample
                        time:   [432.54 ns 433.88 ns 435.57 ns]
                        change: [-17.432% -17.063% -16.682%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
Benchmarking start-end-span/never-sample: Collecting 100 samples in estimated 5.0007 s
start-end-span/never-sample
                        time:   [131.14 ns 131.92 ns 132.81 ns]
                        change: [-24.020% -22.987% -22.032%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

Benchmarking start-end-span-4-attrs/always-sample: Collecting 100 samples in estimated
start-end-span-4-attrs/always-sample
                        time:   [1.1646 µs 1.1815 µs 1.1950 µs]
                        change: [-13.003% -11.021% -8.8554%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking start-end-span-4-attrs/never-sample: Collecting 100 samples in estimated
start-end-span-4-attrs/never-sample
                        time:   [176.38 ns 177.05 ns 177.79 ns]
                        change: [-18.366% -18.037% -17.711%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

Benchmarking start-end-span-8-attrs/always-sample: Collecting 100 samples in estimated
start-end-span-8-attrs/always-sample
                        time:   [1.7350 µs 1.7416 µs 1.7484 µs]
                        change: [-13.517% -11.840% -9.9778%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
Benchmarking start-end-span-8-attrs/never-sample: Collecting 100 samples in estimated
start-end-span-8-attrs/never-sample
                        time:   [217.97 ns 219.36 ns 221.25 ns]
                        change: [-18.194% -17.571% -16.894%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

Benchmarking start-end-span-all-attr-types/always-sample: Collecting 100 samples in es
start-end-span-all-attr-types/always-sample
                        time:   [1.3941 µs 1.4064 µs 1.4171 µs]
                        change: [-14.897% -14.451% -13.978%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
Benchmarking start-end-span-all-attr-types/never-sample: Collecting 100 samples in est
start-end-span-all-attr-types/never-sample
                        time:   [186.94 ns 188.96 ns 190.93 ns]
                        change: [-20.653% -19.603% -18.410%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking start-end-span-all-attr-types-2x/always-sample: Collecting 100 samples in
start-end-span-all-attr-types-2x/always-sample
                        time:   [2.1476 µs 2.1576 µs 2.1675 µs]
                        change: [-12.057% -11.575% -11.138%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking start-end-span-all-attr-types-2x/never-sample: Collecting 100 samples in
start-end-span-all-attr-types-2x/never-sample
                        time:   [242.89 ns 244.19 ns 245.73 ns]
                        change: [-18.510% -18.000% -17.417%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

codecov · 2023-06-09T08:58:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Files	Coverage Δ
opentelemetry-sdk/src/trace/id_generator/mod.rs	`100.0% <100.0%> (ø)`

📢 Thoughts on this report? Let us know!.

hdost · 2023-06-09T09:01:41Z

😓 I need to re-run the baseline I realize my orignal baseline was on battery and I believe that may have shown the drastic improvements.

djc · 2023-06-09T09:20:29Z

Great to see some movement on this! Would be good to redo your measurements with a consistent environment. 😄

Questions:

Do we need a CSPRNG here, or is something insecure sufficient for our needs here?
If we don't need a CSPRNG, we should probably just use SmallRng?
If we do need a CSPRNG, how does rand_pcg compare to ThreadRng? IMO in the absence of strong motivation I would feel better sticking to rand's default algorithm selection.
If rand_pcg provides a substantial advantage, can we use Pcg64Mcg on 64-bit architectures and Pcg32 on 32-bit?

cc @shaun-cox

hdost · 2023-06-09T10:48:56Z

Great to see some movement on this! Would be good to redo your measurements with a consistent environment. smile

Questions:
* Do we need a CSPRNG here, or is something insecure sufficient for our needs here?

I think something insecure is fine for our case.

* If we don't need a CSPRNG, we should probably just use `SmallRng`?

From reading the documentation from the random team it does seem like SmallRng is not portable which I think we may wnat to avoid for better univerality.

* If we do need a CSPRNG, how does rand_pcg compare to `ThreadRng`? IMO in the absence of strong motivation I would feel better sticking to rand's default algorithm selection.

* If rand_pcg provides a substantial advantage, can we use `Pcg64Mcg` on 64-bit architectures and `Pcg32` on 32-bit?

I think we could just opt for the Pcg32.

hdost · 2023-06-09T10:53:40Z

Still seeing improvements, but a bit more variable (2%-12%)
What is a bit curious from my perspective is the regression in the unrelated EvictedHashMap because it doesn't even use the TraceIdGenerator

I ran base lines a few times to normalize the temperature of my laptop and I did the same with the modification.
Below results are for Pcg32 running on my Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz

at 12:45:24 ➜  cargo bench -p opentelemetry_sdk --bench trace -- --baseline main
    Finished bench [optimized + debuginfo] target(s) in 0.13s
     Running benches/trace.rs (/home/h.dost/projects/github.com/hdost/opentelemetry-rust/target/release/deps/trace-eadd52583edeb932)
Gnuplot not found, using plotters backend
EvictedHashMap/insert 1 time:   [94.327 ns 95.895 ns 97.622 ns]
                        change: [+6.8380% +9.7777% +13.308%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
EvictedHashMap/insert 5 time:   [269.19 ns 271.21 ns 273.44 ns]
                        change: [-9.7396% -8.1467% -6.6378%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
EvictedHashMap/insert 10
                        time:   [562.49 ns 568.25 ns 574.97 ns]
                        change: [-11.395% -9.4004% -7.0972%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
EvictedHashMap/insert 20
                        time:   [1.4855 µs 1.5105 µs 1.5351 µs]
                        change: [-3.6807% -1.3490% +1.1990%] (p = 0.29 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

start-end-span/always-sample
                        time:   [437.20 ns 439.62 ns 442.82 ns]
                        change: [-3.6715% -2.9462% -2.3529%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high severe
start-end-span/never-sample
                        time:   [129.74 ns 131.71 ns 133.63 ns]
                        change: [-12.133% -10.492% -8.8450%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

start-end-span-4-attrs/always-sample
                        time:   [1.1549 µs 1.1620 µs 1.1693 µs]
                        change: [-2.2722% -0.8020% +1.0240%] (p = 0.39 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
start-end-span-4-attrs/never-sample
                        time:   [162.44 ns 163.67 ns 165.07 ns]
                        change: [-17.960% -16.329% -14.688%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

start-end-span-8-attrs/always-sample
                        time:   [1.7150 µs 1.7223 µs 1.7299 µs]
                        change: [-3.2110% -1.9454% -0.5185%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
start-end-span-8-attrs/never-sample
                        time:   [209.97 ns 214.02 ns 217.95 ns]
                        change: [-10.995% -9.1955% -7.3323%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

start-end-span-all-attr-types/always-sample
                        time:   [1.4226 µs 1.4318 µs 1.4410 µs]
                        change: [+0.5991% +1.6015% +2.5197%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
start-end-span-all-attr-types/never-sample
                        time:   [174.47 ns 176.15 ns 178.10 ns]
                        change: [-14.156% -12.944% -11.760%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

start-end-span-all-attr-types-2x/always-sample
                        time:   [2.0565 µs 2.0719 µs 2.0903 µs]
                        change: [-3.9848% -2.7611% -1.3589%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
start-end-span-all-attr-types-2x/never-sample
                        time:   [223.16 ns 224.39 ns 225.64 ns]
                        change: [-13.450% -12.519% -11.496%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

djc · 2023-06-09T11:09:57Z

If rand_pcg provides a substantial advantage, can we use Pcg64Mcg on 64-bit architectures and Pcg32 on 32-bit?

I think we could just opt for the Pcg32.

If I'm reading the docs right, using Pcg32 on 64-bit architectures would be bad for performance.

djc · 2023-06-09T11:12:32Z

From reading the documentation from the random team it does seem like SmallRng is not portable which I think we may wnat to avoid for better univerality.

What do you mean by portable, and universality? I don't see these terms in the documentation at https://docs.rs/rand/latest/rand/rngs/struct.SmallRng.html.

hdost · 2023-06-09T12:56:26Z

From reading the documentation from the random team it does seem like SmallRng is not portable which I think we may wnat to avoid for better univerality.

What do you mean by portable, and universality? I don't see these terms in the documentation at https://docs.rs/rand/latest/rand/rngs/struct.SmallRng.html.

There's some details here, but rust-random/rand#1285 reading into it, I don't think this is an issue for us since we don't plan to use ::from_seed()

shaun-cox · 2023-06-09T13:36:37Z

On the consistent environment for benchmarking topic: I use the following and find it works well.

export "RUSTFLAGS=-C force-frame-pointers=yes -C target-cpu=native"
taskset -c 2,4 cargo bench -p opentelemetry_sdk --bench trace -- --save-baseline main start-end-span/
taskset -c 2,4 cargo bench -p opentelemetry_sdk --bench trace -- --profile-time 20 start-end-span/

taskset configures the scheduler to only use those numbered processors for the run. The numbers themselves don't particularly matter, I think, just that they are the same when comparing a later run with --baseline main instead of --save-baseline main.
The reason I use two processors for taskset instead of one is due to the nature of the always-sample benchmarks... the span processor used creates another thread for the receiver of the SpanDatas sent by the main thread being benchmarked which generates those data.

The other observation I have on this random generation of TraceId and SpanId is that I suspect we don't need to generate a full 128 bits of randomness every time we need a new TraceId. Instead, I suspect we could generate 64-bits when the Tracer is returned from the TracerProvider, and then generate the other 64-bits in build_with_context. IOW, precompute the top-half of all TraceIds returned from a given Tracer when the Tracer itself is constructed.

Futhermore, couldn't that later 64-bits be used as the SpanId too? IOW, when a new Span has no parent and is the root of a new Trace, it's SpanId will just be the lower 64-bits of the new TraceId.

Today, we fetch 192 bits of entropy in this codepath, but with above suggestion, we'd only need to fetch 64 bits.

hdost · 2023-06-11T20:56:14Z

So i did a few tests looks like 2-7%. I'll post the full results when I'm back at my computer.

For the entropy, i like the idea, but the one question I have is will this sharing of the first 64 bits result in a skew in of sampling.

djc · 2023-07-03T11:54:19Z

@hdost ping, can you still drive this forward?

hdost · 2023-07-06T08:25:11Z

@hdost ping, can you still drive this forward?

Yes sorry, was on vacation 🏄‍♂️

hdost · 2023-07-06T08:26:24Z

So i did a few tests looks like 2-7%. I'll post the full results when I'm back at my computer.

For the entropy, i like the idea, but the one question I have is will this sharing of the first 64 bits result in a skew in of sampling.

@shaun-cox any thoughts on this ?

shaun-cox · 2023-07-10T15:13:14Z

So i did a few tests looks like 2-7%. I'll post the full results when I'm back at my computer.
For the entropy, i like the idea, but the one question I have is will this sharing of the first 64 bits result in a skew in of sampling.

@shaun-cox any thoughts on this ?

I don't readily see any issues with sampling skew, but I'm not an expert. 64-bits of randomness in the trace id would seem to provide enough to make uniform sampling decisions.

I'll also reference this sdk issue which I came across while researching: open-telemetry/opentelemetry-specification#1413

SmallRng provides 0-6% improvement in Traces. Relates open-telemetry#808

cijothomas · 2023-10-20T17:10:12Z

So i did a few tests looks like 2-7%. I'll post the full results when I'm back at my computer.
For the entropy, i like the idea, but the one question I have is will this sharing of the first 64 bits result in a skew in of sampling.

@shaun-cox any thoughts on this ?

I don't readily see any issues with sampling skew, but I'm not an expert. 64-bits of randomness in the trace id would seem to provide enough to make uniform sampling decisions.

I'll also reference this sdk issue which I came across while researching: open-telemetry/opentelemetry-specification#1413

open-telemetry/opentelemetry-specification#3411 Related.

hdost · 2023-10-23T19:59:30Z

So i did a few tests looks like 2-7%. I'll post the full results when I'm back at my computer.
For the entropy, i like the idea, but the one question I have is will this sharing of the first 64 bits result in a skew in of sampling.

@shaun-cox any thoughts on this ?

I don't readily see any issues with sampling skew, but I'm not an expert. 64-bits of randomness in the trace id would seem to provide enough to make uniform sampling decisions.
I'll also reference this sdk issue which I came across while researching: open-telemetry/opentelemetry-specification#1413

open-telemetry/opentelemetry-specification#3411 Related.

So then for a followup change we can look to add one of the changes mentioned before about having a pre-sampled most Significant bit and only random generate the lower bits.

djc

Nice!

djc · 2023-10-24T08:44:23Z

opentelemetry-sdk/Cargo.toml

@@ -21,7 +21,7 @@ futures-util = { version = "0.3.17", default-features = false, features = ["std"
 once_cell = "1.10"
 ordered-float = "4.0"
 percent-encoding = { version = "2.0", optional = true }
-rand = { version = "0.8", default-features = false, features = ["std", "std_rng"], optional = true }
+rand = { version = "0.8", default-features = false, features = ["std", "std_rng","small_rng"], optional = true }


Nit: add a space before "small_rng", please.

hdost requested a review from a team June 9, 2023 08:41

This was referenced Jun 9, 2023

Proposal: Alter RNG for TraceId and SpanId #808

Closed

WIP - Benchmarks #816

Closed

hdost marked this pull request as draft June 9, 2023 09:04

hdost force-pushed the feat/808-switch-to-small-rng branch from da365a2 to 54d9fcd Compare June 9, 2023 10:20

Move to SmallRng from ThreadRng

bb71c6c

SmallRng provides 0-6% improvement in Traces. Relates open-telemetry#808

hdost force-pushed the feat/808-switch-to-small-rng branch from 54d9fcd to bb71c6c Compare October 18, 2023 08:11

hdost changed the title ~~Use faster Rng in RandomIdGenerator (11%-22% performance improvement)~~ Use faster Rng in RandomIdGenerator (0%-6% performance improvement) Oct 18, 2023

hdost marked this pull request as ready for review October 18, 2023 08:12

shaun-cox approved these changes Oct 20, 2023

View reviewed changes

djc approved these changes Oct 24, 2023

View reviewed changes

hdost mentioned this pull request Nov 12, 2023

Reducing Impact of random generation on span creation. #1367

Open

hdost merged commit 5fc4101 into open-telemetry:main Nov 12, 2023
13 checks passed

hdost deleted the feat/808-switch-to-small-rng branch November 12, 2023 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use faster Rng in RandomIdGenerator (0%-6% performance improvement) #1106

Use faster Rng in RandomIdGenerator (0%-6% performance improvement) #1106

hdost commented Jun 9, 2023 •

edited

Loading

codecov bot commented Jun 9, 2023 •

edited

Loading

hdost commented Jun 9, 2023

djc commented Jun 9, 2023 •

edited

Loading

hdost commented Jun 9, 2023

hdost commented Jun 9, 2023 •

edited

Loading

djc commented Jun 9, 2023

djc commented Jun 9, 2023

hdost commented Jun 9, 2023

shaun-cox commented Jun 9, 2023

hdost commented Jun 11, 2023

djc commented Jul 3, 2023

hdost commented Jul 6, 2023

hdost commented Jul 6, 2023

shaun-cox commented Jul 10, 2023

cijothomas commented Oct 20, 2023

hdost commented Oct 23, 2023

djc left a comment

djc Oct 24, 2023

Use faster Rng in RandomIdGenerator (0%-6% performance improvement) #1106

Use faster Rng in RandomIdGenerator (0%-6% performance improvement) #1106

Conversation

hdost commented Jun 9, 2023 • edited Loading

Changes

Merge requirement checklist

Benchmarks

codecov bot commented Jun 9, 2023 • edited Loading

Codecov Report

hdost commented Jun 9, 2023

djc commented Jun 9, 2023 • edited Loading

hdost commented Jun 9, 2023

hdost commented Jun 9, 2023 • edited Loading

djc commented Jun 9, 2023

djc commented Jun 9, 2023

hdost commented Jun 9, 2023

shaun-cox commented Jun 9, 2023

hdost commented Jun 11, 2023

djc commented Jul 3, 2023

hdost commented Jul 6, 2023

hdost commented Jul 6, 2023

shaun-cox commented Jul 10, 2023

cijothomas commented Oct 20, 2023

hdost commented Oct 23, 2023

djc left a comment

Choose a reason for hiding this comment

djc Oct 24, 2023

Choose a reason for hiding this comment

hdost commented Jun 9, 2023 •

edited

Loading

codecov bot commented Jun 9, 2023 •

edited

Loading

djc commented Jun 9, 2023 •

edited

Loading

hdost commented Jun 9, 2023 •

edited

Loading