Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use faster Rng in RandomIdGenerator (0%-6% performance improvement) #1106

Merged
merged 1 commit into from
Nov 12, 2023

Conversation

hdost
Copy link
Contributor

@hdost hdost commented Jun 9, 2023

Represents a minimum a 11%-22% improvement in relevant benchmarks.

Fixes #808

Changes

Swapped Rng from ThreadLocal to Pcg64Mcg for significant speed improvement.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

Benchmarks

at 05:47:15 ➜  cargo bench -p opentelemetry_sdk --bench trace -- --baseline main
   Compiling rand_pcg v0.3.1
   Compiling opentelemetry_sdk v0.19.0 (/home/h.dost/projects/github.com/hdost/opentelemetry-rust/opentelemetry-sdk)
    Finished bench [optimized + debuginfo] target(s) in 6.57s
     Running benches/trace.rs (/home/h.dost/projects/github.com/hdost/opentelemetry-rust/target/release/deps/trace-eadd52583edeb932)
Gnuplot not found, using plotters backend
Benchmarking EvictedHashMap/insert 1: Collecting 100 samples in estimated 5.0004 s (52
EvictedHashMap/insert 1 time:   [95.338 ns 96.448 ns 97.487 ns]
                        change: [-20.671% -19.590% -18.435%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking EvictedHashMap/insert 5: Collecting 100 samples in estimated 5.0003 s (17
EvictedHashMap/insert 5 time:   [290.93 ns 294.97 ns 298.92 ns]
                        change: [-22.131% -21.194% -20.214%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
Benchmarking EvictedHashMap/insert 10: Collecting 100 samples in estimated 5.0011 s (8
EvictedHashMap/insert 10
                        time:   [605.80 ns 613.05 ns 620.85 ns]
                        change: [-22.299% -21.526% -20.657%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking EvictedHashMap/insert 20: Collecting 100 samples in estimated 5.0006 s (3
EvictedHashMap/insert 20
                        time:   [1.6257 µs 1.6457 µs 1.6656 µs]
                        change: [-14.434% -13.636% -12.633%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking start-end-span/always-sample: Collecting 100 samples in estimated 5.0001
start-end-span/always-sample
                        time:   [432.54 ns 433.88 ns 435.57 ns]
                        change: [-17.432% -17.063% -16.682%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
Benchmarking start-end-span/never-sample: Collecting 100 samples in estimated 5.0007 s
start-end-span/never-sample
                        time:   [131.14 ns 131.92 ns 132.81 ns]
                        change: [-24.020% -22.987% -22.032%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

Benchmarking start-end-span-4-attrs/always-sample: Collecting 100 samples in estimated
start-end-span-4-attrs/always-sample
                        time:   [1.1646 µs 1.1815 µs 1.1950 µs]
                        change: [-13.003% -11.021% -8.8554%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking start-end-span-4-attrs/never-sample: Collecting 100 samples in estimated
start-end-span-4-attrs/never-sample
                        time:   [176.38 ns 177.05 ns 177.79 ns]
                        change: [-18.366% -18.037% -17.711%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

Benchmarking start-end-span-8-attrs/always-sample: Collecting 100 samples in estimated
start-end-span-8-attrs/always-sample
                        time:   [1.7350 µs 1.7416 µs 1.7484 µs]
                        change: [-13.517% -11.840% -9.9778%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
Benchmarking start-end-span-8-attrs/never-sample: Collecting 100 samples in estimated
start-end-span-8-attrs/never-sample
                        time:   [217.97 ns 219.36 ns 221.25 ns]
                        change: [-18.194% -17.571% -16.894%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

Benchmarking start-end-span-all-attr-types/always-sample: Collecting 100 samples in es
start-end-span-all-attr-types/always-sample
                        time:   [1.3941 µs 1.4064 µs 1.4171 µs]
                        change: [-14.897% -14.451% -13.978%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
Benchmarking start-end-span-all-attr-types/never-sample: Collecting 100 samples in est
start-end-span-all-attr-types/never-sample
                        time:   [186.94 ns 188.96 ns 190.93 ns]
                        change: [-20.653% -19.603% -18.410%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking start-end-span-all-attr-types-2x/always-sample: Collecting 100 samples in
start-end-span-all-attr-types-2x/always-sample
                        time:   [2.1476 µs 2.1576 µs 2.1675 µs]
                        change: [-12.057% -11.575% -11.138%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking start-end-span-all-attr-types-2x/never-sample: Collecting 100 samples in
start-end-span-all-attr-types-2x/never-sample
                        time:   [242.89 ns 244.19 ns 245.73 ns]
                        change: [-18.510% -18.000% -17.417%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

@hdost hdost requested a review from a team June 9, 2023 08:41
@codecov
Copy link

codecov bot commented Jun 9, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Files Coverage Δ
opentelemetry-sdk/src/trace/id_generator/mod.rs 100.0% <100.0%> (ø)

📢 Thoughts on this report? Let us know!.

@hdost
Copy link
Contributor Author

hdost commented Jun 9, 2023

😓 I need to re-run the baseline I realize my orignal baseline was on battery and I believe that may have shown the drastic improvements.

@hdost hdost marked this pull request as draft June 9, 2023 09:04
@djc
Copy link
Contributor

djc commented Jun 9, 2023

Great to see some movement on this! Would be good to redo your measurements with a consistent environment. 😄

Questions:

  • Do we need a CSPRNG here, or is something insecure sufficient for our needs here?
  • If we don't need a CSPRNG, we should probably just use SmallRng?
  • If we do need a CSPRNG, how does rand_pcg compare to ThreadRng? IMO in the absence of strong motivation I would feel better sticking to rand's default algorithm selection.
  • If rand_pcg provides a substantial advantage, can we use Pcg64Mcg on 64-bit architectures and Pcg32 on 32-bit?

cc @shaun-cox

@hdost hdost force-pushed the feat/808-switch-to-small-rng branch from da365a2 to 54d9fcd Compare June 9, 2023 10:20
@hdost
Copy link
Contributor Author

hdost commented Jun 9, 2023

Great to see some movement on this! Would be good to redo your measurements with a consistent environment. smile

Questions:

* Do we need a CSPRNG here, or is something insecure sufficient for our needs here?

I think something insecure is fine for our case.

* If we don't need a CSPRNG, we should probably just use `SmallRng`?

From reading the documentation from the random team it does seem like SmallRng is not portable which I think we may wnat to avoid for better univerality.

* If we do need a CSPRNG, how does rand_pcg compare to `ThreadRng`? IMO in the absence of strong motivation I would feel better sticking to rand's default algorithm selection.

* If rand_pcg provides a substantial advantage, can we use `Pcg64Mcg` on 64-bit architectures and `Pcg32` on 32-bit?

I think we could just opt for the Pcg32.

@hdost
Copy link
Contributor Author

hdost commented Jun 9, 2023

Still seeing improvements, but a bit more variable (2%-12%)
What is a bit curious from my perspective is the regression in the unrelated EvictedHashMap because it doesn't even use the TraceIdGenerator

I ran base lines a few times to normalize the temperature of my laptop and I did the same with the modification.
Below results are for Pcg32 running on my Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz

at 12:45:24 ➜  cargo bench -p opentelemetry_sdk --bench trace -- --baseline main
    Finished bench [optimized + debuginfo] target(s) in 0.13s
     Running benches/trace.rs (/home/h.dost/projects/github.com/hdost/opentelemetry-rust/target/release/deps/trace-eadd52583edeb932)
Gnuplot not found, using plotters backend
EvictedHashMap/insert 1 time:   [94.327 ns 95.895 ns 97.622 ns]
                        change: [+6.8380% +9.7777% +13.308%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
EvictedHashMap/insert 5 time:   [269.19 ns 271.21 ns 273.44 ns]
                        change: [-9.7396% -8.1467% -6.6378%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
EvictedHashMap/insert 10
                        time:   [562.49 ns 568.25 ns 574.97 ns]
                        change: [-11.395% -9.4004% -7.0972%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
EvictedHashMap/insert 20
                        time:   [1.4855 µs 1.5105 µs 1.5351 µs]
                        change: [-3.6807% -1.3490% +1.1990%] (p = 0.29 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

start-end-span/always-sample
                        time:   [437.20 ns 439.62 ns 442.82 ns]
                        change: [-3.6715% -2.9462% -2.3529%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high severe
start-end-span/never-sample
                        time:   [129.74 ns 131.71 ns 133.63 ns]
                        change: [-12.133% -10.492% -8.8450%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

start-end-span-4-attrs/always-sample
                        time:   [1.1549 µs 1.1620 µs 1.1693 µs]
                        change: [-2.2722% -0.8020% +1.0240%] (p = 0.39 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
start-end-span-4-attrs/never-sample
                        time:   [162.44 ns 163.67 ns 165.07 ns]
                        change: [-17.960% -16.329% -14.688%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

start-end-span-8-attrs/always-sample
                        time:   [1.7150 µs 1.7223 µs 1.7299 µs]
                        change: [-3.2110% -1.9454% -0.5185%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
start-end-span-8-attrs/never-sample
                        time:   [209.97 ns 214.02 ns 217.95 ns]
                        change: [-10.995% -9.1955% -7.3323%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

start-end-span-all-attr-types/always-sample
                        time:   [1.4226 µs 1.4318 µs 1.4410 µs]
                        change: [+0.5991% +1.6015% +2.5197%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
start-end-span-all-attr-types/never-sample
                        time:   [174.47 ns 176.15 ns 178.10 ns]
                        change: [-14.156% -12.944% -11.760%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

start-end-span-all-attr-types-2x/always-sample
                        time:   [2.0565 µs 2.0719 µs 2.0903 µs]
                        change: [-3.9848% -2.7611% -1.3589%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
start-end-span-all-attr-types-2x/never-sample
                        time:   [223.16 ns 224.39 ns 225.64 ns]
                        change: [-13.450% -12.519% -11.496%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

@djc
Copy link
Contributor

djc commented Jun 9, 2023

  • If rand_pcg provides a substantial advantage, can we use Pcg64Mcg on 64-bit architectures and Pcg32 on 32-bit?

I think we could just opt for the Pcg32.

If I'm reading the docs right, using Pcg32 on 64-bit architectures would be bad for performance.

@djc
Copy link
Contributor

djc commented Jun 9, 2023

From reading the documentation from the random team it does seem like SmallRng is not portable which I think we may wnat to avoid for better univerality.

What do you mean by portable, and universality? I don't see these terms in the documentation at https://docs.rs/rand/latest/rand/rngs/struct.SmallRng.html.

@hdost
Copy link
Contributor Author

hdost commented Jun 9, 2023

From reading the documentation from the random team it does seem like SmallRng is not portable which I think we may wnat to avoid for better univerality.

What do you mean by portable, and universality? I don't see these terms in the documentation at https://docs.rs/rand/latest/rand/rngs/struct.SmallRng.html.

There's some details here, but rust-random/rand#1285 reading into it, I don't think this is an issue for us since we don't plan to use ::from_seed()

@shaun-cox
Copy link
Contributor

On the consistent environment for benchmarking topic: I use the following and find it works well.

export "RUSTFLAGS=-C force-frame-pointers=yes -C target-cpu=native"
taskset -c 2,4 cargo bench -p opentelemetry_sdk --bench trace -- --save-baseline main start-end-span/
taskset -c 2,4 cargo bench -p opentelemetry_sdk --bench trace -- --profile-time 20 start-end-span/

taskset configures the scheduler to only use those numbered processors for the run. The numbers themselves don't particularly matter, I think, just that they are the same when comparing a later run with --baseline main instead of --save-baseline main.
The reason I use two processors for taskset instead of one is due to the nature of the always-sample benchmarks... the span processor used creates another thread for the receiver of the SpanDatas sent by the main thread being benchmarked which generates those data.

The other observation I have on this random generation of TraceId and SpanId is that I suspect we don't need to generate a full 128 bits of randomness every time we need a new TraceId. Instead, I suspect we could generate 64-bits when the Tracer is returned from the TracerProvider, and then generate the other 64-bits in build_with_context. IOW, precompute the top-half of all TraceIds returned from a given Tracer when the Tracer itself is constructed.

Futhermore, couldn't that later 64-bits be used as the SpanId too? IOW, when a new Span has no parent and is the root of a new Trace, it's SpanId will just be the lower 64-bits of the new TraceId.

Today, we fetch 192 bits of entropy in this codepath, but with above suggestion, we'd only need to fetch 64 bits.

@hdost
Copy link
Contributor Author

hdost commented Jun 11, 2023

So i did a few tests looks like 2-7%. I'll post the full results when I'm back at my computer.

For the entropy, i like the idea, but the one question I have is will this sharing of the first 64 bits result in a skew in of sampling.

@djc
Copy link
Contributor

djc commented Jul 3, 2023

@hdost ping, can you still drive this forward?

@hdost
Copy link
Contributor Author

hdost commented Jul 6, 2023

@hdost ping, can you still drive this forward?

Yes sorry, was on vacation 🏄‍♂️

@hdost
Copy link
Contributor Author

hdost commented Jul 6, 2023

So i did a few tests looks like 2-7%. I'll post the full results when I'm back at my computer.

For the entropy, i like the idea, but the one question I have is will this sharing of the first 64 bits result in a skew in of sampling.

@shaun-cox any thoughts on this ?

@shaun-cox
Copy link
Contributor

So i did a few tests looks like 2-7%. I'll post the full results when I'm back at my computer.
For the entropy, i like the idea, but the one question I have is will this sharing of the first 64 bits result in a skew in of sampling.

@shaun-cox any thoughts on this ?

I don't readily see any issues with sampling skew, but I'm not an expert. 64-bits of randomness in the trace id would seem to provide enough to make uniform sampling decisions.

I'll also reference this sdk issue which I came across while researching: open-telemetry/opentelemetry-specification#1413

SmallRng provides 0-6% improvement in Traces.

Relates open-telemetry#808
@hdost hdost force-pushed the feat/808-switch-to-small-rng branch from 54d9fcd to bb71c6c Compare October 18, 2023 08:11
@hdost hdost changed the title Use faster Rng in RandomIdGenerator (11%-22% performance improvement) Use faster Rng in RandomIdGenerator (0%-6% performance improvement) Oct 18, 2023
@hdost hdost marked this pull request as ready for review October 18, 2023 08:12
@cijothomas
Copy link
Member

So i did a few tests looks like 2-7%. I'll post the full results when I'm back at my computer.
For the entropy, i like the idea, but the one question I have is will this sharing of the first 64 bits result in a skew in of sampling.

@shaun-cox any thoughts on this ?

I don't readily see any issues with sampling skew, but I'm not an expert. 64-bits of randomness in the trace id would seem to provide enough to make uniform sampling decisions.

I'll also reference this sdk issue which I came across while researching: open-telemetry/opentelemetry-specification#1413

open-telemetry/opentelemetry-specification#3411 Related.

@hdost
Copy link
Contributor Author

hdost commented Oct 23, 2023

So i did a few tests looks like 2-7%. I'll post the full results when I'm back at my computer.
For the entropy, i like the idea, but the one question I have is will this sharing of the first 64 bits result in a skew in of sampling.

@shaun-cox any thoughts on this ?

I don't readily see any issues with sampling skew, but I'm not an expert. 64-bits of randomness in the trace id would seem to provide enough to make uniform sampling decisions.
I'll also reference this sdk issue which I came across while researching: open-telemetry/opentelemetry-specification#1413

open-telemetry/opentelemetry-specification#3411 Related.

So then for a followup change we can look to add one of the changes mentioned before about having a pre-sampled most Significant bit and only random generate the lower bits.

Copy link
Contributor

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -21,7 +21,7 @@ futures-util = { version = "0.3.17", default-features = false, features = ["std"
once_cell = "1.10"
ordered-float = "4.0"
percent-encoding = { version = "2.0", optional = true }
rand = { version = "0.8", default-features = false, features = ["std", "std_rng"], optional = true }
rand = { version = "0.8", default-features = false, features = ["std", "std_rng","small_rng"], optional = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add a space before "small_rng", please.

@hdost hdost merged commit 5fc4101 into open-telemetry:main Nov 12, 2023
13 checks passed
@hdost hdost deleted the feat/808-switch-to-small-rng branch November 12, 2023 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: Alter RNG for TraceId and SpanId
4 participants