Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchedMesh: Update Example, some optimization #27202

Merged
merged 8 commits into from
Nov 19, 2023

Conversation

gkjohnson
Copy link
Collaborator

@gkjohnson gkjohnson commented Nov 17, 2023

Related issue: #22376, #27168 (comment)

Description

  • Updates BatchedMesh demo so it can display up to 20,000 geometries
  • Updates BatchedMesh demo to provide a toggle for per-object frustum culling
  • Updates the calculation of the texture dimensions so it uses a smaller, non-power-of-two texture
  • Update BatchedMesh so that the "onBeforeRender" multi-draw buffer regeneration only occurs when visibility changes or a new geometry is added.

OnBeforeRender CPU Performance Metrics

Both perObjectFrustumCulled and sortObjects have performance implications when a lot of meshes are used. With 20,000 geometries on my 2021 M1 Pro takes this long:

perObjectFrustumCulled = true perObjectFrustumCulled = false
sortObjects = true ~10ms ~6ms
sortObjects = false ~4.75ms ~0ms

If both sortObjects and perObjectFrustumCulled are false and visibility has been toggled (meaning the multi draw buffer needs to be regenerated) it takes ~0.6ms.

Overdraw Metrics

And here are some performance numbers with the same fields toggled when the camera is zoomed and we have a lot of overdraw happening without sorted elements.

perObjectFrustumCulled = true perObjectFrustumCulled = false
sortObjects = true ~105fps ~77fps
sortObjects = false ~88fps ~88fps

I was a bit surprised by the dip when only sortObjects === true but it's possible that the rendering is a bit vertex bound and the GPU time + the CPU time pushes us over the frame boundary.

image

When the camera is zoomed out, though, having both sortObjects and frustumCulled take more time and the framerate is lower. It's possible we should default these settings to false?

image

cc @mrdoob

Copy link

github-actions bot commented Nov 17, 2023

📦 Bundle size

Full ESM build, minified and gzipped.

Filesize dev Filesize PR Diff
668.2 kB (165.8 kB) 668.4 kB (165.9 kB) +215 B

🌳 Bundle size after tree-shaking

Minimal build including a renderer, camera, empty scene, and dependencies.

Filesize dev Filesize PR Diff
449.4 kB (108.9 kB) 449.4 kB (108.9 kB) +0 B

@gkjohnson
Copy link
Collaborator Author

In terms of improving the performance of both of these functions - @N8python might have a radix sort implementation that could be used to improve the sort time.

And this would be the user's responsibility but sorting and frustum culling could be performed asynchronously over multiple frames to amortize the time. I think we can save looking into enabling something this for when we see a problem, though.

@gkjohnson gkjohnson requested a review from mrdoob November 17, 2023 11:18
@gkjohnson gkjohnson added this to the r159 milestone Nov 18, 2023
@sciecode
Copy link
Contributor

sciecode commented Nov 18, 2023

In terms of improving the performance of both of these functions - @N8python might have a radix sort implementation that could be used to improve the sort time.

I highly discourage from using a naive LSD (least significant digit) radix-sort implementation. This is one of those cases where the theoretical time complexity of the algorithm doesn't translate well to the real-world. LSD radix-sort suffers severely from cache locality incoherence, as soon as the underlying array stop fitting in cache (L2/L3), performance tanks drastically. There's also another consideration, it's not an adaptive algorithm, so it performs much worse than even regular array.sort() on partially ordered data.

A while back I've tried to come up with a 32-bit hybrid sort ( radix MSD / insertion ) that solves both of these problems. It is adaptive ( performs better on partially-ordered data ), stable ( retains original element order on ties ) and cache efficient ( reduces cache misses ).

I've re-written it to accomodate for some regular use cases. Follows an example API:

// ...
const tmp = new Array( array.length );
radixSort( array, {
  reverse: true, // sorts in decreasing order - default: false
  aux: tmp, // optional auxiliar array of same length, prevents internal re-allocation over multiple calls
  get: ( el ) => el.value // optional getter - allows sorting array of objects
});

Follows two tables containing the performance tests results, with mean execution time & performance ratio against native sort(). I've also included an optimized version of the linked radixLSD implementation.

random data - results

(N) 1K 5K 10K 50K 100K 500K 1M
sort 0.190ms 1.159ms 2.592ms 16.533ms 37.794ms 279.764ms 669.860ms
radixLSD 0.031ms 6.166 0.183ms 6.320 0.449ms 5.771 2.644ms 6.253 9.285ms 4.070 333.907ms 0.838 799.146ms 0.838
hybridMSD 0.022ms 8.746 0.211ms 5.493 0.556ms 4.659 2.308ms 7.164 5.255ms 7.192 97.870ms 2.859 263.316ms 2.544

partially ordered ( 99% ) - results

(N) 1K 5K 10K 50K 100K 500K 1M
sort 0.026ms 0.125ms 0.254ms 1.468ms 3.044ms 17.422ms 33.489ms
radixLSD 0.033ms 0.767 0.199ms 0.630 0.449ms 0.566 2.484ms 0.591 6.468ms 0.471 277.133ms 0.063 419.517ms 0.080
hybridMSD 0.017ms 1.491 0.084ms 1.501 0.460ms 0.553 1.437ms 1.022 2.826ms 1.077 13.425ms 1.298 24.021ms 1.394

If it's considered useful, feel free use and modify it, or let me know so I can make the required changes. I believe there are a couple of small optimizations that can still be made. I'll update the linked code if changes are made.

@gkjohnson
Copy link
Collaborator Author

gkjohnson commented Nov 19, 2023

Thanks for the insight!

random data - results
sort | 669.860ms
hybridMSD | 263.316ms

partially ordered ( 99% )
sort | 33.489ms
hybridMSD | 24.021ms

I hadn't thought about this but it makes sense. Right now all the elements are resorted from the original insertion order but the most common case would be retaining the sorted order between frames. If there's a simple way to start from the previously sorted order then it could speed things up quite a bit.

If it's considered useful, feel free use and modify it, or let me know so I can make the required changes. I believe there are a couple of small optimizations that can still be made. I'll update the linked code if changes are made.

I'll give this a try in another PR - I think that for now things are good enough for common cases but there's clearly room for improvement in more complex situations.

@Mugen87 Mugen87 merged commit a90b693 into mrdoob:dev Nov 19, 2023
11 checks passed
@gkjohnson gkjohnson deleted the batched-example branch November 19, 2023 11:30
AdaRoseCannon pushed a commit to AdaRoseCannon/three.js that referenced this pull request Jan 15, 2024
* visible -> visibility

* Remove comment

* Improve scale of transform texture

* Skip onBeforeRender if possible

* Set visibility changed to false

* Make sure visibility change flag is toggled on geometry change

* Set on geometry change instead of geometry add
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants