Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

column/table level change detection #11120

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

re0312
Copy link
Contributor

@re0312 re0312 commented Dec 28, 2023

Objective

Solution

  • Introducing an AtomicU32 type to track the last mutable access for tables and columns.
  • Assigning the responsibility of updating this information to the set_table and set_archetype methods.
  • Implementing the archetype_filter_fetch method on QueryFilter to filter out tables or columns that have not been subject to mutable access since system last run.

Notes

I !!!express reservations about the current implementation due to its involvement in multiple data flows, potentially increasing code vulnerability and expensive . I would highly appreciate it if someone could propose an alternative implementation details with fewer complexities and improved security considerations.

Performance

Intel 13400kf NV4070ti

image
to comprehensively evaluate the change detection mechanism, additional benchmarking across various scenarios is deemed necessary

regression:
image
The regression in spawn/insert is attributed to the introduction of an atomic operation for each table per component.
iter_fragmented_* method appears to only iterate over few entities, making the maintenance of tick costly.
query.get_mut_compoent/unchecked(entity) is expensive becase it set the archetype for each entity retrieved.

It is worth mentioning that many_foxes frame_time dropped from 3.06ms to 4.21ms
yellow is main ,red is the pr
image

the hotspot is animation_player ,propagate_transform which parallel use of get_unchecked for each entity seems leading to numerous invalid cache instances across CPU cores.
image
image

Migration Guide

This section is optional. If there are no breaking changes, you can delete this section.

  • If this PR is a breaking change (relative to the last release of Bevy), describe how a user might need to migrate their code to support these changes
  • Simply adding new functionality is not a breaking change.
  • Fixing behavior that was definitely a bug, rather than a questionable design choice is not a breaking change.

@alice-i-cecile alice-i-cecile added C-Feature A new feature, making something new possible A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help! labels Dec 28, 2023
@alice-i-cecile
Copy link
Member

Okay, really neat stuff. Thanks for trying this out and presenting those benchmark results! Initial thoughts:

  1. I think / hope we should be able to do this without the introduction of atomics. As you've seen, they introduce a meaningful performance overhead on all operations. The scheduler should already guarantee that only one system ever has mutable access to each archetype-component (I think this is what you called a table) at once. However, we will still need to be cautious in how we implement parallel iteration over queries directly.
  2. Firmly agree on the need for more benchmarks for change detection before merging something in this space. I want to have a clear sense of the performance tradeoffs involved, and the
  3. We should consider making change detection configurable (Change Detection as Components / Opt-in or opt-out of Change Detection #4882) before merging this work: I suspect the correct trade-offs will vary substantially by component type.

@JMS55
Copy link
Contributor

JMS55 commented Dec 28, 2023

Definitely excited to have this, as it opens up an opportunity for my reactive UI crate to skip regenerating the UI when a query has no changed data at the archetype/column level.

@re0312
Copy link
Contributor Author

re0312 commented Dec 29, 2023

  1. I think / hope we should be able to do this without the introduction of atomics. As you've seen, they introduce a meaningful performance overhead on all operations. The scheduler should already guarantee that only one system ever has mutable access to each archetype-component (I think this is what you called a table) at once. However, we will still need to be cautious in how we implement parallel iteration over queries directly.

current impl of this pr includes both table (built with their owned table-storage columns) and every columns (both table-storage and sparse-set-storage). schedule cannot guarantee only one mutable access to a table and sparse-set column simultaneously. If we avoid using AtomicType, the viable option would be to only implement column-level change detection specifically for table-storage columns. (of course ,i think not to use atomic is the right way that will greatly reduce overhead )

  1. Firmly agree on the need for more benchmarks for change detection before merging something in this space. I want to have a clear sense of the performance tradeoffs involved, and the

yeah,i think i can add more benchmark related to change detection in another pr

Comment on lines +547 to +558
pub fn read_last_mutable_access_tick(&self) -> Tick {
Tick::new(
self.last_mutable_access_tick
.load(std::sync::atomic::Ordering::Relaxed),
)
}
pub fn update_last_mutable_access_tick(&self, tick: Tick) {
// self.last_mutable_access_tick
// .store(tick.get(), std::sync::atomic::Ordering::Relaxed);
self.last_mutable_access_tick
.fetch_max(tick.get(), std::sync::atomic::Ordering::AcqRel);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the choice for memory ordering? Naively, I would have expected the fetch_max to be Release and the load to be Acquire, so that we cannot miss changes. However I'm not confident in my understanding of atomic ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't deeply considered this aspect yet(so please correct me if i am wrong). My thought was that there might not be any other data flow related to access-tick. Therefore, using a relaxed load operation would be acceptable. Bevy's scheduler may run systems with same table access in parallel, but we cannot guarantee the progress when system would load or store corresponding tick, when two systems write to the same tick concurrently. store operations must be observed. but for load operation ,even if changes are missed,it is not a big deal,our focus is on comparing the loaded value to the system's last_change_tick.

@hymm
Copy link
Contributor

hymm commented Jan 3, 2024

The only time we need to update change ticks in parallel is when we're using par_iter. Could we special case that scenario? Maybe use something like https://docs.rs/thread_local/1.1.7/thread_local/ and take the max of all the ticks after the par_iter is done and use that to update the archetype ticks.

Also adding branching inside the iterator is very bad for vectorization. You probably want to somehow update the matched archetypes. Probably needs benchmarking to be sure.

@re0312
Copy link
Contributor Author

re0312 commented Jan 3, 2024

The only time we need to update change ticks in parallel is when we're using par_iter. Could we special case that scenario? Maybe use something like https://docs.rs/thread_local/1.1.7/thread_local/ and take the max of all the ticks after the par_iter is done and use that to update the archetype ticks.

Also adding branching inside the iterator is very bad for vectorization. You probably want to somehow update the matched archetypes. Probably needs benchmarking to be sure.

yeah,this impl is just a try to identify a method that balances both ergonomics and performance.still need more benchmark,my current thoughts are:

  1. archetype/table level change detection is not easy to impl in bevy,A archetype not only updates pararllel when using par_iter but also between systems run(mulityple systems can run parallel with same archetype)
  2. it is not worth it that accurate update archetype/table change tick like Mut when deref_mut of each coloumn/archetype.
  3. similar to flecs,we could update the archetype/column level tick when it first mutable access in a system run, for example ,Query<&mut T >.iter_mut().foreach that will update the corresponding archeype tick even if the loop does nothing;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Feature A new feature, making something new possible C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accelerate change detection by redundantly storing change ticks
5 participants