Allow archetypes to clear and/or update existing components #3381

emilk · 2023-09-20T12:59:31Z

Backstory

provide empty() method on Rust archetype implementations to log empty #3279
Decide if archetypes logging APIs should log non-provided components as empty-lists #3341

What should the following code do:

log("points", Points3D::new(positions).with_colors(…));
…
log("points", Points3D::new(new_positions));

Should the new positions inherit the colors of the previous log call?

That would be quite non-intuitive.

Related to this: how does one update just the colors of a point cloud?

Design

We've all converged on the following API:

// We do NOT implement `Default` for archetypes.
// users need to explcitly use `clear` or `update`
struct Points3D {
   positions: Optional<Vec<Vec3>>,
   color: Optional<Vec<Color>>,
   radii: Optional<Vec<f32>>,
}

impl Points3D {
    /// Clear all components of this archetype to the empty list.
    fn clear() -> Self {
        Self {
            positions: Some(vec![]),
            color: Some(vec![]),
            radii: Some(vec![]),
        }
    }
    
    /// Create a new `Point3d` point cloud with some positions,
    /// and with emtpy lists for all other archetypes.
    fn new(positions: Vec<Vec3>) -> Self {
        Self {
            positions: Some(positions),
            ..Self::clear()
        }
    }
    
    /// An empty `Point3d` point cloud, with `None` for each component.
    fn update() -> Self {
        Self {
            positions: None,
            color: None,
            radii: None,
        }
    }
}

fn main() {
    // Log a point cloud:
    log("points", Points3D::new(…).with_colors(…));
    
    log("points", Clear{}); // will clear all components of all archetypes
    log("points", Points3D::clear()); // will clear this archetype of the entity
    log("points", Points3D::new(positions).with_colors(…)); // overwrites everything
    log("points", Points3D::update().with_colors(…)); // overwrites only colors

    // This is also possible, though a somewhat weird way to write it:
    log("points", Points3D::clear().with_colors(…)); // overwrites everything
}

Python:

log("points", Clear()); # will clear all components of all archetypes
log("points", Points3D.clear()); # will clear this archetype of the entity
log("points", Points3D(positions, colors=[…], radii=None); # overwrites everything, even radii
log("points", Points3D.update(colors=[…], radii=None)); # overwrites only colors and radii

Necessary changes

All components are Options, even the required ones
When querying, an empty lists should be treated the same as a missing components
In the UI we should probably hide empty components in the streams view

Component lists implications

A component list (e.g. radii) can have one of the four special values:

null/missing
0 length ("empty")
1 length ("splat")
N length (1<N)

null components aren't logged, but empty components are.
When querying the store, missing components are treated the same as empty components.

Related issues:

Visualize density graph of empty batches differently #7117

The text was updated successfully, but these errors were encountered:

emilk · 2023-09-25T08:49:04Z

If this proves to be more complex than anticipated, then we'll push it to 0.10

emilk · 2023-09-25T11:30:01Z

Needs a follow-up where we hide empty components from the streams view

emilk · 2023-09-27T05:38:09Z

We need to think through how num_instances would work for an update. We could take the max-length of all its components, but that won't work if you're only logging splats. Perhaps we don't need num_instances for updates; it's really only used for the primary (e.g. the indicator).

jleibs · 2023-10-10T18:18:47Z

This API should also handle specifying the InstanceKeys for the collection that's being updated.

teh-cmc · 2024-03-15T09:55:47Z

Instance keys and num_instances are not a problem anymore (or won't be, soon).

jleibs · 2024-07-16T12:51:56Z

@teh-cmc we were talking through this with respect to Transforms I realized this approach has the potential to cause some annoying performance issues for chunks.

In particular, every unused optional component will still require growing the offset array for each of the optional components by one element (in many cases, just a repeated 0). This is similar to the overhead of growing sparse unions (growing each child array), and strictly worse than the overhead of sparse unions (just growing one child array).

This suggests to me we're going to need some kind of chunk-specific optimization.

This also has an annoying ripple-effect on datafusion / dataframe UI of showing a bunch of empty / cleared (not nulled) columns all over the place.

### What * Closes #6909 * First steps on the road to #3381 * Creates a new issue: #7117 Best reviewed ignoring whitespace changes. The serializers for `Transform3D` and `LeafTransform3D` will now always write every component, effectively clearing all previously logged values. `Transform3D` uses `Option<C>` for each component, treating it as a single-element list component list where `None` is empty. If we ever want to support updating some components using the transform (as outlined in #3381) we would need to turn this into a tristate, i.e. something like `NoneOrZeroOrOne<C>`. `LeafTransform3D` uses `Option<Vec<C>>` for its components, but has no interface yet for logging an update. Both `Transform3D` and `LeafTransform3D` has a new `::clear()` constructor in Rust, that replaced both `::new()` and `::default()`. This is for added clarity, and to make the transition easier if we ever add an `::update()` constructor, as per #3381 ![image](https://github.com/user-attachments/assets/588224b8-daec-463d-b238-8263f578ae19) ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested the web demo (if applicable): * Using examples from latest `main` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7111?manifest_url=https://app.rerun.io/version/main/examples_manifest.json) * Using full set of examples from `nightly` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7111?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG * [x] If applicable, add a new check to the [release checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)! * [x] If have noted any breaking changes to the log API in `CHANGELOG.md` and the migration guide - [PR Build Summary](https://build.rerun.io/pr/7111) - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html) To run all checks from `main`, comment on the PR with `@rerun-bot full-check`. --------- Co-authored-by: Andreas Reich <andreas@rerun.io>

teh-cmc · 2024-12-18T16:55:58Z

Some notes on why I think (parts of) this is needed in order to unblock tagged components.

Context: partial updates in a tagged environment

We need an ergonomic way to work with raw components (as opposed to archetypes) that still allows for proper tagging.

As of today, as soon as we re-enable tags, all user-level logging code (including our own code that does a bunch of runtime writes for all things re_blueprint_*!) that mixes logging archetypes and logging components will turn into undefined behavior (which is why we disabled tagging entirely for 0.21 in the first place).

E.g.:

rr.set_time_sequence("frame", 42)
rr.log("points", rr.Points3D([[0, 0, 0], [1, 1, 1]], colors=[255, 0, 0]))

rr.set_time_sequence("frame", 42)
rr.log("points", [rr.components.ColorBatch([0, 0, 255])])

rr.set_time_sequence("frame", 43)
rr.log("points", [rr.components.ColorBatch([0, 255, 0])])

# Undefined behaviour, there are three columns in the database now:
# * `points@Points3D:Position3D#positions`
# * `points@Points3D:Color#colors`
# * `points@Color`
#
# What colors should the visualizer use for frame #42? `points@Points3D:Color#colors`, `points@Color`, something else entirely?
# What colors should the visualizer use for frame #43? `points@Points3D:Color#colors`, `points@Color`, something else entirely?
# What colors should the visualizer use for frame #100? `points@Points3D:Color#colors`, `points@Color`, something else entirely?

The correct thing to do, as of today, is something like this:

rr.set_time_sequence("frame", 42)
rr.log("points", rr.Points3D([[0, 0, 0], [1, 1, 1]], colors=[255, 0, 0]))

rr.set_time_sequence("frame", 42)
rr.log("points", [rr.components.ColorBatch([0, 0, 255]).with_descriptor_overrides(
    archetype_name = "rerun.archetypes.Points3D",
    archetype_field_name = "colors"
)])

rr.set_time_sequence("frame", 43)
rr.log("points", [rr.components.ColorBatch([0, 255, 0]).with_descriptor_overrides(
    archetype_name = "rerun.archetypes.Points3D",
    archetype_field_name = "colors"
)])

# There are only two columns in the database:
# * `points@Points3D:Position3D#positions`
# * `points@Points3D:Color#colors`
#
# The visualizer knows what to pick.

which obviously is a pretty hard thing to sell to end users as-is 🫠.

Path forward 1: make it ergonomic

The first possible path forward is very related to what is already proposed in this issue: make these things actually ergonomic!

rr.set_time_sequence("frame", 42)
rr.log("points", rr.Points3D([[0, 0, 0], [1, 1, 1]], colors=[255, 0, 0]))

rr.set_time_sequence("frame", 42)
rr.log("points", [rr.Points3D.update(colors=[0, 0, 255])]) # gets properly tagged

rr.set_time_sequence("frame", 43)
rr.log("points", [rr.Points3D.update(colors=[0, 255, 0])]) # gets properly tagged

# There are only two columns in the database:
# * `points@Points3D:Position3D#positions`
# * `points@Points3D:Color#colors`
#
# The visualizer knows what to pick.

Upsides:

Ergonomic.
Fixes the issue.
We will eventually need something like this in any case, so that users can work ergonomically with partial updates and/or tags.
Everything else mentioned in this ticket.

Downsides:

It's a big revamp of all logging APIs.
Everything else mentioned in this ticket.

Also keep in mind that we don't necessarily need all the features of the original proposal, just enough to get tags rolling (whether that is at all possible, I don't know).

Path forward 2: new query semantics

The other path forward is to introduce new query semantics. One set of semantics that comes fairly naturally to mind is "most-specific at timestamp wins" (whether it's actually a good idea in practice, who knows).

I.e. with the code from earlier:

rr.set_time_sequence("frame", 42)
rr.log("points", rr.Points3D([[0, 0, 0], [1, 1, 1]], colors=[255, 0, 0]))

rr.set_time_sequence("frame", 42)
rr.log("points", [rr.components.ColorBatch([0, 0, 255])])

rr.set_time_sequence("frame", 43)
rr.log("points", [rr.components.ColorBatch([0, 255, 0])])

# There are three columns in the database now:
# * `points@Points3D:Position3D#positions`
# * `points@Points3D:Color#colors`
# * `points@Color`
#
# What colors should the visualizer use for frame #42? `points@Points3D:Color#colors`, because it is the most specific one available at this timestamp.
# What colors should the visualizer use for frame #43? `points@Color`, because it is the most specific one available at this timestamp.
# What colors should the visualizer use for frame #100? `points@Color`, because it is the most specific one available at the closest timestamp.

Upsides:

Fixes the issue.
Does not even require touching user code! Maybe even helps us maintain backwards compat for old untagged RRD files a bit longer?
We will eventually need some set of new query semantics in any case, in order for tagged and untagged data to coexist in a peaceful and useful way.

Downsides:

Getting these semantics right will likely not be an easy feat.
Getting these semantics wrong can quickly turn into a rabbit hole of unforeseen issues and never-ending refactorings.

Conclusion

Ultimately, I do not think this is an either/or situation: we will need both better logging APIs and improved query semantics, eventually.

It's mostly a matter of which should we get to first in order to unblock tagged components the fastest? The former seems more approachable to me, especially if we restrain ourselves to only whatever subset is needed for tagging and partial updates (if that's at all possible) -- but I can be convinced otherwise.
(It should also be trivial to find many many more upsides and downsides for both of those, I haven't had any time to think it through really.)

jleibs · 2024-12-18T20:34:02Z

Thanks for the write up! The motivation for .update() makes sense as a mechanism of controlling tag-application.

The concept of "most-specific at timestamp wins" is already how I was imagining things would work, but I agree it's still nice to give users a way to avoid the ambiguity in the first place.

nikolausWest · 2024-12-19T12:49:28Z

My suggestion is to call it .components() rather than update() to be more explicit about what is happening. Update sounds like it's mutating the archetype

emilk added enhancement New feature or request ⛃ re_datastore affects the datastore itself codegen/idl labels Sep 20, 2023

This was referenced Sep 20, 2023

Decide if archetypes logging APIs should log non-provided components as empty-lists #3341

Closed

provide empty() method on Rust archetype implementations to log empty #3279

Closed

emilk added this to the 0.9 - Codegen, Type Oriented Api milestone Sep 20, 2023

emilk modified the milestones: 0.9 - Codegen, Type Oriented Api, 0.10 Polish Sep 25, 2023

emilk mentioned this issue Sep 30, 2023

New decorator/context manager to catch exceptions when not in strict mode. #3531

Merged

3 tasks

Wumpf mentioned this issue Oct 4, 2023

Automatically derive nullable if not attr.rerun.component_required #2373

Closed

emilk removed this from the 0.10 Polish (non-blocking) milestone Oct 23, 2023

emilk added this to the Triage milestone Nov 6, 2023

Wumpf mentioned this issue Feb 26, 2024

Define a mechanism to log data as explicitly invalid #5293

Open

teh-cmc added the 🪵 Log & send APIs Affects the user-facing API for all languages label Mar 15, 2024

emilk mentioned this issue Jun 19, 2024

Proposal: avoid empty arrays #6602

Open

Wumpf mentioned this issue Jul 16, 2024

Make Transform3D always log all its components #6909

Closed

emilk mentioned this issue Aug 8, 2024

Always log all components of Transform3D #7111

Merged

6 tasks

emilk mentioned this issue Oct 7, 2024

Convert components to arrow early in all log APIs #7620

Closed

emilk mentioned this issue Nov 4, 2024

Eager/early serialize of components to arrow in Rust and C++ #7245

Open

9 tasks

teh-cmc mentioned this issue Dec 18, 2024

Reenable tagging post-0.21 release #8500

Merged

teh-cmc mentioned this issue Jan 6, 2025

New partial updates logging APIs #8580

Open

Wumpf added a commit that referenced this issue Jan 23, 2025

Cleanup mentions of #3381

6bc19f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow archetypes to clear and/or update existing components #3381

Allow archetypes to clear and/or update existing components #3381

emilk commented Sep 20, 2023 •

edited

Loading

emilk commented Sep 25, 2023

emilk commented Sep 25, 2023

emilk commented Sep 27, 2023

jleibs commented Oct 10, 2023

teh-cmc commented Mar 15, 2024

jleibs commented Jul 16, 2024

teh-cmc commented Dec 18, 2024

jleibs commented Dec 18, 2024

nikolausWest commented Dec 19, 2024

Allow archetypes to clear and/or update existing components #3381

Allow archetypes to clear and/or update existing components #3381

Comments

emilk commented Sep 20, 2023 • edited Loading

Backstory

Design

Necessary changes

Component lists implications

Related issues:

emilk commented Sep 25, 2023

emilk commented Sep 25, 2023

emilk commented Sep 27, 2023

jleibs commented Oct 10, 2023

teh-cmc commented Mar 15, 2024

jleibs commented Jul 16, 2024

teh-cmc commented Dec 18, 2024

Context: partial updates in a tagged environment

Path forward 1: make it ergonomic

Path forward 2: new query semantics

Conclusion

jleibs commented Dec 18, 2024

nikolausWest commented Dec 19, 2024

emilk commented Sep 20, 2023 •

edited

Loading