Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More uniform sampling #2440

Draft
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

thorstenhater
Copy link
Contributor

@thorstenhater thorstenhater commented Feb 6, 2025

Intro

This PR explores a few directions for simplifying the sampling interface from the user perspective.
It's not the only way to approach the problems and possibly not the best.

Grievances

There are several stumbling blocks from that perspective that make the use of samplers unergonomic and hard to
communicate

Metadata

Metadata is delivered without type information as std::any and must be cast. Missing a const has consequences.
The type must be determined by knowing the metadata type associatated the probe set in the recipe. Samplers and
probes aren't connected directly to each other, but via a handler that doesn't carry (or allows retrieval of) more
information. There are currently four types: mlocation, [mcable], point_info, and [point_info].

Scalar and Vector probes

Full cell probes generate samples with a vector payload [double] and vector metadata while point probes generate what's essentially a vector of scalar (payload/metadata) probes.

More any

Each entry in the payload needs to cast from any to a scalar or vector (of doubles). Again, missing const is fatal.

Summary Effect

The user is required to distinguish four metadata types and apply the correct casts, while also applying matching casts
to the payload. The information is disaggregated over simulation and recipe code, thus using existing recipes comes at
an extra cost. Moreover, simulations are somewhat brittle towards changes in the probe definitions, which require a full
review of the attached samplers. Despite the obvious implication of metadata type on payload type, the user is required
to track this information.

In Python this might seem less problematic due to dynamic typing, yet differentiating between lists and scalars leads to
different code, which make simulation scripts reliant on dynamic type info (isinstance and ilk).

Possible Remedies and Status.

A list of ideas to reduce churn and verbosity.

  • Link metadata and payload type at type level, reifying the 'vector metadata implies vector payload' relation. Leads to less unconstrained anycasts. Seemed too little of an effect to implement.
  • Always make metadata and payload a vector. That's what I decided on here; it's the most radical approach. It removes one anycast (per entry!) completely. I also removed the differentiation between location and cables, which I am less fond of. There's also the question whether [any=mlocation|mcable|point_info] or any=[mlocation]|... is preferable, neither is really expression what's going on, namely a homogeneous vector of unknown type.
  • Make multi-location probes produce vector samplers, too. Again, included in the PR, since it seems more intuitive.
  • Allow retrieval of samples by (gid, 'tag') (aka cell_address_type). Decided against it, as it doesn't help with the core (in my view) problems; users are still needed to track probe <> sampler relations.
  • Change the metadata type from any to variant<...>. Didn't add this -- for now -- since it requires one header to have all the metadata types. Although the set of possible metadata isn't really open, it's possible to change one type without rebuilding all of Arbor. On the positive side, it's more typesafe and transparent to the user. Doesn't help with Python, though.

As implemented, C++ and Python code for sampling is now significantly less busy, see probe-demo.cpp. The backend code is also somewhat more straightforward. Some conceptual feedback is needed before continuing.

EDIT: As sensible compromise might be this (not literally, but conceptually, as these are function arguments in Arbor.)

struct samples {
  size_t width; // count for metadata and value columns
  sizet_t n_samples; // count of sample times and value rows
  std::variant<mcable*, mlocation*, point_info*> meta; // could be std::any
  double* values; // payload data, size = n_samples * width
  double* time; // sample times, size = n_samples
};

@halfflat
Copy link
Contributor

halfflat commented Feb 7, 2025

Hi Thorsten,

I strongly believe throwing away the genericity of the sample_record is a bad idea; not everything we have now is best represented by a vector of doubles, and certainly not everything we might conceivably represent in the future. It also prevents us from using the type system to ensure that the concordance of probe types and sample consumers. What is the motivation for discarding this?

The simple sampler etc. and cable cell-specific sampling data representations can be changed if there's a need, but reducing a relatively small amount of complexity in representation through the removal of a large amount of expressive power is a bad trade-off.

@thorstenhater
Copy link
Contributor Author

Hi @halfflat,

I have been doing a lot of model building / collaboration lately and terms of practical end-user experience,
the sample extraction is powerful, but annoying to actually use. This concerns cable cells almost exclusively.

There are four types of meta data (mlocation, [mcable], point_info, [point_info]) and two types of payload (double, [double]). In C++ we therefore need to either know the type of probe or rely on RTTI in some capacity. This is made
worse by the fact that information is split over recipe and simulation. Also, locsets with multiple entries produce
multiple streams, but the *_cell probes probes generate one stream with a vector of entries. That seems inconsistent.
The performance is secondary, but paying an any_cast plus a dereference per sampled value seems expensive.

The Python interface is disjoint from the C++ API. Here we deal with tables of samples at the end of simulation::run.
The difference between 'vector' and 'scalar' probes persists, requiring users to deal with lists and scalar valued metadata.

The way interpolation works has bitten me and others before.

Overall, my experience is that sampling is quite complex, more powerful than needed in the cases I have come across,
and thus hard to teach. On the implementation side there's a lot of code dedicated to maintaining the complexity.
This has led me to believe that some simplification is required and this my concrete plan for it

  1. Turn all values into vectors (C++ these become ranges, Python will produce tables always)
  2. Collapse scalar and vector probes (optionally also use the same metadata)
  3. Decide on double as the one true payload

Net effect would be

  1. two types of meta data: [mcable], [point_info]
  2. one type of payload: [double]

thus significantly less casting in (C++) user code and only ever the need to deal with one table per probe in Python.
At the same time, the implementation code becomes simpler (currently slightly net negative, but I didn't start removing
scalar probes, yet.) One might argue that performance might be improved, too, since we have less casts and less data
duplication (the time column will occur only once per locset probe). All at the cost of a second pointer, however, that
would be payed by all probes, not only cable probes. We could work around that by having a double* instead of
a range and extract the size from the metadata list instead. Another choice would be to use a vector of variants for
the metadata, but that also feels off since the type of the elements is homogeneous.

Next, I want to allow switching off interpolation and add a few canned samplers in addition to simple_sampler,
like streaming out to (HDF5) files. (Also, I was thinking about removing any in favour of variant, as the system isn't
really open, but I think that's just churn without payoff)

So far, I am just playing with these ideas to see how the effect on production code looks like.
I am very interested in your (and other people's) input here, since this would impact a lot of user code. I am still convinced,
though, that a simpler interface with less divergent paths is a good thing here. My core question would be where and when the power offered by this setup would be used.

@thorstenhater
Copy link
Contributor Author

Also, since I am now done with the usage code in C++, you could take a look at probe-demo.cpp and see what you think from the user's point of view, @halfflat

@thorstenhater thorstenhater changed the title Py/stream lined probing More uniform sampling Feb 12, 2025
@halfflat
Copy link
Contributor

Just on a few of the points raised in description and your notes above, before I look more thoroughly at the probe-demo code.

  • The necessity of alignment between the expected metadata type or sample data type and those provided by the simulation should, in my opinion, be regarded as a feature rather than a curse: it's there, instead of a more lax alternative, to ensure that there is semantic alignment between the probe specifications and the sampler handles.

    For example, the metadata for describing a mechanism state variable is much more complex than that for a membrane voltage probe. If there were both represented by the same type admitting a more-or-less arbitrary representation (e.g. a JSON representation), the type system won't be there to stop a sampler which expects state data being misapplied to a sampler which expects voltage data. Similarly, differences in sample data representation presents a semantic promise enforced through the type system and moreover would be necessary in any case for hypothetical probes which return more structured data.

  • There is type data in the metadata and sample data returns, but it's only useful for verifying that the type you are getting is the type you expect. What type you should expect is a property of the API and expressed in the documentation; I acknowledge that errors are caught at run-time rather than link-time, but as far as I can tell there is no other way to express this though a generic C++ interface. The run time checks though do prevent the class of 'dynamic typing' sort of errors that a user would otherwise potentially face, as in the previous point.

  • any_cast on an any_ptr is light-weight: it simply compares the type-id and returns nullptr or the actual pointer accordingly. The type comparison is cheap for the expected true case, typically coming down to a pointer equality test (this can give a false negative in some cases with dynamically loaded objects and then take longer iiuc).

    If the overhead of any_cast on each sample value is a concern, we could change the sampling callback function type to std::function<void (probe_metadata, std::size_t, const time_type* time, any_ptr data)> though it could make the casting a little less intuitive: n values of type double would either be represented by e.g. a const pointer to n const pointer values ( const double * const *), or we shuffle things in the cell group to remove the second indirection and just wrap a const double *.

  • The historical reason for returning point-samples with a locset address separately is because we were sort of forced into it by our morphological algebra — it's easier to allow a general locset than to assert that the locset resolves to a single point. But I still feel it's a useful convention: sampling over a probe with basis locset that resolves to two actual locations x and y is much more like sampling over two probes whose bases resolve to x and y respectively. In each case there is an invocation of the sampling callback per concrete location.

  • The sample value types and metadata types shouldn't be regarded as a closed set if we're still in favour of Bring Your Own Cell Group. Which I still am, to be honest.

  • Decoupling sampling plumbing from probe descriptions allows for the same recipe to be used with a variety or a subset of sampling schemes. It allows certain sorts of recipes (i.e. not in their full generality) to be represented in a 'flat' way, e.g. by text files. The same model of cells and connectivity can be used then in different contexts. It also allows for sampling to be changed at run time, e.g. by sampling more stuff more often if the user code sees something interesting.

I think there are two issues that are at the root of the irritations:

  1. Having an open set of sample types is awkward to accommodate in Python, especially because for performance reasons we don't support sampler callbacks into python code. Instead we have to somehow generically accumulate them for presentation at the end.

  2. Generic handling of probe results, as in probe-demo and in the python sample handling, is a poor fit for a variety of sample types.

On the second point, I think the generic use case, where user code really doesn't care about the semantic content of the probed data but just wants to present it in a generic way, is the less import case when compared to situation where the user code expects particular sorts of probed data for the purposes of further analysis, where a generic representation is less safe than a type-checked representation.

But that doesn't mean the generic case shouldn't be better supported. A possible approach is to include in the probe metadata a (possibly empty) function that maps the type-abstracted sample records to e.g. a JSON or flatbuffer (plus flatbuffer schema) representation that in-turn can be parsed and processed in a generic way by the user code.

For the first point, I have been wondering if we could adopt a Python async approach: the python lib sets up callbacks that push sample data into queues which are drained asynchronously and which block forward simulation progress until the Python user side code pulls out enough sample data. It would avoid, potentially, the issue of keeping all the sample data in memory before any processing or I/O can deal with it.

@thorstenhater
Copy link
Contributor Author

thorstenhater commented Feb 13, 2025

The performance / memory concerns of py::simulation::run giving all data in one chunk isn't on the list of grievances, but very much on my mind. I don't think that injecting async is going to help, as it would require an async runtime and runs into the coloured function issue. It also doesn't make the overhead between Python and C++ disappear. My current recommendation to users is to loop py::simulation::run, if it becomes an issue.

I will think through your points more carefully, but I want to stress the central point that I hid among the details: This is about the user interface, because an elegant design doesn't help if it's annoying to use. The Rust folks use 'ergonomic', which fits the idea really well. I am also not the first person to think about sampling from this point of view, see #1929

@halfflat
Copy link
Contributor

I guess I can't really see how the current design (in broad strokes; details of course can be improved and simple_sampler also is a mess) is less frustrating than the alternative. Squishing everything into the one data representation throws away flexibility and type-safety — we're swapping one problem for another. If you can verify the type, you know the data structure and can work with it. If you have to infer the structure and meaning of the data from a generic representation, you have to do a lot more runtime work to avoid semantic errors.

But also, and this is something I really don't see, what is so awful about calling any_cast?

@halfflat
Copy link
Contributor

Just to expand on the first part of the reply: if we represent a set of mlocations as mcables, something that gets mcables expecting a set of mlocations would then have to verify that the mcables describe a point set if they want to ensure that they haven't hooked themselves up to the wrong probe. Just returning the correct type avoids this problem in the first place.

@thorstenhater
Copy link
Contributor Author

* The necessity of alignment between the expected metadata type or sample data type and those provided by the simulation should, in my opinion, be regarded as a feature rather than a curse: it's there, instead of a more lax alternative, to ensure that there is semantic alignment between the probe specifications and the sampler handles.

Sure, that isn't up to debate and was not relaxed. The core change is to acknowledge that most often things come in bunches, so instead of one value, there's now a row of values. What I wanted to point out is that there's an injective
relation metadata -> value_type that's only fixed in the documentation (and checked at runtime).

  For example, the metadata for describing a mechanism state variable is much more complex than that for a membrane voltage probe. If there were both represented by the same type admitting a more-or-less arbitrary representation (e.g. a JSON representation), the type system won't be there to stop a sampler which expects state data being misapplied to a sampler which expects voltage data. Similarly, differences in sample data representation presents a semantic promise enforced through the type system and moreover would be necessary in any case for hypothetical probes which return more structured data.

Agreed, that wasn't really changed, except me overeagerly flattening locations and cables, which I am unhappy with, see above. Expect that to be reverted. Metadata is still defined by the probe type.

I am less sure about the value type though. I am not really fond of building for generality without the need and currently none of our probes do return more than a value or an array thereof.

* The historical reason for returning point-samples with a locset address separately is because we were sort of forced into it by our morphological algebra — it's easier to allow a general locset than to assert that the locset resolves to a single point. But I still feel it's a useful convention: sampling over a probe with basis locset that resolves to two actual locations _x_ and _y_ is much more like sampling over two probes whose bases resolve to _x_ and _y_ respectively. In each case there is an invocation of the sampling callback per concrete location.

I fail to see how a user could and should know that and rather protect them from such details

* The sample value types and metadata types shouldn't be regarded as a closed set if we're still in favour of Bring Your Own Cell Group. Which I still am, to be honest.

Me too, but I am lacking the time and use case.

* Decoupling sampling plumbing from probe descriptions allows for the same recipe to be used with a variety or a subset of sampling schemes. It allows certain sorts of recipes (i.e. not in their full generality) to be represented in a 'flat' way, e.g. by text files. The same model of cells and connectivity can be used then in different contexts. It also allows for sampling to be changed at run time, e.g. by sampling more stuff more often if the user code sees something interesting.

True, I am still in favour of that idea. However, let's make it easy on our users (easier at least).

@halfflat
Copy link
Contributor

halfflat commented Feb 13, 2025

On the point of probes defined over multisets: it's not a matter of protecting users from some weird behaviour; it's still there because it is consistent with sampling more than one point defined over singletons. The same semantics, no matter the breakdown of multiple probes over multiple points.

Just to be explicit: if I attach a sampler to two probes A and B of the same type, and A is defined over 3 points and B is defined over 5 points, the sampler will be called 8 times, one for each point. It has the same semantics as if it were defined over a single probe AB with 8 points in its locset.

I agree that it could be represented as a vector of values — I just don't really see it as an advantage over the status quo.

@thorstenhater
Copy link
Contributor Author

Things are getting a bit more coherent to me.

struct samples_type {
  size_t width; // count for metadata and value columns
  sizet_t n_samples; // count of sample times and value rows
  std::any meta; // Really std::variant<mcable*, mlocation*, point_info*> for now, but might be extented
  std::any values; // payload data, size = n_samples * width, really only double* for now
  double* time; // sample times, size = n_samples
};

and samplers can receive this struct or its components

// Example
struct ARB_SYMBOL_VISIBLE cable_probe_ion_current_density {
    using metadata_type = mlocation;
    using value_type = double;
    locset locations;
    std::string ion;
};

Then, consumers can write this in the majority of cases

// callback
void sampler(const samples_type& samples) {
    using probe = cable_probe_ion_current_density;
    auto* meta = any_cast<const probe::metadata_type*>(samples.meta);
    auto* values = any_cast<const probe::value_type*>(samples.values);
    // iterate over width x n_samples table
    // ...

if a sampler is attached to multiple, differently typed probes, that still needs handling, but that's no different
to before. It also doesn't happen to Python users, who use one_probe exclusively.

That would be addressing my central worries in C++ land:

  • remove the scalar / vector distinction and acknowledge all data is a table. This makes things way easier to explain.
  • there's some help for the user to handle the casting.
  • (one cast per sampler call)

Python would get the same uniformity wrt tables ./. lists of tables, which fixes most problematic points there.

And yes, simple_sampler has been found lacking, but I don't want to fix it here. Similarly, expect non-interpolating cell probes soon.

@halfflat
Copy link
Contributor

I like having the type map (as implemented with the typedefs) from probe address type to metadata and value type. I'd use any_ptr in preference to std::any in the blob passed to the sampler because it is harder to do something inefficient with any_ptr, and further makes the ownership pattern clearer.

Less enamoured with the width field and the table representation. Not everything will naturally fit into a table of doubles, so i don't think it should be baked into the sampler callback API, even if that sort of works for the current state of cable cells. If table representations are good for Python users, that's fine, but can we do the translation in the Python shim? If it's really important to have a generic data representation, I'd much prefer to have a translation function provided with the metadata that can convert the natural representation into the generic one, be it a table, or JSON, or something else even.

@thorstenhater
Copy link
Contributor Author

So, yes, not everything will fit a table of double, but notice that I relaxed that to essentially be a table of runtime determined type. What do you foresee would not fit? Structured data is covered, as long as it homogeneous in time, either as AoS or SoA. Heterogeneous data seems wildly exotic to me in the general context of Arbor. Scalar data is covered by producing a 1 x n_samples 'table'. Note that I ascribe significant value to keeping C++ and Python closely aligned, it's never going to be 100%, but we can try.

@halfflat
Copy link
Contributor

Ah, I did misunderstand what you meant there with the table, sorry! Even heterogenous-in-time data could be represented this way although I agree that that is kinda mad.

I think the table terminology threw me off, to be honest, but I believe I get it now.

@thorstenhater
Copy link
Contributor Author

I would have used DataFrame, but not everybody is used to Python/Pandas/Polars.

- Set metadata for *_cell probes to mcable and mlocation for the rest
- Add typedefs to probe types
- Completely re-work simple sampler making a simpler_sampler ;)
- Foundation for meta -> value type inference
- Tests are good on _Mac_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants