-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More uniform sampling #2440
base: master
Are you sure you want to change the base?
More uniform sampling #2440
Conversation
Hi Thorsten, I strongly believe throwing away the genericity of the sample_record is a bad idea; not everything we have now is best represented by a vector of doubles, and certainly not everything we might conceivably represent in the future. It also prevents us from using the type system to ensure that the concordance of probe types and sample consumers. What is the motivation for discarding this? The simple sampler etc. and cable cell-specific sampling data representations can be changed if there's a need, but reducing a relatively small amount of complexity in representation through the removal of a large amount of expressive power is a bad trade-off. |
Hi @halfflat, I have been doing a lot of model building / collaboration lately and terms of practical end-user experience, There are four types of meta data (mlocation, [mcable], point_info, [point_info]) and two types of payload (double, [double]). In C++ we therefore need to either know the type of probe or rely on RTTI in some capacity. This is made The Python interface is disjoint from the C++ API. Here we deal with tables of samples at the end of The way interpolation works has bitten me and others before. Overall, my experience is that sampling is quite complex, more powerful than needed in the cases I have come across,
Net effect would be
thus significantly less casting in (C++) user code and only ever the need to deal with one table per probe in Python. Next, I want to allow switching off interpolation and add a few canned samplers in addition to So far, I am just playing with these ideas to see how the effect on production code looks like. |
Also, since I am now done with the usage code in C++, you could take a look at |
Just on a few of the points raised in description and your notes above, before I look more thoroughly at the probe-demo code.
I think there are two issues that are at the root of the irritations:
On the second point, I think the generic use case, where user code really doesn't care about the semantic content of the probed data but just wants to present it in a generic way, is the less import case when compared to situation where the user code expects particular sorts of probed data for the purposes of further analysis, where a generic representation is less safe than a type-checked representation. But that doesn't mean the generic case shouldn't be better supported. A possible approach is to include in the probe metadata a (possibly empty) function that maps the type-abstracted sample records to e.g. a JSON or flatbuffer (plus flatbuffer schema) representation that in-turn can be parsed and processed in a generic way by the user code. For the first point, I have been wondering if we could adopt a Python async approach: the python lib sets up callbacks that push sample data into queues which are drained asynchronously and which block forward simulation progress until the Python user side code pulls out enough sample data. It would avoid, potentially, the issue of keeping all the sample data in memory before any processing or I/O can deal with it. |
The performance / memory concerns of I will think through your points more carefully, but I want to stress the central point that I hid among the details: This is about the user interface, because an elegant design doesn't help if it's annoying to use. The Rust folks use 'ergonomic', which fits the idea really well. I am also not the first person to think about sampling from this point of view, see #1929 |
I guess I can't really see how the current design (in broad strokes; details of course can be improved and simple_sampler also is a mess) is less frustrating than the alternative. Squishing everything into the one data representation throws away flexibility and type-safety — we're swapping one problem for another. If you can verify the type, you know the data structure and can work with it. If you have to infer the structure and meaning of the data from a generic representation, you have to do a lot more runtime work to avoid semantic errors. But also, and this is something I really don't see, what is so awful about calling |
Just to expand on the first part of the reply: if we represent a set of |
Sure, that isn't up to debate and was not relaxed. The core change is to acknowledge that most often things come in bunches, so instead of one value, there's now a row of values. What I wanted to point out is that there's an injective
Agreed, that wasn't really changed, except me overeagerly flattening locations and cables, which I am unhappy with, see above. Expect that to be reverted. Metadata is still defined by the probe type. I am less sure about the value type though. I am not really fond of building for generality without the need and currently none of our probes do return more than a value or an array thereof.
I fail to see how a user could and should know that and rather protect them from such details
Me too, but I am lacking the time and use case.
True, I am still in favour of that idea. However, let's make it easy on our users (easier at least). |
On the point of probes defined over multisets: it's not a matter of protecting users from some weird behaviour; it's still there because it is consistent with sampling more than one point defined over singletons. The same semantics, no matter the breakdown of multiple probes over multiple points. Just to be explicit: if I attach a sampler to two probes A and B of the same type, and A is defined over 3 points and B is defined over 5 points, the sampler will be called 8 times, one for each point. It has the same semantics as if it were defined over a single probe AB with 8 points in its locset. I agree that it could be represented as a vector of values — I just don't really see it as an advantage over the status quo. |
Things are getting a bit more coherent to me. struct samples_type {
size_t width; // count for metadata and value columns
sizet_t n_samples; // count of sample times and value rows
std::any meta; // Really std::variant<mcable*, mlocation*, point_info*> for now, but might be extented
std::any values; // payload data, size = n_samples * width, really only double* for now
double* time; // sample times, size = n_samples
}; and samplers can receive this // Example
struct ARB_SYMBOL_VISIBLE cable_probe_ion_current_density {
using metadata_type = mlocation;
using value_type = double;
locset locations;
std::string ion;
}; Then, consumers can write this in the majority of cases // callback
void sampler(const samples_type& samples) {
using probe = cable_probe_ion_current_density;
auto* meta = any_cast<const probe::metadata_type*>(samples.meta);
auto* values = any_cast<const probe::value_type*>(samples.values);
// iterate over width x n_samples table
// ... if a sampler is attached to multiple, differently typed probes, that still needs handling, but that's no different That would be addressing my central worries in C++ land:
Python would get the same uniformity wrt tables ./. lists of tables, which fixes most problematic points there. And yes, |
I like having the type map (as implemented with the typedefs) from probe address type to metadata and value type. I'd use any_ptr in preference to std::any in the blob passed to the sampler because it is harder to do something inefficient with any_ptr, and further makes the ownership pattern clearer. Less enamoured with the |
So, yes, not everything will fit a table of |
Ah, I did misunderstand what you meant there with the table, sorry! Even heterogenous-in-time data could be represented this way although I agree that that is kinda mad. I think the table terminology threw me off, to be honest, but I believe I get it now. |
I would have used DataFrame, but not everybody is used to Python/Pandas/Polars. |
- Set metadata for *_cell probes to mcable and mlocation for the rest - Add typedefs to probe types - Completely re-work simple sampler making a simpler_sampler ;) - Foundation for meta -> value type inference - Tests are good on _Mac_
Intro
This PR explores a few directions for simplifying the sampling interface from the user perspective.
It's not the only way to approach the problems and possibly not the best.
Grievances
There are several stumbling blocks from that perspective that make the use of samplers unergonomic and hard to
communicate
Metadata
Metadata is delivered without type information as
std::any
and must be cast. Missing aconst
has consequences.The type must be determined by knowing the metadata type associatated the probe set in the recipe. Samplers and
probes aren't connected directly to each other, but via a handler that doesn't carry (or allows retrieval of) more
information. There are currently four types:
mlocation
,[mcable]
,point_info
, and[point_info]
.Scalar and Vector probes
Full cell probes generate samples with a vector payload
[double]
and vector metadata while point probes generate what's essentially a vector of scalar (payload/metadata) probes.More
any
Each entry in the payload needs to cast from
any
to a scalar or vector (of doubles). Again, missingconst
is fatal.Summary Effect
The user is required to distinguish four metadata types and apply the correct casts, while also applying matching casts
to the payload. The information is disaggregated over simulation and recipe code, thus using existing recipes comes at
an extra cost. Moreover, simulations are somewhat brittle towards changes in the probe definitions, which require a full
review of the attached samplers. Despite the obvious implication of metadata type on payload type, the user is required
to track this information.
In Python this might seem less problematic due to dynamic typing, yet differentiating between lists and scalars leads to
different code, which make simulation scripts reliant on dynamic type info (
isinstance
and ilk).Possible Remedies and Status.
A list of ideas to reduce churn and verbosity.
anycast
s. Seemed too little of an effect to implement.anycast
(per entry!) completely. I also removed the differentiation between location and cables, which I am less fond of. There's also the question whether[any=mlocation|mcable|point_info]
orany=[mlocation]|...
is preferable, neither is really expression what's going on, namely a homogeneous vector of unknown type.(gid, 'tag')
(akacell_address_type
). Decided against it, as it doesn't help with the core (in my view) problems; users are still needed to track probe <> sampler relations.any
tovariant<...>
. Didn't add this -- for now -- since it requires one header to have all the metadata types. Although the set of possible metadata isn't really open, it's possible to change one type without rebuilding all of Arbor. On the positive side, it's more typesafe and transparent to the user. Doesn't help with Python, though.As implemented, C++ and Python code for sampling is now significantly less busy, see
probe-demo.cpp
. The backend code is also somewhat more straightforward. Some conceptual feedback is needed before continuing.EDIT: As sensible compromise might be this (not literally, but conceptually, as these are function arguments in Arbor.)