-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ntuple] remove use of collection writer in importer #16478
[ntuple] remove use of collection writer in importer #16478
Conversation
93edd9d
to
8854bb7
Compare
Test Results 13 files 13 suites 2d 21h 33m 9s ⏱️ For more details on these failures, see this check. Results for commit b7890cf. ♻️ This comment has been updated with latest results. |
343af90
to
c8671d8
Compare
9843e15
to
2cc15c4
Compare
2cc15c4
to
2b4453e
Compare
model->AddField(std::move(collectionField)); | ||
model->Freeze(); | ||
|
||
auto v = std::static_pointer_cast<std::vector<MyStruct>>(model->GetDefaultEntry().GetPtr<void>("myCollection")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires the alignment inside MyStruct
and the one assigned within the collection field to be exactly the same. Do we have a way to cross check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, through RRecordField::GetOffsets()
. I added a check in the test.
std::vector<std::unique_ptr<RFieldBase>> leafFields; | ||
leafFields.emplace_back(std::make_unique<RField<int>>("myInt")); | ||
leafFields.emplace_back(std::make_unique<RField<float>>("myFloat")); | ||
auto recordField = std::make_unique<RRecordField>("_0", leafFields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any requirement on the name (first argument)? In particular does it 'have to' start with _underscore
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the convention for the composed types in RNTuple. The sub fields (sub types) have the names _0
, _1
, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User code will stray from that. Does it matter? If it does we need code to enforce the convention. If it does not matter what spelling is used here, then we are good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reading code will in fact accept any field name. Probably better though to enforce the convention. Edit: it's actually fine the way it is. Through projections, we can have vector subfields with names other than _0
, and that's also the test that it works.
tree/ntuple/v7/test/ntuple_show.cxx
Outdated
int myInt; | ||
float myFloat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int myInt; | |
float myFloat; | |
short myShort; | |
float myFloat; |
To improve slightly the "quality" of the offset
test (i.e. otherwise with alignment calculation or with 'just' adding sizes you get the same result).
"But" now this is a bad example (user should be recommend to compactify their data structure), so related question, do we have a tutorial for this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the test accordingly.
I don't think that we want to advertise using untyped records or collections to general users. It is a mechanism that allows handling data outside the C++ defined EDMs, such as leaf count arrays or perhaps in the future 3rd party, non-C++ writers. For regular use cases, we want to encourage users to have proper dictionaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't advertise it, this means that user that write non-compile scrips and don't bother with the "complexity" of generating a dictionary will take that same struct and create individual top level column for each data members. Is that what we want them to do (your answer seem to imply that the answer is yes).
On another note, when reading "[using untyped records] is a mechanism that allows handling data ... such as leaf count arrays". Does this means/implies that we only support 'leaf count array that are within the same (untyped) record"? [This might be fine but would be a departure from TTree
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my last sentence was confusing. I meant: we want to encourage users to build their EDMs with proper classes (interpreted or compiled, doesn't matter).
I think I don't understand your second paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my last sentence was confusing. I meant: we want to encourage users to build their EDMs with proper classes (interpreted or compiled, doesn't matter).
So the only case where the untyped record could plays a role is the case compiled but no dictionary, right?
I think I don't understand your second paragraph.
I was trying to express my confusion on when we want this feature to be used and whether it implied the removal of a TTree leaflist
feature or not. I.e. if a user has this:
int count;
struct A {
int *values; // is controled by [count]
float something;
};
struct B {
float *energy; // is controled by [count]
bool isValid;
};
the following is valid/working (besides typos):
tree->Branch("count", &count);
tree->Branch("aval", &a_value, "values[count]/I:something/F");
tree->Branch("bval", &b_value, "energy[count]/F:isValid/O");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, as far as I can see we don't support this feature of the leaf list. (I wasn't aware of it.) I think we should add support but here we probably need to duplicate the count field (which may end up as deduplicated page on disk, but not in memory). I think this is unrelated to this PR though, we never did support this feature.
Currently, the only use for untyped collections and records is importing leaf lists and count leaf arrays. Users could use it for compiled code without a dictionary. But that would be cumbersome to code and I don't think that's a good idea. I would prefer TClass
knowing about the classes that are being written to an RNTuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I cannot get this TTree feature to work. I am trying with
std::unique_ptr<TFile> file(TFile::Open(fileGuard.GetPath().c_str(), "RECREATE"));
auto tree = std::make_unique<TTree>("tree", "");
Int_t count = 1;
struct {
Int_t values[1]; // actual size is controlled by [count]
float foo = 2.0;
} leafList;
leafList.values[0] = 1;
tree->Branch("count", &count);
tree->Branch("leaflist", &leafList, "values[count]/I:foo/F");
tree->Fill();
tree->Write();
And I get
'TLeafI::GetLen' reading 'Leaf counter is greater than maximum! leaf: 'values' len: 1 max: 0'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My example has a thinko, the variable size array needs to be last:
auto tree = std::make_unique<TTree>("tree", "");
Int_t count = 1;
struct {
float foo = 2.0;
Int_t values[1]; // actual size is controlled by [count]
} leafList;
leafList.values[0] = 1;
tree->Branch("count", &count);
tree->Branch("leaflist", &leafList, "foo/F:values[count]/I");
tree->Fill();
but exactly 'that' is somewhat rare. What is very common is a slight variation:
auto tree = std::make_unique<TTree>("tree", "");
static constexpr int kMaxPart = 20;
struct {
int count = 0;
float px[kMaxPart];
float py[kMaxPart];
} leafList;
tree->Branch("leaflist_count", &count);
tree->Branch("leaflist_px", &leafList.px[0], "px.[leaflist_count]/F");
tree->Branch("leaflist_py", &leafList.py[0], "py.[leaflist_count]/F");
tree->Fill();
where the 'unfortunate' part is the 'hand_coded/non_enforced' prefix with underscore.
and we have also seen the next kind:
static constexpr int kMaxPart = 20;
int count = 0;
struct {
float px[kMaxPart];
float py[kMaxPart];
} leafList, leafList2;
tree->Branch("count", &count);
tree->Branch("leaflist_px", &leafList.px[0], "px.[leaflist_count]/F");
tree->Branch("leaflist_py", &leafList.py[0], "py.[leaflist_count]/F");
tree->Branch("leaflist2_px", &leafList2.px[0], "px.[leaflist_count]/F");
tree->Branch("leaflist2_py", &leafList2.py[0], "py.[leaflist_count]/F");
where in practice leafList2
might be the same type as leafList
or a variation with similar arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we need to look into these cases one by one for the importer.
I don't think it should be part of this PR though. I created a separate issue: #16583
Is there anything that still blocks this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed the PR is fine. This thread is a side bar about something we need to decide whether we are missing or need to document (eg. how we recommend to migrate).
Co-authored-by: Florine de Geus <git@fwdg.nl>
Co-authored-by: Florine de Geus <git@fwdg.nl>
Co-authored-by: Florine de Geus <git@fwdg.nl>
Co-authored-by: Florine de Geus <git@fwdg.nl>
Co-authored-by: Jonas Hahnfeld <hahnjo@hahnjo.de>
3a7c14e
to
b7890cf
Compare
In preparation of removing the collection writer, remove its use in the RNTuple importer. Instead we back the projected leaf count arrays by untyped collections (field
RSequenceCollectionField
) of untyped records (RRecordField
). That also streamlines and clarifies the use of untyped collections for the on-disk format: while previously the untyped collection leaf fields were directly under the collection parent field, they are now attached to an intermediate untyped record field. This way, untyped collections behave exactly like other sequence collections.