Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ntuple] remove use of collection writer in importer #16478

Merged

Conversation

jblomer
Copy link
Contributor

@jblomer jblomer commented Sep 19, 2024

In preparation of removing the collection writer, remove its use in the RNTuple importer. Instead we back the projected leaf count arrays by untyped collections (field RSequenceCollectionField) of untyped records (RRecordField). That also streamlines and clarifies the use of untyped collections for the on-disk format: while previously the untyped collection leaf fields were directly under the collection parent field, they are now attached to an intermediate untyped record field. This way, untyped collections behave exactly like other sequence collections.

@jblomer jblomer force-pushed the ntuple-import-without-collection-writer branch from 93edd9d to 8854bb7 Compare September 19, 2024 14:30
@jblomer jblomer self-assigned this Sep 19, 2024
Copy link

github-actions bot commented Sep 19, 2024

Test Results

    13 files      13 suites   2d 21h 33m 9s ⏱️
 2 705 tests  2 704 ✅ 0 💤 1 ❌
33 064 runs  33 063 ✅ 0 💤 1 ❌

For more details on these failures, see this check.

Results for commit b7890cf.

♻️ This comment has been updated with latest results.

tree/ntuple/v7/doc/specifications.md Outdated Show resolved Hide resolved
tree/ntuple/v7/inc/ROOT/RField/RFieldSequenceContainer.hxx Outdated Show resolved Hide resolved
tree/ntuple/v7/src/RNTupleDescriptor.cxx Outdated Show resolved Hide resolved
tree/ntuple/v7/src/RNTupleDescriptor.cxx Outdated Show resolved Hide resolved
tree/ntupleutil/v7/inc/ROOT/RNTupleImporter.hxx Outdated Show resolved Hide resolved
tree/ntupleutil/v7/inc/ROOT/RNTupleImporter.hxx Outdated Show resolved Hide resolved
@jblomer jblomer force-pushed the ntuple-import-without-collection-writer branch from 343af90 to c8671d8 Compare September 23, 2024 13:42
@jblomer jblomer force-pushed the ntuple-import-without-collection-writer branch from 9843e15 to 2cc15c4 Compare September 24, 2024 10:00
@jblomer jblomer requested a review from hahnjo September 24, 2024 13:28
@jblomer jblomer force-pushed the ntuple-import-without-collection-writer branch from 2cc15c4 to 2b4453e Compare September 27, 2024 12:38
@jblomer jblomer requested a review from pcanal September 27, 2024 12:40
model->AddField(std::move(collectionField));
model->Freeze();

auto v = std::static_pointer_cast<std::vector<MyStruct>>(model->GetDefaultEntry().GetPtr<void>("myCollection"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires the alignment inside MyStruct and the one assigned within the collection field to be exactly the same. Do we have a way to cross check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, through RRecordField::GetOffsets(). I added a check in the test.

@jblomer jblomer requested a review from pcanal September 30, 2024 02:40
std::vector<std::unique_ptr<RFieldBase>> leafFields;
leafFields.emplace_back(std::make_unique<RField<int>>("myInt"));
leafFields.emplace_back(std::make_unique<RField<float>>("myFloat"));
auto recordField = std::make_unique<RRecordField>("_0", leafFields);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any requirement on the name (first argument)? In particular does it 'have to' start with _underscore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the convention for the composed types in RNTuple. The sub fields (sub types) have the names _0, _1, ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User code will stray from that. Does it matter? If it does we need code to enforce the convention. If it does not matter what spelling is used here, then we are good to go.

Copy link
Contributor Author

@jblomer jblomer Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reading code will in fact accept any field name. Probably better though to enforce the convention. Edit: it's actually fine the way it is. Through projections, we can have vector subfields with names other than _0, and that's also the test that it works.

Comment on lines 344 to 345
int myInt;
float myFloat;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int myInt;
float myFloat;
short myShort;
float myFloat;

To improve slightly the "quality" of the offset test (i.e. otherwise with alignment calculation or with 'just' adding sizes you get the same result).

"But" now this is a bad example (user should be recommend to compactify their data structure), so related question, do we have a tutorial for this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the test accordingly.

I don't think that we want to advertise using untyped records or collections to general users. It is a mechanism that allows handling data outside the C++ defined EDMs, such as leaf count arrays or perhaps in the future 3rd party, non-C++ writers. For regular use cases, we want to encourage users to have proper dictionaries.

Copy link
Member

@pcanal pcanal Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't advertise it, this means that user that write non-compile scrips and don't bother with the "complexity" of generating a dictionary will take that same struct and create individual top level column for each data members. Is that what we want them to do (your answer seem to imply that the answer is yes).

On another note, when reading "[using untyped records] is a mechanism that allows handling data ... such as leaf count arrays". Does this means/implies that we only support 'leaf count array that are within the same (untyped) record"? [This might be fine but would be a departure from TTree]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my last sentence was confusing. I meant: we want to encourage users to build their EDMs with proper classes (interpreted or compiled, doesn't matter).

I think I don't understand your second paragraph.

Copy link
Member

@pcanal pcanal Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my last sentence was confusing. I meant: we want to encourage users to build their EDMs with proper classes (interpreted or compiled, doesn't matter).

So the only case where the untyped record could plays a role is the case compiled but no dictionary, right?

I think I don't understand your second paragraph.

I was trying to express my confusion on when we want this feature to be used and whether it implied the removal of a TTree leaflist feature or not. I.e. if a user has this:

int count;
struct A {
    int *values; // is controled by [count]
    float something;
};
struct B {
  float *energy;  // is controled by [count]
  bool isValid;
};

the following is valid/working (besides typos):

tree->Branch("count", &count);
tree->Branch("aval", &a_value, "values[count]/I:something/F");
tree->Branch("bval", &b_value, "energy[count]/F:isValid/O");

Copy link
Contributor Author

@jblomer jblomer Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, as far as I can see we don't support this feature of the leaf list. (I wasn't aware of it.) I think we should add support but here we probably need to duplicate the count field (which may end up as deduplicated page on disk, but not in memory). I think this is unrelated to this PR though, we never did support this feature.

Currently, the only use for untyped collections and records is importing leaf lists and count leaf arrays. Users could use it for compiled code without a dictionary. But that would be cumbersome to code and I don't think that's a good idea. I would prefer TClass knowing about the classes that are being written to an RNTuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I cannot get this TTree feature to work. I am trying with

std::unique_ptr<TFile> file(TFile::Open(fileGuard.GetPath().c_str(), "RECREATE"));
auto tree = std::make_unique<TTree>("tree", "");

Int_t count = 1;
struct {
   Int_t values[1]; // actual size is controlled by [count]
   float foo = 2.0;
} leafList;
leafList.values[0] = 1;
tree->Branch("count", &count);
tree->Branch("leaflist", &leafList, "values[count]/I:foo/F");

tree->Fill();
tree->Write();

And I get

'TLeafI::GetLen' reading 'Leaf counter is greater than maximum!  leaf: 'values' len: 1 max: 0'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My example has a thinko, the variable size array needs to be last:

auto tree = std::make_unique<TTree>("tree", "");

Int_t count = 1;
struct {
   float foo = 2.0;
   Int_t values[1]; // actual size is controlled by [count]
} leafList;
leafList.values[0] = 1;
tree->Branch("count", &count);
tree->Branch("leaflist", &leafList, "foo/F:values[count]/I");

tree->Fill();

but exactly 'that' is somewhat rare. What is very common is a slight variation:

auto tree = std::make_unique<TTree>("tree", "");
static constexpr int kMaxPart = 20;
 
struct {
   int count = 0;
   float px[kMaxPart];
   float py[kMaxPart];
} leafList;

tree->Branch("leaflist_count", &count);
tree->Branch("leaflist_px", &leafList.px[0], "px.[leaflist_count]/F");
tree->Branch("leaflist_py", &leafList.py[0], "py.[leaflist_count]/F");

tree->Fill();

where the 'unfortunate' part is the 'hand_coded/non_enforced' prefix with underscore.

and we have also seen the next kind:

static constexpr int kMaxPart = 20;
 
int count = 0;
struct {
   float px[kMaxPart];
   float py[kMaxPart];
} leafList, leafList2;

tree->Branch("count", &count);
tree->Branch("leaflist_px", &leafList.px[0], "px.[leaflist_count]/F");
tree->Branch("leaflist_py", &leafList.py[0], "py.[leaflist_count]/F");
tree->Branch("leaflist2_px", &leafList2.px[0], "px.[leaflist_count]/F");
tree->Branch("leaflist2_py", &leafList2.py[0], "py.[leaflist_count]/F");

where in practice leafList2 might be the same type as leafList or a variation with similar arrays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we need to look into these cases one by one for the importer.

I don't think it should be part of this PR though. I created a separate issue: #16583

Is there anything that still blocks this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed the PR is fine. This thread is a side bar about something we need to decide whether we are missing or need to document (eg. how we recommend to migrate).

@jblomer jblomer requested a review from pcanal October 1, 2024 07:25
@jblomer jblomer merged commit 9d03b32 into root-project:master Oct 2, 2024
17 of 20 checks passed
@jblomer jblomer deleted the ntuple-import-without-collection-writer branch October 2, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants