-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix unexpected behaviour of generate_subcatalogs
#241
Fix unexpected behaviour of generate_subcatalogs
#241
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #241 +/- ##
===========================================
+ Coverage 93.73% 93.91% +0.18%
===========================================
Files 30 32 +2
Lines 3749 3963 +214
===========================================
+ Hits 3514 3722 +208
- Misses 235 241 +6
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good - thank you! A couple of thoughts:
The way this works now, the generate_subcatalogs
recurses down to the child catalogs and then crawls up it's parent tree to cache parent catalog IDs. This ends up iterating over the catalog multiple times - both down the tree with for child in self.get_children():
and during the while loop you introduce. Can you structure it in such a way that it reduced the iterations? I think passing forward the subcat_id_to_cat
as an optional parameter might do the trick.
Also, could you provide one or more unit tests that would cover the case mentioned in #240? Might be worth trying to test any edge cases that may arise around catalog merging - e.g. what happens if two subcatalogs have the same ID? This is often the case for subcatalogs that are based on date - so a subcatalog of day "15" might exist for a subcatalog of month "1", but also for month "2". I'm not sure how this code will behave in this scenario.
3b66b84
to
5651517
Compare
Thanks a lot for having a look! I have followed your suggestion and removed the redundant iteration over the catalog by passing over the parent catalog IDs. I have changed the subsequent loop over items slightly: I use the list of parent IDs to check whether subcatalogs need to be added, but I crawl down the template levels using the subcatalogs' I will add unit tests with something like #240 and other cases! |
I have added unit tests for the following edge cases:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, well tested. Just a couple of questions on a bit of code and I think this is good to go!
pystac/catalog.py
Outdated
id_iter = reversed(parent_ids) | ||
if all(['{}'.format(id) == next(id_iter, None) | ||
for id in reversed(item_parts.values())]): | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The '{}'.format(id)
here is redundant, and could just be id
, yeah?
Is the reversing to try and reduce iterations as the likely case that the more root-leaning parent IDs would match? If so the iteration of the reverse
between the id_iter and item_parts would cancel this out I think - and I believe this is equivalent to the forward-iterating case, so perhaps the reverse's can be dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The '{}'.format(id) here is redundant, and could just be id, yeah?
Actually the values in the dictionary returned by layout_template.get_template_values(item)
are not necessarily strings (they have the actual property type), so they need to be converted in order to compare to the IDs in the catalog structure. I have used the same syntax as further below (line 551 in the original version of the code) for consistency.
Is the reversing to try and reduce iterations as the likely case that the more root-leaning parent IDs would match? If so the iteration of the reverse between the id_iter and item_parts would cancel this out I think - and I believe this is equivalent to the forward-iterating case, so perhaps the reverse's can be dropped?
Here I want to to check whether the template sub-catalog structure matches the actual one, but on the item-leaning side. Reversing allows me to align the current and the desired structures using the item position as a reference. Suppose the items are in the catalog /my-catalog/my-collection/2020/12/01
and the template is ${year}/${month}/${day}
(i.e. parent_ids = ['my-catalog', 'my-collection', '2020', '12', '01']
and item_parts.values() = ['2020', '12', '01']
): by reversing both the structure and the template I can verify that they match on the outermost side, thus allowing me to skip to the next element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the values in the dictionary returned by layout_template.get_template_values(item) are not necessarily strings
Ah! Of course. My mistake.
Here I want to to check whether the template sub-catalog...
Ok, that makes sense now. The parent IDs can be longer than the item_parts, and you're only exhausting the item parts - that was the part I was missing. Would you mind adding some comments to the effect of the comment above, as it'll make things a bit easier to parse for future readers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, sorry for the cryptic code :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Fixes #240