Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unexpected behaviour of generate_subcatalogs #241

Merged
merged 6 commits into from
Dec 9, 2020

Conversation

fnattino
Copy link
Contributor

Fixes #240

@codecov-io
Copy link

codecov-io commented Nov 19, 2020

Codecov Report

Merging #241 (5651517) into develop (2a0c850) will increase coverage by 0.18%.
The diff coverage is 87.50%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #241      +/-   ##
===========================================
+ Coverage    93.73%   93.91%   +0.18%     
===========================================
  Files           30       32       +2     
  Lines         3749     3963     +214     
===========================================
+ Hits          3514     3722     +208     
- Misses         235      241       +6     
Impacted Files Coverage Δ
pystac/catalog.py 95.58% <87.50%> (-0.26%) ⬇️
pystac/__init__.py 100.00% <0.00%> (ø)
pystac/extensions/scientific.py 97.36% <0.00%> (ø)
pystac/extensions/sat.py 98.24% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a0c850...5651517. Read the comment docs.

Copy link
Member

@lossyrob lossyrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good - thank you! A couple of thoughts:

The way this works now, the generate_subcatalogs recurses down to the child catalogs and then crawls up it's parent tree to cache parent catalog IDs. This ends up iterating over the catalog multiple times - both down the tree with for child in self.get_children(): and during the while loop you introduce. Can you structure it in such a way that it reduced the iterations? I think passing forward the subcat_id_to_cat as an optional parameter might do the trick.

Also, could you provide one or more unit tests that would cover the case mentioned in #240? Might be worth trying to test any edge cases that may arise around catalog merging - e.g. what happens if two subcatalogs have the same ID? This is often the case for subcatalogs that are based on date - so a subcatalog of day "15" might exist for a subcatalog of month "1", but also for month "2". I'm not sure how this code will behave in this scenario.

@fnattino fnattino force-pushed the fix/generate_subcatalogs branch from 3b66b84 to 5651517 Compare November 23, 2020 21:12
@fnattino
Copy link
Contributor Author

Thanks a lot for having a look! I have followed your suggestion and removed the redundant iteration over the catalog by passing over the parent catalog IDs. I have changed the subsequent loop over items slightly: I use the list of parent IDs to check whether subcatalogs need to be added, but I crawl down the template levels using the subcatalogs' get_child method. I think this should solve most of the problems with the same ID appearing in different sub-catalog levels, as this should naturally follow the catalog's branches.

I will add unit tests with something like #240 and other cases!

@fnattino
Copy link
Contributor Author

I have added unit tests for the following edge cases:

  • if the sub-catalog structure already match the template, generate_subcatalogs should do nothing (or, in other words, it should be possible to call generate_subcatalogs multiple times without side effects);
  • A consistent sub-catalog structure should be obtained if items are added to a catalog at multiple stages and generate_subcatalogs is called every times items are added (case of Unexpected behaviour of generate_subcatalogs #240);
  • generate_subcatalogs should behave for catalog structures where the same sub-catalog ID appears in different catalog branches (as in the month/day example above);
  • generate_subcatalogs should also work for structures where the same sub-catalog ID appears at different levels in the same branch, as for the template ${property1}/${property2} where both property1 and property2 can assume the same value. I am thinking e.g. to the eo:row and eo:column properties in the AWS Landsat catalog..

@fnattino fnattino requested a review from lossyrob November 24, 2020 20:34
Copy link
Member

@lossyrob lossyrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, well tested. Just a couple of questions on a bit of code and I think this is good to go!

Comment on lines 557 to 560
id_iter = reversed(parent_ids)
if all(['{}'.format(id) == next(id_iter, None)
for id in reversed(item_parts.values())]):
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The '{}'.format(id) here is redundant, and could just be id, yeah?

Is the reversing to try and reduce iterations as the likely case that the more root-leaning parent IDs would match? If so the iteration of the reverse between the id_iter and item_parts would cancel this out I think - and I believe this is equivalent to the forward-iterating case, so perhaps the reverse's can be dropped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The '{}'.format(id) here is redundant, and could just be id, yeah?

Actually the values in the dictionary returned by layout_template.get_template_values(item) are not necessarily strings (they have the actual property type), so they need to be converted in order to compare to the IDs in the catalog structure. I have used the same syntax as further below (line 551 in the original version of the code) for consistency.

Is the reversing to try and reduce iterations as the likely case that the more root-leaning parent IDs would match? If so the iteration of the reverse between the id_iter and item_parts would cancel this out I think - and I believe this is equivalent to the forward-iterating case, so perhaps the reverse's can be dropped?

Here I want to to check whether the template sub-catalog structure matches the actual one, but on the item-leaning side. Reversing allows me to align the current and the desired structures using the item position as a reference. Suppose the items are in the catalog /my-catalog/my-collection/2020/12/01 and the template is ${year}/${month}/${day} (i.e. parent_ids = ['my-catalog', 'my-collection', '2020', '12', '01'] and item_parts.values() = ['2020', '12', '01']): by reversing both the structure and the template I can verify that they match on the outermost side, thus allowing me to skip to the next element.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the values in the dictionary returned by layout_template.get_template_values(item) are not necessarily strings

Ah! Of course. My mistake.

Here I want to to check whether the template sub-catalog...

Ok, that makes sense now. The parent IDs can be longer than the item_parts, and you're only exhausting the item parts - that was the part I was missing. Would you mind adding some comments to the effect of the comment above, as it'll make things a bit easier to parse for future readers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, sorry for the cryptic code :)

@fnattino fnattino requested a review from lossyrob December 7, 2020 09:40
Copy link
Member

@lossyrob lossyrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected behaviour of generate_subcatalogs
3 participants