Fix unexpected behaviour of `generate_subcatalogs` #241

fnattino · 2020-11-19T08:44:26Z

Fixes #240

codecov-io · 2020-11-19T08:46:30Z

Codecov Report

Merging #241 (5651517) into develop (2a0c850) will increase coverage by 0.18%.
The diff coverage is 87.50%.

@@             Coverage Diff             @@
##           develop     #241      +/-   ##
===========================================
+ Coverage    93.73%   93.91%   +0.18%     
===========================================
  Files           30       32       +2     
  Lines         3749     3963     +214     
===========================================
+ Hits          3514     3722     +208     
- Misses         235      241       +6

Impacted Files	Coverage Δ
pystac/catalog.py	`95.58% <87.50%> (-0.26%)`	⬇️
pystac/__init__.py	`100.00% <0.00%> (ø)`
pystac/extensions/scientific.py	`97.36% <0.00%> (ø)`
pystac/extensions/sat.py	`98.24% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a0c850...5651517. Read the comment docs.

lossyrob

This is looking good - thank you! A couple of thoughts:

The way this works now, the generate_subcatalogs recurses down to the child catalogs and then crawls up it's parent tree to cache parent catalog IDs. This ends up iterating over the catalog multiple times - both down the tree with for child in self.get_children(): and during the while loop you introduce. Can you structure it in such a way that it reduced the iterations? I think passing forward the subcat_id_to_cat as an optional parameter might do the trick.

Also, could you provide one or more unit tests that would cover the case mentioned in #240? Might be worth trying to test any edge cases that may arise around catalog merging - e.g. what happens if two subcatalogs have the same ID? This is often the case for subcatalogs that are based on date - so a subcatalog of day "15" might exist for a subcatalog of month "1", but also for month "2". I'm not sure how this code will behave in this scenario.

fnattino · 2020-11-23T21:43:18Z

Thanks a lot for having a look! I have followed your suggestion and removed the redundant iteration over the catalog by passing over the parent catalog IDs. I have changed the subsequent loop over items slightly: I use the list of parent IDs to check whether subcatalogs need to be added, but I crawl down the template levels using the subcatalogs' get_child method. I think this should solve most of the problems with the same ID appearing in different sub-catalog levels, as this should naturally follow the catalog's branches.

I will add unit tests with something like #240 and other cases!

fnattino · 2020-11-24T20:33:21Z

I have added unit tests for the following edge cases:

if the sub-catalog structure already match the template, generate_subcatalogs should do nothing (or, in other words, it should be possible to call generate_subcatalogs multiple times without side effects);
A consistent sub-catalog structure should be obtained if items are added to a catalog at multiple stages and generate_subcatalogs is called every times items are added (case of Unexpected behaviour of generate_subcatalogs #240);
generate_subcatalogs should behave for catalog structures where the same sub-catalog ID appears in different catalog branches (as in the month/day example above);
generate_subcatalogs should also work for structures where the same sub-catalog ID appears at different levels in the same branch, as for the template ${property1}/${property2} where both property1 and property2 can assume the same value. I am thinking e.g. to the eo:row and eo:column properties in the AWS Landsat catalog..

lossyrob

Nice work, well tested. Just a couple of questions on a bit of code and I think this is good to go!

lossyrob · 2020-12-01T02:30:18Z

pystac/catalog.py

+            id_iter = reversed(parent_ids)
+            if all(['{}'.format(id) == next(id_iter, None)
+                    for id in reversed(item_parts.values())]):
+                continue


The '{}'.format(id) here is redundant, and could just be id, yeah?

Is the reversing to try and reduce iterations as the likely case that the more root-leaning parent IDs would match? If so the iteration of the reverse between the id_iter and item_parts would cancel this out I think - and I believe this is equivalent to the forward-iterating case, so perhaps the reverse's can be dropped?

The '{}'.format(id) here is redundant, and could just be id, yeah?

Actually the values in the dictionary returned by layout_template.get_template_values(item) are not necessarily strings (they have the actual property type), so they need to be converted in order to compare to the IDs in the catalog structure. I have used the same syntax as further below (line 551 in the original version of the code) for consistency.

Is the reversing to try and reduce iterations as the likely case that the more root-leaning parent IDs would match? If so the iteration of the reverse between the id_iter and item_parts would cancel this out I think - and I believe this is equivalent to the forward-iterating case, so perhaps the reverse's can be dropped?

Here I want to to check whether the template sub-catalog structure matches the actual one, but on the item-leaning side. Reversing allows me to align the current and the desired structures using the item position as a reference. Suppose the items are in the catalog /my-catalog/my-collection/2020/12/01 and the template is ${year}/${month}/${day} (i.e. parent_ids = ['my-catalog', 'my-collection', '2020', '12', '01'] and item_parts.values() = ['2020', '12', '01']): by reversing both the structure and the template I can verify that they match on the outermost side, thus allowing me to skip to the next element.

Actually the values in the dictionary returned by layout_template.get_template_values(item) are not necessarily strings

Ah! Of course. My mistake.

Here I want to to check whether the template sub-catalog...

Ok, that makes sense now. The parent IDs can be longer than the item_parts, and you're only exhausting the item parts - that was the part I was missing. Would you mind adding some comments to the effect of the comment above, as it'll make things a bit easier to parse for future readers?

Good point, sorry for the cryptic code :)

lossyrob

Thank you!

initialize subcatalog dictionary in generate_subcatalogs

d1b561b

lossyrob requested changes Nov 20, 2020

View reviewed changes

fnattino added 2 commits November 23, 2020 21:59

fix generate_subcatalogs for edge-cases

2015bf5

fix formatting

5651517

fnattino force-pushed the fix/generate_subcatalogs branch from 3b66b84 to 5651517 Compare November 23, 2020 21:12

fnattino added 2 commits November 24, 2020 20:21

convert template parts to string

10e8f79

add unit tests

985cc55

fnattino requested a review from lossyrob November 24, 2020 20:34

lossyrob requested changes Dec 1, 2020

View reviewed changes

add comments to clarify the reverse

3ab4095

fnattino requested a review from lossyrob December 7, 2020 09:40

lossyrob approved these changes Dec 9, 2020

View reviewed changes

lossyrob merged commit 0630c4f into stac-utils:develop Dec 9, 2020

scottyhq mentioned this pull request Jul 21, 2021

Unable to control subfolders with LayoutTemplate #480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unexpected behaviour of `generate_subcatalogs` #241

Fix unexpected behaviour of `generate_subcatalogs` #241

fnattino commented Nov 19, 2020

codecov-io commented Nov 19, 2020 •

edited

Loading

lossyrob left a comment

fnattino commented Nov 23, 2020

fnattino commented Nov 24, 2020

lossyrob left a comment

lossyrob Dec 1, 2020

fnattino Dec 1, 2020

lossyrob Dec 2, 2020

fnattino Dec 4, 2020

lossyrob left a comment

Fix unexpected behaviour of generate_subcatalogs #241

Fix unexpected behaviour of generate_subcatalogs #241

Conversation

fnattino commented Nov 19, 2020

codecov-io commented Nov 19, 2020 • edited Loading

Codecov Report

lossyrob left a comment

Choose a reason for hiding this comment

fnattino commented Nov 23, 2020

fnattino commented Nov 24, 2020

lossyrob left a comment

Choose a reason for hiding this comment

lossyrob Dec 1, 2020

Choose a reason for hiding this comment

fnattino Dec 1, 2020

Choose a reason for hiding this comment

lossyrob Dec 2, 2020

Choose a reason for hiding this comment

fnattino Dec 4, 2020

Choose a reason for hiding this comment

lossyrob left a comment

Choose a reason for hiding this comment

Fix unexpected behaviour of `generate_subcatalogs` #241

Fix unexpected behaviour of `generate_subcatalogs` #241

codecov-io commented Nov 19, 2020 •

edited

Loading