Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance documentation of checksums easyconfig parameter #104

Merged
merged 4 commits into from
Dec 19, 2024

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Jan 20, 2023

Add more examples of which formats are possible.
Also contains not-yet implemented or broken features!

This is related to easybuilders/easybuild-framework#4142, easybuilders/easybuild-framework#4177, easybuilders/easybuild-framework#4150, easybuilders/easybuild-framework#4159, easybuilders/easybuild-framework#4164

We have https://github.com/easybuilders/easybuild-framework/blob/e3681ae53628400096f3e29d58e787bf1173f27b/test/framework/type_checking.py#L184-L225

and https://github.com/easybuilders/easybuild-framework/blob/e3681ae53628400096f3e29d58e787bf1173f27b/test/framework/type_checking.py#L709-L719

which tests various formats for checksums.

We have code in get_checksum_for which supports checksums being a dict instead of a list: https://github.com/easybuilders/easybuild-framework/blob/e3681ae53628400096f3e29d58e787bf1173f27b/easybuild/framework/easyblock.py#L369

However this is not supported by the type-checking code and hence to_checksums will be called which would iterate over the dicts keys mistaking them for checksums!!! https://github.com/easybuilders/easybuild-framework/blob/e3681ae53628400096f3e29d58e787bf1173f27b/easybuild/framework/easyconfig/types.py#L475

Furthermore this is made more difficult by the YEB files which use a YAML format where tuples are not supported. E.g. a test file has this:

checksums: [[
    'be662daa971a640e40be5c804d9d7d10',  # default [MD5]
    '44332000aa33b99ad1e00cbd1a7da769220d74647060a10e807b916d73ea27bc',  # default (SHA256)
    ['adler32', '0x998410035'],
    ['crc32', '0x1553842328'],
    ['md5', 'be662daa971a640e40be5c804d9d7d10'],
    ['sha1', 'f618096c52244539d0e89867405f573fdb0b55b0'],
    ['size', 273],
]]

This is a checksum entry for a single file. Obviously the inner-most 2-element lists should be converted to tuples and the outer-most list should be a list. But it isn't clear what to do with the 2nd list: Are those alternative checksums where only one needs to match or additional checksums that all need to match? Both cases should be supported.

I would argue that an additional level can be used: checksums: [[['mainchecksum', 'altchecksum']]] The type conversion code can deduce that after the 2nd level of lists only tuples may be specified as a list (i.e. an AND) inside a list (already an AND) doesn't make sense, so the 2nd level list is an AND consisting only of a single element, which is redundant but ok.

And finally specifying None in a dict currently yields the same error as not specifying it, see easybuilders/easybuild-framework#4142

So things to decide:

  • Do we allow checksums to be a dict? This would make the base check (len(checksums) = len(srcs+patches)) impossible. So I'd keep it a list.
  • What do we allow as values of dicts? Should we support the logical AND/OR semantic possible or only strings/type-value-tuples?
  • I'd always disallow putting a dict anywhere inside a dict, that's what I did in Fix to_checksums with None values in dicts and recursion easybuild-framework#4159 as I see no reason why you'd want that. Or is there any?
  • How do we handle None? In a tuple it doesn't make sense, so I'd disallow it there. For a dict should we differ between having a key-None entry or not having any entry especially related to --enforce-checksums?

I see 2 ways here:

  1. dicts can only contain checksums/type-checksum-tuples or None
  2. allow full power, so that the value of a dict entry is handled the same as a "top-level entry" except it disallows more dicts.

As for the missing-key case: I'd handle it the same as not specifying it: Error when enforce_checksums is active else treat as matched. I'd treat it as an error such that typos can be detected as otherwise e.g. 'src_X86_64.tgz' might be used instead of 'src_x86_64.tgz' looking valid

Port of easybuilders/easybuild#853 to the new markdown format

docs/writing-easyconfig-files.md Outdated Show resolved Hide resolved
docs/writing-easyconfig-files.md Outdated Show resolved Hide resolved
docs/writing-easyconfig-files.md Outdated Show resolved Hide resolved
docs/writing-easyconfig-files.md Outdated Show resolved Hide resolved
docs/writing-easyconfig-files.md Outdated Show resolved Hide resolved
docs/writing-easyconfig-files.md Outdated Show resolved Hide resolved
docs/writing-easyconfig-files.md Outdated Show resolved Hide resolved
@Flamefire
Copy link
Contributor Author

@smoors Thanks for the review. Besides the stylistic issues I wanted to put the general handling up for discussion, see the 2nd part of the description (starting at "So things to decide:")

So those questions should be answered first, maybe in the next confcall (which I can't attend unfortunately)

boegel
boegel previously requested changes Aug 2, 2023
docs/writing-easyconfig-files.md Outdated Show resolved Hide resolved
docs/writing-easyconfig-files.md Outdated Show resolved Hide resolved
docs/writing-easyconfig-files.md Outdated Show resolved Hide resolved
docs/writing-easyconfig-files.md Outdated Show resolved Hide resolved
docs/writing-easyconfig-files.md Outdated Show resolved Hide resolved
@Flamefire
Copy link
Contributor Author

@boegel Thanks for the review, however I'd like a decision first on "things to decide:" in the OP.

Also are you sure about e.g. #104 (comment) ? 0123456789...abcdef isn't a valid checksum either and using 'oldchecksum', 'newchecksum' is IMO a good example on the reason/situation where you may want to use that format.

@boegel boegel changed the title Enhance documentation of checksums EC parameter enhance documentation of checksums easyconfig parameter Sep 13, 2023
@Flamefire Flamefire marked this pull request as ready for review November 16, 2023 08:27
@Flamefire Flamefire requested review from boegel and smoors November 16, 2023 08:27
@Flamefire
Copy link
Contributor Author

I updated this with the review comments applied especially as currently not being able to use alternative checksums in dict entries came up, so this should be resolved timely. Decisions I basically assumed and documented which I'd like to outline:

  • The checksums parameter cannot be a dict, but must be a list such that the number of checksums can be easily checked against the number of files
  • Dict entries are basically the same as specifying the value directly except of course they are selected based on the filename
  • So list, tuples and lists of tuples are allowed as entries in a dict, as is "None"
  • dicts in dicts are not allowed

Missing from documentation and possibly implementation:
Having None in a tuple or list should be an error as it is either superflous or makes the alternatives always pass

Flamefire and others added 4 commits July 12, 2024 14:13
Add more examples of which formats are possible.
Also contains not-yet implemented or broken features.
Co-authored-by: Sam Moors <smoors@users.noreply.github.com>
@Micket
Copy link
Contributor

Micket commented Dec 15, 2024

Sorry for being late to the party here, and i don't want to complicate/stir things up, but i was just curious about

The checksums parameter cannot be a dict, but must be a list such that the number of checksums can be easily checked against the number of files

I'm just wondering what the keys would b in this top-level dict? I would have assumes it would be the filenames, so that length check would still hold? If not, what would the keys be? Are we expecting duplicates because i can't see how wouldn't be horribly broken.

If i had to pick a setup a way to set it up from scratch i would had just done

checksums = {
    'source1.tar.gz': '1234...abcd',                         # 99.9% of cases
    'source_multiple_checksums.tar.gz': ('xxxxx', 'yyyyy'),  # 0.05% of cases
    'flaky_source.tar.gz': None,                             # 0.05% of cases
    'oh.my.god.please.no': {'crc32', '0x1553842328'}, # or something like that
}

I'm also very confused by
https://github.com/easybuilders/easybuild-framework/pull/4164/files#diff-d5f983bd8247adfc28121c0b495d55888bf123382e9e2c4ca350a03347f95649R224
in both the original and updated code; as it reads on line 212

each item can be a list of checksums for a single file, which can be of different types...

clearly indicating that it's a single file, but then

{'foo.txt': sha256_checksum1, 'bar.txt': sha256_checksum2, 'baz.txt': ('size', filesize)},

two different file names?

I don't get it.

@Flamefire
Copy link
Contributor Author

I'm just wondering what the keys would b in this top-level dict? I would have assumes it would be the filenames, so that length check would still hold? If not, what would the keys be? Are we expecting duplicates because i can't see how wouldn't be horribly broken.

The dict-thing is for stuff like java_x86.tar + java_arch64.tar where the name is based on a template.

I would too prefer a top-level dict. The length check (len(checksums.keys()) would work mostly, until it doesn't for e.g. the above case where there are alternative names for the same file. We could maybe make special cases for that but I guess it would complicate stuff.

It also might be hard to get the code working for the old-style (using a list) and the new style (dict) without making it too complicated, although I guess it is doable.
Here I just wanted to formalize how it is currently working and make that way more consistent.

I'm also very confused by

each item can be a list of checksums for a single file, which can be of different types...
...
two different file names?

Maybe not "a single file" but "a single source/patch entry" as the dict you quoted is a single "file" just with different names. We could make that example more clear I guess by using the (resolved) template names for e.g. java

@Micket
Copy link
Contributor

Micket commented Dec 15, 2024

The dict-thing is for stuff like java_x86.tar + java_arch64.tar where the name is based on a template.

ah yes i forgot about those.

Are we doing a whole lot of assuming that the checksums equal the sources + patches throughout the code, or is it just EasyBlock.check_checksums_for method?

Because that already does support the checksums.json variant, which is basically just an externally maintained flat top-level dict.

If one flattened the checksums to just deal with files, e.g.

checksums = {
    'openjdk-17.1.2.3-x86.sh': 'xxxxx',
    'openjdk-17.1.2.3-power.sh': 'yyyyy',
    'openjdk-17.1.2.3-arm.sh': 'zzzzzz',
}

one could just treat this the same as the external json, and maybe we could add a little something extra to warn on unused checksum entries (where we simply whitelist the rare cases like Java in the CI)

Checking the code just now, i spot:

        # Single source should be re-wrapped as a list, and checksums with it
        if isinstance(sources, dict):
            sources = [sources]

so kind of already supported (as long as you only have one source entry)

Here I just wanted to formalize how it is currently working and make that way more consistent.

Fully agree

@Flamefire
Copy link
Contributor Author

Are we doing a whole lot of assuming that the checksums equal the sources + patches throughout the code, or is it just EasyBlock.check_checksums_for method?

Seems like the rest is able to handle the dict even though documentation/dcostrings says otherwise, e.g. get_checksum_for is documented as "checksums: a list or tuple of checksums (or None)" and also receives an index, but seemingly handles dicts

If one flattened the checksums to just deal with files

Can there be any cases where 2 different sources might have the same name? Maybe with crates or git checkouts? But I guess the download_filename feature could fix that rare case.

one could just treat this the same as the external json, and maybe we could add a little something extra to warn on unused checksum entries (where we simply whitelist the rare cases like Java in the CI)

I'm not a fan of more hard-coded exceptions...
How could we detect unused entries generically? E.g. for:

sources = ['source1.tar', 'mysource2.tar']
checksums = {'source1.tar': cs1, 'source2.tar': cs2}

The 2nd is unused due to a typo.
Or harder:

checksums = {
    'openjdk-17.1.2.3-x86.sh': 'xxxxx',
    'openjdk-17.1.2.3-Power.sh': 'yyyyy', # Case error
    'openjdk-17.1.2.3-arm.sh': 'zzzzzz',
}

Maybe we can change check_checksums_for to try and get each checksum and even iterate templates we know.

But in both cases (whitelist, templates) we may need to update that every now and then to handle some contributions

Checking the code just now, i spot:

        # Single source should be re-wrapped as a list, and checksums with it
        if isinstance(sources, dict):
            sources = [sources]

so kind of already supported (as long as you only have one source entry)

That is for sources, the checksum is checked for string_type but I guess it can easily extended to a dict too, yes.

So in summary I think it does make sense to have/allow a top-level dict for checksums to get rid of the extra [] we have in 99% of the cases. Besides the work to implement this properly the only minor downside I see is the unclear order: Should we sort the dict entries by name or by index in the sources+patches? Currently that question does not exists ;-)
So we would have:

  • if checksums is a list, we fetch the corresponding index and treat this as-if that was passed to a function
  • if now/or prior checksums is a dict we fetch the corresponding item and error if not found
  • verify_checksum then does not need to handle the dict case anymore and would error if a dict was found

Sounds easy so far. Improving inject_checksums might be much harder though in presence of templates. Not sure how we handle this now.

And: Would we update the existing ECs?

@Micket
Copy link
Contributor

Micket commented Dec 16, 2024

Can there be any cases where 2 different sources might have the same name?

That would cause problems even if we didn't check checksums. I have encountered this scenario with some extensions that were just v1.2.3.tar.gz from github and they too had to be fixed with download_filename.
It would also break for anyone using the checksum.json, since that is already supposed to be work, I don't think this would add any new condition.

I'm not a fan of more hard-coded exceptions...

Yes i agree, though I don't believe we are checking for unused Java checksums either when using the dict-thing, so the hard case you mention isn't covered today either. Not to excuse it, but I suspect it's pretty narly to check given how core the architecture thing is (and we need to account for that not all architectures are supported in all commercial tools). That would be a pure CI thing, so nothing we need to block EB 5.0 over.

  • if checksums is a list, we fetch the corresponding index and treat this as-if that was passed to a function
  • if now/or prior checksums is a dict we fetch the corresponding item and error if not found
  • verify_checksum then does not need to handle the dict case anymore and would error if a dict was found

Yes that all sounds correct to me.

And: Would we update the existing ECs?

I would say we just start making --inject-checksums use this as the default.

@Flamefire
Copy link
Contributor Author

Yes i agree, though I don't believe we are checking for unused Java checksums either when using the dict-thing, so the hard case you mention isn't covered today either.

Currently we check len(sources + patches) == len(checksums). With a dict instead we have len(sources + patches) <= len(checksums), so we can at least easily catch clearly missing checksums. But with those multi-arch stuff (and similar) we won't know if there are too many or missing checksums

Not to excuse it, but I suspect it's pretty narly to check given how core the architecture thing is (and we need to account for that not all architectures are supported in all commercial tools). That would be a pure CI thing, so nothing we need to block EB 5.0 over.

Then we should just exclude any software that uses %(arch)s in its sources from the == check (and only use <=).
Later we can enhance this by "deduplicating" checksum keys ignoring parts that can be arch values. So no big deal.

I would say we just start making --inject-checksums use this as the default.

Should do, yes.

So how shall we proceed? Merge this and the related PRs first and then implement the full dict support? I'd say both are orthogonal as the open PRs just bring documentation and implementation in line and fix corner cases and inconsistencies. Support for top-level dicts can still be added as it is mostly a duty of get_checksum_for to handle that (which it seemingly already does)

@Micket
Copy link
Contributor

Micket commented Dec 18, 2024

So how shall we proceed?

Yes it's just so many PRs here i need some help keeping track of what to merge and when. We are 100% focusing on the 5.0 beta release now.

I think this documentation goes in, and easybuilders/easybuild-framework#4578 to start?

If you have duplicate PRs for develop, i think you can just close those to help me out, since we will just deal with the merge with develop later, no more PRs there unless there is a very strong reason.

@Flamefire
Copy link
Contributor Author

Flamefire commented Dec 18, 2024

If you have duplicate PRs for develop, i think you can just close those to help me out, since we will just deal with the merge with develop later, no more PRs there unless there is a very strong reason.

I'd suggest the order:

  1. This PR
  2. Fix the checksum type check easybuild-framework#4578
  3. Fix to_checksums with None values in dicts and recursion easybuild-framework#4579
  4. Allow nesting values in checksum dicts easybuild-framework#4711

The 2nd would need a rebase after the first is merged to avoid the duplicate commit that is required to test the changes depending on that other PR.

I closed the PRs to develop with a reference to the new ones

Copy link
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but i would expect we probably want to restructure a bit if we manage to make the type(checksums) == dict a reality.

@Micket
Copy link
Contributor

Micket commented Dec 18, 2024

someone else needs to merge this, i don't have permission to

@ocaisa ocaisa merged commit 14f8074 into easybuilders:main Dec 19, 2024
4 checks passed
@Flamefire Flamefire deleted the checksums branch December 19, 2024 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants