Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for license expressions #707

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

cdce8p
Copy link

@cdce8p cdce8p commented Nov 24, 2024

https://peps.python.org/pep-0639/#add-string-value-to-license-key
https://peps.python.org/pep-0639/#add-license-expression-field

Partially vendor packaging 24.2 to include the license validation added in pypa/packaging#828.

Until Metadata version 2.4 is supported, the license expression will be backfilled to the License field. Hatch does the same atm.

Work on #692

@mgorny
Copy link
Contributor

mgorny commented Jan 4, 2025

Do we have actually need to have the validation/canonicalization done unconditionally? Creating a cyclic dependency between flit_core and packaging would really suck for Gentoo, and unless the non-canonicalized metadata is completely invalid and breaks everything, we'd rather see fewer moving parts for an end user who just need to install the package locally, rather than publish it on PyPI.

@cdce8p cdce8p force-pushed the license-expression branch from cbcbb87 to 95a8099 Compare January 4, 2025 11:44
@cdce8p
Copy link
Author

cdce8p commented Jan 4, 2025

Do we have actually need to have the validation/canonicalization done unconditionally?

From the PEP:

Build and publishing tools SHOULD check that the License-Expression field contains a valid SPDX expression, including the validity of the particular license identifiers (as defined above). Tools MAY halt execution and raise an error when an invalid expression is found. If tools choose to validate the SPDX expression, they also SHOULD store a case-normalized version of the License-Expression field using the reference case for each SPDX license identifier and uppercase for the AND, OR and WITH keywords.

Hatchling uses packaging for the validation as well and setuptools will likely do the same. So I don't think it would make sense to skip it here.

@mgorny
Copy link
Contributor

mgorny commented Jan 4, 2025

The difference is that hatchling and setuptools already depend on packaging, while flit_core does not, while packaging uses flit_core as a build system to provide a clean bootstrap path. So adding packaging use to flit_core introduces a cyclic dependency here and makes unvendored bootstrap impossible.

@mgorny
Copy link
Contributor

mgorny commented Jan 4, 2025

Also, I dare say that "build and publishing tools" are frontends, not backends. Though admittedly, it's not the first time where people are adding frontend work to backends, or pushing maintainer's work onto end users.

@cdce8p
Copy link
Author

cdce8p commented Jan 4, 2025

So adding packaging use to flit_core introduces a cyclic dependency here and makes unvendored bootstrap impossible.

tomli also uses flit_core and is vendored here. How is that handled in Gentoo?

@mgorny
Copy link
Contributor

mgorny commented Jan 4, 2025

tomli is special-cased, and it is needed only for Python 3.10.

@cdce8p
Copy link
Author

cdce8p commented Jan 4, 2025

tomli is special-cased, and it is needed only for Python 3.10.

Yeah, but in that case you could probably special case packaging as well. Tbh it's not my decision to make, I've made my point that I believe validation would be useful. Someone else should decide how to proceed here.

@takluyver
Copy link
Member

Thanks for looking into this!

Flit does have a role in bootstrapping the packaging ecosystem, and as such it's useful to avoid dependencies, even vendored ones (since Linux distros in particular prefer to unvendor such things). We made an exception for tomli, since using TOML was standardised before it was in the stdlib, but I'd rather not make a pattern of this.

At the same time, I do think it makes sense to validate this before producing a package. It's frustrating in some ways that we use the same pathways to make packages for publication as for immediate installation, but that's where we are, and there are upsides to this as well.

I'd like to consider doing an independent implementation of SPDX validation & case normalisation. Besides preserving the zero-dependency status, this also means the ecosystem has two versions used in earnest to compare, rather than everyone doing the same thing. In a similar vein, Flit implements its own parsing of the PEP 621 [project] table, construction of the metadata file and wheel file structure, even though these are libraries that could provide these. I also like that when there's a bug, it's just a bug to fix, not an incompatibility with dependency X version Y.

It looks like this is tractable for license expressionss - at a glance, packaging has done it in about 150 lines, not counting the data of license names. @mgorny I take it you'd be happy with that? @cdce8p are you interested in working on that, or shall I have a go?

@cdce8p cdce8p force-pushed the license-expression branch from 1da4db3 to aa16d4d Compare January 5, 2025 16:41
@cdce8p
Copy link
Author

cdce8p commented Jan 5, 2025

I'd like to consider doing an independent implementation of SPDX validation & case normalisation. Besides preserving the zero-dependency status, this also means the ecosystem has two versions used in earnest to compare, rather than everyone doing the same thing. [...]

I understand your argument. Just skeptical a custom implementation wouldn't look quite similar to the existing one in packaging. The main advantage IMO of using the vendored version would be that they likely thought of any special case and if there is a bug, it's fixed upstream for everyone.

It looks like this is tractable for license expressionss - at a glance, packaging has done it in about 150 lines, not counting the data of license names. @mgorny I take it you'd be happy with that? @cdce8p are you interested in working on that, or shall I have a go?

I'm certainly interested in moving forward on the PEP 639 support. (Not sure I would be the best to write the custom validation though.) Maybe an option would be to defer the validation for a moment and circle back to it when all other things are in place, i.e. before moving to metadata version 2.4?

I removed the vendored packaging dependency here and replaced it with TODO comments. So this PR stands on its own now. Similarly #705 for license-files. Would it be an option for you if you look at these two first?

If not, what path would you prefer?

@mgorny
Copy link
Contributor

mgorny commented Jan 5, 2025

@mgorny I take it you'd be happy with that?

Yeah, that would work for me. Thanks!

@takluyver
Copy link
Member

As I see it, PEP 639 is metadata 2.4 - I don't see any reason to separate adding support for license expressions and emitting metadata version 2.4.

It looks like validation in the spec is only a SHOULD, not MUST, for build tools. So we could postpone that for now and just accept whatever people give us, leaving validation to PyPI on upload. I don't really like this idea, though, because if we add validation at some point in the future, previously accepted input could suddenly be rejected.

However, Flit has also always been a tool that aims to cover the 90% of simple, common use cases, not 100% of everything people might want. And I strongly suspect that 99% of the projects people use Flit for are released under a single license, with no need for AND, OR or WITH. So how about we start by validating what SPDX calls a 'simple license expression', i.e. either a license name from the list, or a LicenseRef-blah with permitted characters?

In this case, unlike others where Flit has made simplifying assumptions, I expect that we'll add support for the full license expression spec in the future - possibly quite soon. But breaking it up like this seems like it could be a useful way to move forward in smaller steps. When we add compound expressions, more possible inputs will be allowed rather than fewer.

@cdce8p
Copy link
Author

cdce8p commented Jan 6, 2025

As I see it, PEP 639 is metadata 2.4 - I don't see any reason to separate adding support for license expressions and emitting metadata version 2.4.

It's technically possible to backport the PEP 639 data to version 2.3, e.g. by backfilling License as I've done here. Hatchling used that approach in the beginning while support in twine for 2.4 was still missing.

I do agree though, that the goal should be to move to 2.4 rather quickly. Splitting it up into #705 and this one just made the most sense to me. The required changes for 2.4 will be fairly strait forward afterwards. Anyway, I made sure both PRs to continue to stand on its own so that at no point invalid metadata gets emitted.

It looks like validation in the spec is only a SHOULD, not MUST, for build tools. So we could postpone that for now and just accept whatever people give us, leaving validation to PyPI on upload.

I was just proposing to defer it from this PR specifically to make the review easier. Once it is merged, I can take a look at a MVP for the validation part.

So how about we start by validating what SPDX calls a 'simple license expression', i.e. either a license name from the list, or a LicenseRef-blah with permitted characters?

That might work if we bail out once we encounter any special characters. Could be a good first step.

@takluyver
Copy link
Member

Thanks! If you're happy with that idea, let's go with validating a single license for this PR. I think we can include the optional + suffix (meaning 'or later version').

I'd probably bump the metadata version to 2.4 before releasing, but I like the idea of ensuring that the emitted metadata is always valid, even before a release. 👍

@cdce8p
Copy link
Author

cdce8p commented Jan 6, 2025

Thanks! If you're happy with that idea, let's go with validating a single license for this PR. I think we can include the optional + suffix (meaning 'or later version').

Not sure that makes much sense tbh. Sure I could add a single check for say MIT License (SPDX identifier is just MIT) but anything more than that would require a list of all valid identifiers even without being able to handle AND / OR / WITH. I'd suggest to do that in a followup instead.

I'd probably bump the metadata version to 2.4 before releasing, but I like the idea of ensuring that the emitted metadata is always valid, even before a release. 👍

Getting to 2.4 by then is my goal as well.

@takluyver
Copy link
Member

Sorry, I meant an expression specifying one license, as discussed, not literally just one option. I do expect that to mean including the list of valid licenses in some form. If you don't want to do this, though, I'll try to get to it.

@mgorny
Copy link
Contributor

mgorny commented Jan 6, 2025

I don't think having a hardcoded list of valid licenses is a good idea. While new licenses aren't added often, this would imply that you'll need to keep updating the list and making new releases whenever that happens, and people will have to >= dep on very specific flit_core version providing the support for a given license. Hatchling and setuptools have already done something similar for trove classifiers, and it's a nightmare for downstreams.

@cdce8p
Copy link
Author

cdce8p commented Jan 6, 2025

Sorry, I meant an expression specifying one license, as discussed, not literally just one option. I do expect that to mean including the list of valid licenses in some form. If you don't want to do this, though, I'll try to get to it.

Makes sense now. I'd still suggest to do that in a followup though. That would also provide a good opportunity to discuss different approaches. The PR here is basically ready on its own. Will start working on it this week.

If you've some spare time, I'd appreciate if you could do a first pass through the change here and in #705 so we can get the fundamentals right.

I don't think having a hardcoded list of valid licenses is a good idea. While new licenses aren't added often, this would imply that you'll need to keep updating the list and making new releases whenever that happens, and people will have to >= dep on very specific flit_core version providing the support for a given license. Hatchling and setuptools have already done something similar for trove classifiers, and it's a nightmare for downstreams.

SPDX identifier are only ever deprecated and never removed. Thus the only thing missing would be support for new ones. As you've said yourself though, hardcoding the list is the current approach taken by most tools. Downloading it on demand might work but is also prone to errors so I'd recommend against it.

@takluyver
Copy link
Member

The hardcoded list doesn't seem like a major issue, to be honest. The vast majority of open source projects use a few familiar licenses which will have been on the list for years. We should make sure to have an 'escape hatch' for that one project blazing the trail for a new not-yet-included license, but I don't think it's going to be needed very often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants