-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to a generated license-classifications.yml
based on ScanCode's License DB
#64
Conversation
b546063
to
fc8f779
Compare
dc22349
to
ab86897
Compare
- name: "permissive" | ||
- name: "proprietary-free" | ||
- name: "public-domain" | ||
- name: "source-available" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ScanCode have 2-3 description for each of the categories? Maybe ask @pombredanne.
Would be nice if we could add a category description as even I struggle to fully understand the source-available and unstated-license categories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it makes me feel a bit uneasy to solely rely on ScanCode's categorization here. We already have an IMO too tight coupling to ScanCode and its license texts in ORT's SPDX module, and I'd rather not introduce similar couplings elsewhere, but stay scanner-agnostic where we can.
So I'm wondering why we can't do as proposed here and simply use LDBcollector's classification, which at least does not rely on a single source / "opinion" only, and also nicely prefixes (at least some) category names with a hint about what the category semantically is about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it makes me feel a bit uneasy to solely rely on ScanCode's categorization here. We already have an IMO too tight coupling to ScanCode and its license texts in ORT's SPDX module, and I'd rather not introduce similar couplings elsewhere, but stay scanner-agnostic where we can.
I believe 'ort-config' is about providing a real world example, providing amongst others the following value
- Users can get started quickly.
- Users can see how a real world setup could look like.
- It allows building and sharing a vision of a good config we consider reasonable.
- Users can see how curations / package configurations and policy rules work together
- Users can instantly run all stages: analyzer, scanner, evaluator, advisor, reporter and
see the report with the violations
Regarding point 5: When looking at the report of an arbitrary scan, it will be cluttered with a bunch of UNHANDLED_LICENSE issues. So, it's quite obfuscated how effort / issues is there to fix, as classifying the licenses "maybe" lead to further issues. Having a lot of licenses classified, which this PR does, improves that situation quite a bit, the report created by default will be more meaningful.
In terms of getting the user started quickly, I believe it makes sense to stick to the de-facto default scanner ScanCode for the following reasons:
- A setup for other scanners would look similar, so it does not illustrate really new cases.
As it's similar. it's easy for the user to create if the ort-config is understood. - Adding categories from mutliple scanners adds complexity. It makes it harder to understand and iterate on the configuration.
- Categories from mutliple scanners may overlap semantically
- To have the setup kind of complete, policy rules for all the categories have to be written.
Who's gonna do that?
So, at the very minimum, I believe adding categories beyond ScanCode should be considered out of scope of this PR.
I'd vote against adding other categories due to how I see the effort / benefit ratio.
import com.fasterxml.jackson.databind.PropertyNamingStrategies | ||
import com.fasterxml.jackson.databind.json.JsonMapper | ||
import com.fasterxml.jackson.module.kotlin.KotlinFeature | ||
import com.fasterxml.jackson.module.kotlin.KotlinModule | ||
import com.fasterxml.jackson.module.kotlin.readValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we're trying to avoid Jackson in ORT Core (also see oss-review-toolkit/ort#3904), would you agree that it also makes sense to rather use kotlinx-serialization here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could make sense yes. However, Jackson has been used in that module already before, so I went for local consistency. I don't want to put in the effort to refactor this PR to use 'kotlinx-serialization' now. I wouldn't object if someone did it on top, after my PR is merged.
private fun downloadLicenseDetailsBatched( | ||
licenses: Collection<License>, | ||
threadCount: Int = 60 | ||
): Map<License, LicenseDetails> { | ||
val queue = ConcurrentLinkedQueue(licenses) | ||
val result = ConcurrentHashMap<License, LicenseDetails>() | ||
|
||
val threads = List(threadCount) { _ -> | ||
Thread { | ||
var license = queue.poll() | ||
|
||
while (license != null) { | ||
result[license] = downloadLicenseDetails(license) | ||
license = queue.poll() | ||
} | ||
} | ||
} | ||
|
||
val executionTimeMillis = measureTimeMillis { | ||
threads.run { | ||
forEach { it.start() } | ||
forEach { it.join() } | ||
} | ||
} | ||
|
||
LOGGER.quiet("Downloaded ${result.size} files in ${executionTimeMillis / 1000f} seconds on $threadCount threads.") | ||
|
||
return result.toMap() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using the download
plugin instead to safe some custom code, similar to like we already do in the spdx-utils
module?
In general, Gradle discourages plugins to manually create threads to parallelize work. Instead, the Worker API should be used. The download
plugin already does that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using the
download
plugin instead to safe some custom code, similar to like we already do in thespdx-utils
module?
I decided against it as it leads to a coupling with Gradle tasks, and seemed in the end a bit more complex.
This way the code is easily readable even for people who are not into Gradle tasks.
I found it nicer not to add a dependency for such small thing as parallelizing a download and instead have full control over parallelization.
In general, Gradle discourages plugins to manually create threads to parallelize work. Instead, the Worker API should be used. The download plugin already does that.
This task is run manually only very rarely. Given that I believe the manual creation of threads is just a theoretical problem in this case. Would you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way the code is easily readable even for people who are not into Gradle tasks.
Actually, I found it harder to read (and to check for correctness) than a Download
-task.
This task is run manually only very rarely. Given that I believe the manual creation of threads is just a theoretical problem in this case. Would you agree?
Independently of whether the problem is theoretical or real, I believe we should not introduce code that does not adhere to Gradle best practices. But I'm curious about what other @oss-review-toolkit/kotlin-devs say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I found it harder to read (and to check for correctness) than a Download-task.
I believe that's subjective.
Independently of whether the problem is theoretical or real, I believe we should not introduce code that does not adhere to Gradle best practices.
In general I believe if there can be good reasons to not follow best practices, e.g. there's an exception to almost every rule. So, I believe using the fact that something is best practices as rationale why an exception wrt to the best practices should not be made, does not make much sense.
Apart from that, I was arguing that I would like to keep the implementation separate from Gradle.
So, why should Gradle best practice apply to code which is just plain Kotlin without Gradle dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I found it harder to read (and to check for correctness) than a Download-task.
I believe that's subjective.
Sure it is. Just like your statement before, that the code would be easier to read.
So, why should Gradle best practice apply to code which is just plain Kotlin without Gradle dependencies?
Because it's actually only being used within Gradle build files, and thus using Gradle infrastructure, which has its own rules. And I currently don't see the code ever being used elsewhere.
Apart from that, I was arguing that I would like to keep the implementation separate from Gradle.
Why? If you have concrete plans to use the code elsewhere later, please share them so we can make more provident reviews.
tools/curations/buildSrc/src/main/kotlin/ScanCodeLicenseDbClassifications.kt
Outdated
Show resolved
Hide resolved
Dismissing my review in favor of others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I think that this comment should still be addressed:
#64 (comment)
This revision has the fix for serializing license classifications, see oss-review-toolkit/ort#5882. Signed-off-by: Frank Viernau <frank_viernau@epam.com>
The task downloads the "ScanCode License DB" [1] data from [2], derives ORT license classifications from that data and finally writes that result to `license-classifcations.yml` in the root directory of the code repository. The data isn't modified except for the following: 1. Exceptions are filtered out. 2. License IDs which do not refer to specific licenses, like the ones with `is_unknown == true` or `is_generic == true`, are re-assigned to separate categories. [1] https://github.com/nexB/scancode-licensedb [2] https://scancode-licensedb.aboutcode.org Signed-off-by: Frank Viernau <frank_viernau@epam.com>
Update the file by running [1] which generates license classifications based on [2]. That license data corresponds to ScanCode version `31.1.1`, see also the bottom of [2]. Note: For all licenses which are assigned to a category which is new in `license-classifications.yml` the `UNHANDLED_LICENSE` violations will continue to be flagged. `UNHANDLED_LICENSE` violations for all licenses which are now assigned to a non-offending category, like e.g. "permissive", dissappear. [1] `./gradlew generateLicenseClassifications` [2] https://scancode-licensedb.aboutcode.org/ Signed-off-by: Frank Viernau <frank_viernau@epam.com>
Signed-off-by: Frank Viernau <frank_viernau@epam.com>
ab86897
to
be6d816
Compare
I've addressed that comment (just now). |
Add a Gradle task which generates
license-classifications.yml
from ScanCode's license data hosted ataboutcode.org
and make use of the generated file.Fixes #59.
Note: As a follow up, policy rules will need to be extended to handle new license categories.
For now, the newly added categories will still be flagged by the evaluator rule
UNHANDLED_LICENSE
.