Switch to a generated `license-classifications.yml` based on ScanCode's License DB #64

fviernau · 2022-09-29T07:59:25Z

Add a Gradle task which generates license-classifications.yml from ScanCode's license data hosted at aboutcode.org and make use of the generated file.

Fixes #59.

Note: As a follow up, policy rules will need to be extended to handle new license categories.
For now, the newly added categories will still be flagged by the evaluator rule UNHANDLED_LICENSE.

tsteenbe · 2022-10-05T08:44:57Z

license-classifications.yml

+- name: "permissive"
+- name: "proprietary-free"
+- name: "public-domain"
+- name: "source-available"


Does ScanCode have 2-3 description for each of the categories? Maybe ask @pombredanne.
Would be nice if we could add a category description as even I struggle to fully understand the source-available and unstated-license categories

BTW, it makes me feel a bit uneasy to solely rely on ScanCode's categorization here. We already have an IMO too tight coupling to ScanCode and its license texts in ORT's SPDX module, and I'd rather not introduce similar couplings elsewhere, but stay scanner-agnostic where we can.

So I'm wondering why we can't do as proposed here and simply use LDBcollector's classification, which at least does not rely on a single source / "opinion" only, and also nicely prefixes (at least some) category names with a hint about what the category semantically is about.

BTW, it makes me feel a bit uneasy to solely rely on ScanCode's categorization here. We already have an IMO too tight coupling to ScanCode and its license texts in ORT's SPDX module, and I'd rather not introduce similar couplings elsewhere, but stay scanner-agnostic where we can.

I believe 'ort-config' is about providing a real world example, providing amongst others the following value

Users can get started quickly.

Users can see how a real world setup could look like.

It allows building and sharing a vision of a good config we consider reasonable.

Users can see how curations / package configurations and policy rules work together

Users can instantly run all stages: analyzer, scanner, evaluator, advisor, reporter and
see the report with the violations

Regarding point 5: When looking at the report of an arbitrary scan, it will be cluttered with a bunch of UNHANDLED_LICENSE issues. So, it's quite obfuscated how effort / issues is there to fix, as classifying the licenses "maybe" lead to further issues. Having a lot of licenses classified, which this PR does, improves that situation quite a bit, the report created by default will be more meaningful.

In terms of getting the user started quickly, I believe it makes sense to stick to the de-facto default scanner ScanCode for the following reasons:

A setup for other scanners would look similar, so it does not illustrate really new cases.
As it's similar. it's easy for the user to create if the ort-config is understood.

Adding categories from mutliple scanners adds complexity. It makes it harder to understand and iterate on the configuration.

Categories from mutliple scanners may overlap semantically

To have the setup kind of complete, policy rules for all the categories have to be written.
Who's gonna do that?

So, at the very minimum, I believe adding categories beyond ScanCode should be considered out of scope of this PR.
I'd vote against adding other categories due to how I see the effort / benefit ratio.

sschuberth · 2022-10-07T10:44:18Z

tools/curations/buildSrc/src/main/kotlin/ScanCodeLicenseDbClassifications.kt

+import com.fasterxml.jackson.databind.PropertyNamingStrategies
+import com.fasterxml.jackson.databind.json.JsonMapper
+import com.fasterxml.jackson.module.kotlin.KotlinFeature
+import com.fasterxml.jackson.module.kotlin.KotlinModule
+import com.fasterxml.jackson.module.kotlin.readValue


Given that we're trying to avoid Jackson in ORT Core (also see oss-review-toolkit/ort#3904), would you agree that it also makes sense to rather use kotlinx-serialization here?

Could make sense yes. However, Jackson has been used in that module already before, so I went for local consistency. I don't want to put in the effort to refactor this PR to use 'kotlinx-serialization' now. I wouldn't object if someone did it on top, after my PR is merged.

sschuberth · 2022-10-07T10:55:43Z

tools/curations/buildSrc/src/main/kotlin/ScanCodeLicenseDbClassifications.kt

+private fun downloadLicenseDetailsBatched(
+    licenses: Collection<License>,
+    threadCount: Int = 60
+): Map<License, LicenseDetails> {
+    val queue = ConcurrentLinkedQueue(licenses)
+    val result = ConcurrentHashMap<License, LicenseDetails>()
+
+    val threads = List(threadCount) { _ ->
+        Thread {
+            var license = queue.poll()
+
+            while (license != null) {
+                result[license] = downloadLicenseDetails(license)
+                license = queue.poll()
+            }
+        }
+    }
+
+    val executionTimeMillis = measureTimeMillis {
+        threads.run {
+            forEach { it.start() }
+            forEach { it.join() }
+        }
+    }
+
+    LOGGER.quiet("Downloaded ${result.size} files in ${executionTimeMillis / 1000f} seconds on $threadCount threads.")
+
+    return result.toMap()
+}


Have you considered using the download plugin instead to safe some custom code, similar to like we already do in the spdx-utils module?

In general, Gradle discourages plugins to manually create threads to parallelize work. Instead, the Worker API should be used. The download plugin already does that.

Have you considered using the download plugin instead to safe some custom code, similar to like we already do in the spdx-utils module?

I decided against it as it leads to a coupling with Gradle tasks, and seemed in the end a bit more complex.
This way the code is easily readable even for people who are not into Gradle tasks.
I found it nicer not to add a dependency for such small thing as parallelizing a download and instead have full control over parallelization.

In general, Gradle discourages plugins to manually create threads to parallelize work. Instead, the Worker API should be used. The download plugin already does that.

This task is run manually only very rarely. Given that I believe the manual creation of threads is just a theoretical problem in this case. Would you agree?

This way the code is easily readable even for people who are not into Gradle tasks.

Actually, I found it harder to read (and to check for correctness) than a Download-task.

This task is run manually only very rarely. Given that I believe the manual creation of threads is just a theoretical problem in this case. Would you agree?

Independently of whether the problem is theoretical or real, I believe we should not introduce code that does not adhere to Gradle best practices. But I'm curious about what other @oss-review-toolkit/kotlin-devs say.

Actually, I found it harder to read (and to check for correctness) than a Download-task.

I believe that's subjective.

Independently of whether the problem is theoretical or real, I believe we should not introduce code that does not adhere to Gradle best practices.

In general I believe if there can be good reasons to not follow best practices, e.g. there's an exception to almost every rule. So, I believe using the fact that something is best practices as rationale why an exception wrt to the best practices should not be made, does not make much sense.

Apart from that, I was arguing that I would like to keep the implementation separate from Gradle.
So, why should Gradle best practice apply to code which is just plain Kotlin without Gradle dependencies?

Actually, I found it harder to read (and to check for correctness) than a Download-task.

I believe that's subjective.

Sure it is. Just like your statement before, that the code would be easier to read.

So, why should Gradle best practice apply to code which is just plain Kotlin without Gradle dependencies?

Because it's actually only being used within Gradle build files, and thus using Gradle infrastructure, which has its own rules. And I currently don't see the code ever being used elsewhere.

Apart from that, I was arguing that I would like to keep the implementation separate from Gradle.

Why? If you have concrete plans to use the code elsewhere later, please share them so we can make more provident reviews.

tools/curations/buildSrc/src/main/kotlin/ScanCodeLicenseDbClassifications.kt

Dismissing my review in favor of others.

MarcelBochtler

LGTM, but I think that this comment should still be addressed:
#64 (comment)

This revision has the fix for serializing license classifications, see oss-review-toolkit/ort#5882. Signed-off-by: Frank Viernau <frank_viernau@epam.com>

The task downloads the "ScanCode License DB" [1] data from [2], derives ORT license classifications from that data and finally writes that result to `license-classifcations.yml` in the root directory of the code repository. The data isn't modified except for the following: 1. Exceptions are filtered out. 2. License IDs which do not refer to specific licenses, like the ones with `is_unknown == true` or `is_generic == true`, are re-assigned to separate categories. [1] https://github.com/nexB/scancode-licensedb [2] https://scancode-licensedb.aboutcode.org Signed-off-by: Frank Viernau <frank_viernau@epam.com>

Update the file by running [1] which generates license classifications based on [2]. That license data corresponds to ScanCode version `31.1.1`, see also the bottom of [2]. Note: For all licenses which are assigned to a category which is new in `license-classifications.yml` the `UNHANDLED_LICENSE` violations will continue to be flagged. `UNHANDLED_LICENSE` violations for all licenses which are now assigned to a non-offending category, like e.g. "permissive", dissappear. [1] `./gradlew generateLicenseClassifications` [2] https://scancode-licensedb.aboutcode.org/ Signed-off-by: Frank Viernau <frank_viernau@epam.com>

Signed-off-by: Frank Viernau <frank_viernau@epam.com>

fviernau · 2022-10-10T09:40:00Z

LGTM, but I think that this comment should still be addressed:
#64 (comment)

I've addressed that comment (just now).

fviernau force-pushed the generate-license-classificiations branch 5 times, most recently from b546063 to fc8f779 Compare September 29, 2022 11:37

fviernau marked this pull request as ready for review September 29, 2022 11:37

fviernau requested review from a team as code owners September 29, 2022 11:37

fviernau force-pushed the generate-license-classificiations branch 2 times, most recently from dc22349 to ab86897 Compare September 30, 2022 07:26

tsteenbe reviewed Oct 5, 2022

View reviewed changes

fviernau requested a review from sschuberth October 6, 2022 20:48

sschuberth previously requested changes Oct 7, 2022

View reviewed changes

sschuberth requested a review from a team October 10, 2022 08:46

MarcelBochtler reviewed Oct 10, 2022

View reviewed changes

fviernau added 4 commits October 10, 2022 11:29

Gradle: Upgrade to a current ORT version

53f38f9

This revision has the fix for serializing license classifications, see oss-review-toolkit/ort#5882. Signed-off-by: Frank Viernau <frank_viernau@epam.com>

evaluator-rules: Add a TODO to remind about updating the rules

be6d816

Signed-off-by: Frank Viernau <frank_viernau@epam.com>

fviernau force-pushed the generate-license-classificiations branch from ab86897 to be6d816 Compare October 10, 2022 09:39

MarcelBochtler approved these changes Oct 10, 2022

View reviewed changes

fviernau merged commit 96cb835 into main Oct 10, 2022

fviernau deleted the generate-license-classificiations branch October 10, 2022 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to a generated `license-classifications.yml` based on ScanCode's License DB #64

Switch to a generated `license-classifications.yml` based on ScanCode's License DB #64

fviernau commented Sep 29, 2022

tsteenbe Oct 5, 2022

sschuberth Oct 5, 2022

fviernau Oct 6, 2022 •

edited

Loading

sschuberth Oct 7, 2022

fviernau Oct 7, 2022 •

edited

Loading

sschuberth Oct 7, 2022

fviernau Oct 7, 2022 •

edited

Loading

sschuberth Oct 7, 2022

fviernau Oct 10, 2022 •

edited

Loading

sschuberth Oct 10, 2022

MarcelBochtler left a comment

fviernau commented Oct 10, 2022 •

edited

Loading

Switch to a generated license-classifications.yml based on ScanCode's License DB #64

Switch to a generated license-classifications.yml based on ScanCode's License DB #64

Conversation

fviernau commented Sep 29, 2022

tsteenbe Oct 5, 2022

Choose a reason for hiding this comment

sschuberth Oct 5, 2022

Choose a reason for hiding this comment

fviernau Oct 6, 2022 • edited Loading

Choose a reason for hiding this comment

sschuberth Oct 7, 2022

Choose a reason for hiding this comment

fviernau Oct 7, 2022 • edited Loading

Choose a reason for hiding this comment

sschuberth Oct 7, 2022

Choose a reason for hiding this comment

fviernau Oct 7, 2022 • edited Loading

Choose a reason for hiding this comment

sschuberth Oct 7, 2022

Choose a reason for hiding this comment

fviernau Oct 10, 2022 • edited Loading

Choose a reason for hiding this comment

sschuberth Oct 10, 2022

Choose a reason for hiding this comment

MarcelBochtler left a comment

Choose a reason for hiding this comment

fviernau commented Oct 10, 2022 • edited Loading

Switch to a generated `license-classifications.yml` based on ScanCode's License DB #64

Switch to a generated `license-classifications.yml` based on ScanCode's License DB #64

fviernau Oct 6, 2022 •

edited

Loading

fviernau Oct 7, 2022 •

edited

Loading

fviernau Oct 7, 2022 •

edited

Loading

fviernau Oct 10, 2022 •

edited

Loading

fviernau commented Oct 10, 2022 •

edited

Loading