From 768a4d0d4e502ffac14224738052956fc222c96c Mon Sep 17 00:00:00 2001 From: Zhenghui Wang Date: Mon, 6 Jan 2020 22:58:18 -0800 Subject: [PATCH] Update go-license-tool (#553) * update go-license-tool README * fix typo * address comments --- .../testing/go-license-tools/README.md | 138 +++++++++++++++--- .../patch_additional_license_info.py | 69 +++++++++ 2 files changed, 187 insertions(+), 20 deletions(-) create mode 100644 py/kubeflow/testing/go-license-tools/patch_additional_license_info.py diff --git a/py/kubeflow/testing/go-license-tools/README.md b/py/kubeflow/testing/go-license-tools/README.md index 8730f04f1b9..9f492c783c2 100644 --- a/py/kubeflow/testing/go-license-tools/README.md +++ b/py/kubeflow/testing/go-license-tools/README.md @@ -6,7 +6,7 @@ When we release go library images (can be considered as redistributing third party binary). We need to do the following to be compliant: -* Put license declarations in the image for licences of all dependencies and transistive dependencies. +* Put license declarations in the image for licenses of all dependencies and transitive dependencies. * Mirror source code in the image for code with MPL, EPL, GPL or CDDL licenses. It's not an easy task to get license of all (transitive) dependencies of a go @@ -14,33 +14,131 @@ library. Thus, we need these tools to automate this task. ## How to get all dependencies with license and source code? -1. Install CLI tools here: `python setup.py install` -1. Collect dependencies + transitive dependencies in a go library. Put them together in a text file called `dep.txt`. Format: each line has a library name. The library name should be a valid golang import module name. +### I. Setup +Download this folder to your local folder namely `` and install it: +``` +$ python /setup.py install +``` - Example ways to get it: - * argo uses gopkg for package management. It has a [Gopkg.lock file](https://github.com/argoproj/argo/blob/master/Gopkg.lock) - with all of its dependencies and transitive dependencies. All the name fields in this file is what we need. You can run `parse-toml-dep` to parse it. - * minio uses [official go modules](https://blog.golang.org/using-go-modules), there's a [go.mod file](https://github.com/minio/minio/blob/master/go.mod) describing its direct dependencies. Run command `go list -m all` to get final versions that will be used in a build for all direct and indirect dependencies, [reference](https://github.com/golang/go/wiki/Modules#daily-workflow). Parse its output to make a file we need. +### II. Get all dependency repositories +1. Collect dependencies and transitive dependencies in a Go library into a text file called `dep.txt`, where each line is a valid golang import module name. For example + ``` + ...... + cloud.google.com/go + github.com/BurntSushi/toml + github.com/beorn7/perks + github.com/bmatcuk/doublestar + ...... + ``` - Reminder: don't forget to put the library itself into `dep.txt`. -1. Run `get-github-repo` to resolve github repos of golang imports. Not all -imports can be figured out by my script, needs manual help for <2% of libraries. + Typical ways to get it: + * `gopkg` for package management. `gopkg` has a [Gopkg.lock file](https://github.com/argoproj/argo/blob/master/Gopkg.lock) + with all of its dependencies and transitive dependencies. All the name fields in this file is what we need. You can run `parse-toml-dep.py` to parse it. + * [Official go modules](https://blog.golang.org/using-go-modules) has a [go.mod file](https://github.com/minio/minio/blob/master/go.mod) describing its direct dependencies. Run command - For a library we cannot resolve, manually put it in `dep-repo-mapping.manual.csv`, so the tool knows how to find its github repo the next time. + ```$ go list -m all | cut -d ' ' -f 1 > dep.txt``` - Defaults to read dependencies from `dep.txt` and writes to `repo.txt`. -1. Run `get-github-license-info` to crawl github license info of these libraries. (Not all repos have github recognizable license, needs manual help for <2% of libraries) + to get final versions that will be used in a build for all direct and indirect dependencies, ([reference](https://github.com/golang/go/wiki/Modules#daily-workflow)). - Defaults to read repos from `repo.txt` and writes to `license-info.csv`. You - need to configure github personal access token because it sends a lot of - requests to github. Follow instructions in `get-github-license-info -h`. + **Reminder:** don't forget to put your library itself into `dep.txt`. +2. Run `$ python /get-github-repo.py` to resolve github repositories of golang imports. Not all imports can be figured out by my script, needs manual help for <2% of libraries. For example, you may see an output like this: + ``` + ...... + Successfully resolved github repo for 89 dependencies and saved to repo.txt. Failed to resolve 3 dependencies. + We failed to resolve the following dependencies: + gomodules.xyz/jsonpatch/v2 + honnef.co/go/tools + ml_metadata + ``` - For repos that fails to fetch license, it's usually because their github repo + For a library we cannot resolve, manually put it in `dep-repo-mapping.manual.csv`, so the tool knows how to find its github repo in the future. For example, the corresponding `dep-repo-mapping.manual.csv` for the example above is + ``` + gomodules.xyz/jsonpatch/v2,gomodules/jsonpatch + honnef.co/go/tools,dominikh/go-tools + ml_metadata,google/ml-metadata + ``` + 3. Rerun the command in this to resolve all repositories this time: + ``` + $ python /get_github_repo.py + + ...... + Successfully resolved github repo for 92 dependencies and saved to repo.txt. Failed to resolve 0 dependencies. + ``` + +### III. Get all license URLs and types + +1. Crawl github license info of these libraries via the following command to produce the `license_info.csv` file. (Not all repositories have github recognizable license, needs manual help for <2% of libraries) + ``` + $ python /third_party/cli/get_github_license_info.py --github-api-token-file= + ...... + Fetching license for google/ml-metadata + Fetching license for kubernetes-sigs/controller-runtime + Fetching license for kubernetes-sigs/testing_frameworks + Fetching license for kubernetes-sigs/yaml + Fetched github license info, 91 succeeded, 0 failed. + ``` + You have to create a `` in order to access Github repositories, because it sends a lot of requests to github. Follow instructions in `get-github-license-info -h`. + + For repositories that fails to fetch license, it's usually because their github repo doesn't have a github understandable license file. Check its readme and update correct info into `license-info.csv`. (Usually, use its README file which mentions license.) -1. Edit license info file. Manually check the license file for all repos with a license categorized as "Other" by github. Figure out their true license names. -1. Run `concatenate-license` to crawl full text license files for all dependencies and concat them into one file. + +2. Fill in missing license information. If you open `license_info.csv`, you can see some fields are marked as `Other`. We have to update them to the right license types. First we need to grep all these unknown license URLs: + ``` + $ cat license_info.csv | grep Other | cut -d ',' -f 2 + + GoogleCloudPlatform/gcloud-golang,https://github.com/googleapis/google-cloud-go/blob/master/LICENSE,Other,https://raw.githubusercontent.com/googleapis/google-cloud-go/master/LICENSE + ghodss/yaml,https://github.com/ghodss/yaml/blob/master/LICENSE,Other,https://raw.githubusercontent.com/ghodss/yaml/master/LICENSE + gogo/protobuf,https://github.com/gogo/protobuf/blob/master/LICENSE,Other,https://raw.githubusercontent.com/gogo/protobuf/master/LICENSE + ...... + ``` + + Now we can open these license all at once in Chrome via a plugin called [OpenList](https://chrome.google.com/webstore/detail/openlist/nkpjembldfckmdchbdiclhfedcngbgnl?hl=en). + + After checking the license content one by one, we can now create `additional_license_info.csv` to record the right license types. The content of `additional_license_info.csv` looks like this: + ``` + https://github.com/googleapis/google-cloud-go/blob/master/LICENSE,Apache License 2.0 + https://github.com/ghodss/yaml/blob/master/LICENSE,MIT + https://github.com/gogo/protobuf/blob/master/LICENSE,BSD 3-Clause "New" or "Revised" License + ...... + ``` + + Finally, we can patch the additional license types in `additional_license_info.csv` on `license_info.csv` to get the final list of licenses with types. + + ``` + $ python patch_additional_license_info.py + ``` + + +3. Run `concatenate-license` to crawl full text license files for all dependencies and concat them into one file. Defaults to read license info from `license-info.csv`. Writes to `license.txt`. Put `license.txt` to `third_party/library/license.txt` where it is read when building docker images. -1. Manually update a list of dependencies that requires source code, put it into `third_party/library/repo-MPL.txt`. +4. Manually update a list of dependencies that requires source code, put it into `third_party/library/repo-MPL.txt`. + +## Add CI tests for license information. +It is considered as best practice to continuously test whether the right licence information is presented in the `license.txt` file for every new commit in your code repository. So that it is always safe to deliver a new image from the source code. + +For examples, you can add the following tests into your CI pipeline. + +1. Check if `dep.txt` is updated and force the license information to be updated in the same PR. + + Suppose your repository uses standard Go module and `dep.txt` is checked in. The test Shell script can simply be + ``` + go list -m all | cut -d ' ' -f 1 > /tmp/generated_dep.txt + + if ! diff /tmp/generated_dep.txt dep.txt; then + echo "Please update the license file for changed dependencies." + exit 1 + fi + ``` + +2. Check if the final `license.txt` is up-to-date. The test Shell script can be + ``` + python3 concatenate_license.py --output=/tmp/generated_license.txt + + if ! diff /tmp/generated_license.txt license.txt; then + echo "Please regenerate third_party/license.txt." + exit 1 + fi + ``` diff --git a/py/kubeflow/testing/go-license-tools/patch_additional_license_info.py b/py/kubeflow/testing/go-license-tools/patch_additional_license_info.py new file mode 100644 index 00000000000..4a9da1d1cf8 --- /dev/null +++ b/py/kubeflow/testing/go-license-tools/patch_additional_license_info.py @@ -0,0 +1,69 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +parser = argparse.ArgumentParser( + description='Generate dependencies json from license.csv file.' +) +parser.add_argument( + 'license_info_file', + nargs='?', + default='license_info.csv', + help='CSV file with license info fetched from github using ' + 'get-github-license-info CLI tool. (default: %(default)s)', +) +parser.add_argument( + 'additional_license_info_file', + nargs='?', + default='additional_license_info.csv', + help='CSV file with license info. Each line is in the form ' + ',. (default: %(default)s)', +) +args = parser.parse_args() + + +def main(): + mapping = {} + with open(args.additional_license_info_file, 'r') as f: + for line in f: + parts = line.strip().split(',') + assert len(parts) == 2 + [license_url, license_type] = parts + mapping[license_url] = license_type + + newlines = [] + with open(args.license_info_file, 'r') as f: + for line in f: + parts = line.strip().split(',') + _, license_url, license_type, *_ = parts + if license_type == 'Other': + if not license_url in mapping: + raise ValueError( + 'Unknown license type: ' + 'please add the right license type for {} in file {}' + .format(license_url, args.additional_license_info_file) + ) + parts[2] = mapping[license_url] + print( + 'Update license {} to type {}'.format( + license_url, mapping[license_url] + ) + ) + newlines.append(','.join(parts)) + + with open(args.license_info_file, 'w') as f: + for line in newlines: + print(line, file=f) + + +if __name__ == "__main__": + main()