Add tool and guide for getting python license information (kubeflow#569)

* add python license tool and scripts * update readme * remove unused imports
jlewi · Jan 22, 2020 · 59bdeae · 59bdeae
1 parent b58ed35
commit 59bdeae
Show file tree

Hide file tree

Showing 2 changed files with 127 additions and 0 deletions.
diff --git a/py/kubeflow/testing/python-license-tools/README.md b/py/kubeflow/testing/python-license-tools/README.md
@@ -0,0 +1,39 @@
+# CLI tools to fetch Python library's license info
+
+This doc aims to show how to get third party library license information for Kubeflow Python applications.
+
+As a prerequisite, please read the [Go license tools guide](https://github.com/kubeflow/testing/blob/master/py/kubeflow/testing/go-license-tools/README.md) on why third party library license compliance is important and how it is accomplished for Go applications. Specifically, this doc differs from the Go guide mainly on the way of getting dependencies using `pipenv` and source repository from PyPI.
+
+## How to get all dependencies with license and source code?
+
+### I. Setup
+Download the Python files both in go-license-tools folder and this folder.
+
+### II. Get all dependency repositories
+1. Figure out all Python dependencies with `pipenv`. Your application doesn't have to use `pipenv` to manage its dependencies but we are using it here to find the transitive dependencies:
+
+    - Install [pipenv](https://pypi.org/project/pipenv/).
+    - Run `pipenv install ...` to install all your direct dependencies.
+    - Run `pipenv lock` to generate lock file `Pipfile.lock` in JSON for all transitive dependencies.
+
+2. Get Github source repositories for all the dependencies by running this script.
+    ```
+    python3 pipfile_to_github_repo.py
+    ```
+    This script parses the `Pipfile.lock` and looks up the source repositories registered on PyPI. You should see a file named `repo.txt` is generated and its content looks like this:
+    ```
+    AzureAD/azure-activedirectory-library-for-python
+    tkem/cachetools
+    certifi: None
+    cffi: None
+    ......
+    ```
+    Each line above is a Github repository name for a package. Unfortunately, not all packages have source repository information listed on PyPI. In this case, we use `<pakcage_name>: None` to denote such packages. In this example, `certifi` and `cffi` miss the source repository information and we have to manually search for source repositories and edit the information.
+
+3. Manually edit source repository information. Once we find all the package source repositories, we need to update `repo.txt`. For example, we can replace
+line `certifi: None` with its Github source repository `certifi/python-certifi`.
+
+    However, we can't update `repo.txt` for `cffi` directly, because it is not hosted on Github but . We have to remember to update its license URI and type in the final `license_info.csv`, which is produced in the next step.
+
+### III. Get all license URLs and types
+This step is the same as [the one](https://github.com/kubeflow/testing/blob/master/py/kubeflow/testing/go-license-tools/README.md#iii-get-all-license-urls-and-types) described in the Go license tools guide, but you have to manually add the license information for code not hosted on Github.
diff --git a/py/kubeflow/testing/python-license-tools/pipfile_to_github_repo.py b/py/kubeflow/testing/python-license-tools/pipfile_to_github_repo.py
@@ -0,0 +1,88 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+
+import argparse
+import json
+import requests
+from bs4 import BeautifulSoup as Soup
+
+parser = argparse.ArgumentParser(
+    description='JSON format piplock file maintained by pipenv.'
+)
+parser.add_argument(
+    'pip_lock_path',
+    nargs='?',
+    default='Pipfile.lock',
+    help='JSON format pip dependency lock file.'
+)
+parser.add_argument(
+    '-o',
+    '--output',
+    dest='output_file',
+    nargs='?',
+    default='repo.txt',
+    help=
+    'Output a file, where each line is <pkg>,<repo>. (default: %(default)s)',
+)
+
+GITHUB_HTTPS = 'https://github.com/'
+GITHUB_HTTP = 'http://github.com/'
+
+args = parser.parse_args()
+
+
+def get_github_repo_name(url):
+  if url.startswith(GITHUB_HTTPS):
+    url = url[len(GITHUB_HTTPS):]
+  if url.startswith(GITHUB_HTTP):
+    url = url[len(GITHUB_HTTP):]
+  if url[-1] == '/':
+    url = url[:-1]
+  return url
+
+
+def main():
+  lockfile = None
+  with open(args.pip_lock_path, 'r') as f:
+    lockfile = json.loads(f.read())
+
+  deps = {}
+  pkgs = lockfile.get('default')
+  for pkg in pkgs:
+    deps[pkg] = pkgs[pkg].get('version').strip('==')
+
+  repositories = {}
+  for pkg in deps:
+    pypi_url = 'https://pypi.org/project/{}/{}/'.format(pkg, deps[pkg])
+    response = requests.get(pypi_url)
+    assert response.ok, 'it failed with {} {}'.format(
+        response.status_code, response.reason
+    )
+
+    soup = Soup(response.text, features="html.parser")
+    for link in soup.find_all('a'):
+      href = link.get('href')
+      if href is not None and (href.startswith(GITHUB_HTTP) or
+                               href.startswith(GITHUB_HTTPS)):
+        text = str(link)
+        if text.find('Homepage') >= 0 or text.find('Code') >= 0:
+          repositories[pkg] = get_github_repo_name(href)
+          break
+    else:
+      repositories[pkg] = pkg + ': None'
+
+  with open(args.output_file, 'w') as out:
+    for pkg in repositories:
+      print(repositories[pkg], file=out)
+
+
+if __name__ == '__main__':
+  main()