Skip to content

Commit

Permalink
Add tool and guide for getting python license information (kubeflow#569)
Browse files Browse the repository at this point in the history
* add python license tool and scripts

* update readme

* remove unused imports
  • Loading branch information
zhenghuiwang authored and k8s-ci-robot committed Jan 22, 2020
1 parent b58ed35 commit 59bdeae
Show file tree
Hide file tree
Showing 2 changed files with 127 additions and 0 deletions.
39 changes: 39 additions & 0 deletions py/kubeflow/testing/python-license-tools/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# CLI tools to fetch Python library's license info

This doc aims to show how to get third party library license information for Kubeflow Python applications.

As a prerequisite, please read the [Go license tools guide](https://github.com/kubeflow/testing/blob/master/py/kubeflow/testing/go-license-tools/README.md) on why third party library license compliance is important and how it is accomplished for Go applications. Specifically, this doc differs from the Go guide mainly on the way of getting dependencies using `pipenv` and source repository from PyPI.

## How to get all dependencies with license and source code?

### I. Setup
Download the Python files both in go-license-tools folder and this folder.

### II. Get all dependency repositories
1. Figure out all Python dependencies with `pipenv`. Your application doesn't have to use `pipenv` to manage its dependencies but we are using it here to find the transitive dependencies:

- Install [pipenv](https://pypi.org/project/pipenv/).
- Run `pipenv install ...` to install all your direct dependencies.
- Run `pipenv lock` to generate lock file `Pipfile.lock` in JSON for all transitive dependencies.

2. Get Github source repositories for all the dependencies by running this script.
```
python3 pipfile_to_github_repo.py
```
This script parses the `Pipfile.lock` and looks up the source repositories registered on PyPI. You should see a file named `repo.txt` is generated and its content looks like this:
```
AzureAD/azure-activedirectory-library-for-python
tkem/cachetools
certifi: None
cffi: None
......
```
Each line above is a Github repository name for a package. Unfortunately, not all packages have source repository information listed on PyPI. In this case, we use `<pakcage_name>: None` to denote such packages. In this example, `certifi` and `cffi` miss the source repository information and we have to manually search for source repositories and edit the information.

3. Manually edit source repository information. Once we find all the package source repositories, we need to update `repo.txt`. For example, we can replace
line `certifi: None` with its Github source repository `certifi/python-certifi`.

However, we can't update `repo.txt` for `cffi` directly, because it is not hosted on Github but . We have to remember to update its license URI and type in the final `license_info.csv`, which is produced in the next step.

### III. Get all license URLs and types
This step is the same as [the one](https://github.com/kubeflow/testing/blob/master/py/kubeflow/testing/go-license-tools/README.md#iii-get-all-license-urls-and-types) described in the Go license tools guide, but you have to manually add the license information for code not hosted on Github.
88 changes: 88 additions & 0 deletions py/kubeflow/testing/python-license-tools/pipfile_to_github_repo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and

import argparse
import json
import requests
from bs4 import BeautifulSoup as Soup

parser = argparse.ArgumentParser(
description='JSON format piplock file maintained by pipenv.'
)
parser.add_argument(
'pip_lock_path',
nargs='?',
default='Pipfile.lock',
help='JSON format pip dependency lock file.'
)
parser.add_argument(
'-o',
'--output',
dest='output_file',
nargs='?',
default='repo.txt',
help=
'Output a file, where each line is <pkg>,<repo>. (default: %(default)s)',
)

GITHUB_HTTPS = 'https://github.com/'
GITHUB_HTTP = 'http://github.com/'

args = parser.parse_args()


def get_github_repo_name(url):
if url.startswith(GITHUB_HTTPS):
url = url[len(GITHUB_HTTPS):]
if url.startswith(GITHUB_HTTP):
url = url[len(GITHUB_HTTP):]
if url[-1] == '/':
url = url[:-1]
return url


def main():
lockfile = None
with open(args.pip_lock_path, 'r') as f:
lockfile = json.loads(f.read())

deps = {}
pkgs = lockfile.get('default')
for pkg in pkgs:
deps[pkg] = pkgs[pkg].get('version').strip('==')

repositories = {}
for pkg in deps:
pypi_url = 'https://pypi.org/project/{}/{}/'.format(pkg, deps[pkg])
response = requests.get(pypi_url)
assert response.ok, 'it failed with {} {}'.format(
response.status_code, response.reason
)

soup = Soup(response.text, features="html.parser")
for link in soup.find_all('a'):
href = link.get('href')
if href is not None and (href.startswith(GITHUB_HTTP) or
href.startswith(GITHUB_HTTPS)):
text = str(link)
if text.find('Homepage') >= 0 or text.find('Code') >= 0:
repositories[pkg] = get_github_repo_name(href)
break
else:
repositories[pkg] = pkg + ': None'

with open(args.output_file, 'w') as out:
for pkg in repositories:
print(repositories[pkg], file=out)


if __name__ == '__main__':
main()

0 comments on commit 59bdeae

Please sign in to comment.