-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] enhance model_uploader workflow to support MIT-licensed models from huggingface #388
Changes from all commits
e70d699
dcea54e
607336c
7adfbf3
7d18378
a5c6627
46b99d7
efbc533
3e9871d
20d4fc8
337ffa9
a007301
e997594
08540b2
98794e0
70f0713
fe0471a
db49002
4200890
34a18d6
5f6e714
5da6ae9
4f6af15
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,6 +8,14 @@ on: | |
required: true | ||
type: string | ||
default: "huggingface" | ||
model_license: | ||
description: "Model license (e.g. Apache-2.0)" | ||
required: true | ||
type: choice | ||
default: "Apache-2.0" | ||
options: | ||
- "Apache-2.0" | ||
- "MIT" | ||
model_id: | ||
description: "Model ID for auto-tracing and uploading (e.g. sentence-transformers/msmarco-distilbert-base-tas-b)" | ||
required: true | ||
|
@@ -49,7 +57,10 @@ on: | |
options: | ||
- "NO" | ||
- "YES" | ||
|
||
additional_license_url: | ||
description: "(Optional) Additional license url of the huggingface MIT model." | ||
required: false | ||
type: string | ||
|
||
jobs: | ||
# Step 2: Initiate workflow variable | ||
|
@@ -73,6 +84,7 @@ jobs: | |
embedding_dimension=${{ github.event.inputs.embedding_dimension }} | ||
pooling_mode=${{ github.event.inputs.pooling_mode }} | ||
model_description="${{ github.event.inputs.model_description }}" | ||
additional_license_url="${{ github.event.inputs.additional_license_url }}" | ||
|
||
workflow_info=" | ||
============= Workflow Details ============== | ||
|
@@ -83,11 +95,13 @@ jobs: | |
|
||
========= Workflow Input Information ========= | ||
- Model ID: ${{ github.event.inputs.model_id }} | ||
- Model License ${{ github.event.inputs.model_license }} | ||
- Model Version: ${{ github.event.inputs.model_version }} | ||
- Tracing Format: ${{ github.event.inputs.tracing_format }} | ||
- Embedding Dimension: ${embedding_dimension:-N/A} | ||
- Pooling Mode: ${pooling_mode:-N/A} | ||
- Model Description: ${model_description:-N/A} | ||
- Additional License Url: ${additional_license_url:-N/A} | ||
|
||
======== Workflow Output Information ========= | ||
- Embedding Verification: Passed" | ||
|
@@ -96,16 +110,35 @@ jobs: | |
echo "${workflow_info@E}" >> $GITHUB_OUTPUT | ||
echo "EOF" >> $GITHUB_OUTPUT | ||
echo "${workflow_info@E}" | ||
- name: Check Additional license url | ||
id: check_additional_license_url | ||
run: | | ||
if [[ "${{ github.event.inputs.model_license }}" == "MIT" ]] | ||
then | ||
echo "Uploading MIT licensed model" | ||
if [[ "${{ github.event.inputs.additional_license_url }}" == "" ]] | ||
then | ||
echo "missing MIT license url" | ||
exit 1 | ||
fi | ||
if [[ "${{ github.event.inputs.model_source }}" == "" ]] | ||
then | ||
echo "only support MIT models from huggingface now" | ||
exit 1 | ||
fi | ||
fi | ||
- name: Initiate license_line | ||
id: init_license_line | ||
run: | | ||
echo "verified=:white_check_mark: — It is verified that this model is licensed under Apache 2.0" >> $GITHUB_OUTPUT | ||
echo "unverified=- [ ] :warning: The license cannot be verified. Please confirm by yourself that the model is licensed under Apache 2.0 :warning:" >> $GITHUB_OUTPUT | ||
echo "verified_apache=:white_check_mark: — It is verified that this model is licensed under Apache 2.0" >> $GITHUB_OUTPUT | ||
echo "verified_mit=:white_check_mark: — It is verified that this model is licensed under MIT, and we have provided copyright statements" >> $GITHUB_OUTPUT | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For apache 2.0 licensed models, will this make any issue? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've done a regression test using Apache-2 models and it still works well https://github.com/zhichao-aws/opensearch-py-ml/actions/runs/8843322311/job/24283396611 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For apache 2.0, I'm seeing:
I was thinking may be we can make this bit more dynamic? Like in |
||
echo "unverified=- [ ] :warning: The license cannot be verified. Please confirm by yourself that the model is licensed under Apache 2.0 or MIT with enough materials :warning:" >> $GITHUB_OUTPUT | ||
outputs: | ||
model_folder: ${{ steps.init_folders.outputs.model_folder }} | ||
sentence_transformer_folder: ${{ steps.init_folders.outputs.sentence_transformer_folder }} | ||
workflow_info: ${{ steps.init_workflow_info.outputs.workflow_info }} | ||
verified_license_line: ${{ steps.init_license_line.outputs.verified }} | ||
verified_apache_license_line: ${{ steps.init_license_line.outputs.verified_apache }} | ||
verified_mit_license_line: ${{ steps.init_license_line.outputs.verified_mit }} | ||
unverified_license_line: ${{ steps.init_license_line.outputs.unverified }} | ||
|
||
# Step 3: Check if the model already exists in the model hub | ||
|
@@ -175,11 +208,13 @@ jobs: | |
- name: Export Arguments | ||
run: | | ||
echo "MODEL_ID=${{ github.event.inputs.model_id }}" >> $GITHUB_ENV | ||
echo "MODEL_LICENSE=${{ github.event.inputs.model_license }}" >> $GITHUB_ENV | ||
echo "MODEL_VERSION=${{ github.event.inputs.model_version }}" >> $GITHUB_ENV | ||
echo "TRACING_FORMAT=${{ github.event.inputs.tracing_format }}" >> $GITHUB_ENV | ||
echo "EMBEDDING_DIMENSION=${{ github.event.inputs.embedding_dimension }}" >> $GITHUB_ENV | ||
echo "POOLING_MODE=${{ github.event.inputs.pooling_mode }}" >> $GITHUB_ENV | ||
echo "MODEL_DESCRIPTION=${{ github.event.inputs.model_description }}" >> $GITHUB_ENV | ||
echo "MODEL_DESCRIPTION=${{ github.event.inputs.model_description }}" >> $GITHUB_ENV | ||
echo "ADDITIONAL_LICENSE_URL=${{ github.event.inputs.additional_license_url }}" >> $GITHUB_ENV | ||
- name: Autotracing ${{ matrix.cluster }} secured=${{ matrix.secured }} version=${{matrix.entry.opensearch_version}} | ||
run: "./.ci/run-tests ${{ matrix.cluster }} ${{ matrix.secured }} ${{ matrix.entry.opensearch_version }} trace" | ||
- name: Limit Model Size to 2GB | ||
|
@@ -195,10 +230,15 @@ jobs: | |
- name: License Verification | ||
id: license_verification | ||
run: | | ||
apache_verified=$(<trace_output/apache_verified.txt) | ||
if [[ $apache_verified == "True" ]] | ||
license_verified=$(<trace_output/license_verified.txt) | ||
if [[ $license_verified == "True" ]] | ||
then | ||
echo "license_line=${{ needs.init-workflow-var.outputs.verified_license_line }}" >> $GITHUB_OUTPUT | ||
if [[ "${{ github.event.inputs.model_license }}" == "Apache-2.0" ]] | ||
then | ||
echo "license_line=${{ needs.init-workflow-var.outputs.verified_apache_license_line }}" >> $GITHUB_OUTPUT | ||
else | ||
echo "license_line=${{ needs.init-workflow-var.outputs.verified_mit_license_line }}" >> $GITHUB_OUTPUT | ||
fi | ||
echo "license_info=Automatically Verified" >> $GITHUB_OUTPUT | ||
else | ||
echo "license_line=${{ needs.init-workflow-var.outputs.unverified_license_line }}" >> $GITHUB_OUTPUT | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,7 +15,7 @@ | |
import subprocess | ||
import time | ||
from pathlib import Path | ||
from typing import List | ||
from typing import List, Optional | ||
from zipfile import ZipFile | ||
|
||
import matplotlib.pyplot as plt | ||
|
@@ -38,6 +38,7 @@ | |
) | ||
|
||
LICENSE_URL = "https://github.com/opensearch-project/opensearch-py-ml/raw/main/LICENSE" | ||
THIRD_PARTY_FILE_NAME = "THIRD-PARTY" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you please check with the open source engineer that if naming the attribution file as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The name |
||
|
||
|
||
class SentenceTransformerModel: | ||
|
@@ -657,6 +658,22 @@ def _add_apache_license_to_model_zip_file(self, model_zip_file_path: str): | |
with ZipFile(str(model_zip_file_path), "a") as zipObj: | ||
zipObj.writestr("LICENSE", r.content) | ||
|
||
def _add_third_party_copyrights_statements_to_model_zip_file( | ||
self, third_party_copyrights_statements: str, model_zip_file_path: str | ||
): | ||
""" | ||
Add Statements text for non Apache-2.0 licensed third party model. Add it to the model zip file at model_zip_file_path | ||
|
||
:param third_party_copyrights_statements: Statements text for non Apache-2.0 licensed third party model. Should be put in the final artifact. | ||
:type third_party_copyrights_statements: string | ||
:param model_zip_file_path: Path to the model zip file | ||
:type model_zip_file_path: string | ||
:return: no return value expected | ||
:rtype: None | ||
""" | ||
with ZipFile(str(model_zip_file_path), "a") as zipObj: | ||
zipObj.writestr(THIRD_PARTY_FILE_NAME, third_party_copyrights_statements) | ||
|
||
def zip_model( | ||
self, | ||
model_path: str = None, | ||
|
@@ -771,6 +788,7 @@ def save_as_pt( | |
model_output_path: str = None, | ||
zip_file_name: str = None, | ||
add_apache_license: bool = False, | ||
third_party_copyrights_statements: Optional[str] = None, | ||
) -> str: | ||
""" | ||
Download sentence transformer model directly from huggingface, convert model to torch script format, | ||
|
@@ -802,10 +820,16 @@ def save_as_pt( | |
:param add_apache_license: | ||
Optional, whether to add a Apache-2.0 license file to model zip file | ||
:type add_apache_license: string | ||
:param third_party_copyrights_statements: Statements text for non Apache-2.0 licensed third party model. Should be put in the final artifact. | ||
:type third_party_copyrights_statements: string | ||
:return: model zip file path. The file path where the zip file is being saved | ||
:rtype: string | ||
""" | ||
|
||
if add_apache_license == True and third_party_copyrights_statements is not None: | ||
assert ( | ||
False | ||
), "When the model is from third party under non Apache-2.0 license, we can not add Apache-2.0 license for it." | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My understanding is, we will still distribute the artifacts under apache 2.0 license but for MIT licenses models, we also need to add extra attribution file to deliberately mention about the contributors name. Am I missing anything? |
||
model = SentenceTransformer(model_id) | ||
|
||
if model_name is None: | ||
|
@@ -870,6 +894,10 @@ def save_as_pt( | |
) | ||
if add_apache_license: | ||
self._add_apache_license_to_model_zip_file(zip_file_path) | ||
if third_party_copyrights_statements is not None: | ||
self._add_third_party_copyrights_statements_to_model_zip_file( | ||
third_party_copyrights_statements, zip_file_path | ||
) | ||
|
||
self.torch_script_zip_file_path = zip_file_path | ||
print("zip file is saved to ", zip_file_path, "\n") | ||
|
@@ -883,6 +911,7 @@ def save_as_onnx( | |
model_output_path: str = None, | ||
zip_file_name: str = None, | ||
add_apache_license: bool = False, | ||
third_party_copyrights_statements: Optional[str] = None, | ||
) -> str: | ||
""" | ||
Download sentence transformer model directly from huggingface, convert model to onnx format, | ||
|
@@ -911,10 +940,16 @@ def save_as_onnx( | |
:param add_apache_license: | ||
Optional, whether to add a Apache-2.0 license file to model zip file | ||
:type add_apache_license: string | ||
:param third_party_copyrights_statements: Statements text for non Apache-2.0 licensed third party model. Should be put in the final artifact. | ||
:type third_party_copyrights_statements: string | ||
:return: model zip file path. The file path where the zip file is being saved | ||
:rtype: string | ||
""" | ||
|
||
if add_apache_license == True and third_party_copyrights_statements is not None: | ||
assert ( | ||
False | ||
), "When the model is from third party under non Apache-2.0 license, we can not add Apache-2.0 license for it." | ||
model = SentenceTransformer(model_id) | ||
|
||
if model_name is None: | ||
|
@@ -968,6 +1003,10 @@ def save_as_onnx( | |
) | ||
if add_apache_license: | ||
self._add_apache_license_to_model_zip_file(zip_file_path) | ||
if third_party_copyrights_statements is not None: | ||
self._add_third_party_copyrights_statements_to_model_zip_file( | ||
third_party_copyrights_statements, zip_file_path | ||
) | ||
|
||
self.onnx_zip_file_path = zip_file_path | ||
print("zip file is saved to ", zip_file_path, "\n") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we provide some other source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the third party statements can be wrong. So we need to check whether they are matched before approve the release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I tried to mean is let's say
github.event.inputs.model_source
== "XYZ" then this condition will pass, right?