Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HuggingFace][Neuronx] Training - Optimum Neuron 0.0.25 - Neuron sdk 2.20.0 - Transformers to 4.43.2 #4365

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dacorvo
Copy link
Contributor

@dacorvo dacorvo commented Oct 22, 2024

Issue #4307

Description

This PR creates Hugginface's PyTorch DLC for training on neuron-v2 devices (Trainium).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@aws-deep-learning-containers-ci aws-deep-learning-containers-ci bot added build Reflects file change in build folder huggingface Reflects file change in huggingface folder Size:S Determines the size of the PR labels Oct 22, 2024
@dacorvo dacorvo marked this pull request as ready for review October 22, 2024 07:12
@dacorvo dacorvo requested review from a team as code owners October 22, 2024 07:12
"71670": "[Package: torch] Core torch package version 2.1 affected, cannot be changed in PyTorch 2.1 DLC advisory='A vulnerability in the PyTorch's torch.distributed.rpc framework, specifically in versions prior to 2.2.2, allows for remote code execution (RCE). The framework, which is used in distributed training scenarios, does not properly verify the functions being called during RPC (Remote Procedure Call) operations. This oversight permits attackers to execute arbitrary commands by leveraging built-in Python functions such as eval during multi-cpu RPC communication. The vulnerability arises from the lack of restriction on function calls when a worker node serializes and sends a PythonUDF (User Defined Function) to the master node, which then deserializes and executes the function without validation. This flaw can be exploited to compromise master nodes initiating distributed training, potentially leading to the theft of sensitive AI-related data.'",
"71671": "[Package: torch] Core torch package version 2.1 affected, cannot be changed in PyTorch 2.1 DLC advisory='PyTorch before v2.2.0 was discovered to contain a heap buffer overflow vulnerability in the component /runtime/vararg_functions.cpp. This vulnerability allows attackers to cause a Denial of Service (DoS) via a crafted input.'",
"71672": "[Package: torch] Core torch package version 2.1 affected, cannot be changed in PyTorch 2.1 DLC advisory='Pytorch before version v2.2.0 was discovered to contain a use-after-free vulnerability in torch/csrc/jit/mobile/interpreter.cpp.'",
"71064": "Affected versions of Requests, when making requests through a Requests `Session`, if the first request is made with `verify=False` to disable cert verification, all subsequent requests to the same host will continue to ignore cert verification regardless of changes to the value of `verify`. This behavior will continue for the lifecycle of the connection in the connection pool. Requests 2.32.0 fixes the issue, but versions 2.32.0 and 2.32.1 were yanked due to conflicts with CVE-2024-35195 mitigation."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a , on this line, code build logs:

Traceback (most recent call last):
--
259 | File "src/main.py", line 132, in <module>
260 | main()
261 | File "src/main.py", line 128, in main
262 | image_builder(buildspec_file, image_types, device_types)
263 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/image_builder.py", line 370, in image_builder
264 | pushed_images += process_images(parent_images, "Parent/Independent", buildspec_path=buildspec)
265 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/image_builder.py", line 434, in process_images
266 | build_images(common_stage_image_list, make_dummy_boto_client=True)
267 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/image_builder.py", line 581, in build_images
268 | FORMATTER.progress(THREADS)
269 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/output.py", line 103, in progress
270 | output[i] += "." * 10 + constants.STATUS_MESSAGE[futures[image].result()]
271 | File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 432, in result
272 | return self.__get_result()
273 | File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
274 | raise self._exception
275 | File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
276 | result = self.fn(*self.args, **self.kwargs)
277 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/image.py", line 164, in build
278 | self.update_pre_build_configuration()
279 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/common_stage_image.py", line 54, in update_pre_build_configuration
280 | generate_safety_report_for_image(
281 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/utils.py", line 383, in generate_safety_report_for_image
282 | ignore_dict = get_safety_ignore_dict(
283 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/utils.py", line 324, in get_safety_ignore_dict
284 | get_safety_ignore_dict_from_image_specific_safety_allowlists(image_uri)
285 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/utils.py", line 265, in get_safety_ignore_dict_from_image_specific_safety_allowlists
286 | ignore_dict_from_image_specific_allowlist = json.load(f)
287 | File "/usr/local/lib/python3.8/json/__init__.py", line 293, in load
288 | return loads(fp.read(),
289 | File "/usr/local/lib/python3.8/json/__init__.py", line 357, in loads
290 | return _default_decoder.decode(s)
291 | File "/usr/local/lib/python3.8/json/decoder.py", line 337, in decode
292 | obj, end = self.raw_decode(s, idx=_w(s, 0).end())
293 | File "/usr/local/lib/python3.8/json/decoder.py", line 353, in raw_decode
294 | obj, end = self.scan_once(s, idx)
295 | json.decoder.JSONDecodeError: Expecting ',' delimiter: line 8 column 5 (char 2593)

@dacorvo
Copy link
Contributor Author

dacorvo commented Oct 24, 2024

@Captainia a python vulnerability is detected for gevent:

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:97 SAFETY_REPORT (FAILED) [pkg: gevent] [installed: 24.2.1] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='73655', advisory='Affected versions of gevent are vulnerable to a Race Condition leading to Unauthorized Access — CWE-362. The attack can be carried out when the fallback socketpair implementation is used on platforms that lack native support, and the vulnerable function does not properly authenticate the connected sockets. To exploit this vulnerability, an attacker must be able to predict the address and port used by the fallback socketpair and establish a connection before the legitimate client. Users are advised to update to the version of gevent where this issue is fixed by introducing authentication steps in the fallback socketpair implementation to ensure the sockets are correctly connected.', reason_to_ignore='N/A', spec='<24.10.1', ignored=False)]]

It does not seem to have been detected for other images in this repository, so I don't know if it can be ignored or not.

Copy link
Contributor

@Captainia Captainia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one new critical vulnerability in the image

{"apparmor": [{"description": "It was discovered that the AppArmor policy compiler incorrectly generated looser restrictions than expected for rules allowing mount operations. A local attacker could possibly use this to bypass AppArmor restrictions in applications where some mount operations were permitted.", "vulnerability_id": "CVE-2016-1585", "name": "CVE-2016-1585", "package_name": "apparmor", "package_details": {"file_path": null, "name": "apparmor", "package_manager": "OS", "version": "2.13.3", "release": "7ubuntu5.3build2"}, "remediation": {"recommendation": {"text": "None Provided"}}, "cvss_v3_score": 9.8, "cvss_v30_score": 9.8, "cvss_v31_score": 0.0, "cvss_v2_score": 0.0, "cvss_v3_severity": "CRITICAL", "source_url": "https://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-1585.html", "source": "UBUNTU_CVE", "severity": "CRITICAL", "status": "ACTIVE", "title": "CVE-2016-1585 - apparmor", "reason_to_ignore": "N/A"}]}

Shall we add it to apt-get update && apt-get upgrade in the docker file?

@dacorvo
Copy link
Contributor Author

dacorvo commented Oct 24, 2024

There is one new critical vulnerability in the image

{"apparmor": [{"description": "It was discovered that the AppArmor policy compiler incorrectly generated looser restrictions than expected for rules allowing mount operations. A local attacker could possibly use this to bypass AppArmor restrictions in applications where some mount operations were permitted.", "vulnerability_id": "CVE-2016-1585", "name": "CVE-2016-1585", "package_name": "apparmor", "package_details": {"file_path": null, "name": "apparmor", "package_manager": "OS", "version": "2.13.3", "release": "7ubuntu5.3build2"}, "remediation": {"recommendation": {"text": "None Provided"}}, "cvss_v3_score": 9.8, "cvss_v30_score": 9.8, "cvss_v31_score": 0.0, "cvss_v2_score": 0.0, "cvss_v3_severity": "CRITICAL", "source_url": "https://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-1585.html", "source": "UBUNTU_CVE", "severity": "CRITICAL", "status": "ACTIVE", "title": "CVE-2016-1585 - apparmor", "reason_to_ignore": "N/A"}]}

Shall we add it to apt-get update && apt-get upgrade in the docker file?

It is already installed in the Dockerfile.
https://github.com/dacorvo/deep-learning-containers/blob/fb7e3ca6a491cae112edd53440a1761d67595d3b/huggingface/pytorch/training/docker/2.1/py3/sdk2.20.0/Dockerfile.neuronx#L40

@Captainia
Copy link
Contributor

@Captainia a python vulnerability is detected for gevent:

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:97 SAFETY_REPORT (FAILED) [pkg: gevent] [installed: 24.2.1] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='73655', advisory='Affected versions of gevent are vulnerable to a Race Condition leading to Unauthorized Access — CWE-362. The attack can be carried out when the fallback socketpair implementation is used on platforms that lack native support, and the vulnerable function does not properly authenticate the connected sockets. To exploit this vulnerability, an attacker must be able to predict the address and port used by the fallback socketpair and establish a connection before the legitimate client. Users are advised to update to the version of gevent where this issue is fixed by introducing authentication steps in the fallback socketpair implementation to ensure the sockets are correctly connected.', reason_to_ignore='N/A', spec='<24.10.1', ignored=False)]]

It does not seem to have been detected for other images in this repository, so I don't know if it can be ignored or not.

It seems this vulnerability is not exploitable in a docker environment, but worth confirming and then we can add to ignore list.

@Captainia
Copy link
Contributor

@Captainia a python vulnerability is detected for gevent:

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:97 SAFETY_REPORT (FAILED) [pkg: gevent] [installed: 24.2.1] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='73655', advisory='Affected versions of gevent are vulnerable to a Race Condition leading to Unauthorized Access — CWE-362. The attack can be carried out when the fallback socketpair implementation is used on platforms that lack native support, and the vulnerable function does not properly authenticate the connected sockets. To exploit this vulnerability, an attacker must be able to predict the address and port used by the fallback socketpair and establish a connection before the legitimate client. Users are advised to update to the version of gevent where this issue is fixed by introducing authentication steps in the fallback socketpair implementation to ensure the sockets are correctly connected.', reason_to_ignore='N/A', spec='<24.10.1', ignored=False)]]

It does not seem to have been detected for other images in this repository, so I don't know if it can be ignored or not.

It seems this vulnerability is not exploitable in a docker environment, but worth confirming and then we can add to ignore list.

Could you update gevent similar to the PR here? #4367 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Reflects file change in build folder huggingface Reflects file change in huggingface folder Size:S Determines the size of the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants