Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Docker based python environment creation on AML #19500

Closed
mohitzsh opened this issue Jun 25, 2021 · 4 comments
Closed

Error in Docker based python environment creation on AML #19500

mohitzsh opened this issue Jun 25, 2021 · 4 comments
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. Machine Learning Compute ML-Compute AreaPath question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@mohitzsh
Copy link

mohitzsh commented Jun 25, 2021

  • Package Name: azureml-sdk
  • Package Version: 1.31.0
  • Operating System: Ubuntu 18.04
  • Python Version: 3.7.10

Describe the bug
When submitting a job to AML Compute Target with docker based python environment, step "azureml-logs/55_azureml-execution" fails with the following error:

AzureMLCompute job failed.
FailedStartingContainer: Unable to start docker container
	FailedContainerStart: Unable to start docker container
	err: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

	Reason: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

	Info: Failed to setup runtime for job execution: Job environment preparation failed on 10.0.0.5 with err exit status

To Reproduce
Steps to reproduce the behavior:

# Save this in current directory as train.py
import torch
print(torch.cuda.is_available())
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DockerConfiguration

env = Environment(name="env")
conda_deps = CondaDependencies()

conda_deps.add_channel("pytorch")
conda_deps.add_conda_package("pytorch==1.7.1=py3.7_cuda10.1.243_cudnn7.6.3_0")
conda_deps.add_conda_package("cudatoolkit==10.1.243")
conda_deps.add_conda_package("pip")
conda_deps.add_conda_package("python==3.7.0")
conda_deps.add_pip_package("scikit-learn~=0.24")
conda_deps.add_pip_package("transformers>=4.4.2")

env.python.conda_dependencies=conda_deps
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
docker_config = DockerConfiguration(use_docker=True)

compute_target = workspace.compute_targets["compute-target-name"] # Standard_NC6s_v3

src = ScriptRunConfig(source_directory=".",
                      script='train.py', 
                      compute_target=compute_target,
                      environment=env, 
                      docker_runtime_config=docker_config)
experiment = Experiment(workspace=workspace, name="test-exp")
experiment.submit(src)

Expected behavior
The experiment should run and print True to the stdout.

Screenshots
NA
Additional context
Here are some logs:
20_image_build_log.txt
55_azureml-execution-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
65_job_prep-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
75_job_post-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jun 25, 2021
@chlowell chlowell added ML-Compute AreaPath Service Attention Workflow: This issue is responsible by Azure service team. and removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Jun 25, 2021
@ghost
Copy link

ghost commented Jun 25, 2021

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @brendalee.

Issue Details
  • Package Name: azureml-sdk
  • Package Version: 1.31.0
  • Operating System: Ubuntu 18.04
  • Python Version: 3.7.10

Describe the bug
When submitting a job to AML Compute Target with docker based python environment, step "azureml-logs/55_azureml-execution" fails with the following error:

AzureMLCompute job failed.
FailedStartingContainer: Unable to start docker container
	FailedContainerStart: Unable to start docker container
	err: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

	Reason: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

	Info: Failed to setup runtime for job execution: Job environment preparation failed on 10.0.0.5 with err exit status

To Reproduce
Steps to reproduce the behavior:

# Save this in current directory as train.py
import torch
print(torch.cuda.is_available())
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DockerConfiguration

env = Environment(name="env")
conda_deps = CondaDependencies()

conda_deps.add_channel("pytorch")
conda_deps.add_conda_package("pytorch==1.7.1=py3.7_cuda10.1.243_cudnn7.6.3_0")
conda_deps.add_conda_package("cudatoolkit==10.1.243")
conda_deps.add_conda_package("pip")
conda_deps.add_conda_package("python==3.7.0")
conda_deps.add_pip_package("scikit-learn~=0.24")
conda_deps.add_pip_package("transformers>=4.4.2")

env.python.conda_dependencies=conda_deps
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
docker_config = DockerConfiguration(use_docker=True)

compute_target = workspace.compute_targets["compute-target-name"] # Standard_NC6s_v3

src = ScriptRunConfig(source_directory=".",
                      script='train.py', 
                      compute_target=compute_target,
                      environment=env, 
                      docker_runtime_config=docker_config)
experiment = Experiment(workspace=workspace, name="test-exp")
experiment.submit(src)

Expected behavior
The experiment should run and print True to the stdout.

Screenshots
NA
Additional context
Here are some logs:
20_image_build_log.txt
55_azureml-execution-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
65_job_prep-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
75_job_post-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt

Author: mohit-sh
Assignees: -
Labels:

ML-Compute, Service Attention, customer-reported, question

Milestone: -

@ghost
Copy link

ghost commented Jun 25, 2021

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github.

Issue Details
  • Package Name: azureml-sdk
  • Package Version: 1.31.0
  • Operating System: Ubuntu 18.04
  • Python Version: 3.7.10

Describe the bug
When submitting a job to AML Compute Target with docker based python environment, step "azureml-logs/55_azureml-execution" fails with the following error:

AzureMLCompute job failed.
FailedStartingContainer: Unable to start docker container
	FailedContainerStart: Unable to start docker container
	err: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

	Reason: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

	Info: Failed to setup runtime for job execution: Job environment preparation failed on 10.0.0.5 with err exit status

To Reproduce
Steps to reproduce the behavior:

# Save this in current directory as train.py
import torch
print(torch.cuda.is_available())
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DockerConfiguration

env = Environment(name="env")
conda_deps = CondaDependencies()

conda_deps.add_channel("pytorch")
conda_deps.add_conda_package("pytorch==1.7.1=py3.7_cuda10.1.243_cudnn7.6.3_0")
conda_deps.add_conda_package("cudatoolkit==10.1.243")
conda_deps.add_conda_package("pip")
conda_deps.add_conda_package("python==3.7.0")
conda_deps.add_pip_package("scikit-learn~=0.24")
conda_deps.add_pip_package("transformers>=4.4.2")

env.python.conda_dependencies=conda_deps
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
docker_config = DockerConfiguration(use_docker=True)

compute_target = workspace.compute_targets["compute-target-name"] # Standard_NC6s_v3

src = ScriptRunConfig(source_directory=".",
                      script='train.py', 
                      compute_target=compute_target,
                      environment=env, 
                      docker_runtime_config=docker_config)
experiment = Experiment(workspace=workspace, name="test-exp")
experiment.submit(src)

Expected behavior
The experiment should run and print True to the stdout.

Screenshots
NA
Additional context
Here are some logs:
20_image_build_log.txt
55_azureml-execution-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
65_job_prep-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
75_job_post-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt

Author: mohit-sh
Assignees: -
Labels:

ML-Compute, Machine Learning Compute, Service Attention, customer-reported, question

Milestone: -

@azureml-github
Copy link

azureml-github commented Jun 25, 2021 via email

@Karishma-Tiwari-MSFT
Copy link

Thanks for bringing this to our attention. We will now close this issue as this has been handed off to DataCompute/Runtime/Compute
If there are further questions regarding this matter, please tag me in a comment. I will reopen it and we will gladly continue the discussion.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. Machine Learning Compute ML-Compute AreaPath question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

4 participants