Error in Docker based python environment creation on AML #19500

mohitzsh · 2021-06-25T19:57:03Z

Package Name: azureml-sdk
Package Version: 1.31.0
Operating System: Ubuntu 18.04
Python Version: 3.7.10

Describe the bug
When submitting a job to AML Compute Target with docker based python environment, step "azureml-logs/55_azureml-execution" fails with the following error:

AzureMLCompute job failed.
FailedStartingContainer: Unable to start docker container
	FailedContainerStart: Unable to start docker container
	err: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

	Reason: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

	Info: Failed to setup runtime for job execution: Job environment preparation failed on 10.0.0.5 with err exit status

To Reproduce
Steps to reproduce the behavior:

# Save this in current directory as train.py
import torch
print(torch.cuda.is_available())

from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DockerConfiguration

env = Environment(name="env")
conda_deps = CondaDependencies()

conda_deps.add_channel("pytorch")
conda_deps.add_conda_package("pytorch==1.7.1=py3.7_cuda10.1.243_cudnn7.6.3_0")
conda_deps.add_conda_package("cudatoolkit==10.1.243")
conda_deps.add_conda_package("pip")
conda_deps.add_conda_package("python==3.7.0")
conda_deps.add_pip_package("scikit-learn~=0.24")
conda_deps.add_pip_package("transformers>=4.4.2")

env.python.conda_dependencies=conda_deps
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
docker_config = DockerConfiguration(use_docker=True)

compute_target = workspace.compute_targets["compute-target-name"] # Standard_NC6s_v3

src = ScriptRunConfig(source_directory=".",
                      script='train.py', 
                      compute_target=compute_target,
                      environment=env, 
                      docker_runtime_config=docker_config)
experiment = Experiment(workspace=workspace, name="test-exp")
experiment.submit(src)

Expected behavior
The experiment should run and print True to the stdout.

Screenshots
NA
Additional context
Here are some logs:
20_image_build_log.txt
55_azureml-execution-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
65_job_prep-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
75_job_post-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt

The text was updated successfully, but these errors were encountered:

ghost · 2021-06-25T20:35:30Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @brendalee.

Issue Details

Package Name: azureml-sdk
Package Version: 1.31.0
Operating System: Ubuntu 18.04
Python Version: 3.7.10

Describe the bug
When submitting a job to AML Compute Target with docker based python environment, step "azureml-logs/55_azureml-execution" fails with the following error:

AzureMLCompute job failed.
FailedStartingContainer: Unable to start docker container
	FailedContainerStart: Unable to start docker container
	err: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

	Reason: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

	Info: Failed to setup runtime for job execution: Job environment preparation failed on 10.0.0.5 with err exit status

To Reproduce
Steps to reproduce the behavior:

# Save this in current directory as train.py
import torch
print(torch.cuda.is_available())

from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DockerConfiguration

env = Environment(name="env")
conda_deps = CondaDependencies()

conda_deps.add_channel("pytorch")
conda_deps.add_conda_package("pytorch==1.7.1=py3.7_cuda10.1.243_cudnn7.6.3_0")
conda_deps.add_conda_package("cudatoolkit==10.1.243")
conda_deps.add_conda_package("pip")
conda_deps.add_conda_package("python==3.7.0")
conda_deps.add_pip_package("scikit-learn~=0.24")
conda_deps.add_pip_package("transformers>=4.4.2")

env.python.conda_dependencies=conda_deps
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
docker_config = DockerConfiguration(use_docker=True)

compute_target = workspace.compute_targets["compute-target-name"] # Standard_NC6s_v3

src = ScriptRunConfig(source_directory=".",
                      script='train.py', 
                      compute_target=compute_target,
                      environment=env, 
                      docker_runtime_config=docker_config)
experiment = Experiment(workspace=workspace, name="test-exp")
experiment.submit(src)

Expected behavior
The experiment should run and print True to the stdout.

Screenshots
NA
Additional context
Here are some logs:
20_image_build_log.txt
55_azureml-execution-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
65_job_prep-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
75_job_post-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt

Author:	mohit-sh
Assignees:	-
Labels:	`ML-Compute`, `Service Attention`, `customer-reported`, `question`
Milestone:	-

ghost · 2021-06-25T20:35:48Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github.

Issue Details

Package Name: azureml-sdk
Package Version: 1.31.0
Operating System: Ubuntu 18.04
Python Version: 3.7.10

Describe the bug
When submitting a job to AML Compute Target with docker based python environment, step "azureml-logs/55_azureml-execution" fails with the following error:

AzureMLCompute job failed.
FailedStartingContainer: Unable to start docker container
	FailedContainerStart: Unable to start docker container
	err: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

	Reason: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

	Info: Failed to setup runtime for job execution: Job environment preparation failed on 10.0.0.5 with err exit status

To Reproduce
Steps to reproduce the behavior:

# Save this in current directory as train.py
import torch
print(torch.cuda.is_available())

from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DockerConfiguration

env = Environment(name="env")
conda_deps = CondaDependencies()

conda_deps.add_channel("pytorch")
conda_deps.add_conda_package("pytorch==1.7.1=py3.7_cuda10.1.243_cudnn7.6.3_0")
conda_deps.add_conda_package("cudatoolkit==10.1.243")
conda_deps.add_conda_package("pip")
conda_deps.add_conda_package("python==3.7.0")
conda_deps.add_pip_package("scikit-learn~=0.24")
conda_deps.add_pip_package("transformers>=4.4.2")

env.python.conda_dependencies=conda_deps
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
docker_config = DockerConfiguration(use_docker=True)

compute_target = workspace.compute_targets["compute-target-name"] # Standard_NC6s_v3

src = ScriptRunConfig(source_directory=".",
                      script='train.py', 
                      compute_target=compute_target,
                      environment=env, 
                      docker_runtime_config=docker_config)
experiment = Experiment(workspace=workspace, name="test-exp")
experiment.submit(src)

Expected behavior
The experiment should run and print True to the stdout.

Screenshots
NA
Additional context
Here are some logs:
20_image_build_log.txt
55_azureml-execution-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
65_job_prep-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
75_job_post-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt

Author:	mohit-sh
Assignees:	-
Labels:	`ML-Compute`, `Machine Learning Compute`, `Service Attention`, `customer-reported`, `question`
Milestone:	-

azureml-github · 2021-06-25T22:00:22Z

Handed off to DataCompute/Runtime/Compute From: msftbot[bot] ***@***.***> Sent: Friday, June 25, 2021 4:36 PM To: Azure/azure-sdk-for-python ***@***.***> Cc: azureml-github ***@***.***>; Mention ***@***.***> Subject: Re: [Azure/azure-sdk-for-python] Error in Docker based python environment creation on AML (#19500) Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fazureml-github&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691131907%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yWlDROy6fk%2F%2BKdNQ9y1bvKXJjzU4xqHIqiv9XyfpZok%3D&reserved=0>. Issue Details

…

________________________________ * Package Name: azureml-sdk * Package Version: 1.31.0 * Operating System: Ubuntu 18.04 * Python Version: 3.7.10 Describe the bug When submitting a job to AML Compute Target with docker based python environment, step "azureml-logs/55_azureml-execution" fails with the following error: AzureMLCompute job failed. FailedStartingContainer: Unable to start docker container FailedContainerStart: Unable to start docker container err: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown. Reason: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown. Info: Failed to setup runtime for job execution: Job environment preparation failed on 10.0.0.5 with err exit status To Reproduce Steps to reproduce the behavior: # Save this in current directory as train.py import torch print(torch.cuda.is_available()) from azureml.core import Experiment, ScriptRunConfig, Environment from azureml.core.conda_dependencies import CondaDependencies from azureml.core.runconfig import DockerConfiguration env = Environment(name="env") conda_deps = CondaDependencies() conda_deps.add_channel("pytorch") conda_deps.add_conda_package("pytorch==1.7.1=py3.7_cuda10.1.243_cudnn7.6.3_0") conda_deps.add_conda_package("cudatoolkit==10.1.243") conda_deps.add_conda_package("pip") conda_deps.add_conda_package("python==3.7.0") conda_deps.add_pip_package("scikit-learn~=0.24") conda_deps.add_pip_package("transformers>=4.4.2") env.python.conda_dependencies=conda_deps env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04' docker_config = DockerConfiguration(use_docker=True) compute_target = workspace.compute_targets["compute-target-name"] # Standard_NC6s_v3 src = ScriptRunConfig(source_directory=".", script='train.py', compute_target=compute_target, environment=env, docker_runtime_config=docker_config) experiment = Experiment(workspace=workspace, name="test-exp") experiment.submit(src) Expected behavior The experiment should run and print True to the stdout. Screenshots NA Additional context Here are some logs: 20_image_build_log.txt<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-python%2Ffiles%2F6718528%2F20_image_build_log.txt&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691141929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=D2i6kJ90FQUL6sMhQg%2BfIH%2FHRIU5Kjf8DD1gqODxbTA%3D&reserved=0> 55_azureml-execution-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-python%2Ffiles%2F6718530%2F55_azureml-execution-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691141929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qvzfYdkx0BHRTYb2%2FW0yuvWcyyaA0fYgDzxFzkfKsvc%3D&reserved=0> 65_job_prep-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-python%2Ffiles%2F6718532%2F65_job_prep-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691151895%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZQ1FBQLej6Qa7tbaq5sE0tFvLu3Vg3xGyp5dj8Pmir8%3D&reserved=0> 75_job_post-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-python%2Ffiles%2F6718533%2F75_job_post-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691151895%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=adzPscTnjE7MDDLghvvPMWl4a8qLXp9VtjLmLPcHNMc%3D&reserved=0> Author: mohit-sh Assignees: - Labels: ML-Compute, Machine Learning Compute, Service Attention, customer-reported, question Milestone: - - You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-python%2Fissues%2F19500%23issuecomment-868819524&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691161899%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=toqTWFnmIJaYPOr5U2Bgq%2FAh8yPzpkYgUu4yn7O9hH8%3D&reserved=0>, or unsubscribe<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAOVYYXWTCCWS7WQVL7X5543TUTSC7ANCNFSM47KPGX7Q&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691161899%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kvi0vFeU7XDAhacVFuq5qeemo3msmREA11XGB6Gq%2FIQ%3D&reserved=0>.

Karishma-Tiwari-MSFT · 2021-10-27T23:02:06Z

Thanks for bringing this to our attention. We will now close this issue as this has been handed off to DataCompute/Runtime/Compute
If there are further questions regarding this matter, please tag me in a comment. I will reopen it and we will gladly continue the discussion.

chlowell added ML-Compute AreaPath Service Attention Workflow: This issue is responsible by Azure service team. and removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Jun 25, 2021

chlowell added the Machine Learning Compute label Jun 25, 2021

dmarx mentioned this issue Jul 27, 2021

Azureml Environment does not install additional conda dependencies on top of curated env built from dockerfile #19960

Closed

Karishma-Tiwari-MSFT closed this as completed Oct 27, 2021

github-actions bot locked and limited conversation to collaborators Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in Docker based python environment creation on AML #19500

Error in Docker based python environment creation on AML #19500

mohitzsh commented Jun 25, 2021 •

edited

Loading

ghost commented Jun 25, 2021

ghost commented Jun 25, 2021

azureml-github commented Jun 25, 2021 via email

Karishma-Tiwari-MSFT commented Oct 27, 2021

Error in Docker based python environment creation on AML #19500

Error in Docker based python environment creation on AML #19500

Comments

mohitzsh commented Jun 25, 2021 • edited Loading

ghost commented Jun 25, 2021

ghost commented Jun 25, 2021

azureml-github commented Jun 25, 2021 via email

Karishma-Tiwari-MSFT commented Oct 27, 2021

mohitzsh commented Jun 25, 2021 •

edited

Loading