-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in Docker based python environment creation on AML #19500
Comments
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @brendalee. Issue Details
Describe the bug
To Reproduce # Save this in current directory as train.py
import torch
print(torch.cuda.is_available()) from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DockerConfiguration
env = Environment(name="env")
conda_deps = CondaDependencies()
conda_deps.add_channel("pytorch")
conda_deps.add_conda_package("pytorch==1.7.1=py3.7_cuda10.1.243_cudnn7.6.3_0")
conda_deps.add_conda_package("cudatoolkit==10.1.243")
conda_deps.add_conda_package("pip")
conda_deps.add_conda_package("python==3.7.0")
conda_deps.add_pip_package("scikit-learn~=0.24")
conda_deps.add_pip_package("transformers>=4.4.2")
env.python.conda_dependencies=conda_deps
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
docker_config = DockerConfiguration(use_docker=True)
compute_target = workspace.compute_targets["compute-target-name"] # Standard_NC6s_v3
src = ScriptRunConfig(source_directory=".",
script='train.py',
compute_target=compute_target,
environment=env,
docker_runtime_config=docker_config)
experiment = Experiment(workspace=workspace, name="test-exp")
experiment.submit(src) Expected behavior Screenshots
|
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github. Issue Details
Describe the bug
To Reproduce # Save this in current directory as train.py
import torch
print(torch.cuda.is_available()) from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DockerConfiguration
env = Environment(name="env")
conda_deps = CondaDependencies()
conda_deps.add_channel("pytorch")
conda_deps.add_conda_package("pytorch==1.7.1=py3.7_cuda10.1.243_cudnn7.6.3_0")
conda_deps.add_conda_package("cudatoolkit==10.1.243")
conda_deps.add_conda_package("pip")
conda_deps.add_conda_package("python==3.7.0")
conda_deps.add_pip_package("scikit-learn~=0.24")
conda_deps.add_pip_package("transformers>=4.4.2")
env.python.conda_dependencies=conda_deps
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
docker_config = DockerConfiguration(use_docker=True)
compute_target = workspace.compute_targets["compute-target-name"] # Standard_NC6s_v3
src = ScriptRunConfig(source_directory=".",
script='train.py',
compute_target=compute_target,
environment=env,
docker_runtime_config=docker_config)
experiment = Experiment(workspace=workspace, name="test-exp")
experiment.submit(src) Expected behavior Screenshots
|
Handed off to DataCompute/Runtime/Compute
From: msftbot[bot] ***@***.***>
Sent: Friday, June 25, 2021 4:36 PM
To: Azure/azure-sdk-for-python ***@***.***>
Cc: azureml-github ***@***.***>; Mention ***@***.***>
Subject: Re: [Azure/azure-sdk-for-python] Error in Docker based python environment creation on AML (#19500)
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fazureml-github&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691131907%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yWlDROy6fk%2F%2BKdNQ9y1bvKXJjzU4xqHIqiv9XyfpZok%3D&reserved=0>.
Issue Details
…________________________________
* Package Name: azureml-sdk
* Package Version: 1.31.0
* Operating System: Ubuntu 18.04
* Python Version: 3.7.10
Describe the bug
When submitting a job to AML Compute Target with docker based python environment, step "azureml-logs/55_azureml-execution" fails with the following error:
AzureMLCompute job failed.
FailedStartingContainer: Unable to start docker container
FailedContainerStart: Unable to start docker container
err: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.
Reason: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/1b3ad7a17baaf26506a5654fedaf3d30bbe78b73b5364a272e01787fd9b864ba/merged/run/nvidia-persistenced/socket: no such device or address: unknown.
Info: Failed to setup runtime for job execution: Job environment preparation failed on 10.0.0.5 with err exit status
To Reproduce
Steps to reproduce the behavior:
# Save this in current directory as train.py
import torch
print(torch.cuda.is_available())
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DockerConfiguration
env = Environment(name="env")
conda_deps = CondaDependencies()
conda_deps.add_channel("pytorch")
conda_deps.add_conda_package("pytorch==1.7.1=py3.7_cuda10.1.243_cudnn7.6.3_0")
conda_deps.add_conda_package("cudatoolkit==10.1.243")
conda_deps.add_conda_package("pip")
conda_deps.add_conda_package("python==3.7.0")
conda_deps.add_pip_package("scikit-learn~=0.24")
conda_deps.add_pip_package("transformers>=4.4.2")
env.python.conda_dependencies=conda_deps
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
docker_config = DockerConfiguration(use_docker=True)
compute_target = workspace.compute_targets["compute-target-name"] # Standard_NC6s_v3
src = ScriptRunConfig(source_directory=".",
script='train.py',
compute_target=compute_target,
environment=env,
docker_runtime_config=docker_config)
experiment = Experiment(workspace=workspace, name="test-exp")
experiment.submit(src)
Expected behavior
The experiment should run and print True to the stdout.
Screenshots
NA
Additional context
Here are some logs:
20_image_build_log.txt<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-python%2Ffiles%2F6718528%2F20_image_build_log.txt&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691141929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=D2i6kJ90FQUL6sMhQg%2BfIH%2FHRIU5Kjf8DD1gqODxbTA%3D&reserved=0>
55_azureml-execution-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-python%2Ffiles%2F6718530%2F55_azureml-execution-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691141929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qvzfYdkx0BHRTYb2%2FW0yuvWcyyaA0fYgDzxFzkfKsvc%3D&reserved=0>
65_job_prep-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-python%2Ffiles%2F6718532%2F65_job_prep-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691151895%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZQ1FBQLej6Qa7tbaq5sE0tFvLu3Vg3xGyp5dj8Pmir8%3D&reserved=0>
75_job_post-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-python%2Ffiles%2F6718533%2F75_job_post-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691151895%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=adzPscTnjE7MDDLghvvPMWl4a8qLXp9VtjLmLPcHNMc%3D&reserved=0>
Author:
mohit-sh
Assignees:
-
Labels:
ML-Compute, Machine Learning Compute, Service Attention, customer-reported, question
Milestone:
-
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-python%2Fissues%2F19500%23issuecomment-868819524&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691161899%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=toqTWFnmIJaYPOr5U2Bgq%2FAh8yPzpkYgUu4yn7O9hH8%3D&reserved=0>, or unsubscribe<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAOVYYXWTCCWS7WQVL7X5543TUTSC7ANCNFSM47KPGX7Q&data=04%7C01%7CNeehar.Duvvuri%40microsoft.com%7Cfaabb651d65c4b8d6bd708d93818d794%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637602501691161899%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kvi0vFeU7XDAhacVFuq5qeemo3msmREA11XGB6Gq%2FIQ%3D&reserved=0>.
|
Thanks for bringing this to our attention. We will now close this issue as this has been handed off to DataCompute/Runtime/Compute |
Describe the bug
When submitting a job to AML Compute Target with docker based python environment, step "azureml-logs/55_azureml-execution" fails with the following error:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The experiment should run and print
True
to the stdout.Screenshots
NA
Additional context
Here are some logs:
20_image_build_log.txt
55_azureml-execution-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
65_job_prep-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
75_job_post-tvmps_c0b9a69eacd833c39d52bd1c08966426f895191550edefd1affad1331377c5a4_d.txt
The text was updated successfully, but these errors were encountered: