Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Albert does not run locally due to FSx data source #43

Open
Cpruce opened this issue Oct 26, 2020 · 0 comments
Open

Albert does not run locally due to FSx data source #43

Cpruce opened this issue Oct 26, 2020 · 0 comments

Comments

@Cpruce
Copy link

Cpruce commented Oct 26, 2020

Error:

root@6eca14cacd82:/shared/sm_bert_tf/tf2/deep-learning-models/models/nlp# ./launch.sh
wandb: WARNING W&B installed but not logged in.  Run `wandb login` or set the WANDB_API_KEY env variable.
2020-10-26 13:42:31.332706: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Parameter distribution will be renamed to {'mpi': {'enabled': True, 'processes_per_host': 8, 'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none'}} in SageMaker Python SDK v2.
Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.
DATA_SOURCE {'FileSystemDataSource': {'FileSystemId': 'fs-0d3f0dfc5f6428d70', 'FileSystemType': 'FSxLustre', 'DirectoryPath': '/fsx', 'FileSystemAccessMode': 'rw'}}
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/shared/sm_bert_tf/tf2/deep-learning-models/models/nlp/albert/launch_sagemaker.py", line 63, in <module>
    security_group_ids=sm_args.security_group_ids,
  File "/shared/sm_bert_tf/tf2/deep-learning-models/models/nlp/common/sagemaker_utils.py", line 97, in launch_sagemaker_job
    estimator_hvd.fit(fsx_input)
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/tensorflow/estimator.py", line 483, in fit
    fit_super()
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/tensorflow/estimator.py", line 462, in fit_super
    super(TensorFlow, self).fit(inputs, wait, logs, job_name, experiment_config)
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/estimator.py", line 496, in fit
    self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/estimator.py", line 1091, in start_new
    estimator.sagemaker_session.train(**train_args)
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/session.py", line 590, in train
    self.sagemaker_client.create_training_job(**train_request)
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/local/local_session.py", line 102, in create_training_job
    training_job.start(InputDataConfig, OutputDataConfig, hyperparameters, TrainingJobName)
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/local/entities.py", line 80, in start
    "Need channel['DataSource'] to have ['S3DataSource'] or ['FileDataSource']"
ValueError: Need channel['DataSource'] to have ['S3DataSource'] or ['FileDataSource']

This is due to the FileSystemDataSource not being supported by _LocalTrainingJob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant