Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate data uploading from job submission for DS2 cloud training and add support for multiple shards uploading. #205

Merged
merged 3 commits into from
Aug 15, 2017

Conversation

xinghai-sun
Copy link
Contributor

@xinghai-sun xinghai-sun commented Aug 15, 2017

Resolve #205

  • Separate data uploading from training job submission for DS2 cloud training.
  • Add supports for multiple shards packing and uploading.
  • Update cloud/REAME.md


```
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac", "duration": 5.855, "text
": "mister quilter is the ..."}
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0001.flac", "duration": 4.815, "text
": "nor is mister ..."}
```
- `OUT_MANIFESTS`: Paths (in local filesystem) to write the updated output manifest files to. Multiple paths can be concatenated with a whitespace delimeter. The values of `audio_filepath` in the output manifests are jjjjjkknew paths in PaddleCloud filesystem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jjjjjkknew -> new ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


## Step-2 Configure computation resource
You have to take this step only once, when it is your first time to do the cloud training. Later on, the data is persisitent on the cloud filesystem and is reusable for multple jobs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multple -> multiple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

MEAN_STD_FILE="../mean_std.npz"
# Configure output path in PaddleCloud filesystem
CLOUD_DATA_DIR="/pfs/dlnel/home/sunxinghai@baidu.com/deepspeech2/data"
TRAIN_MANIFEST="cloud/cloud.manifest.test"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'cloud.manifest.test' -> 'cloud.manifest.train'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"--dev_manifest_path",
default="../datasets/manifest.dev",
"--in_manifest_paths",
default=["../datasets/manifest.test", "../datasets/manifest.dev"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里默认值是故意不设置为/manifest.train么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改过来了,不是故意的,调试时临时设置忘记改了。

pcloud_cp(args.vocab_file, cloud_vocab_file)
pcloud_cp(args.mean_std_file, cloud_mean_file)
upload_data(args.in_manifest_paths, args.out_manifest_paths,
args.local_tmp_dir, args.cloud_data_dir, 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 -> args.num_shards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@xinghai-sun xinghai-sun merged commit 69ebc58 into PaddlePaddle:develop Aug 15, 2017
@xinghai-sun xinghai-sun deleted the cloud_shards branch August 15, 2017 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants