-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #205 from xinghai-sun/cloud_shards
Separate data uploading from job submission for DS2 cloud training and add support for multiple shards uploading.
- Loading branch information
Showing
7 changed files
with
173 additions
and
226 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,81 +1,63 @@ | ||
# Run DS2 on PaddleCloud | ||
# Train DeepSpeech2 on PaddleCloud | ||
|
||
>Note: | ||
>Make sure [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `models/deep_speech_2/cloud/` | ||
>Please make sure [PaddleCloud Client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `deep_speech_2/cloud/` | ||
## Step-1 Configure data set | ||
## Step 1: Upload Data | ||
|
||
Configure your input data and output path in pcloud_submit.sh: | ||
Provided with several input manifests, `pcloud_upload_data.sh` will pack and upload all the containing audio files to PaddleCloud filesystem, and also generate some corresponding manifest files with updated cloud paths. | ||
|
||
- `TRAIN_MANIFEST`: Absolute path of train data manifest file in local file system.This file has format as bellow: | ||
Please modify the following arguments in `pcloud_upload_data.sh`: | ||
|
||
- `IN_MANIFESTS`: Paths (in local filesystem) of manifest files containing the audio files to be uploaded. Multiple paths can be concatenated with a whitespace delimeter. | ||
- `OUT_MANIFESTS`: Paths (in local filesystem) to write the updated output manifest files to. Multiple paths can be concatenated with a whitespace delimeter. The values of `audio_filepath` in the output manifests are updated with cloud filesystem paths. | ||
- `CLOUD_DATA_DIR`: Directory (in PaddleCloud filesystem) to upload the data to. Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it. | ||
- `NUM_SHARDS`: Number of data shards / parts (in tar files) to be generated when packing and uploading data. Smaller `num_shards` requires larger temoporal local disk space for packing data. | ||
|
||
By running: | ||
|
||
``` | ||
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac", "duration": 5.855, "text | ||
": "mister quilter is the ..."} | ||
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0001.flac", "duration": 4.815, "text | ||
": "nor is mister ..."} | ||
sh pcloud_upload_data.sh | ||
``` | ||
all the audio files will be uploaded to PaddleCloud filesystem, and you will get modified manifests files in `OUT_MANIFESTS`. | ||
|
||
- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem. This file has format like `TRAIN_MANIFEST`. | ||
- `VOCAB_FILE`: Absolute path of vocabulary file in local filesytem. | ||
- `MEAN_STD_FILE`: Absolute path of normalizer's statistic file in local filesytem. | ||
- `CLOUD_DATA_DIR:` Absolute path in PaddleCloud filesystem. We will upload local train data to this directory. | ||
- `CLOUD_MODEL_DIR`: Absolute path in PaddleCloud filesystem. PaddleCloud trainer will save model to this directory. | ||
You have to take this step only once, in the very first time you do the cloud training. Later on, the data is persisitent on the cloud filesystem and reusable for further job submissions. | ||
|
||
>Note: Upload will be skipped if target file has existed in `CLOUD_DATA_DIR`. | ||
## Step 2: Configure Training | ||
|
||
## Step-2 Configure computation resource | ||
Configure cloud training arguments in `pcloud_submit.sh`, with the following arguments: | ||
|
||
Configure computation resource in pcloud_submit.sh: | ||
- `TRAIN_MANIFEST`: Manifest filepath (in local filesystem) for training. Notice that the`audio_filepath` should be in cloud filesystem, like those generated by `pcloud_upload_data.sh`. | ||
- `DEV_MANIFEST`: Manifest filepath (in local filesystem) for validation. | ||
- `CLOUD_MODEL_DIR`: Directory (in PaddleCloud filesystem) to save the model parameters (checkpoints). Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it. | ||
- `BATCH_SIZE`: Training batch size for a single node. | ||
- `NUM_GPU`: Number of GPUs allocated for a single node. | ||
- `NUM_NODE`: Number of nodes (machines) allocated for this job. | ||
- `IS_LOCAL`: Set to False to enable parameter server, if using multiple nodes. | ||
|
||
``` | ||
# Configure computation resource and submit job to PaddleCloud | ||
paddlecloud submit \ | ||
-image wanghaoshuang/pcloud_ds2:latest \ | ||
-jobname ${JOB_NAME} \ | ||
-cpu 4 \ | ||
-gpu 4 \ | ||
-memory 10Gi \ | ||
-parallelism 1 \ | ||
-pscpu 1 \ | ||
-pservers 1 \ | ||
-psmemory 10Gi \ | ||
-passes 1 \ | ||
-entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEL_DIR}" \ | ||
${DS2_PATH} | ||
``` | ||
For more information, please refer to [PaddleCloud](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务) | ||
Configure other training hyper-parameters in `pcloud_train.sh` as you wish, just as what you can do in local training. | ||
|
||
## Step-3 Configure algorithm options | ||
Configure algorithm options in pcloud_train.sh: | ||
``` | ||
python train.py \ | ||
--use_gpu=1 \ | ||
--trainer_count=4 \ | ||
--batch_size=256 \ | ||
--mean_std_filepath=$MEAN_STD_FILE \ | ||
--train_manifest_path='./local.train.manifest' \ | ||
--dev_manifest_path='./local.test.manifest' \ | ||
--vocab_filepath=$VOCAB_PATH \ | ||
--output_model_dir=${MODEL_PATH} | ||
``` | ||
You can get more information about algorithm options by follow command: | ||
``` | ||
cd .. | ||
python train.py --help | ||
``` | ||
By running: | ||
|
||
## Step-4 Submit job | ||
``` | ||
$ sh pcloud_submit.sh | ||
sh pcloud_submit.sh | ||
``` | ||
you submit a training job to PaddleCloud. And you will see the job name when the submission is done. | ||
|
||
|
||
## Step 3 Get Job Logs | ||
|
||
Run this to list all the jobs you have submitted, as well as their running status: | ||
|
||
## Step-5 Get logs | ||
``` | ||
$ paddlecloud logs -n 10000 deepspeech20170727130129 | ||
paddlecloud get jobs | ||
``` | ||
For more information, please refer to [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#下载并配置paddlecloud) or get help by follow command: | ||
|
||
Run this, the corresponding job's logs will be printed. | ||
``` | ||
paddlecloud --help | ||
paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME | ||
``` | ||
|
||
## More Help | ||
|
||
For more information about the usage of PaddleCloud, please refer to [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,48 +1,27 @@ | ||
# Configure input data set in local filesystem | ||
TRAIN_MANIFEST="../datasets/manifest.train" | ||
DEV_MANIFEST="../datasets/manifest.dev" | ||
VOCAB_FILE="../datasets/vocab/eng_vocab.txt" | ||
MEAN_STD_FILE="../mean_std.npz" | ||
# Configure output path in PaddleCloud filesystem | ||
CLOUD_DATA_DIR="/pfs/dlnel/home/sunxinghai@baidu.com/deepspeech2/data" | ||
CLOUD_MODEL_DIR="/pfs/dlnel/home/sunxinghai@baidu.com/deepspeech2/model" | ||
# Configure cloud resources | ||
NUM_CPU=8 | ||
TRAIN_MANIFEST="cloud/cloud.manifest.train" | ||
DEV_MANIFEST="cloud/cloud.manifest.dev" | ||
CLOUD_MODEL_DIR="/pfs/dlnel/home/USERNAME/deepspeech2/model" | ||
BATCH_SIZE=256 | ||
NUM_GPU=8 | ||
NUM_NODE=1 | ||
MEMORY="10Gi" | ||
IS_LOCAL="True" | ||
|
||
# Pack and upload local data to PaddleCloud filesystem | ||
python upload_data.py \ | ||
--train_manifest_path=${TRAIN_MANIFEST} \ | ||
--dev_manifest_path=${DEV_MANIFEST} \ | ||
--vocab_file=${VOCAB_FILE} \ | ||
--mean_std_file=${MEAN_STD_FILE} \ | ||
--cloud_data_path=${CLOUD_DATA_DIR} | ||
if [ $? -ne 0 ] | ||
then | ||
echo "upload data failed!" | ||
exit 1 | ||
fi | ||
|
||
# Submit job to PaddleCloud | ||
JOB_NAME=deepspeech-`date +%Y%m%d%H%M%S` | ||
DS2_PATH=${PWD%/*} | ||
cp -f pcloud_train.sh ${DS2_PATH} | ||
|
||
paddlecloud submit \ | ||
-image bootstrapper:5000/wanghaoshuang/pcloud_ds2:latest \ | ||
-jobname ${JOB_NAME} \ | ||
-cpu ${NUM_CPU} \ | ||
-cpu ${NUM_GPU} \ | ||
-gpu ${NUM_GPU} \ | ||
-memory ${MEMORY} \ | ||
-memory 64Gi \ | ||
-parallelism ${NUM_NODE} \ | ||
-pscpu 1 \ | ||
-pservers 1 \ | ||
-psmemory ${MEMORY} \ | ||
-psmemory 64Gi \ | ||
-passes 1 \ | ||
-entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEL_DIR} ${NUM_CPU} ${NUM_GPU} ${IS_LOCAL}" \ | ||
-entry "sh pcloud_train.sh ${TRAIN_MANIFEST} ${DEV_MANIFEST} ${CLOUD_MODEL_DIR} ${NUM_GPU} ${BATCH_SIZE} ${IS_LOCAL}" \ | ||
${DS2_PATH} | ||
|
||
rm ${DS2_PATH}/pcloud_train.sh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,36 +1,24 @@ | ||
DATA_PATH=$1 | ||
MODEL_PATH=$2 | ||
NUM_CPU=$3 | ||
TRAIN_MANIFEST=$1 | ||
DEV_MANIFEST=$2 | ||
MODEL_PATH=$3 | ||
NUM_GPU=$4 | ||
IS_LOCAL=$5 | ||
BATCH_SIZE=$5 | ||
IS_LOCAL=$6 | ||
|
||
TRAIN_MANI=${DATA_PATH}/cloud.train.manifest | ||
DEV_MANI=${DATA_PATH}/cloud.dev.manifest | ||
TRAIN_TAR=${DATA_PATH}/cloud.train.tar | ||
DEV_TAR=${DATA_PATH}/cloud.dev.tar | ||
VOCAB_PATH=${DATA_PATH}/vocab.txt | ||
MEAN_STD_FILE=${DATA_PATH}/mean_std.npz | ||
|
||
# split train data for each pcloud node | ||
python ./cloud/split_data.py \ | ||
--in_manifest_path=${TRAIN_MANI} \ | ||
--data_tar_path=${TRAIN_TAR} \ | ||
--out_manifest_path='/local.train.manifest' | ||
--in_manifest_path=${TRAIN_MANIFEST} \ | ||
--out_manifest_path='/local.manifest.train' | ||
|
||
# split dev data for each pcloud node | ||
python ./cloud/split_data.py \ | ||
--in_manifest_path=${DEV_MANI} \ | ||
--data_tar_path=${DEV_TAR} \ | ||
--out_manifest_path='/local.dev.manifest' | ||
--in_manifest_path=${DEV_MANIFEST} \ | ||
--out_manifest_path='/local.manifest.dev' | ||
|
||
# run train | ||
python train.py \ | ||
--batch_size=$BATCH_SIZE \ | ||
--use_gpu=1 \ | ||
--trainer_count=${NUM_GPU} \ | ||
--num_threads_data=${NUM_CPU} \ | ||
--num_threads_data=${NUM_GPU} \ | ||
--is_local=${IS_LOCAL} \ | ||
--mean_std_filepath=${MEAN_STD_FILE} \ | ||
--train_manifest_path='/local.train.manifest' \ | ||
--dev_manifest_path='/local.dev.manifest' \ | ||
--vocab_filepath=${VOCAB_PATH} \ | ||
--output_model_dir=${MODEL_PATH} | ||
--train_manifest_path='/local.manifest.train' \ | ||
--dev_manifest_path='/local.manifest.dev' \ | ||
--output_model_dir=${MODEL_PATH} \ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
IN_MANIFESTS="../datasets/manifest.train ../datasets/manifest.dev ../datasets/manifest.test" | ||
OUT_MANIFESTS="./cloud.manifest.train ./cloud.manifest.dev ./cloud.manifest.test" | ||
CLOUD_DATA_DIR="/pfs/dlnel/home/USERNAME/deepspeech2/data/librispeech" | ||
NUM_SHARDS=50 | ||
|
||
python upload_data.py \ | ||
--in_manifest_paths ${IN_MANIFESTS} \ | ||
--out_manifest_paths ${OUT_MANIFESTS} \ | ||
--cloud_data_dir ${CLOUD_DATA_DIR} \ | ||
--num_shards ${NUM_SHARDS} | ||
|
||
if [ $? -ne 0 ] | ||
then | ||
echo "Upload Data Failed!" | ||
exit 1 | ||
fi | ||
echo "All Done." |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.