forked from PaddlePaddle/Paddle
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add some docs, requirements.txt, update setup.py and bug fix (PaddleP…
…addle#3) * add some docs and requirements.txt * bug fix
- Loading branch information
lilong12
authored
Dec 18, 2019
1 parent
b680968
commit 36cc759
Showing
24 changed files
with
319 additions
and
812 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
*.pyc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# Base64格式图像预处理 | ||
|
||
## 简介 | ||
|
||
实际业务中,一种常见的训练数据存储格式是将图像数据编码为base64格式。训练数据文件 | ||
的每一行存储一张图像的base64数据和该图像的标签,并通常以制表符('\t')分隔。 | ||
|
||
通常,所有训练数据文件的文件列表记录在一个单独的文件中,整个训练数据集的目录结构如下: | ||
|
||
```shell | ||
dataset | ||
|-- file_list.txt | ||
|-- dataset.part1 | ||
|-- dataset.part2 | ||
... .... | ||
`-- dataset.part10 | ||
``` | ||
|
||
其中,file_list.txt记录训练数据的文件列表,每行代表一个文件,以上面的例子来说, | ||
file_list.txt的文件内容如下: | ||
|
||
```shell | ||
dataset.part1 | ||
dataset.part2 | ||
... | ||
dataset.part10 | ||
``` | ||
|
||
而数据文件的每一行表示一张图像数据的base64表示,以及以制表符分隔的图像标签。 | ||
|
||
对于分布式训练,需要每张GPU卡处理相同数量的图像数据,并且通常需要在训练前做一次 | ||
训练数据的全局shuffle。 | ||
|
||
本文档介绍Base64格式图像预处理工具,用于在对训练数据做全局shuffle,并将训练数据均分到多个数据文件, | ||
数据文件的数量和训练中使用的GPU卡数相同。当训练数据的总量不能整除GPU卡数时,通常会填充部分图像 | ||
数据(填充的图像数据随机选自训练数据集),以保证总的训练图像数量是GPU卡数的整数倍。 | ||
|
||
## 工具使用方法 | ||
|
||
工具位于tools目录下。 | ||
可以通过下面的命令行查看工具的使用帮助信息: | ||
|
||
```python | ||
python tools/process_base64_files.py --help | ||
``` | ||
|
||
该工具支持以下命令行选项: | ||
|
||
* data_dir: 训练数据的根目录 | ||
* file_list: 记录训练数据文件的列表文件,如file_list.txt | ||
* nranks: 训练所使用的GPU卡的数量。 | ||
|
||
可以通过以下命令行运行该工具: | ||
|
||
```shell | ||
python tools/process_base64_files.py --data_dir=./dataset --file_list=file_list.txt --nranks=8 | ||
``` | ||
|
||
那么,会生成8个数量数据文件,每个文件中包含相同数量的训练数据。 | ||
|
||
最终的目录格式如下: | ||
|
||
```shell | ||
dataset | ||
|-- file_list.txt | ||
|-- dataset.part1 | ||
|-- dataset.part2 | ||
... .... | ||
`-- dataset.part8 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# 分布式参数转换 | ||
|
||
## 简介 | ||
|
||
对于最后一层全连接层参数(W和b,假设参数b存在,否则,全连接参数仅为W),通常切分到所有训练GPU卡。例如, | ||
假设训练阶段使用的GPU卡数为N,那么 | ||
|
||
$$W = [W_{1}, W_{2},..., W_{N}$$ | ||
$$b = [b_{1}, b_{2},..., b_{N}$$ | ||
|
||
并且,参数$W_{i}$和$b_{i}$保存在第i个GPU。 | ||
|
||
当保存模型时,各个GPU卡的分布式参数均会得到保存。 | ||
|
||
在热启动或fine-tuning阶段,如果训练GPU卡数和热启动前或者预训练阶段使用的GPU卡数不同时,需要 | ||
对分布式参数进行转换,以保证分布式参数的数量和训练使用的GPU卡数相同。 | ||
|
||
默认地,当使用plsc.entry.Entry.train()方法时,会自动进行分布式参数的转换。 | ||
|
||
## 工具使用方法 | ||
|
||
分布式参数转换工具也可以单独使用,可以通过下面的命令查看使用方法: | ||
|
||
```shell | ||
python -m plsc.utils.process_distfc_parameter --help | ||
``` | ||
|
||
该工具支持以下命令行选项: | ||
|
||
| 选项 | 描述 | | ||
| :---------------------- | :------------------- | | ||
| name_feature | 分布式参数的名称特征,用于识别分布式参数。默认的,分布式参数的名称前缀为dist@arcface@rank@rankid或者dist@softmax@rank@rankid。其中,rankid为表示GPU卡的id。默认地,name_feature的值为@rank@。用户通常不需要改变该参数的值 | | ||
| pretrain_nranks | 预训练阶段使用的GPU卡数 | | ||
| nranks | 本次训练将使用的GPU卡数 | | ||
| num_classes | 分类类别的数目 | | ||
| emb_dim | 倒数第二层全连接层的输出维度,不包含batch size | | ||
| pretrained_model_dir | 预训练模型的保存目录 | | ||
| output_dir | 转换后分布式参数的保存目录 | | ||
|
||
通常,在预训练模型中包含meta.pickle文件,该文件记录预训练阶段使用的GPU卡数,分类类别书和倒数第二层全连接层的输出维度,因此通常不需要指定pretrain_nranks、num_classes和emb_dim参数。 | ||
|
||
可以通过以下命令转换分布式参数: | ||
```shell | ||
python -m plsc.utils.process_distfc_parameter --nranks=4 --pretrained_model_dir=./output --output_dir=./output_post | ||
``` | ||
|
||
需要注意的是,转换后的分布式参数保存目录只包含转换后的分布式参数,而不包含其它模型参数。因此,通常需要使用转换后的分布式参数替换 | ||
预训练模型中的分布式参数。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Oops, something went wrong.