add analysis tool of nan and inf op #49512
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Others
PR changes
Others
Describe
环境变量
当开启环境变量(1)(2)后将对OP进行精度检查并输出到屏幕(具体使用请参考: #47672 )
export FLAGS_check_nan_inf=1
export FLAGS_check_nan_inf_level= 请参考#47672
首先
在python代码中添加以下调用:
paddle.fluid.core.set_nan_inf_debug_path("output_dir")
填写的为日志输出路径,后续会将日志打印到指定文件夹
当用户设置output_dir时,将check_nan_inf输出的日志打印到output_dir文件夹中,当使用多卡运行时将区分_gpu.*, *用于表示当前卡号, workerlog_cpu.0
效果如图:
image
注意:打印到文件中耗时较长,请按需使用
本PR主要是进行fp32_dir及fp16_dir结果对比,并将结果存入excel
然后
执行python tools/analysis_nan_inf.py --fp32_dir output_dir_fp32 --fp16_dir output_fp16_dir 即可产生对比结果excel


脚本参数说明:
--fp32_dir 指定fp32的文件输入路径
--fp16_dir 指定fp16的文件输入
--out_file_name 指定输出excel的名称
--loss_scale AMP 训练过程中设置的scale
--num_workerlogs 需要进行比较的文件数量
--skip_normal_tensors 仅将有问题的算子输出到excel(默认全部输出)
--specified_op_list 指定需要输出的op列表(默认全部输出)
有问题的算子 是指满足以下任一条件: