Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add analysis tool of nan and inf op #49512

Conversation

AnnaTrainingG
Copy link
Contributor

@AnnaTrainingG AnnaTrainingG commented Jan 3, 2023

PR types

Others

PR changes

Others

Describe

环境变量

当开启环境变量(1)(2)后将对OP进行精度检查并输出到屏幕(具体使用请参考: #47672
export FLAGS_check_nan_inf=1
export FLAGS_check_nan_inf_level= 请参考#47672

首先

在python代码中添加以下调用:
paddle.fluid.core.set_nan_inf_debug_path("output_dir")
填写的为日志输出路径,后续会将日志打印到指定文件夹
当用户设置output_dir时,将check_nan_inf输出的日志打印到output_dir文件夹中,当使用多卡运行时将区分_gpu.*, *用于表示当前卡号, workerlog_cpu.0

效果如图:
image

注意:打印到文件中耗时较长,请按需使用

本PR主要是进行fp32_dir及fp16_dir结果对比,并将结果存入excel

然后

执行python tools/analysis_nan_inf.py --fp32_dir output_dir_fp32 --fp16_dir output_fp16_dir 即可产生对比结果excel
image
image

脚本参数说明:
--fp32_dir 指定fp32的文件输入路径
--fp16_dir 指定fp16的文件输入
--out_file_name 指定输出excel的名称
--loss_scale AMP 训练过程中设置的scale
--num_workerlogs 需要进行比较的文件数量
--skip_normal_tensors 仅将有问题的算子输出到excel(默认全部输出)
--specified_op_list 指定需要输出的op列表(默认全部输出)

有问题的算子 是指满足以下任一条件:

# 1. The number of OP outputs exceeds the indication range of int32
# 2. The output data exceeds the representation range of fp16
# 3. Nan or inf appears in fp16 output data
# 4. The maximum value of fp32 is not equal to the maximum value of fp16
# 5. The minimum value of fp32 is not equal to the minimum value of fp16

@paddle-bot
Copy link

paddle-bot bot commented Jan 3, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@zhangting2020
Copy link
Contributor

这个我觉得不要放到Paddle/tools目录下,那个目录里面有的是CI相关的和研发有关的工具。这个挪到AMP相关目录下面吧,整体建一个AMP的工具目录,相关文件都放那里。

@AnnaTrainingG
Copy link
Contributor Author

好的

@AnnaTrainingG
Copy link
Contributor Author

本PR 关闭 转移到 #51957

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants