Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify all external API error message mechanism and enhance third-party API error msg #33003

Merged
merged 3 commits into from
May 27, 2021

Conversation

zhwesky2010
Copy link
Contributor

@zhwesky2010 zhwesky2010 commented May 19, 2021

PR types

New features

PR changes

Others

Describe

1. 统一所有外部API报错消息机制,使用external_error.proto进行统一管理;

2. 新增强CURAND、CUDNN、CUBLAS、CUSOLVER、NCCL五种Nvidia类型API的报错信息,打印出详细的Hint提示内容,竞品只给出报错码;


1). CURAND API:

----------------------
Error Message Summary:
----------------------
OSError: (External)  CURAND error(204). 
  [Hint: 'CURAND_STATUS_ARCH_MISMATCH'.  Architecture mismatch, GPU does not support requested feature. ] (at /workspace/Paddle5/paddle/fluid/platform/gpu_info.cc:99)

2). CUDNN API:

----------------------
Error Message Summary:
----------------------
OSError: (External)  CUDNN error(3), CUDNN_STATUS_BAD_PARAM. 
  [Hint: 'CUDNN_STATUS_BAD_PARAM'. An incorrect value or parameter was passed to the function. To correct, ensure that all the parameters being passed have valid values. ] (at /workspace/Paddle5/paddle/fluid/platform/gpu_info.cc:99)

3). CUBLAS API:

----------------------
Error Message Summary:
----------------------
OSError: (External)  CUBLAS error(13). 
  [Hint: 'CUBLAS_STATUS_EXECUTION_FAILED'. The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons.  To correct: check that the hardware, an appropriate version of the driver, and the cuBLAS library are correctly installed. ] (at/workspace/Paddle5/paddle/fluid/platform/gpu_info.cc:99)

4). CUSOLVER API:

----------------------
Error Message Summary:
----------------------
OSError: (External)  CUSOLVER error(7). 
  [Hint: 'CUSOLVER_STATUS_INTERNAL_ERROR'. An internal cuSolver operation failed. This error is usually caused by a cudaMemcpyAsync() failure.To correct: check that the hardware, an appropriate version of the driver, and the cuSolver library are correctly installed. Also, check that the memory passed as a parameter to the routine is not being deallocated prior to the routine’s completion. ] (at /workspace/Paddle5/paddle/fluid/platform/gpu_info.cc:99)

5). NCCL API:

----------------------
Error Message Summary:
----------------------
OSError: (External)  NCCL error(3), internal error. 
  [Hint: 'ncclInternalError'. An internal check failed. This is either a bug in NCCL or due to memory corruption. ] (at /workspace/Paddle5/paddle/fluid/platform/gpu_info.cc:99)

6). CUDA API:

共包括116种错误类型与报错信息,不进行改动,在1.8已经进行支持


缺省信息:

上述所有API如果无法查找到爬虫中对应的错误码,将会默认打印:

----------------------
Error Message Summary:
----------------------
OSError: (External)  CUDNN error(100). 
  [Hint: Please search for the error code(100) on website(https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error. ] (at /workspace/Paddle5/paddle/fluid/platform/gpu_info.cc:99)


注:CUDNN仅为示例,实际会根据External API的类型来返回对应网址;

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -0,0 +1,363 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2020->2021?

// Version of cuda API
required int32 version = 1;
// Indicates which kind of third-party API
required ApiType type = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not need verison now?

Copy link
Contributor Author

@zhwesky2010 zhwesky2010 May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在分了6种API,每种API取了一个网址,如果再分小版本就太多了。因为低版本API都是包含在高版本里,现在用的CUDA11.2 是比较全的

namespace details {

template <typename T>
struct CudaStatusType {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CudaStatusType命名是不是不太严谨,用NvidiaLib或者别的?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
这个地方是写错了吗

写错了,改过来了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用了ExternalApiType,看是用NvidiaApiType还是?

@chenwhql
Copy link
Contributor

image
这个地方是写错了吗

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants