Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
update error spec (#4804)
Browse files Browse the repository at this point in the history
  • Loading branch information
Binyang2014 authored Aug 12, 2020
1 parent 4afd1d4 commit e0aba7f
Showing 1 changed file with 29 additions and 23 deletions.
52 changes: 29 additions & 23 deletions src/k8s-job-exit-spec/config/k8s-job-exit-spec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -614,6 +614,30 @@ spec:
nameRegex: '(?ms).*'
messageRegex: '(?mi).*(nvidia-container-cli: device error: unknown device id).*'

###########################################################################
# Range: [1, 128]
# Owner: PAI_RUNTIME
# Description: User Container issued failures:
# -> Involuntary failures caused by hardware
###########################################################################
- code: 128
phrase: ContainerMayFailDueToGpuDeviceEccError
issuer: USER_CONTAINER
causer: PAI_HW
type: PLATFORM_FAILURE
stage: RUNNING
behavior: UNKNOWN
reaction: RETRY_TO_MAX
reason: "Container failed may due to GPU ecc error"
repro:
- "Run program in GPU with ECC error"
solution:
- "Wait result from next retry"
- "Contact Cluster Admin"
pattern:
runtimeContainerPatterns:
- gpuInfo:
nvidiaDoubleEccError: true

###########################################################################
# Range: [129, 192]
Expand Down Expand Up @@ -747,16 +771,17 @@ spec:
- code: 139
phrase: ContainerSigSegvReceived
issuer: USER_CONTAINER
causer: USER_CONTAINER
type: USER_FAILURE
causer: UNKNOWN
type: UNKNOWN_FAILURE
stage: RUNNING
behavior: PERMANENT
reaction: NEVER_RETRY
behavior: UNKNOWN
reaction: RETRY_TO_MAX
reason: "Container killed by OS Signal: SIGSEGV"
repro:
- "User program accesses an illegal memory address"
solution:
- "Check container log and fix your program bug"
- "Contact Cluster Admin"
pattern:
runtimeContainerPatterns:
- exitCode: 139
Expand Down Expand Up @@ -937,25 +962,6 @@ spec:
runtimeContainerPatterns:
- userLogRegex: "(?i)cuda runtime error (2) : out of memory"

- code: 227
phrase: ContainerMayFailDueToGpuDeviceEccError
issuer: PAI_RUNTIME
causer: PAI_HW
type: PLATFORM_FAILURE
stage: RUNNING
behavior: TRANSIENT
reaction: RETRY_TO_MAX
reason: "Container failed may due to GPU ecc error"
repro:
- "Run program in GPU with ECC error"
solution:
- "Wait result from next retry"
- "Contact Cluster Admin"
pattern:
runtimeContainerPatterns:
- gpuInfo:
nvidiaDoubleEccError: true

- code: 228
phrase: ContainerIncorrectParameterError
issuer: PAI_RUNTIME
Expand Down

0 comments on commit e0aba7f

Please sign in to comment.