Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcjob的task名称包含 - 时, task-topolopy 插件会误判而失效。 #3469

Closed
jichangyun opened this issue May 15, 2024 · 17 comments · Fixed by #3711
Closed

vcjob的task名称包含 - 时, task-topolopy 插件会误判而失效。 #3469

jichangyun opened this issue May 15, 2024 · 17 comments · Fixed by #3711
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.

Comments

@jichangyun
Copy link

代码位置:

func affinityCheck(job *api.JobInfo, affinity [][]string) error {
if job == nil || affinity == nil {
return fmt.Errorf("empty input, job: %v, affinity: %v", job, affinity)
}
var taskNumber = len(job.Tasks)
var taskRef = make(map[string]bool, taskNumber)
for _, task := range job.Tasks {
tmpStrings := strings.Split(task.Name, "-")
if _, exist := taskRef[tmpStrings[len(tmpStrings)-2]]; !exist {
taskRef[tmpStrings[len(tmpStrings)-2]] = true
}
}
for _, aff := range affinity {
affTasks := make(map[string]bool, len(aff))
for _, task := range aff {
if len(task) == 0 {
continue
}
if _, exist := taskRef[task]; !exist {
return fmt.Errorf("task %s do not exist in job <%s/%s>", task, job.Namespace, job.Name)
}
if _, exist := affTasks[task]; exist {
return fmt.Errorf("task %s is duplicated in job <%s/%s>", task, job.Namespace, job.Name)
}
affTasks[task] = true
}
}
return nil
}

问题点:

tmpStrings := strings.Split(task.Name, "-")

task 名称包含 - 时, tmpStrings := strings.Split(task.Name, "-") 切割的结果会非期望, 放进 taskRef 的 key 就不对了,下面 check 的时候就会报错 “task %s do not exist in job <%s/%s>”

@jichangyun jichangyun changed the title vsjob的task名称包含 - 时, task-topolopy 插件会误判而失效。 vcjob的task名称包含 - 时, task-topolopy 插件会误判而失效。 May 15, 2024
@Monokaix
Copy link
Member

Monokaix commented May 15, 2024

It has been fixed by #2940 :)

@jichangyun
Copy link
Author

jichangyun commented May 16, 2024

It has been fixed by #2940 :)

改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。

打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 Job 的 uid
看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。

func (ji *JobInfo) SetPodGroup(pg *PodGroup) {
ji.Name = pg.Name
ji.Namespace = pg.Namespace
ji.MinAvailable = pg.Spec.MinMember
ji.Queue = QueueID(pg.Spec.Queue)
ji.CreationTimestamp = pg.GetCreationTimestamp()

@jichangyun jichangyun reopened this May 16, 2024
@Monokaix
Copy link
Member

It has been fixed by #2940 :)

改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。

打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。

func (ji *JobInfo) SetPodGroup(pg *PodGroup) {
ji.Name = pg.Name
ji.Namespace = pg.Namespace
ji.MinAvailable = pg.Spec.MinMember
ji.Queue = QueueID(pg.Spec.Queue)
ji.CreationTimestamp = pg.GetCreationTimestamp()

Is it a vcjob or deplopyemnt?

@jichangyun
Copy link
Author

It has been fixed by #2940 :)

改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。
打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。

func (ji *JobInfo) SetPodGroup(pg *PodGroup) {
ji.Name = pg.Name
ji.Namespace = pg.Namespace
ji.MinAvailable = pg.Spec.MinMember
ji.Queue = QueueID(pg.Spec.Queue)
ji.CreationTimestamp = pg.GetCreationTimestamp()

Is it a vcjob or deplopyemnt?

vcjob

@jichangyun
Copy link
Author

It has been fixed by #2940 :)

改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。
打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。

func (ji *JobInfo) SetPodGroup(pg *PodGroup) {
ji.Name = pg.Name
ji.Namespace = pg.Namespace
ji.MinAvailable = pg.Spec.MinMember
ji.Queue = QueueID(pg.Spec.Queue)
ji.CreationTimestamp = pg.GetCreationTimestamp()

Is it a vcjob or deplopyemnt?

vcjob

前面写错了, jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,后面的 uuid 是 vcjob 的UID。

@Monokaix
Copy link
Member

It has been fixed by #2940 :)

改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。
打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。

func (ji *JobInfo) SetPodGroup(pg *PodGroup) {
ji.Name = pg.Name
ji.Namespace = pg.Namespace
ji.MinAvailable = pg.Spec.MinMember
ji.Queue = QueueID(pg.Spec.Queue)
ji.CreationTimestamp = pg.GetCreationTimestamp()

Is it a vcjob or deplopyemnt?

vcjob

前面写错了, jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,后面的 uuid 是 vcjob 的UID。

can you paste the whole job name and
pod name and job yaml?

@jichangyun
Copy link
Author

It has been fixed by #2940 :)

改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。
打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。

func (ji *JobInfo) SetPodGroup(pg *PodGroup) {
ji.Name = pg.Name
ji.Namespace = pg.Namespace
ji.MinAvailable = pg.Spec.MinMember
ji.Queue = QueueID(pg.Spec.Queue)
ji.CreationTimestamp = pg.GetCreationTimestamp()

Is it a vcjob or deplopyemnt?

vcjob

前面写错了, jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,后面的 uuid 是 vcjob 的UID。

can you paste the whole job name and pod name and job yaml?

# kubectl get pods
NAME                            READY   STATUS                 RESTARTS         AGE
test-binpack-xytsyh2-task-1-0   0/1     CreateContainerError   0                9m1s
test-binpack-xytsyh2-task-2-0   0/1     CreateContainerError   0                9m1s

# kubectl get podgroup
NAME                                                        STATUS    MINMEMBER   RUNNINGS   AGE
test-binpack-xytsyh2-d7e821cc-2e6c-4b5a-9818-f3efca39e5da   Running   2                      59m

# 
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  creationTimestamp: "2024-05-16T03:21:00Z"
  generation: 1
  name: test-binpack-xytsyh2
  namespace: default
  resourceVersion: "1025579"
  uid: d7e821cc-2e6c-4b5a-9818-f3efca39e5da
spec:
  maxRetry: 3
  minAvailable: 2
  queue: default
  schedulerName: volcano
  tasks:
  - maxRetry: 3
    minAvailable: 1
    name: task-1
    replicas: 1
    template:
      metadata: {}
      spec:
        containers:
        - image: alpine
          imagePullPolicy: IfNotPresent
          name: test
        restartPolicy: OnFailure
  - maxRetry: 3
    minAvailable: 1
    name: task-2
    replicas: 1
    template:
      metadata: {}
      spec:
        containers:
        - image: alpine
          imagePullPolicy: IfNotPresent
          name: test
        restartPolicy: OnFailure
status:
  conditions:
  - lastTransitionTime: "2024-05-16T03:21:02Z"
    status: Pending
  minAvailable: 2
  pending: 2
  state:
    lastTransitionTime: "2024-05-16T03:21:02Z"
    phase: Pending
  taskStatusCount:
    task-1:
      phase:
        Pending: 1
    task-2:
      phase:
        Pending: 1

@Monokaix
Copy link
Member

/good-first-issue

@volcano-sh-bot
Copy link
Contributor

@Monokaix:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@volcano-sh-bot volcano-sh-bot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels May 16, 2024
@Monokaix
Copy link
Member

@loheagn can you take a look?

@Monokaix
Copy link
Member

@jichangyun Have you set affinity rules in job's annotation?

@jichangyun
Copy link
Author

@jichangyun Have you set affinity rules in job's annotation?

恩,是的,就是 task-1 和 task-2 反亲和。

大概看了下,问题点应该在这里, SetPodGroup 时,会置 ji.Name = pg.Name ,而 pg.Name 是 job.Name + "-" + job.UID .

func (ji *JobInfo) SetPodGroup(pg *PodGroup) {
ji.Name = pg.Name
ji.Namespace = pg.Namespace
ji.MinAvailable = pg.Spec.MinMember
ji.Queue = QueueID(pg.Spec.Queue)
ji.CreationTimestamp = pg.GetCreationTimestamp()

私有项目就先把 topology.go 文件中的 var jobNamePrefix = job.Name + "-" 改成 var jobNamePrefix = job.Name[:len(job.Name)-36] 了,自测没问题。
但是,个人认为 job_info.go 中的 置 ji.Name = pg.Name 不太合适,但是目前研究还不够,不敢直接改 job_info.go ,所以先不提 PR 过来了。

@lowang-bh
Copy link
Member

个人认为 job_info.go 中的 置 ji.Name = pg.Name 不太合适

ji.Name = pg.Name没有问题,本来这俩就是相同的。问题在于controller自动生成的podgroup为啥还要带一个UID,有的又没有。这个UID生成我没找到在哪,我认为没必要有。

@loheagn
Copy link

loheagn commented May 17, 2024

So the current issue is that the full task name in topology.go is not necessarily formatted as ${job name}-${task name}-${index}, which might lead to the patch in #2940 not being able to retrieve the correct task name?

@jichangyun
Copy link
Author

个人认为 job_info.go 中的 置 ji.Name = pg.Name 不太合适

ji.Name = pg.Name没有问题,本来这俩就是相同的。问题在于controller自动生成的podgroup为啥还要带一个UID,有的又没有。这个UID生成我没找到在哪,我认为没必要有。

恩,如果这两个名称是一样的,那应该没问题。
我在 1.7.0 版本上自测,似乎都是 pg.name = job.Name + "-" + job.UID

翻了下 master 分支的代码,好像是在这里。

func (cc *jobcontroller) createOrUpdatePodGroup(job *batch.Job) error {
// If PodGroup does not exist, create one for Job.
pgName := job.Name + "-" + string(job.UID)
var pg *scheduling.PodGroup
var err error
pg, err = cc.pgLister.PodGroups(job.Namespace).Get(pgName)
if err != nil {
if !apierrors.IsNotFound(err) {
klog.Errorf("Failed to get PodGroup for Job <%s/%s>: %v",
job.Namespace, job.Name, err)
return err
}
// try to get old pg if new pg not exist

@JesseStutler
Copy link
Contributor

It's a bug definitely. As @jichangyun said, the name of podgroup created by vc-controller isjob.Name - job.UID, but the job name in task.Name has no uid. It means that the annotation configured does not work now, but in old code #2940 , if the task does not contain "-", we will get the right task name no matter what the jobNamePrefix is.

@JesseStutler
Copy link
Contributor

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants