Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

Tasks running the wrong command when built and deployed with v0.3.2 #1579

Closed
ragesoss opened this issue Jun 2, 2021 · 9 comments
Closed
Labels
bug Something isn't working plugin/pack question Further information is requested
Milestone

Comments

@ragesoss
Copy link

ragesoss commented Jun 2, 2021

Describe the bug
For the last month or so, I've been using Waypoint in conjunction with Docker, Nomad, and Consul to run a Rails app, Wiki Education Dashboard. I recently upgraded the waypoint client on my development machine from 0.3.1 to 0.3.2, and since the first time I built new Docker images of the app and deployed them (ie, waypoint up) with version 0.3.2 (yesterday), the behavior of the system changed: the "sidekiq" tasks stopped working, although they were allocated and appeared to be running. I found that each of the "sidekiq" tasks appeared (according to stdout on the Nomad UI) to be running the wrong command, puma, which corresponds to the web server task rather than the sidekiq task.

I downgraded my Waypoint client to 0.3.1 and then ran waypoint deploy (to redeploy the images previously built and deployed with 0.3.2), but this did not fix the problem. Running waypoint up with 0.3.1 to both rebuild the Docker images and deploy them did successfully fix the problem, getting the sidekiq allocations running sidekiq instead of puma.

See:

Expected behavior
Waypoint tasks should run the the command specified for their own config, not the command for another task's config.

Waypoint Platform Versions

  • Waypoint CLI Version: 0.3.2
  • Waypoint Server Platform and Version: 0.3.1, nomad
  • Waypoint Plugin: pack

Additional context
Here is the infrastructure repo that we're running this on: https://github.com/WikiEducationFoundation/wikieduinfra

cc @nateberkopec, who developed our waypoint configuration.

@ragesoss ragesoss added the new label Jun 2, 2021
@briancain
Copy link
Member

Hey there @ragesoss ! Thanks for opening an issue with Waypoint.

I noticed in your waypoint.hcl that your deploy step is just running nomad directly:

https://github.com/WikiEducationFoundation/WikiEduDashboard/blob/master/waypoint.hcl#L35-L41

  deploy {
    use "exec" {
      command = ["nomad","run","<TPL>"]


      template {
        path = "${path.app}/job.hcl.tpl"
      }
    }
  }

I don't think there are any exec plugin changes between 0.3.1 and 0.3.2. Is it possible the job.hcl.tpl file changed between upgrades in your waypoint client binary? If you were using the nomad plugin to deploy, I could see this potentially being a Waypoint bug, but since the deploy step is using exec and Waypoint is using nomad directly, I wonder if the behavior you are seeing could be reproduced by doing a waypoint build and just doing a nomad run job.hcl.tpl.

@briancain briancain added plugin/exec question Further information is requested and removed new labels Jun 2, 2021
@ragesoss
Copy link
Author

ragesoss commented Jun 2, 2021

@briancain thanks! I'll look into these suggestions. job.hcl.tpl has not changed recently; (after trying other things before the downgrade) ultimately the only change I made to restore the previous behavior was to sudo apt-get install waypoint=0.3.1 and then run waypoint up. Since waypoint deploy didn't result in different behavior after I reverted to 0.3.1, trying to replicate the problem via waypoint build and then nomad run job.hcl.tpl makes sense. I guess I can first try to waypoint deploy with 0.3.2 without rebuilding the image, to see if that breaks it, and then do the steps you suggest.

@ragesoss
Copy link
Author

ragesoss commented Jun 3, 2021

The problem does appear to be specific to the waypoint build step:

  1. With an image from waypoint build on 0.3.1, I can run waypoint deploy with 0.3.2 and it works correctly.
  2. With an image from waypoint build on 0.3.2, I can run waypoint deploy with 0.3.1 and replicate the problem.

I attempted to run nomad run job.hcl.tpl but it looks like it relies on going through Waypoint to turn the template into something Nomad can understand:

2.7.1  ~/h/WikiEduDashboard   $-  nomad run job.hcl.tpl                          261ms  Thu 03 Jun 2021 02:06:55 PM PDT
Error getting job struct: Error parsing job file from job.hcl.tpl:
job.hcl.tpl:78,17-18: Invalid character; This character is not used within the language.
job.hcl.tpl:78,23-24: Invalid character; This character is not used within the language.
job.hcl.tpl:79,11-12: Invalid character; This character is not used within the language.
job.hcl.tpl:135,17-18: Invalid character; This character is not used within the language.
job.hcl.tpl:135,23-24: Invalid character; This character is not used within the language.
job.hcl.tpl:136,11-12: Invalid character; This character is not used within the language.
job.hcl.tpl:192,17-18: Invalid character; This character is not used within the language.
job.hcl.tpl:192,23-24: Invalid character; This character is not used within the language.
job.hcl.tpl:193,11-12: Invalid character; This character is not used within the language.
job.hcl.tpl:256,17-18: Invalid character; This character is not used within the language.
job.hcl.tpl:256,23-24: Invalid character; This character is not used within the language.
job.hcl.tpl:257,11-12: Invalid character; This character is not used within the language.
job.hcl.tpl:314,17-18: Invalid character; This character is not used within the language.
job.hcl.tpl:314,23-24: Invalid character; This character is not used within the language.
job.hcl.tpl:315,11-12: Invalid character; This character is not used within the language.
job.hcl.tpl:372,17-18: Invalid character; This character is not used within the language.
job.hcl.tpl:372,23-24: Invalid character; This character is not used within the language.
job.hcl.tpl:373,11-12: Invalid character; This character is not used within the language.
job.hcl.tpl:430,17-18: Invalid character; This character is not used within the language.
job.hcl.tpl:430,23-24: Invalid character; This character is not used within the language.
job.hcl.tpl:431,11-12: Invalid character; This character is not used within the language.
job.hcl.tpl:78,9-10: Argument or block definition required; An argument or block definition is required here.

@briancain
Copy link
Member

@ragesoss - thanks for the extra info, could you provide us with two gists: one of waypoint build -vvv for version 0.3.1 and one for version 0.3.2? I'm also assuming waypoint.hcl is the same on both attempts? We'll take a look and see what could be going on. Thank you! 🙏🏻

I attempted to run nomad run job.hcl.tpl but it looks like it relies on going through Waypoint to turn the template into something Nomad can understand:

Oops, I should have caught that. That's correct, Waypoint injects some env variables into the job file to work with Nomad and the .tpl means template. But I think that's ok, it looks like you were able to isolate it to an issue with waypoint build!

@briancain briancain added bug Something isn't working plugin/pack and removed plugin/exec labels Jun 3, 2021
@ragesoss
Copy link
Author

ragesoss commented Jun 3, 2021

@briancain: yes, waypoint.hcl and everything else except for the waypoint client version was the same between the two cases I described today.

I've just run them with -vvv and captured the gists:

@briancain briancain added this to the 0.4.x milestone Jun 4, 2021
@briancain
Copy link
Member

Hey there @ragesoss! Thank you for providing those gists. I've done a bit of digging into why this might be happening with your project. We are pretty sure this change between the versions you mentioned was initially caused by upgrading the version of pack that Waypoint uses, #1457, which was needed to fix other bugs related to pack.

One thing I noticed between the two docker images built on version 0.3.1 and the latest version of Waypoint 0.4.0 is that it's possible the previous pack version would use a launcher to start all of your tasks, but now it attempts to launch your web directly, which is potentially the issue here. I generated a gist that shows the difference between the two container images which you can see below. The changes in RED are from Waypoint 0.3.1 building your projects container, where as the GREEN changes are Waypoint 0.4.0:

https://gist.github.com/briancain/4503bd73b1f16b601930d03b6834ae4c/revisions#diff-467802a887a7b72e265f260a42ceec4c6d281bdc384be95625353afed956cc67

We're still looking into the cause, but hopefully that gives you something to go off of to get you unblocked. Thanks for all of the information you provided here! 🙏🏻

@ragesoss
Copy link
Author

ragesoss commented Jun 9, 2021

Thanks! The pack upgrade from 0.15.1 to 0.18.1 makes sense... I'm guessing based on a pretty cursory exploration that it's because the pack bump included a bump to lifecycle which included a change that sets the default entrypoint to web if it's using buildpack API lower than 0.6... and ours in 0.2. So (fingers crossed) I can try bumping the buildpack API for our copy of https://github.com/fagiani/apt-buildpack and it might fix it.

@ragesoss
Copy link
Author

ragesoss commented Jun 9, 2021

(That didn't work, as it looks like the buildpack isn't compatible with buildpack API 0.6, so I guess I'll have to learn more about pack and buildpack and lifecycle to sort this out.)

ragesoss added a commit to WikiEducationFoundation/WikiEduDashboard that referenced this issue Jun 15, 2021
This makes the jobspec compatible with Waypoint 0.3.2 and up. Waypoint 0.3.2 included a version bump of `pack`, which in turn bumped `lifecycle`, which introduced setting the `web` process as the default process type for the image.

Adding a default process type meant that Nomad would use the entrypoint for the default process type unless another entrypoint is specified. So instead of simply running the command for the sidekiq task, it would run the default web entrypoint and use the `command` as arguments.

We intend for the `command` and `args` to completely specify how to launch the container for any each task, so we specify the base buildpack `launcher` command as the entrypoint for each job.

Relevant context: hashicorp/waypoint#1579
@ragesoss
Copy link
Author

Okay! I think I've got an understanding of what was going wrong (although I'm not 100% confident that I've got the right picture). The fix was to add entrypoint = ["launcher"] to the config for each of the Nomad tasks.

For images built by the older pack, it seems that starting a task without specifying an entrypoint would result in running the "command" without an entrypoint (as described here). (Per @birancain above, I guess the actual behavior was to default to using launcher, but the result is similar.) With the newer pack, the web entrypoint (generated by the ruby buildpack, I assume) became the default entrypoint for the image, so the "command" for each task would handled as arguments for the default entrypoint rather than its own command without an entrypoint. Explicitly switching to the built-in basic launcher entrypoint that pack provides restored the intended behavior of treating the configured "command" as the launch command for each task.

Relevant doc: https://www.nomadproject.io/docs/drivers/docker#entrypoint

So I guess this probably isn't a bug in waypoint, just a behavior change in pack that required a corresponding change to the Nomad jobspec.

Thanks much for the help, @briancain!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working plugin/pack question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants