Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear Architecture field in platform constraint for arm architectures #34021

Merged

Conversation

nishanttotla
Copy link
Contributor

This is a potential fix for moby/swarmkit#2294

This is a temporary requirement.

Also related: opencontainers/image-spec#661

@aaronlehmann
Copy link
Contributor

I wonder if we could append all known ARM variants to platforms in this case.

@nishanttotla
Copy link
Contributor Author

Ping @stevvooe

@nishanttotla
Copy link
Contributor Author

@aaronlehmann based on the discussion in the SwarmKit issue, I don't think we should be doing that. This will eventually be fixed for images.

@aaronlehmann
Copy link
Contributor

Wouldn't this PR cause ARM images to be scheduled on x86 and other platforms? I suppose it's better than not working at all, but I'm wondering if we can do better.

@nishanttotla
Copy link
Contributor Author

@aaronlehmann yes, that will happen. I think until the manifest is fixed, it is better to err on the side of putting lesser constraints.

@stevvooe
Copy link
Contributor

stevvooe commented Jul 7, 2017

This fix needs to happen server-side:

  1. If the platform comes from a manifest list or oci index, it is fine.
  2. If it has to be resolved from the image config, it should undergo similar logic.

Wouldn't this PR cause ARM images to be scheduled on x86 and other platforms? I suppose it's better than not working at all, but I'm wondering if we can do better.

Yes, but right now, we have no way of resolving which version of arm an image belongs to.

@nishanttotla nishanttotla force-pushed the dont-set-architecture-constraint branch from 444fdca to 5fa6df3 Compare July 10, 2017 16:55
@estesp
Copy link
Contributor

estesp commented Jul 11, 2017

It is strange that I'm just learning this, but a weird mismatch that we actually query the Linux UTS subsystem for the daemon's architecture setting, but all image handling is done with GOOS and GOARCH. It was confusing to me that an image was trying to match on something that Go doesn't even acknowledge (armv7l) which to me is more the root of the problem than the well understood issue with ARM variants.

Seems we need to at some point resolve matching to all use GOOS/GOARCH (with the added variant processing that is already available in OCI and also available when building a manifest list) so that we don't have some resolver of a bunch of random distro uname choices back to something "understood".

@thaJeztah
Copy link
Member

I agree with all of the above; we should fix the current situation, because it's a regression from 17.03;

  • use "OS", if it's a known value
  • use architecture, if it's a known value
  • if not, ignore, and use the same behaviour as docker 17.03

Wouldn't this PR cause ARM images to be scheduled on x86 and other platforms? I suppose it's better than not working at all, but I'm wondering if we can do better.

This is what 17.03 did, so not a regression. Mixed architecture Swarm clusters are probably rare, and if someone currently uses a mixed architecture (or operating system) cluster, they probably use constraints already to influence scheduling.

In the reporter's case moby/swarmkit#2294, things worked fine in the old situation; the service used an image specific to the architecture, and could just be deployed.

@justincormack
Copy link
Contributor

justincormack commented Jul 12, 2017 via email

@estesp
Copy link
Contributor

estesp commented Jul 12, 2017

@justincormack true, the connection to Go is not to take their view of OS/architecture as gospel, but realize that as Go programs, we have a few options for determining our current runtime architecture and OS; either ask the Go runtime, or try to divine it from Linux syscalls/environment data.

FWIW, IMHO, YMMV :) .. Go is a much more stable source of this information than Linux as distros have shown to be quite open to variance and finding their own way when it comes to embedded architectures and uname -a responses :)

To make this more concrete, there is an issue (I think it was over in the OCI spec repo) that lists about 60 different variants (strings) of arch-os tuples on ARM platforms running Linux. Go provides us 2: "arm" and "arm64". That leaves us having to find another method (using the proposed variant field in manifest list/OCI indexes) when building images for v6, v7, v8 specific ARM families, but at least it means we aren't trying to harmonize 60 different strings down to "arm" or "arm64" + a variant list of 3 items. Of course we will have to find a way to query the runtime "variant" when arch = {arm,arm64}, which IMO is the only missing piece, but that only applies to manifest list/index-based image resolving, not single manifest images which only provide us the two pieces of info: OS and arch.

If I haven't rambled on too long (don't answer that), another way to look at this bug is that you will never have a stable comparison between runtime.GOARCH and uname -a; and images are created using runtime.GOARCH but they are being compared to output from uname -a. It was already shown to be a problem in the old days because x86_64 (a Linux-ism) was being hardcoded in images, but runtime.GOARCH in Go is called "amd64" for the same platform.

@stevvooe
Copy link
Contributor

@justincormack UTS doesn't exactly help here. Right now, it returns armv7l in certain cases, which I cannot find any information about, anywhere. UTS (or gnu triples) seem identify a toolchain, which can be very different for identical architectures.

Go's approach is at least pragmatic: they try to define a real machine target.

If you can identify a standard approach that actually works, then, by all means, suggest one. Laying this on the footsteps of Go's toolchain is fairly dismissive.

@alexellis
Copy link
Contributor

I also tested 17.07.0-ce-rc1 this morning on armv7l and after upgrading from 17.05 (fully working with stacks and services) - I get 0/1 replicas after deploying a stack.

In addition there are no error messages on service inspect/ls/ps.

@andrewhsu
Copy link
Member

@alexellis fyi the code changes in this PR was not included in the 17.07.0-ce-rc1 release because it has not been merged yet. Curious to know if you ran your test with a custom compile of the 17.07.0-ce-rc1 codebase with this PR patch applied?

@alexellis
Copy link
Contributor

alexellis commented Jul 27, 2017

I think Stefan and myself tried the same install script from Eli @ Docker Inc via the #arm Slack channel-

curl -fsSL
https://raw.githubusercontent.com/seemethere/docker-install/update_for_170701rc1/install.sh
| sh
 Version:      17.07.0-ce-rc1
 API version:  1.31
 Go version:   go1.8.3
 Git commit:   8c4be39
 Built:        Wed Jul 26 21:00:23 2017
 OS/Arch:      linux/arm

All services from docker-compose.armhf.yml are pinned at 0/1 replicas.
https://github.com/alexellis/faas

DEBU[0116] no suitable node available for task module=node node.id=a82troq1z5gmjwnpdlw76k5v0 task.id=dcxqoaohw872432weilc9voc8

However with 17.05 all come out at 1/1 replicas.

 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:30:54 2017
 OS/Arch:      linux/arm

@thaJeztah
Copy link
Member

@alexellis what does

docker service inspect -f '{{json .Spec.TaskTemplate.Placement}}' <service-id>

show for those services?

@alexellis
Copy link
Contributor

{"Platforms":[{"Architecture":"arm","OS":"linux"}]}

@alexellis
Copy link
Contributor

@thaJeztah ... it's still broken in the following version
@andrewhsu

Client:
 Version:      17.07.0-ce-rc2
 API version:  1.31
 Go version:   go1.8.3
 Git commit:   36ce605
 Built:        Mon Aug  7 23:53:53 2017
 OS/Arch:      linux/arm

Server:
 Version:      17.07.0-ce-rc2
 API version:  1.31 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   36ce605
 Built:        Mon Aug  7 23:47:08 2017
 OS/Arch:      linux/arm
 Experimental: false

@justincormack
Copy link
Contributor

@alexellis nothing has changed since #34021 (comment) "fyi the code changes in this PR was not included in the 17.07.0-ce-rc1 release because it has not been merged yet. Curious to know if you ran your test with a custom compile of the 17.07.0-ce-rc1 codebase with this PR patch applied?"

Have you tested it? Does it fix the issue?

@alexellis
Copy link
Contributor

alexellis commented Aug 11, 2017

I interpreted the comment to mean "tell me what build you're using?" and I replied.

Do you have a "custom compile" of Docker CE available for testing?

@alexellis
Copy link
Contributor

alexellis commented Aug 20, 2017

I can confirm the fix works.. I deployed a full application (FaaS).

I tested it along with rallying some of the community to help with the ARM testing and opened additional issues in the docker-arm repo with the tag community-supports-docker.

alexellis/docker-arm#14 (comment)

@nishanttotla @estesp @StefanScherer

@vattybear
Copy link

Concur with @alexellis above - have similarly tested the fix today - commented with details on alexellis/docker-arm#17

@andrewhsu
Copy link
Member

Is this PR a suitable stepping-stone step in the interim?

@@ -123,12 +123,19 @@ func (s *distributionRouter) getDistributionInfo(ctx context.Context, w http.Res
if err == nil {
err := json.Unmarshal(configJSON, &platform)
if err == nil && (platform.OS != "" || platform.Architecture != "") {
if platform.Architecture == "arm" || platform.Architecture == "Arm" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at this approach and doing it on the server side will create future problems for this endpoint. Clearing the architecture should be done on the client side, when setting up the constraint. That way, we can fix this in the future and still be able to support this version of docker.

@stevvooe
Copy link
Contributor

stevvooe commented Sep 1, 2017

LGTM

@nishanttotla Thanks!

@@ -98,6 +98,11 @@ func imageDigestAndPlatforms(ctx context.Context, cli DistributionAPIClient, ima
if len(distributionInspect.Platforms) > 0 {
platforms = make([]swarm.Platform, 0, len(distributionInspect.Platforms))
for _, p := range distributionInspect.Platforms {
// clear architecture field for arm. This is a temporary fix.
arch := p.Architecture
if arch == "arm" || arch == "Arm" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if stringsToLower(arch) == "arm"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should this be HasPrefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the images I've seen, it's only been "arm". We could use HasPrefix to be extra careful but I don't think it's necessary.

@@ -98,6 +98,11 @@ func imageDigestAndPlatforms(ctx context.Context, cli DistributionAPIClient, ima
if len(distributionInspect.Platforms) > 0 {
platforms = make([]swarm.Platform, 0, len(distributionInspect.Platforms))
for _, p := range distributionInspect.Platforms {
// clear architecture field for arm. This is a temporary fix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more detail here as to why it's being cleared?

Signed-off-by: Nishant Totla <nishanttotla@gmail.com>
@nishanttotla nishanttotla force-pushed the dont-set-architecture-constraint branch from f172404 to 772af60 Compare September 1, 2017 20:08
@nishanttotla
Copy link
Contributor Author

@cpuguy83 addressed your comments.

@cpuguy83
Copy link
Member

cpuguy83 commented Sep 1, 2017

Lint errors are being dealt with here: #34706

@nishanttotla
Copy link
Contributor Author

This commit will also need vendoring into the docker/cli.

Copy link
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrewhsu
Copy link
Member

Is this one ready to go or are there any other types of verification that is needed before merging?

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐮

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.