Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTLCreateSystemDefaultDevice returns nil #1779

Closed
1 of 6 tasks
drewcrawford opened this issue Oct 10, 2020 · 2 comments
Closed
1 of 6 tasks

MTLCreateSystemDefaultDevice returns nil #1779

drewcrawford opened this issue Oct 10, 2020 · 2 comments
Assignees
Labels
Area: Apple question Further information is requested

Comments

@drewcrawford
Copy link

Description
Inside a GitHub-hosted runner, calls to the macOS API MTLCreateSystemDefaultDevice returns nil. This prevents use of Metal, is not generally anticipated to happen on macOS, and can break arbitrary software, which is more likely to occur over time. This appears to be caused by GPU configuration in the guest environment.

Area for Triage:
Apple

Question, Bug, or Feature?:
?

Virtual environments affected

  • macOS 10.15
  • Ubuntu 16.04 LTS
  • Ubuntu 18.04 LTS
  • Ubuntu 20.04 LTS
  • Windows Server 2016 R2
  • Windows Server 2019

Expected behavior
MTLCreateSystemDefaultDevice() should return a non-nil value

Actual behavior
MTLCreateSystemDefaultDevice() returns nil

Repro steps

In the linked action run this API is called in both macOS and iOS Simulator environment

Metal device is  nil
all devices []

What is this API?

This API is a chokepoint for use of Metal, the only non-deprecated graphics library on macOS. In addition, Metal is a general-purpose computing language that may be doing the heavy lifting when you call some other system API. It's increasingly likely over time that some software you use or test in a CI environment on Apple is trying to do this.

What is the significance of the current behavior?

Errors related to this appear in other reports, so I wonder if other macOS issues are related to this issue.

It is generally imagined that this API returning nil is not really possible on modern macOS. A brief survey of usage on GitHub supports this view, the predominant pattern being force-unwrapping the API (!) which crashes in a virtual environment. A minority of results generate a soft error, and I wasn't immediately able to turn up any examples that would function correctly in a GitHub runner.

Developers assume it works because a GPU supporting Metal has been a minimum system requirement for macOS since 10.14, and iOS for even longer. So this API working (e.g., slowly with integrated graphics) is imagined to be part of the macOS 10.14+ platform, rather than a question of availability on specific hardware. This is a very different expectation than Windows/Linux.

I asked someone with knowledge of the implementation for this API if there is any reason a developer today ought to handle a nil response, and they suggested nil probably indicates a serious OS fault, so not really.

Isn't there a software fallback for this?

Not for Metal itself. Codebases that predate widespread Metal availability may have kept around their old codepath which incidentally supported a fallback. These are increasingly not maintained or actively developed, and so if they exist they usually aren't the priority for running or testing/CI workflows.

Roblox recently wrote that

Today, for our audience, [opengl] is ~2% – which means our OpenGL backend barely matters anymore. We still maintain it but this will not continue for long.

Of course, new code written today is likely to skip this entirely and assume Metal is available.

What can be done about this?

The method I'm aware of is to passthrough the host GPU to the guest environment. I don't know if this can be done for multiple guests or would be sensible in GitHub's environment (I'm guessing not)

For virtualizing macOS 11, Apple is forcing a new set of low-level APIs. Some VMWare products have experimental support with these APIs to paravirtualize the host GPU into the guest environment which fixes this issue. So it seems like the situation for macOS 11 will be better, but might require additional or experimental config to make it work.

@LeonidLapshin LeonidLapshin self-assigned this Oct 12, 2020
@LeonidLapshin LeonidLapshin added the question Further information is requested label Oct 12, 2020
@LeonidLapshin
Copy link
Contributor

Hey, @drewcrawford
Thank you for your research and provided information, but, unfortunately, our current virtualization approach for macOS doesn't allow us to provide GPU passthrough due to several hardware and software limitations. Moreover, I'm afraid there are no plans to make GPU functions available for macOS runners in the nearest future.

drewcrawford added a commit to drewcrawford/blitcurve that referenced this issue Nov 18, 2020
In FB8904929 we noticed that metal::pow often returns the wrong value on macOS.  This causes incorrect results for some operations like BCCubicSplit (and left/right variants).  It could theoretically affect BCCubicEvaluate, BCCubicEvaluatePrime and BCNormalization as well but it seems a lot less likely for those functions.

This may be related to use of negative base and odd exponent, which ought to be well-defined for metal::pow per Metal Shading Language Specification Table 6.4, but in practice behaves like UB with different results on different GPUs.  Seems to be quite finicky about whether the operands are statically known or the dataflow of the arguments.

The fix is to use explicit multiplication where possible (e.g. integer exponent) and where we think the base might be negative.

Also add test coverage.  This test coverage is known to trip on some AMD and Intel systems.  However some GPUs are known to pass the test even though they experience the issue, so it’s not total.

Due to actions/runner-images#1779, there is no CI coverage for this issue.

See also FB8904929, mt2-109.
drewcrawford added a commit to drewcrawford/blitcurve that referenced this issue Nov 19, 2020
In FB8904929 we noticed that metal::pow often returns the wrong value on macOS.  This causes incorrect results for some operations like BCCubicSplit (and left/right variants).  It could theoretically affect BCCubicEvaluate, BCCubicEvaluatePrime and BCNormalization as well but it seems a lot less likely for those functions.

This may be related to use of negative base and odd exponent, which ought to be well-defined for metal::pow per Metal Shading Language Specification Table 6.4, but in practice behaves like UB with different results on different GPUs.  Seems to be quite finicky about whether the operands are statically known or the dataflow of the arguments.

The fix is to use explicit multiplication where possible (e.g. integer exponent) and where we think the base might be negative.

Also add test coverage.  This test coverage is known to trip on some AMD and Intel systems.  However some GPUs are known to pass the test even though they experience the issue, so it’s not total.

Due to actions/runner-images#1779, there is no CI coverage for this issue.

See also FB8904929, mt2-109.
@Gustl22
Copy link

Gustl22 commented Jan 30, 2023

@LeonidLapshin Will it be reasonable to allow Metal support while introducing the M1 runners
Since Flutter 3.7.x OpenGL support is removed and therefore the options to test UI are very limited / not existent.
CC @Steve-Glass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Apple question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants