Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose multitasking support for webXR #15

Open
TrevorDev opened this issue May 5, 2018 · 18 comments
Open

Expose multitasking support for webXR #15

TrevorDev opened this issue May 5, 2018 · 18 comments

Comments

@TrevorDev
Copy link

Has support for multitasking in webXR been discussed already online in the past for later versions of the spec? I searched online but haven't found any discussions.

Background:

Today's WebVR/WebXR applications are limited to only fully immersive experiences taking full control of the users environment. On mobile or desktop, workflows often involve some form of multitasking such as running 2 applications side by side to get the productivity boost of not requiring to switch between apps to consume their content concurrently. Existing AR/VR desktop environments such as the Windows Mixed Reality home or Oculus Dash already allow multitasking web browsers in 2D (eg. https://www.youtube.com/watch?v=SvP_RI_S-bw) so it would be natural to provide similar support for 3D web content through a browser.

Use Cases:

In general most webXR applications could provide some support for multitasking similar to desktop applications.

  • Comparison Web Shopping:
    One furniture store website would display a couple interactive 3D chairs (Change fabric, fold up, etc.) another would do the same. The user could then compare both offers directly without having to jump back and forth between apps.

  • Gaming and video watching:
    One website would display a game such as flying an RC plane. Another would be presenting a video. Neither of these apps alone may be exciting enough to view alone but they enhance the experience of whatever they are running concurrently with.

  • Feed notifications:
    A feed based website could be launched and then placed off to the side while using other applications. When a new item is added to the feed it could display that new content is ready.

Proposed approach:

In order to get basic multitasking support, an app should have a way to provide what the user is seeing (possibly color and depth textures) to the browser for it to be composited on top of other apps. Some mechanism for window management might be needed if each webXR application's position could be controlled externally to avoid heavily overlapping apps. Input for the apps would need to be able to handle scenarios such as if 2 apps have buttons in front or behind each other and the user selects with a ray that passes through both buttons. Some method to retrieve information about the current environment the apps is running in (eg. lights to ensure multiple apps are consistently lit)

Here is an old prototype video I made with rough use cases: https://www.youtube.com/watch?v=R3xZ1G291Ks

@AlbertoElias
Copy link

I think it'd be interesting to think of some kind of Extension kind of API for sites that don't require to occupy your whole fov. It might just be like a tool or a virtual character or others that you can interact with in all immersive sites.

This might be something that could work nicely with https://github.com/WICG/webpackage

@TrevorFSmith
Copy link
Contributor

TrevorFSmith commented Jul 30, 2018

(pulling @janerivi over from #16 so that we can consolidate discussion)

This is a topic that has come up quite a bit and the usual response is that the WebXR Device API approach of providing a rendering context makes it very difficult to visually composite multiple apps. And then someone brings up the complexity of cross-origin security, audio mixing, and managing input among apps and everyone backs away slowly because it's just too big.

So, step zero in this situation is to figure out whether it's possible to tease out sub-problems that aren't by themselves too massive to reason about and then find consensus.

When I think about the problem, there are several pieces:

  • How can the UA orchestrate the graphics layer so that multiple, possibly antagonistic, web apps can present their visual information without conflicting with each other?
  • How can the UA help the user understand which app is generating visual information?
  • How can the UA help the user understand which app is receiving their input (pointer, voice, gaze...)?
  • How can the UA orchestrate audio information for multiple apps?
  • How can the UA orchestrate haptic information for multiple apps?

@TrevorDev
Copy link
Author

Thanks for your insight

I really feel that multitasking scenarios should be a high priority and hope people don't back away again on this. I am afraid that most non-gaming/experience apps(eg. shopping or new consumption) are unlikly to be used if they are forced into running alone.

Some thoughts:

  • For multiple concurrent graphics layers from different apps I'd like to suggest the xr api provide a "multitasking" session type that has a color+depth framebuffer, multiple apps would write to that layer and then be composited on top of each other before sent to the headset by the browser. This is somewhat similar to how the UtilityLayerRenderer in Babylon is able to composite multiple isolated scenes on top of each other. Prototype
  • Potentially input could always be shared between all apps. If sensitive input needs to be sent to a single app, that app could ask for exclusive access brefily. Apps could also provide a hittest method which the UA could call to check where/if a controller's ray is currently hitting an app.
  • @TrevorFSmith Could you expand on what you mean by orchestrating audio information? What differences do you see that might be needed compared to how multiple tabs can already play audio concurrently?

Other questions to consider (With my current thinking inline):

  • Would the web apps be responsible for positioning/anchoring themselves or should the UA manage this
    • Leave this out of scope for now. Applications can manage this within their own logic. If overlapping applications becomes an issue, this could be added in the future by providing the app and anchor to attach to.
  • What interaction model between web apps should exist (eg. if one app has a wall and the other has a ball should they collided?)
    • The complexity of this I feel is to high so I would recommend keeping scenarios like this out of scope. For now, apps can use existing mechanisms such as websockets/Broadcast Channel to communicate if needed.
  • How can this be iterated on quickly/experimented with:
    • I would really like to see more experimental browsers or behind flags implemented proposals of this. If no one has looked at the Exokit browser I highly suggest taking a look. The stability is not ideal at the moment but they have some cool multitasking features working. Additionally,
      running multiple apps on a single webpage works ok, too
  • If multiple apps are running concurrently should the UA be responsible for controller rendering/ray casting
    • I'd suggest yes, possibly apps could request to take control over this in the future, but if compositing is handled by the UA, maybe input should too
  • How might 3D web apps be able to run concurrently with 3D native apps?
    • If the host OS wanted to do the compositing could the UA provide it with the composited scene such that the host OS could composite the browser along with native apps?

@TrevorFSmith
Copy link
Contributor

For most of the points we've made, there's a tension between what the browser can control and how much control the apps have. Tip too far toward browser control and the use cases are limited to what we can imagine today. Tip too far toward app control and they'll clash with each other on every output channel (auditory, visual, haptic, ...) and every input channel (hand gestures, wands, voice, ...).

I've heard from a couple of people that there might be a middle ground for the web, where there is a constrained declaration of the initial content (sometimes using markup, sometimes a format like glTF) and then a constrained protocol between scripts and the UA for receiving input from the user and making changes to the content. So, it's a way for web devs to create something the UI can manage that is also pretty flexible for what it shows and does.

This is, however, very different than the current WebXR Device API approach of giving each session a full rendering context and control over the entire visual field.

I'm not sure how (or if) we bridge those two disparate approaches.

@TrevorDev
Copy link
Author

The scale between browser vs app control definitely does exist but my current thinking is to expose as much control to the app as possible while still allowing users to avoid/close apps that cause issues. If this is done in an experimental way, I'd expect developers will naturally find patterns to play nicely with other apps or risk users abandoning them similar to what happens to intrusive 2D websites today. If browsers provided a method to composite multiple xr frames from different tabs the MVP API contract would be very close to that of the existing xr spec (This seems to work well for Exokit which is trying to avoid existing xr apps from even having to be rewritten).

I would like to learn more about the possible markup/gltf solution you mentioned. What might be responsible for rendering the 3D content? If this did exist separate to the xr device api what might be reasons to use one API over the other?

@TrevorFSmith
Copy link
Contributor

I don't yet see a way to give two or more apps the freedom to directly render into a 3D space (even in separate composited graphics contexts) without opening up the user to a wide range of undesirable behavior where the apps are conflicting in both accidental and malicious ways. We probably don't want to share depth buffers between apps because that would leak quite a bit of information across domains. If we solve that problem, somehow, then how can the user know which application is responsible for a specific bit of rendered information? A malicious app could insert itself in front of another app and mimic its UI, causing the user to unknowingly interact with the malicious app, possibly leaking personal data.

I do see a way that the app could give the browser a declarative renderable (e.g. a glTF file) and the browser would be in charge of loading the renderable, keeping it visually separate from other apps' renderables, and clearly denoting which app is receiving input.

@TrevorDev
Copy link
Author

If the color+depth compositing is done externally from the apps, depth information does not need to be shared cross domain, apps could write to a write-only buffer or submit buffers to be composited.

There are a couple of possible solutions to stop malicious apps. For example, when entering sensitive information, apps could enter a "security mode" where it is the only app able to render and the browser can display to the user which app they are interacting with. Additionally an api could be exposed that apps could call to let the browser know they are being interacted with so given any interaction, the browser could display which apps are responding.

I'm somewhat concerned that having the browser in charge of rendering and things like keeping apps visually separate may be leaning too far towards browser control. However if security is a higher priority I agree it's probably the better option to restrict apps from misbehaving from the start. (I do still hope I can fly a virtual RC rocket around my entire multitasking space while watching a video at some point in the future though 🚀)

@AlbertoElias
Copy link

I really like the idea of an API for some kind of security mode when there's a delicate interaction going on with a specific app.

Additionally an api could be exposed that apps could call to let the browser know they are being interacted with so given any interaction, the browser could display which apps are responding.

Can't browsers know that already?

@TrevorDev
Copy link
Author

@AlbertoElias if any app was able to draw anywhere in your 3D space and multiple apps were running at the same time. The browser and the user wouldn't know what content belongs to each app, all they would see is the composited views from all apps.

In the scenario you mentioned with a virtual character as well as some of the scenarios that @janerivi mentioned in his issue such as the fox running on the ground, these apps need to have some sort of freedom in their drawing location to achieve their desired functionality while at the same time make trade-offs to ensure users are secure. Maybe in the gltf api @TrevorFSmith proposed, apps could request to move to different areas of the screen if my original compositing proposal was too risky.

@AlbertoElias
Copy link

Yes, that's true, but I think needing an API for the different apps to call to let the browser know is too much to ask of developers. I don't know much about this, but it'd be interesting to see if browsers could hook into the drawing pipeline to see which sites drew what content in the 3d scene.

Maybe it'd be interesting to check out what Magic Leap is doing with their Lumin runtime. From a quick read, it seems each landscape app is deployed on a prism, so they do seem to be separated from one another. Maybe someone at Mozilla can tell us more about how these prisms work together

@blairmacintyre
Copy link

(@TrevorDev Sorry for the late reply; somehow I wasn't getting notifications on this repo, I thought I'd "watched" it)

This topic is near and dear to my heart; running multiple AR apps at once is something I also think is critical, and it was central to the ideas behind the first version of the Argon browser we released back in 2011; all versions (up to the current Argon4) allow multiple pages to be overlaid.

I've brought this topic up at our face-to-face meetings, and @TrevorFSmith and I included some initial support for it in our proposed API last year, but have generally gotten a lot of pushback because of two main things

  • performance issues (i.e., it's going to be hard enough to get a single app to run with reliable performance on the mobile devices)
  • UI issues (some of which @TrevorFSmith has summarized).

Some thoughts on the above (again, sorry for not keeping up).

How can the UA orchestrate the graphics layer so that multiple, possibly antagonistic, web apps can present their visual information without conflicting with each other?
How can the UA help the user understand which app is generating visual information?
How can the UA help the user understand which app is receiving their input (pointer, voice, gaze...)?
How can the UA orchestrate audio information for multiple apps?
How can the UA orchestrate haptic information for multiple apps?

I don't think these all need to be solved completely, as the can just be left up to the UA (which is perhaps what you meant).

For example, a UA would feed each app/layer the same WebXR pose info, but may have different ways of choosing the order they get called/rendered (i.e., a notion like "tabs").

In Argon, we settled on having a single page being the "current page"; that page was the only one to get input. We tried different schemes of sending input to all pages, but it was easy to come up with security problems. Having one page (that the user chooses) as the "current" one simplifies this, and lets other issues (i.e. audio) be managed as well (e.g., UA's can provide UI's to mute "background" audio or have it off by default).

Potentially input could always be shared between all apps. If sensitive input needs to be sent to a single app, that app could ask for exclusive access brefily. Apps could also provide a hittest method which the UA could call to check where/if a controller's ray is currently hitting an app.

It would be safer, and simpler, to have input go to a single app at a time; implementing UI controls (i.e., the equivalent of click-drag-release) because insanely complicated if multiple channels can get the same input ... and relying on dev's to note when input is sensitive seems dangerous!

[Regarding the security/depth leakage/etc.]

I don't actually see an issue if the app can't read the depth or framebuffer back.

There are actually two reasons to have many apps:

  1. to allow many long running apps to be seen/interacted with simultaneously. This is hard, as the content needs to work robustly, continuously, and allow for high quality interaction that is pleasant, intuitive, etc. I DO NOT see this as a good candidate for multiple simultaneous WebXR sessions (even though this is more or less what I suggested at the face-to-faces last year).

  2. to allow user's to choose to run multiple web apps at the same time, for reasons like comparing content or briefly accessing multiple disjoint things (e.g., bring up two repair manuals; see items from multiple stores in the living room at the same time). The expectation here is that the UA makes sure the user knows that "things may looked wacked, and performance may suffer since both apps may assume they have the whole machine", but they want to do it temporarily for "good reasons" . This seems like a really good use case for having multiple simultaneous WebXR sessions.

My view is that the most obvious way to allow multiple simultaneous WebXR sessions is this:

  • each app things they have complete control, as they do now
  • they render RGBA+Depth
  • UA handles compositing such that not app can read back another app's content
  • only one gets input at a time (UA provides method for user to choose and control which)
  • UA controls the ordering of rendering, and how the user understands which one currently has input

There are some pretty simple metaphors for input management. For example, perhaps the UA requires the user to choose the input layer; when no input is being set to layers, they are both visible, but when one is selected, the content of "background" layers is greyed out or appears to be behind a slightly distorted visual layer (e.g., water, glass, etc).

But, any of that could be up to the UA.

From WebXR's perspective, the implementation needs to support multiple layers, but the standard just needs to acknowledge it as a possibility and not prevent it.

For example, instead of coupling input to one immersive session at creation time, it could make clear that if the UA supports multiple immersive sessions, only one can get any input (beyond pose) at a time. Things like hit testing might only return success on the layer that is getting input; there would have to be a way for "background" apps to know they aren't getting hits because they are backgrounded. This could be as simple as their being "focus" and "blur" events on the layers (for foreground/background) and hittest only succeeding if the app has focus.

@AlbertoElias
Copy link

I agree with currently focusing on the second reason, especially as there is a lot of utility already for that and it is a point we can get to based on the current direction WebXR is heading. We can look in the future how to improve on that.

Can´t the type of immersive session change for a multi-app environment and, in that way, let the developer know?

@TrevorDev
Copy link
Author

I had the opportunity to meet up with @NellWaliczek and @toji at siggraph this week and asked for their thoughts on this. From my understanding of their response (Please correct me if i'm wrong), they have heavy concerns around the same issues that Trevor brought up and believe that the declarative model may be the right direction moving forward and that this would likely be external to the webXR spec. This topic is something that the Immersive Web Working Group will likely be looking into as native AR/VR platforms get more mature.

@blairmacintyre Argon looks very interesting (I never heard of it and am glad others have similar concerns as I 👍 ). I think experimental browser like this or exokit are the right way to push for new features desired or create an entirely new platform for app development. If scenarios can be developed here that make users go out of their way to install, it is a good way to prove value. Restricting input to a single app at a time seems like a good way to avoid accidental input noise, I'd be interested how this might work to enable playing a 3D game while watching a 3D video.

@AlbertoElias The Lumin runtime is very cool, first AR headset to ship 3D multiasking support that Im aware of! One thing I fear though is that as other platforms ship similar features, I hope their runtimes are similar enough that a browser spec will be feasible enough to create useful content that works on all of them. This is one of the reasons why rgb+depth compositing seems like the right api approach to me as the contract with the OS may be less complex.

@TrevorFSmith
Copy link
Contributor

Yes, I don't currently see a way to use the session-oriented WebXR APIs as the basis for the sort of long-lived, simultaneously running XR apps that I think would be super interesting.

I'm knocking around some ideas for an internal project that uses various features that already exist in headset browsers (workers, home environments, etc) as a way for UAs to host such apps, but the ideas are still very early. The rough idea is that the web dev declares a set of 3D and audio assets and a script. The UA loads the assets (using its own loader, not one provided by the developer) and then loads the script in a worker. There would be a specified message protocol between the worker and the script, with the UA providing input messages like 'hover' and 'activate' and the script providing messages about modifying aspects of the assets like changing texture IDs, posing meshes, and starting and stopping audio assets.

The UA would then be totally in charge of managing and rendering the 3D and audio assets. It would also handle input and choose how to indicate to the user that input is going to a specific app. The UA would also manage the lifecycle of the app scripts.

It's an essentially different model for XR apps than page-based WebXR sessions, but I feel like it's a worthwhile experiment.

@JeroMiya
Copy link

Not sure if it was mentioned but the potential security threat of having multiple apps receive input at the same time is if one app has some custom 3d input method for passwords or log in or other sensitive information. Even if the malicious app can't read back the frame buffer they could potentially machine learn for example the password from gaze, or else reduce the search space statistically.

And yet if we go by the model of each app just being given it's own frame buffer and a pose then they can't even render their scene to be composed without that head pose.

This suggests to me a different rendering model when apps are composited. Instead of telling the UA how to render, it would need to tell the UA what to render. In other words, instead of a frame buffer, something like a DOM for 3d objects (gltf?) where each app adds objects to it's own DOM into a UE provided coordinate space (app might ask for a world anchor or headspace etc..) and then the UE merges the DOM for rendering. Apps can read back only their own DOM and only one app gets input at a time.

Because the UA knows more about the scene this also makes it easier for the UA to visualize app ownership of 3D objects, optionally limit apps to an enclosure in world space, and let the user hide/show apps. It would also help UA manage performance on mobile devices.

Of course this would be more limiting than the current webxr model so I see the two working side by side, but those apps would need to request full access and then run one app at a time like today.

@avaer
Copy link

avaer commented Feb 13, 2019

Several good points from the Feb 12th call, as well as the F2F:

  • The concept can be naturally bound to volumes/"prisms". Potential solutions come out of that in terms of security: UA can capture input/manage permissions on a per-volume basis.
  • We'll need some good, higher level UA UX to manage these "apps".
  • The UA would have to handle slowness from applications, and deal with system reprojection.
  • Scriptability/programmability is important for many use cases.
  • "We should do the simplest thing" -- but it's not clear what is simplest.

@TrevorDev
Copy link
Author

Its been a while. Is this something that can be added to the tpac agenda? /tpac @AdaRoseCannon

@AdaRoseCannon
Copy link
Member

We should have time to discuss this. At least talk about where things like this have appeared in recent operating systems.

It seems there are some cases where this is already happening and it doesn't need explicit API support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants