-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose multitasking support for webXR #15
Comments
I think it'd be interesting to think of some kind of Extension kind of API for sites that don't require to occupy your whole fov. It might just be like a tool or a virtual character or others that you can interact with in all immersive sites. This might be something that could work nicely with https://github.com/WICG/webpackage |
(pulling @janerivi over from #16 so that we can consolidate discussion) This is a topic that has come up quite a bit and the usual response is that the WebXR Device API approach of providing a rendering context makes it very difficult to visually composite multiple apps. And then someone brings up the complexity of cross-origin security, audio mixing, and managing input among apps and everyone backs away slowly because it's just too big. So, step zero in this situation is to figure out whether it's possible to tease out sub-problems that aren't by themselves too massive to reason about and then find consensus. When I think about the problem, there are several pieces:
|
Thanks for your insight I really feel that multitasking scenarios should be a high priority and hope people don't back away again on this. I am afraid that most non-gaming/experience apps(eg. shopping or new consumption) are unlikly to be used if they are forced into running alone. Some thoughts:
Other questions to consider (With my current thinking inline):
|
For most of the points we've made, there's a tension between what the browser can control and how much control the apps have. Tip too far toward browser control and the use cases are limited to what we can imagine today. Tip too far toward app control and they'll clash with each other on every output channel (auditory, visual, haptic, ...) and every input channel (hand gestures, wands, voice, ...). I've heard from a couple of people that there might be a middle ground for the web, where there is a constrained declaration of the initial content (sometimes using markup, sometimes a format like glTF) and then a constrained protocol between scripts and the UA for receiving input from the user and making changes to the content. So, it's a way for web devs to create something the UI can manage that is also pretty flexible for what it shows and does. This is, however, very different than the current WebXR Device API approach of giving each session a full rendering context and control over the entire visual field. I'm not sure how (or if) we bridge those two disparate approaches. |
The scale between browser vs app control definitely does exist but my current thinking is to expose as much control to the app as possible while still allowing users to avoid/close apps that cause issues. If this is done in an experimental way, I'd expect developers will naturally find patterns to play nicely with other apps or risk users abandoning them similar to what happens to intrusive 2D websites today. If browsers provided a method to composite multiple xr frames from different tabs the MVP API contract would be very close to that of the existing xr spec (This seems to work well for Exokit which is trying to avoid existing xr apps from even having to be rewritten). I would like to learn more about the possible markup/gltf solution you mentioned. What might be responsible for rendering the 3D content? If this did exist separate to the xr device api what might be reasons to use one API over the other? |
I don't yet see a way to give two or more apps the freedom to directly render into a 3D space (even in separate composited graphics contexts) without opening up the user to a wide range of undesirable behavior where the apps are conflicting in both accidental and malicious ways. We probably don't want to share depth buffers between apps because that would leak quite a bit of information across domains. If we solve that problem, somehow, then how can the user know which application is responsible for a specific bit of rendered information? A malicious app could insert itself in front of another app and mimic its UI, causing the user to unknowingly interact with the malicious app, possibly leaking personal data. I do see a way that the app could give the browser a declarative renderable (e.g. a glTF file) and the browser would be in charge of loading the renderable, keeping it visually separate from other apps' renderables, and clearly denoting which app is receiving input. |
If the color+depth compositing is done externally from the apps, depth information does not need to be shared cross domain, apps could write to a write-only buffer or submit buffers to be composited. There are a couple of possible solutions to stop malicious apps. For example, when entering sensitive information, apps could enter a "security mode" where it is the only app able to render and the browser can display to the user which app they are interacting with. Additionally an api could be exposed that apps could call to let the browser know they are being interacted with so given any interaction, the browser could display which apps are responding. I'm somewhat concerned that having the browser in charge of rendering and things like keeping apps visually separate may be leaning too far towards browser control. However if security is a higher priority I agree it's probably the better option to restrict apps from misbehaving from the start. (I do still hope I can fly a virtual RC rocket around my entire multitasking space while watching a video at some point in the future though 🚀) |
I really like the idea of an API for some kind of security mode when there's a delicate interaction going on with a specific app.
Can't browsers know that already? |
@AlbertoElias if any app was able to draw anywhere in your 3D space and multiple apps were running at the same time. The browser and the user wouldn't know what content belongs to each app, all they would see is the composited views from all apps. In the scenario you mentioned with a virtual character as well as some of the scenarios that @janerivi mentioned in his issue such as the fox running on the ground, these apps need to have some sort of freedom in their drawing location to achieve their desired functionality while at the same time make trade-offs to ensure users are secure. Maybe in the gltf api @TrevorFSmith proposed, apps could request to move to different areas of the screen if my original compositing proposal was too risky. |
Yes, that's true, but I think needing an API for the different apps to call to let the browser know is too much to ask of developers. I don't know much about this, but it'd be interesting to see if browsers could hook into the drawing pipeline to see which sites drew what content in the 3d scene. Maybe it'd be interesting to check out what Magic Leap is doing with their Lumin runtime. From a quick read, it seems each landscape app is deployed on a prism, so they do seem to be separated from one another. Maybe someone at Mozilla can tell us more about how these prisms work together |
(@TrevorDev Sorry for the late reply; somehow I wasn't getting notifications on this repo, I thought I'd "watched" it) This topic is near and dear to my heart; running multiple AR apps at once is something I also think is critical, and it was central to the ideas behind the first version of the Argon browser we released back in 2011; all versions (up to the current Argon4) allow multiple pages to be overlaid. I've brought this topic up at our face-to-face meetings, and @TrevorFSmith and I included some initial support for it in our proposed API last year, but have generally gotten a lot of pushback because of two main things
Some thoughts on the above (again, sorry for not keeping up).
I don't think these all need to be solved completely, as the can just be left up to the UA (which is perhaps what you meant). For example, a UA would feed each app/layer the same WebXR pose info, but may have different ways of choosing the order they get called/rendered (i.e., a notion like "tabs"). In Argon, we settled on having a single page being the "current page"; that page was the only one to get input. We tried different schemes of sending input to all pages, but it was easy to come up with security problems. Having one page (that the user chooses) as the "current" one simplifies this, and lets other issues (i.e. audio) be managed as well (e.g., UA's can provide UI's to mute "background" audio or have it off by default).
It would be safer, and simpler, to have input go to a single app at a time; implementing UI controls (i.e., the equivalent of click-drag-release) because insanely complicated if multiple channels can get the same input ... and relying on dev's to note when input is sensitive seems dangerous!
I don't actually see an issue if the app can't read the depth or framebuffer back. There are actually two reasons to have many apps:
My view is that the most obvious way to allow multiple simultaneous WebXR sessions is this:
There are some pretty simple metaphors for input management. For example, perhaps the UA requires the user to choose the input layer; when no input is being set to layers, they are both visible, but when one is selected, the content of "background" layers is greyed out or appears to be behind a slightly distorted visual layer (e.g., water, glass, etc). But, any of that could be up to the UA. From WebXR's perspective, the implementation needs to support multiple layers, but the standard just needs to acknowledge it as a possibility and not prevent it. For example, instead of coupling input to one immersive session at creation time, it could make clear that if the UA supports multiple immersive sessions, only one can get any input (beyond pose) at a time. Things like hit testing might only return success on the layer that is getting input; there would have to be a way for "background" apps to know they aren't getting hits because they are backgrounded. This could be as simple as their being "focus" and "blur" events on the layers (for foreground/background) and hittest only succeeding if the app has focus. |
I agree with currently focusing on the second reason, especially as there is a lot of utility already for that and it is a point we can get to based on the current direction WebXR is heading. We can look in the future how to improve on that. Can´t the type of immersive session change for a multi-app environment and, in that way, let the developer know? |
I had the opportunity to meet up with @NellWaliczek and @toji at siggraph this week and asked for their thoughts on this. From my understanding of their response (Please correct me if i'm wrong), they have heavy concerns around the same issues that Trevor brought up and believe that the declarative model may be the right direction moving forward and that this would likely be external to the webXR spec. This topic is something that the Immersive Web Working Group will likely be looking into as native AR/VR platforms get more mature. @blairmacintyre Argon looks very interesting (I never heard of it and am glad others have similar concerns as I 👍 ). I think experimental browser like this or exokit are the right way to push for new features desired or create an entirely new platform for app development. If scenarios can be developed here that make users go out of their way to install, it is a good way to prove value. Restricting input to a single app at a time seems like a good way to avoid accidental input noise, I'd be interested how this might work to enable playing a 3D game while watching a 3D video. @AlbertoElias The Lumin runtime is very cool, first AR headset to ship 3D multiasking support that Im aware of! One thing I fear though is that as other platforms ship similar features, I hope their runtimes are similar enough that a browser spec will be feasible enough to create useful content that works on all of them. This is one of the reasons why rgb+depth compositing seems like the right api approach to me as the contract with the OS may be less complex. |
Yes, I don't currently see a way to use the session-oriented WebXR APIs as the basis for the sort of long-lived, simultaneously running XR apps that I think would be super interesting. I'm knocking around some ideas for an internal project that uses various features that already exist in headset browsers (workers, home environments, etc) as a way for UAs to host such apps, but the ideas are still very early. The rough idea is that the web dev declares a set of 3D and audio assets and a script. The UA loads the assets (using its own loader, not one provided by the developer) and then loads the script in a worker. There would be a specified message protocol between the worker and the script, with the UA providing input messages like 'hover' and 'activate' and the script providing messages about modifying aspects of the assets like changing texture IDs, posing meshes, and starting and stopping audio assets. The UA would then be totally in charge of managing and rendering the 3D and audio assets. It would also handle input and choose how to indicate to the user that input is going to a specific app. The UA would also manage the lifecycle of the app scripts. It's an essentially different model for XR apps than page-based WebXR sessions, but I feel like it's a worthwhile experiment. |
Not sure if it was mentioned but the potential security threat of having multiple apps receive input at the same time is if one app has some custom 3d input method for passwords or log in or other sensitive information. Even if the malicious app can't read back the frame buffer they could potentially machine learn for example the password from gaze, or else reduce the search space statistically. And yet if we go by the model of each app just being given it's own frame buffer and a pose then they can't even render their scene to be composed without that head pose. This suggests to me a different rendering model when apps are composited. Instead of telling the UA how to render, it would need to tell the UA what to render. In other words, instead of a frame buffer, something like a DOM for 3d objects (gltf?) where each app adds objects to it's own DOM into a UE provided coordinate space (app might ask for a world anchor or headspace etc..) and then the UE merges the DOM for rendering. Apps can read back only their own DOM and only one app gets input at a time. Because the UA knows more about the scene this also makes it easier for the UA to visualize app ownership of 3D objects, optionally limit apps to an enclosure in world space, and let the user hide/show apps. It would also help UA manage performance on mobile devices. Of course this would be more limiting than the current webxr model so I see the two working side by side, but those apps would need to request full access and then run one app at a time like today. |
Several good points from the Feb 12th call, as well as the F2F:
|
Its been a while. Is this something that can be added to the tpac agenda? /tpac @AdaRoseCannon |
We should have time to discuss this. At least talk about where things like this have appeared in recent operating systems. It seems there are some cases where this is already happening and it doesn't need explicit API support. |
Has support for multitasking in webXR been discussed already online in the past for later versions of the spec? I searched online but haven't found any discussions.
Background:
Today's WebVR/WebXR applications are limited to only fully immersive experiences taking full control of the users environment. On mobile or desktop, workflows often involve some form of multitasking such as running 2 applications side by side to get the productivity boost of not requiring to switch between apps to consume their content concurrently. Existing AR/VR desktop environments such as the Windows Mixed Reality home or Oculus Dash already allow multitasking web browsers in 2D (eg. https://www.youtube.com/watch?v=SvP_RI_S-bw) so it would be natural to provide similar support for 3D web content through a browser.
Use Cases:
In general most webXR applications could provide some support for multitasking similar to desktop applications.
Comparison Web Shopping:
One furniture store website would display a couple interactive 3D chairs (Change fabric, fold up, etc.) another would do the same. The user could then compare both offers directly without having to jump back and forth between apps.
Gaming and video watching:
One website would display a game such as flying an RC plane. Another would be presenting a video. Neither of these apps alone may be exciting enough to view alone but they enhance the experience of whatever they are running concurrently with.
Feed notifications:
A feed based website could be launched and then placed off to the side while using other applications. When a new item is added to the feed it could display that new content is ready.
Proposed approach:
In order to get basic multitasking support, an app should have a way to provide what the user is seeing (possibly color and depth textures) to the browser for it to be composited on top of other apps. Some mechanism for window management might be needed if each webXR application's position could be controlled externally to avoid heavily overlapping apps. Input for the apps would need to be able to handle scenarios such as if 2 apps have buttons in front or behind each other and the user selects with a ray that passes through both buttons. Some method to retrieve information about the current environment the apps is running in (eg. lights to ensure multiple apps are consistently lit)
Here is an old prototype video I made with rough use cases: https://www.youtube.com/watch?v=R3xZ1G291Ks
The text was updated successfully, but these errors were encountered: