-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Considerations for Accessibility in Hubs
For users who are deaf or hard of hearing, the audio requirements for users within a Hubs environment make it impossible to interact with others within the space. While we do use visual indicators to indicate who is speaking at a given time, it is currently impossible to do much more than turn the volume up on users who need to be heard.
Using a machine learning speech-to-text translation service, automatically caption the avatar audio from another user. Tied to their username, this would provide a data store that could be used to transcribe spoken audio into a format that is accessible via methods that do not rely on full hearing. This could be presented in a user-configurable manner depending on the needs of the user, including:
-
Spatialized, displayed over or on the avatar who is speaking in some capacity. While this positioning would provide a benefit in very quickly mapping the avatar to the information, it does have tradeoffs. One challenge of this presentation is that it would generally require that the speaking avatar is in the user’s view frustum, and if the avatar was behind the user, there would likely be missed information while the application alerted the user that someone outside of their view is speaking. This mechanic would also require users to be at a relatively close distance to view the digetic translation.
-
Non-spatialized, HUD / 2D window callout. This option would present an option e.g. a tab in the chat window, or as a separate window (either a React component or browser instance) for a user to refer to during a conversation. This would transcribe the spoken audio in a way that would be decoupled from the spatial environment and would work independently of where the user as looking within the scene.
When multiple audio sources are coming into a room at a given time, perhaps from a video and avatar speaking at the same time, it can be difficult to understand the various audio threads. One option that could be beneficial is the ability to enter a mode that isolates audio to a single source (avatar or video). This could be something that is navigated through the 3D space, or as part of the user list, which could be updated to indicate who is speaking and automatically focus on the avatar who is talking (similar to how video conferencing solutions work).
We should consider the ways that media with audio is presented in a Hubs room. While there are use cases that exist where synchronized media that begins at the same time is the largest priority, this can also be a negative experience. Although this would likely add synchronization complexity, we may want to allow a global client media volume setting, or the ability to turn auto-play off for videos and audio on a specific client.
Across the different modalities that Hubs runs on, there are a number of different input mechanics that can be used to interact with one’s avatar and the other users / objects within a space. For example, a mobile phone might allow a user to navigate by a pinch-and-zoom mechanic, while a 3DoF VR headset would provide a single controller to the user and more complex headsets would allow gestures. This means that ultimately, Hubs will need to support accessible best practices on mobile and desktop (which is generally somewhat well-understood) in addition to helping uncover and support best practices in VR locomotion. Examples of these types of accessible input mechanics that could be supported within Hubs include:
- Being able to navigate the lobby and room entry flow using only a keyboard
- Supporting full controls with just one hand controller in a 6DoF system
- Implementation of a binary movement/interaction system that uses three different signal lengths to select the correct command (based off of the sip and puff input mechanic, ideally with the ability to be accessed via any combination of assistive technology controller that can be used with browser apps)
- Ability to navigate through an environment with voice commands
Many 3D environments are currently inaccessible to blind or low-vision users. Hubs in particular, with the prevalence of user generated content and media composition tools, can result in rooms and environments that are visually very busy: scenes can contain countless 3D and 2D objects, including web pages and images, videos, other avatars, and more.
One accessibility consideration that we may want to adapt for Hubs is the concept of aria-hidden: the HTML designation that marks which HTML elements on a given webpage are hidden from screen reader software. While elements in a scene may be important for a room’s given context at any particular time, not all content is. While it would likely be impossible for a general rule to be applied that could determine this outside of knowledge of a room’s context, we could explore opportunities to reduce the rendered objects to a user at a given time. Some approaches to this could be:
- Hiding the scene elements from a room, displaying only the avatars. This would have implications related to room navigation as well as collisions (perhaps a modified flying mode could address some of these issues)
- Providing Spoke authors the ability to mark things in the scene as essential or non-essential. Clients could surface an option that does not render “non essential” items, but this would provide more control than simply turning off all visual elements in the scene. This may also be helpful for reducing overall bandwidth that a scene requires, but introduces the idea that not all users might see the same thing or share the “same” reality.
There are several ways that we could approach description of scene elements to make Hubs rooms more accessible through assistive technologies. One way of doing this could be to provide a method for Spoke scene creators to include descriptive text elements in the components that they add to a scene, which could then be read by Hubs when a user requests information about what is in their camera’s view frustum. Alternatively, we could try to implement an automatic captioning service that would attempt to describe a scene automatically, but this has limitations. We would also need to consider various ways that the Hubs client could surface this information in a useful way - perhaps as captions on hover, or spoken aloud.