Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support capturing audio output from sound card #629

Closed
guest271314 opened this issue Oct 9, 2019 · 20 comments
Closed

Support capturing audio output from sound card #629

guest271314 opened this issue Oct 9, 2019 · 20 comments

Comments

@guest271314
Copy link

Add the capability to capture audio output from sound card. Whether headphones are plugged in or not it should be possible to capture audio output from the system soundcard.

navigator.mediaDevices.getUserMedia({audio:{speakers:true}})

or

navigator.mediaDevices.getUserMedia({audio:{headphones:true}})

or

navigator.mediaDevices.getUserMedia({audio:{output:true}})

Fixes

@alvestrand
Copy link
Contributor

Why does this need special treatment?
Wouldn't a sound card just be another input device?

@guest271314
Copy link
Author

When let stream = navigator.mediaDevices.getUserMedia() is executed the options currently available are microphone and camera. When enumerateDevices() is used to get the deviceId of "audioouput" then stream = await navigator.mediaDevices.getUserMedia({audio:{deviceId:{exact: deviceId}}}) needs to be executed again.

There should be a means to directly select "audioutput" at the prompt, by default, not implementer design alone (Firefox might provide such a UI, not Chromium, Chrome where devicechange event is not dispatched on mediaDevices).

It is not immediately clear at the specification that to select "audiooutput" navigator.getUserMedia({audio:<settings>}) needs to be executed twice to get only audio output and not input from microphone.

What is the canonical procedure to capture audio output (not input from a microphone) from the system?

@guest271314
Copy link
Author

guest271314 commented Oct 12, 2019

@alvestrand Firefox lists

Monitor of Built-in Audio Analog Stereo

(emphasis added) at getUserMedia() prompt, Chromium does not.

How To Record Speaker Output on Linux

Note: monitor means turning the output of the speakers into an input.

Firefox throws an OverConstrained error when exact is used to set deviceId constraint.

@guest271314
Copy link
Author

At https://w3c.github.io/mediacapture-main/#methods-5

6.5. Request permission to use a PermissionDescriptor with its name member set to the permission name associated with kind (e.g. "camera", "microphone"), and, optionally, its deviceId member set to the device's deviceId, while considering all devices attached to a live and same-permission MediaStreamTrack in the current browsing context to have permission status "granted", resulting in a set of provided media. Same-permission in this context means a MediaStreamTrack that required the same level of permission to obtain as what is being requested (e.g. not isolated).

does a PR need to be filed to explicitly include "audiooutput" at

with kind (e.g. "camera", "microphone")

?

@youennf
Copy link
Contributor

youennf commented Oct 24, 2019

What is the canonical procedure to capture audio output (not input from a microphone) from the system?

getUserMedia is focusing on microphones and cameras, extending it to audiooutput would probably not work well.

The getDisplayMedia spec provides the ability to capture system sound, see https://w3c.github.io/mediacapture-screen-share/#dom-mediadevices-getdisplaymedia.

@guest271314
Copy link
Author

@youennf

getUserMedia is focusing on microphones and cameras, extending it to audiooutput would probably not work well.

It is already possible to capture audio output with getUserMedia() and enumerateDevices().

It is simply not clear that procedure is possible. Chromium only lists microphone and camera at UI prompt. "Speakers" or "Audio Output" should be listed, where the equivalent of enumerateDevices() is executed internally to select "audiooutput" device when that option is selected at getUserMedia() UI permission prompt.

@guest271314
Copy link
Author

The getDisplayMedia spec provides the ability to capture system sound, see https://w3c.github.io/mediacapture-screen-share/#dom-mediadevices-getdisplaymedia.

Is the suggestion to file an issue at that specification?

Attempted to get audio using getDisplayMedia() using two different approaches, so far. Executing getDisplayMedia() once does not resolve to MediaStream containing an audio track. Executing getDisplayMedia() twice, once audio is being output, then navigating to the prompt again to select the same display device, where output is unknown, e.g., audio in the form of speech, initially results in at least half of the audio already having output before the second media stream containing no audio track is resolved by getDisplayMedia().

What is the canonical procedure per any specification or implementation relevant to media streams and capture origin to capture only audio output?

@jan-ivar
Copy link
Member

From #651 (comment):

@jan-ivar How do you suggest to proceed to specify that audio output can be captured?

I don't think mediacapture-main should specify generic capture of audio output from a browser or document without the context of a compelling user problem that it best solves.

Especially not to work around another web API failing to provide adequate access¹ to audio it generates, to solve a use case that seems reasonable in that spec's domain.

Instead, I'd ask that API to solve it without workarounds. Feel free to refer to mozilla/standards-positions#170 (comment).

The audio capture support in getDisplayMedia is not a good fit, as it's A) aimed at the screen-sharing use-case B) optional, and C) solely complementary to video.


1. In the form of a MediaStreamTrack or otherwise. A precedent here is MediaStreamAudioDestinationNode which gives JS a track with web audio.

@guest271314
Copy link
Author

@jan-ivar

I don't think mediacapture-main should specify generic capture of audio output from a browser or document without the context of a compelling user problem that it best solves.

Compelling use cases are listed at #651 (comment).

As previously stated, that functionality is already possible at both Firefox and Chromium.

What is missing is the specification acknowledging the technical facts and canonical example of how to do so.

There's no way to recognize speech from audio sources other than local mic, e.g. from a WebRTC call.

There's no way to pre-process audio—e.g. with web audio—before feeding it to the speech api.

are not true and correct.

@guest271314
Copy link
Author

@jan-ivar One example proof of

There's no way to recognize speech from audio sources other than local mic, e.g. from a WebRTC call.

There's no way to pre-process audio—e.g. with web audio—before feeding it to the speech api.

being not true and correct https://bugzilla.mozilla.org/show_bug.cgi?id=1604994#c5.

Are you stating that the non-exhaustive list of use cases at #651 (comment) are not compelling?

Are you asking for an exhaustive list of individuals, institutions which have published use cases for capturing audio output?

@jan-ivar
Copy link
Member

"that it best solves". I think those are compelling use cases for web speech to solve cleanly.

@guest271314
Copy link
Author

@jan-ivar There is no bright-line rule that the input to SpeechRecognition must be from a live human voice. https://stackoverflow.com/a/47113924.

Consider an individual with vision impairment. They have a book they want to read or write. They can feed the text of the book to SpeechRecognition via speechSynthesis. Before feeding the text to speechSynthesis the can modify the plain text or Braile input, capture the audio output and test the result of SpeechRecognition prior to publishing their work.

In reverse, audio output can be converted to plain text (in one or more languages) or Brail, etc.

Without the ability to capture audio output it becomes difficult to test input and output.

I think those are compelling use cases for web speech to solve cleanly.

Well, you can refer to the document that you cited mozilla/standards-positions#170 (comment), in this case the author of the post is correct in their analysis

I’m not sure how versed you are at reading specs, but if you take a look at the actual spec you will see that there are parts of the API that are either impossible to implement in an interoperable manner or the spec doesn’t say what to do: to be blunt, the spec hardly qualify as a spec at all... its more of a wish list thinly disguised as technical speciation only because it uses a W3C stylesheet: There are no algorithms. There is basically zero specified error handling. The eventing model is a total mystery. And much of it is just hand waving that magical things will happen and speech will somehow be generated/recognized (see the grammars section of the spec for a good hardy chuckle).

Essentially, the Web Speech API is dead. While a novel and worthy start, there are several issues with the current specification.

Media Capture and Streams is well-suited to take on the task of filling in the holes, which are actually in accord with what is already possible.

Am still not gathering the reluctance to acknowledge that the procedure is already technically possible.

@guest271314
Copy link
Author

Am perfoming due diligence before writing TTS/SST for the web platform from scratch where the technology and infrastructure already exists within the body of Media Capture and Streams, to avoid repeating what is already technically possible, where all that is really needed at this point relevant to capturing audio output under the umbrella of this specification, is the acknowledgement that that behaviour is already technically possible. Once that acknowledgment is made, then the canonical algorithm to do so can be incorporated into the specification officially. Have listed compelling use cases for the subject matter of TTS/SST though audio capture is not limited to those use cases alone. How user utilize the functionality, once unequivocally specified, is up to them. It does not appear to be a difficult task to simply amend the specification to acknowledge what is already possible, even if those possible outputs were/are unintended, and make sure that an algorithm is clearly defined to guide in the implementation of the edge cases, if you will, or unintended consequences of the power of this API to be useful in more than only the perhaps limited intent conceived by the original authors of the technical document. It is mathematically impossible to conceive of all of the possible use cases from within the official body, hierarchial structure, or any variance of a system itself, no matter to field or domain of human activity. Though try to avoid citing secondary sources, in this case Wikipedia provides concise synposis of the indisputable mathematical fact demonstrated by Kurt Gödel's Incompleteness Theorm (On Formally Undecidable Propositions of "Principia Mathematica" and Related Systems, 1931)

  1. If a (logical or axiomatic formal) system is consistent, it cannot be complete.
  2. The consistency of axioms cannot be proved within their own system.

@aboba
Copy link
Contributor

aboba commented Jan 9, 2020

Jan-Ivar said:

"I don't think mediacapture-main should specify generic capture of audio output from a browser or document without the context of a compelling user problem that it best solves.

Especially not to work around another web API failing to provide adequate access¹ to audio it generates, to solve a use case that seems reasonable in that spec's domain.

Instead, I'd ask that API to solve it without workarounds. Feel free to refer to mozilla/standards-positions#170 (comment).

The audio capture support in getDisplayMedia is not a good fit, as it's A) aimed at the screen-sharing use-case B) optional, and C) solely complementary to video."

[BA] Closing this issue.

@aboba aboba closed this as completed Jan 9, 2020
@guest271314
Copy link
Author

@aboba The closure of this issue leads to other questions.

If capturing from a source card were not possible at all and if implementations did not provide a means to create fake media devices and fake input streams available to the main thread then closure of this would be in accord with the position that capturing from an output device (or any other device, virtual or otherwise) is simply not possible under the umbrella of this specification.

However, since browsers already do allow for setting a fake media device and fake input media stream the question then becomes what is the canonical procedure to create a specification compliant fake media device and fake media stream from a file (https://chromium.googlesource.com/chromium/src/+/4cdbc38ac425f5f66467c1290f11aa0e7e98c6a3/media/audio/fake_audio_output_stream.cc; https://chromium.googlesource.com/chromium/src/+/4cdbc38ac425f5f66467c1290f11aa0e7e98c6a3/media/audio/fake_audio_manager.cc; https://stackoverflow.com/a/40783725).

It is not as if the substance of this feature request is not already possible. Am asking for the functionality to be officially standardized. Instead of now having to embark on writing and implementing to code outside of the official standard (https://github.com/auscaster/webrtc-native-to-browser-peerconnection-example), if only to prove the requirement is possible and not specifying the same will only lead to disparate code in the wild that will eventually lead right back to this specification.

chromium-browser --allow-file-access-from-files --autoplay-policy=no-user-gesture-required --use-fake-device-for-media-stream --use-fake-ui-for-media-stream --use-file-for-fake-audio-capture=$HOME/test.wav%noloop --user-data-dir=$HOME/test 'file:///home/user/testUseFileForFakeAudioCaptureChromium.html'

navigator.mediaDevices.getUserMedia({audio: true})
.then(mediaStream => {
  const ac = new AudioContext();
  const source = ac.createMediaStreamSource(mediaStream);
  source.connect(ac.destination);
});

The problem is that implementation has several bugs that can be fixed by the official specification body, to allow for creation of a MediaStream (MediaStreamTrack) directly from a file using, if necessary, a created fake device. In part, because this specification refuses to acknowledge that capturing Monitor from <device> is a concreate use case that is well within the purview of this standard.

Will ask the question ("How to create a fake media device and MediaStream from a file") officially in an issue.

@guest271314
Copy link
Author

@aboba

Instead, I'd ask that API to solve it without workarounds.

As already stated, the Web Speech API is essentially dead. Filed issues at least over a year ago to implement SSML parsing at both Chromium and Firefox. The fix is relatively simple and the patch has already been posted (https://bugs.chromium.org/p/chromium/issues/detail?id=795371#c18).

ChromiumOS authors, again, focus on extension code using espeak-ng https://chromium.googlesource.com/chromiumos/third_party/espeak-ng/+/refs/heads/chrome, instead of actually working on the Web Speech API.

Besides, am banned from WICG for 1,000 years and when joined the W3C (https://lists.w3.org/Archives/Public/public-wicg/2019Oct/0000.html) to contribute to Web Speech API, was un-joined from that parent organization, for fraudualent reasoning easily proven hypocritical
w3c_real_name
.

Have zero confidence that W3C and especially WICG are operating in a non-biased manner.

@guest271314
Copy link
Author

@aboba Though, as usual, the closure of this official feature request will probably be beneficial in the end, in order to be free of the (at times moving, and occasionally arbitrarily) constraints of any specification when trying to meet a requirement for a use case. Performed due diligence in any case in an attempt to get what is already possible actually specified.

@ivyfae
Copy link

ivyfae commented Aug 4, 2020

"that it best solves". I think those are compelling use cases for web speech to solve cleanly.

I'm not sure what the etiquette is for commenting on closed issues, so let me know if this would be better posted elsewhere, but I have a use case that has not been considered and is completely unrelated to web speech: music production, video editing, and other similar work. Many music producers record or stream the music production process for educational purposes, or for live performances. This is even more common now that so many musicians aren't able to perform live due to COVID. In these cases, there doesn't appear to be any viable substitute for capturing the main output of an external audio interface/sound card.

Why does this need special treatment?
Wouldn't a sound card just be another input device?

If you want the sound card input, yes. But in use cases related to music production or video editing, if I want to stream or record, the sound card's main output is going to be the only useful audio to capture. None of the other sources will have all parts of the master mix. The DAW needs to use the resources the sound card provides directly instead of just passing audio data to it like most audio-light applications do, so the sound can't be grabbed using the 'desktop' generic source.

In the specific application I'm writing I was disappointed to find that calling getUserMedia with an audiooutput device ID confusingly gave up and instead gave me stream containing audio from my laptop's built-in mic.

@guest271314
Copy link
Author

@ivyfae See also #708.

@guest271314
Copy link
Author

@alvestrand

Why does this need special treatment?
Wouldn't a sound card just be another input device?

Agree.

Reading this language https://chromium-review.googlesource.com/c/chromium/src/+/1064373/

PulseAudio: Filter out unavailable inputs and refuse to open monitor inputs

If there are no available inputs, PulseAudio will, for some reason,
select the monitor of the current default sink as the default source.
...
this CL also explicitly returns
invalid AudioParameters for monitor devices, and explicitly fails to
open them as inputs - to be on the safe side.

it appears that Chromium just "refuse" to open certain devices - rather than fix the actual problem with PulseAudio being described, where Mozilla Firefox and Nightly do not have that problem, and using pavucontrol as a workaround confirms that selection of monitor devices is possible.

Perhaps you can actually fix whatever problem was reported at https://issuetracker.google.com/79580580 re PulseAudio in order for Chromium to be in alignment with your evaluation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants