Streams, RTC, audio API and media controllers

The ideas here build on Ian Hickson's WhatWG proposal, adding features partly inspired by the Mozilla audio API and the Chrome audio API. Unlike previous audio API proposals, the API presented here integrates with proposed API for media capture from local devices, integrates with proposed API for peer-to-peer media streaming, handles audio and video in a unified framework, incorporates Worker-based Javascript audio processing, and specifies synchronization across multiple media sources and effects. The API presented here does not include a library of "native" effects; those can be added as a clean extension to StreamProcessor, perhaps as a "level 2" spec.

The work here is nascent. Until a prototype implementation exists, this proposal is likely to be incomplete and possibly not even implementable.

Scenarios

These are higher-level than use-cases.

1) Play video with processing effect applied to the audio track

2) Play video with processing effects mixing in out-of-band audio tracks (in sync)

3) Capture microphone input and stream it out to a peer with a processing effect applied to the audio

4) Capture microphone input and visualize it as it is being streamed out to a peer and recorded

5) Capture microphone input, visualize it, mix in another audio track and stream the result to a peer and record

6) Receive audio streams from peers, mix them with spatialization effects, and play

7) Seamlessly chain from the end of one input stream to another

8) Seamlessly switch from one input stream to another, e.g. to implement adaptive streaming

9) Synthesize samples from JS data

10) Trigger a sound sample to be played through the effects graph ASAP but without causing any blocking

11) Synchronized MIDI + Audio capture

12) Synchronized MIDI + Audio playback (Would that just work if streams could contain MIDI data?)

13) Capture video from a camera and analyze it (e.g. face recognition)

14) Capture video, record it to a file and upload the file (e.g. Youtube)

15) Capture video from a canvas element, record it and upload (e.g. Screencast/"Webcast" or composite multiple video sources with effects into a single canvas then record)

Straw-man Proposal

Streams

The semantics of a stream:

A window of timecoded video and audio data.
The timecodes are in the stream's own internal timeline. The internal timeline can have any base offset but always advances at the same rate as real time, if it's advancing at all.
Not seekable, resettable etc. The window moves forward automatically in real time (or close to it).
A stream can be "blocked". While it's blocked, its timeline and data window does not advance.
A stream can be "ended". While it's ended, it must also be blocked. An ended stream will not normally produce data in the future (although it might if the source is reset somehow).

We do not allow streams to have independent timelines (e.g. no adjustable playback rate or seeking within an arbitrary Stream), because that leads to a single Stream being consumed at multiple different offsets at the same time, which requires either unbounded buffering or multiple internal decoders and streams for a single Stream. It seems simpler and more predictable in performance to require authors to create multiple streams (if necessary) and change the playback rate in the original stream sources.

Hard case:

Mix http://slow with http://fast, and mix http://fast with http://fast2; does the http://fast stream have to provide data at two different offsets?
Solution: if a (non-live) stream feeds into a blocking mixer, then it itself gets blocked. This has the same effect as the entire graph of (non-live) connected streams blocking as a unit.

Media elements

interface HTMLMediaElement {
 readonly attribute Stream stream;

 // Returns the same stream as 'stream', but also sets the captureAudio attribute.
 Stream captureStream();

 // This attribute is NOT reflected into the DOM. It's initially false.
 attribute boolean captureAudio;

 attribute any src;
};

'stream' returns the stream of "what the element is playing" --- whatever the element is currently playing, after its volume and playbackrate are taken into account. While the element is not playing (e.g. because it's paused, seeking, or buffering), the stream is blocked. When the element is in the ended state, the stream is in the ended state. When something else causes this stream to be blocked, we block the output of the media element.

When 'captureAudio' is set, the element does not produce direct audio output. Audio output is still sent to 'stream'.

'src' can be set to a Stream. Blocked streams play silence and show the last video frame.

Stream extensions

Streams can have attributes that transform their output:

interface Stream {
  attribute double volume;

  void setVolume(volume, [optional] double atTime);

  // When set, destinations treat the stream as not blocking. While the stream is
  // blocked, its data are replaced with silence.
  attribute boolean live;
  // When set, the stream is blocked while it is not an input to any StreamProcessor.
  attribute boolean waitForUse;

  // When the stream enters the "ended" state, an HTML task is queued to run this callback.
  attribute Function onended;

  // Create a new StreamProcessor with this Stream as the input.
  StreamProcessor createProcessor();
  // Create a new StreamProcessor with this Stream as the input,
  // initializing worker.
  StreamProcessor createProcessor(Worker worker);
};

Stream mixing and processing

[Constructor]
interface StreamProcessor : Stream {
 readonly attribute Stream[] inputs;
 void addStream(Stream input, [optional] double atTime);
 void setInputParams(Stream input, any params, [optional] double atTime);
 void removeStream(Stream input, [optional] double atTime);

 attribute Worker worker;
};

This object combines multiple streams with synchronization to create a new stream. While any input stream is blocked and not live, the StreamProcessor is blocked. While the StreamProcessor is blocked, all its input streams are forced to be blocked. (Note that this can cause other StreamProcessors using the same input stream(s) to block, etc.) A StreamProcessor is ended if all its inputs are ended (including if there are no inputs).

'inputs' returns the current set of input streams. A stream can be used as multiple inputs to the same StreamProcessor, so 'inputs' can contain multiple references to the same stream.

'setInputParams' sets the parameters object for the given input stream. All inputs using that stream must share the same parameters object. These parameters are only for this ProccesorStream; if the input stream is used by other ProcessorStreams, they will have separate input parameters.

When 'atTime' is specified, the operation happens instantaneously at the given media time, and all changes with the same atTime happen atomically. Media times are on the same timeline as "animation time" (window.mozAnimationStartTime or whatever the standardized version of that turns out to be). If atTime is in the past or omitted, the change happens as soon as possible, and all such immediate changes issued by a given HTML5 task happen atomically.

While 'worker' is null, the output is produced simply by adding the streams together. Video frames are composited with the last-added stream on top, everything letterboxed to the size of the last-added stream that has video. While there is no input stream, the StreamProcessor produces silence and no video.

While 'worker' is non-null, input stream data is fed into the worker by dispatching onprocessstream callbacks. Each onprocessstream callback takes a StreamEvent as a parameter. A StreamEvent provides audio sample buffers for each input stream; the event callback can write audio output buffers and a list of output video frames. If the callback does not output audio, default audio output is automatically generated as above. Each StreamEvent contains the parameters associated with each input stream contributing to the StreamEvent.

Currently the StreamEvent API does not offer access to video data. This should be added later.

Note that 'worker' cannot be a SharedWorker. This ensures that the worker can run in the same process as the page in multiprocess browsers, so media streams can be confined to a single process.

An ended stream is treated as producing silence and no video. (Alternative: automatically remove the stream as an input. But this might confuse scripts.)

interface DedicatedWorkerGlobalScope {
 attribute Function onprocessstream;
 attribute float streamRewindMax;
 attribute boolean variableAudioFormats;
};

'onprocessstream' stores the callback function to be called whenever stream data needs to be processed.

interface StreamEvent {
 readonly attribute float rewind;

 readonly attribute StreamBuffer inputs[];
 void writeAudio(long sampleRate, short channels, Float32Array data);
};

To support graph changes with low latency, we might need to throw out processed samples that have already been buffered and reprocess them. The 'rewind' attribute indicates how far back in the stream's history we have moved before the current inputs start. It is a non-negative value less than or equal to the value of streamRewindMax on entry to the event handler. The default value of streamRewindMax is zero so by default 'rewind' is always zero; filters that support rewinding need to opt into it.

'inputs' provides access to a StreamBuffer representing data produced by each input stream.

interface StreamBuffer {
 readonly attribute any parameters;
 readonly attribute long audioSampleRate;
 readonly attribute short audioChannels;
 reaodnly attribute long audioLength;
 readonly attribute Float32Array audioSamples;
 // TODO something for video frames.
};

'parameters' returns a structured clone of the latest parameters set for each input stream.

'audioSampleRate' and 'audioChannels' represent the format of the samples. 'audioSampleRate' is the number of samples per second. 'audioChannels' is the number of channels; the channel mapping is as defined in the Vorbis specification.

'audioLength' is the number of samples per channel.

If 'variableAudioFormats' is false (the default) when the event handler fires, the UA will convert all the input audio to a single common format before presenting them to the event handler. Typically the UA would choose the highest-fidelity format to avoid lossy conversion. If variableAudioFormats was false for the previous invocation of the event handler, the UA also ensures that the format stays the same as the format used by the previous invocation of the handler.

'audioSamples' gives access to the audio samples for each input stream. The array length will be 'audioLength' multiplied by 'audioChannels'. The samples are floats ranging from -1 to 1, laid out non-interleaved, i.e. consecutive segments of 'audioLength' samples each. The durations of the input buffers for the input streams will be equal (or as equal as possible given varying sample rates).

Streams not containing audio will have audioChannels set to zero, and the audioSamples array will be empty --- unless variableAudioFormats is false and some input stream has audio.

'writeAudio' writes audio data to the stream output. If 'writeAudio' is not called before the event handler returns, the inputs are automatically mixed and written to the output. The 'data' array length must be a multiple of 'channels'. 'writeAudio' can be called more than once during an event handler; the data will be appended to the output stream.

There is no requirement that the amount of data output match the input buffer duration. A filter with a delay will output less data than the duration of the input buffer, at least during the first event; the UA will compensate by trying to buffer up more input data and firing the event again to get more output. A synthesizer with no inputs can output as much data as it wants; the UA will buffer data and fire events as necessary. Filters that misbehave, e.g. by continuously writing zero-length buffers, will cause the stream to block.

Graph cycles

If a cycle is formed in the graph, the streams involved block until the cycle is removed.

Dynamic graph changes

Dynamic graph changes performed by a script take effect atomically after the script has run to completion. Effectively we post a task to the HTML event loop that makes all the pending changes. The exact timing is up to the implementation but the implementation should try to minimize the latency of changes.

Canvas Recording

To enable video synthesis and some easy kinds of video effects we can record the contents of a canvas:

interface HTMLCanvasElement {
 readonly attribute Stream stream;
};

'stream' is a stream containing the "live" contents of the canvas as video frames, and no audio.

Examples

1) Play video with processing effect applied to the audio track

<video src="foo.webm" id="v" controls></video>
<audio id="out" autoplay></audio>
<script>
 document.getElementById("out").src =
   document.getElementById("v").captureStream().createProcessor(new Worker("effect.js"));
</script>

2) Play video with processing effects mixing in out-of-band audio tracks (in sync)

<video src="foo.webm" id="v"></video>
<audio src="back.webm" id="back"></audio>
<audio id="out" autoplay></audio>
<script>
 var mixer = document.getElementById("v").captureStream().createProcessor(new Worker("audio-ducking.js"));
 mixer.addStream(document.getElementById("back").captureStream());
 document.getElementById("out").src = mixer;
 function startPlaying() {
   document.getElementById("v").play();
   document.getElementById("back").play();
 }
 // We probably need additional API to more conveniently tie together
 // the controls for multiple media elements.
</script>

3) Capture microphone input and stream it out to a peer with a processing effect applied to the audio

<script>
 navigator.getUserMedia('audio', gotAudio);
 function gotAudio(stream) {
   peerConnection.addStream(stream.createProcessor(new Worker("effect.js")));
 }
</script>

4) Capture microphone input and visualize it as it is being streamed out to a peer and recorded

<canvas id="c"></canvas>
<script>
 navigator.getUserMedia('audio', gotAudio);
 var streamRecorder;
 function gotAudio(stream) {
   var worker = new Worker("visualizer.js");
   var processed = stream.createProcessor(worker);
   worker.onmessage = function(event) {
     drawSpectrumToCanvas(event.data, document.getElementById("c"));
   }
   streamRecorder = processed.record();
   peerConnection.addStream(processed);
 }
</script>

5) Capture microphone input, visualize it, mix in another audio track and stream the result to a peer and record

<canvas id="c"></canvas>
<mediaresource src="back.webm" id="back"></mediaresource>
<script>
 navigator.getUserMedia('audio', gotAudio);
 var streamRecorder;
 function gotAudio(stream) {
   var worker = new Worker("visualizer.js");
   var processed = stream.createProcessor(worker);
   worker.onmessage = function(event) {
     drawSpectrumToCanvas(event.data, document.getElementById("c"));
   }
   var mixer = processed.createProcessor();
   mixer.addStream(document.getElementById("back").Stream());
   streamRecorder = mixer.record();
   peerConnection.addStream(mixer);
 }
</script>

6) Receive audio streams from peers, mix them with spatialization effects, and play

<audio id="out" autoplay></audio>
<script>
 var worker = new Worker("spatializer.js");
 var spatialized = stream.createProcessor(worker);
 peerConnection.onaddstream = function (event) {
   spatialized.addStream(event.stream);
   spatialized.setInputParams(event.stream, {x:..., y:..., z:...});
 };
 document.getElementById("out").src = spatialized;   
</script>

7) Seamlessly chain from the end of one input stream to another

<mediaresource src="in1.webm" id="in1" preload></mediaresource>
<mediaresource src="in2.webm" id="in2"></mediaresource>
<audio id="out" autoplay></audio>
<script>
 var in1 = document.getElementById("in1");
 in1.onloadeddata = function() {
   var mixer = in1.captureStream().createProcessor();
   var in2 = document.getElementById("in2");
   mixer.addStream(in2.captureStream(), window.currentTime + in1.duration);
   document.getElementById("out").src = mixer;
   in1.play();
 }
</script>

8) Seamlessly switch from one input stream to another, e.g. to implement adaptive streaming

<mediaresource src="in1.webm" id="in1" preload></mediaresource>
<mediaresource src="in2.webm" id="in2"></mediaresource>
<audio id="out" autoplay></audio>
<script>
 var stream1 = document.getElementById("in1").captureStream();
 var mixer = stream1.createProcessor();
 document.getElementById("out").src = mixer;
 function switchStreams() {
   var in2 = document.getElementById("in2");
   in2.currentTime = in1.currentTime;
   var stream2 = in2.captureStream();
   stream2.volume = 0;
   stream2.live = true; // don't block while this stream is blocked, just play silence
   mixer.addStream(stream2);
   stream2.onplaying = function() {
     if (mixer.inputs[0] == stream1) {
       stream2.volume = 1.0;
       stream2.live = false; // allow output to block while this stream is playing
       mixer.removeStream(stream1);
     }
   }
 }
</script>

9) Synthesize samples from JS data

<audio id="out" autoplay></audio>
<script>
 document.getElementById("out").src =
   new StreamProcessor(new Worker("synthesizer.js"));
</script>

10) Trigger a sound sample to be played through the effects graph ASAP but without causing any blocking

<script>
 var effectsMixer = ...;
 function playSound(src) {
   var audio = new Audio(src);
   audio.oncanplaythrough = new function() {
     var stream = audio.captureStream();
     stream.live = true;
     effectsMixer.addStream(stream);
     stream.onended = function() { effectsMixer.removeStream(stream); }
     audio.play();
   }
 }
</script>

13) Capture video from a camera and analyze it (e.g. face recognition)

<script>
 navigator.getUserMedia('video', gotVideo);
 function gotVideo(stream) {
   stream.createProcessor(new Worker("face-recognizer.js"));
 }
</script>

14) Capture video, record it to a file and upload the file (e.g. Youtube)

<script>
 navigator.getUserMedia('video', gotVideo);
 var streamRecorder;
 function gotVideo(stream) {
   streamRecorder = stream.record();
 }
 function stopRecording() {
   streamRecorder.getRecordedData(gotData);
 }
 function gotData(blob) {
   var x = new XMLHttpRequest();
   x.open('POST', 'uploadMessage');
   x.send(blob);
 }
</script>

15) Capture video from a canvas, record it to a file then upload

<canvas width="640" height="480" id="c"></canvas>
<script>
 var canvas = document.getElementById("c");  
 var streamRecorder = canvas.createStream().record();
 function stopRecording() {
   streamRecorder.getRecordedData(gotData);
 }
 function gotData(blob) {
   var x = new XMLHttpRequest();
   x.open('POST', 'uploadMessage');
   x.send(blob);
 }
 var frame = 0;
 function updateCanvas() {
   var ctx = canvas.getContext("2d");
   ctx.clearRect(0, 0, 640, 480);
   ctx.fillText("Frame " + frame, 0, 200);
   ++frame;
 }
 setInterval(updateCanvas, 30);
</script>

Related Proposals

W3C-RTC charter (Harald et. al.): RTCStreamAPI

WhatWG proposal (Ian et. al.): [1]

Chrome audio API: [2]

Mozilla audio API: [3]

WhatWG MediaController API: [4]

MediaStreamAPI

Contents

Streams, RTC, audio API and media controllers

Scenarios

Straw-man Proposal

Streams

Media elements

Stream extensions

Stream mixing and processing

Graph cycles

Dynamic graph changes

Canvas Recording

Examples

Related Proposals

Navigation menu

MediaStreamAPI

Streams, RTC, audio API and media controllers

Scenarios

Straw-man Proposal

Streams

Media elements

Stream extensions

Stream mixing and processing

Graph cycles

Dynamic graph changes

Canvas Recording

Examples

Related Proposals

Navigation menu

Search