Media/WebRTC/WebRTCE10S: Difference between revisions

 
(27 intermediate revisions by the same user not shown)
Line 17: Line 17:
and between the PeerConnection and video/audio tags. This doesn't mean that
and between the PeerConnection and video/audio tags. This doesn't mean that
the media actually flows through the JS, however.
the media actually flows through the JS, however.
Below we show a proposed process split with E10S:
https://raw.github.com/mozilla/webrtc/master/planning/architecture-e10s.png
== System Resources to be Proxied ==
The following system resources need to somehow be made accessible to the renderer
process.
* Video rendering (accessed via a video tag)  [TODO: Is this actually a system resource? Not clear on what the display model is.]
* The speaker (accessed via an audio tag)
* The camera and microphone
* Hardware video encoders and decoders (if any)
* The network interfaces
In addition, we use the Socket Transport Service (STS) to do socket input processing. We create
UDP sockets via NSPR and then attach them to the STS in order to be informed when data is
available.
=== Input Device Access (getUserMedia) ===
We assume that camera and microphone access will be available only in the
parent process. However, since most of the WebRTC stack will live in the
child process, we need some mechanism for making the media available to
it.
The basic idea is to create a new backend for MediaManager/GetUserMedia
that is just a proxy talking to the real media devices over IPDL. The
incoming media frames would then be passed over the IPDL channel
to the child process where they are injected into the MediaStreamGraph.
This shouldn't be too complicated, but there are a few challenges:
* Making sure that we don't do superfluous copies of the data. I understand that we can move the data via gralloc buffers, so maybe that will be OK for video. [OPEN ISSUE: Will that work for audio?]
* Latency. We need to make sure that moving the data across the IPDL interface doesn't introduce too much latency. Hopefully this is a solved problem.
=== Output Access ===
[TODO: Presumably this works the same as rendering now?]
=== Hardware Acceleration ===
In this design, we make no attempt to combine HW acceleration and capture
or rendering. I.e., if we have a standalone HW encoder, we just insert it
into the pipeline in place of the the SW encoder and then redirect the
encoded media out the network interface. The same goes for decoding.
There's no attempt made to shortcut the rest of the stack. This design
promotes modularity, since we can just make the HW encoder look
like another module inside of GIPS. In the longer term, we may want
to revisit this, but I think it's the best design for now.
Note that if we have an integrated encoder (e.g., in a camera) then
we *can* accomodate that by just having gUM return encoded frames
instead of I420 and then we pass those directly to the network without
encoding them. (Though this is somewhat complicated by the need
to render them locally in a video tag.)
=== Network Access ===
All networking access in WebRTC is mediated through the ICE stack (media/mtransport/third_party/nICEr and media/mtransport/nr*).
From a technical perspective, the requirements look like:
* The ability to send and receive UDP datagrams with any valid local address and any remote address.
* The ability to enumerate every network interface.
* The ability to have events happen at specific times.
Below is a schematic diagram of the interaction of the ICE stack with the rest of the system which shows
how things actually work.
https://raw.github.com/mozilla/webrtc/master/planning/network-e10s.png
As before, the boxes on the left signify the currently protected operations.
There are two natural designs, discussed below.
==== Network Proxies ====
The first design is to do only the primitive networking operations in the parent
process and have ICE talk to the proxies that remote those operations,
as shown below. This is approximately the design Google uses.
https://raw.github.com/mozilla/webrtc/master/planning/network-e10s-socket-proxy.png
The advantage of this design is that it is relatively straightforward to execute
and that the APIs that are required are relatively limited. I.e.,
* List all the interfaces and their addresses
* Bind a socket to a given interface/address
* Send a packet to a given remote address from a given socket
* Receive a packet on a given socket and learn the remote address
The major disadvantage of this design is that it provides the content process
with a fair amount of control over the network and thus potentially represents
a threat if/when the content process is compromised. For instance,
if the content process is compromised, it could send arbitrary UDP or
TCP packets to anywhere in the network that is accessible to the phone.
Of course, this is already a risk in the desktop version.
We might be able to mitigate this risk somewhat by installing some
primitive packet filtering on the parent process side. For instance, we
could enforce the following policy:
* A socket maintains two tables:
** An outstanding STUN transaction table
** A "permissions" table of accepted remote addresses
* When a content process tries to send a non-STUN formatted packet, the socket rejects it unless the remote address is in the permissions table
* When a content process sends a STUN-formatted packet, it gets transmitted and added to the outstanding STUN transaction table
* When packet is received, it is checked against the outstanding STUN transaction table. If a transaction completes, then the address is added to the permissions table.
This would be relatively easy to implement and would provide a measure of protection
against misuse of this interface. It would require some STUN-parsing smarts in the
parent, but those can be kept relatively minimal.
Detailed api proposal at [[Media/WebRTC/WebRTCE10S/NetworkProxyInterface]]
==== ICE In Parent ====
The alternative design is to push the entire ICE stack into the parent process, as shown
below.
https://raw.github.com/mozilla/webrtc/master/planning/network-e10s-ice-parent.png
The advantage of this design from a security perspective is that by pushing the
connectivity checking into the parent process we completely remove the
ability of a compromised content process to send arbitrary network
traffic.
The two major drawbacks of this design are:
* The interface to the ICE stack is very complicated, which makes the
engineering task harder.
* The ICE stack itself is also complicated, which increases the surface area
in the "secure" parent process.
The ICE stack interface is found at:
* http://hg.mozilla.org/mozilla-central/file/b553e9ca2354/media/mtransport/nricectx.h
* http://hg.mozilla.org/mozilla-central/file/b553e9ca2354/media/mtransport/nricemediastream.h
This API has around 20 distinct API calls, each of which will need to be separately
remoted. A number of them have fairly complicated semantics, which would tend
to invade the rest of the program.
==== Recommendation ====
In my opinion we should go for the "Network Proxies" design. It's going to be a lot simpler
to implement than the "ICE in the parent" design and can be largely hidden by an
already replaceable component (nr_socket_prsock.cpp) without impacting the rest
of the code. It also lets us work in parallel because we can do a simple implementation
without the packet filter described above and then add the packet filter transparently
later.
Confirmed users
214

edits