Platform/XML Rewrite: Difference between revisions

Remove mention of SAX and XBL, and otherwise update the plan for 2021
(Remove mention of SAX and XBL, and otherwise update the plan for 2021)
Line 16: Line 16:
* Replacing expat
* Replacing expat
* Hiding expat from sinks
* Hiding expat from sinks
* Moving XUL/XBL1/SAX/RDF/XSLT off the main thread
* Moving chrome prototype parser or XSLT off the main thread


==Background observations==
==Background observations==
Line 30: Line 30:


I propose making the XML Web content load path have the same structure as the HTML loads path (with document.write simplified out). That is, it would have these major parts:
I propose making the XML Web content load path have the same structure as the HTML loads path (with document.write simplified out). That is, it would have these major parts:
* A parser object (mozilla::parser::xml::Parser) that nsDocument sees and that holds the rest together.
* A parser object (mozilla::XmlParser) that nsDocument sees and that holds the rest together.
* An IO driver (mozilla::parser::xml::StreamParser) that can receive bytes from a network stream, manages the character encoding conversion and pushes UTF-16 code units to expat.
* An IO driver (mozilla::XmlStreamParser) that can receive bytes from a network stream, manages the character encoding conversion and pushes UTF-16 code units to expat.
* expat (portable parser core)
* expat (portable parser core)
* An object that implements handler callback for expat and produces tree ops. (mozilla::parser::xml::TreeOpGenerator)
* An object that implements handler callback for expat and produces tree ops. (mozilla::XmlTreeOpGenerator)
* The same executor for the tree ops an on the HTML side (nsHtml5TreeOpExecutor, eventually to be named mozilla::parser::TreeOpExecutor)
* The same executor for the tree ops an on the HTML side (nsHtml5TreeOpExecutor, eventually to be named mozilla::TreeOpExecutor)


===Character encodings===
===Character encodings===


expat has built-in capability to decode US-ASCII, ISO-8859-1, UTF-8 and UTF-16 and has an API for plugging in support for other decoders. So why bother with putting bytes to UTF-16 conversion in mozilla::parser::xml::StreamParser outside expat?
expat has built-in capability to decode US-ASCII, ISO-8859-1, UTF-8 and UTF-16 and has an unconventional API for plugging in other decoders.


Unfortunately, expat has an unconventional API for encoding pluggability. Instead of having an API where byte buffers go in and UTF-16 or UTF-8 buffers come out, expat has an API for loading conversion tables into expat in the format that expat wants. Our pre-existing decoders don't expose their internals in that format. Therefore, to be able to use our pre-existing converters, we can't let expat manage the conversion.
We should continue to handle characters encodings outside expat, but we should handle the buffering in clearer code than the current nsScanner that wasn't meant for XML to begin with and is now a very strange way to handle buffering before data reaches expat.
 
Encoding sniffing should be handled the [https://bugzilla.mozilla.org/attachment.cgi?id=524615&action=diff same way nsHtml5StreamParser handles it in the XML View Source mode]: mozilla::parser::xml::StreamParser itself should handle UTF-8 and UTF-16 BOM sniffing. If there's no BOM, an instance of expat itself should be used for extracting the encoding name from the XML declaration.


===Connecting handlers to expat===
===Connecting handlers to expat===


Looking at the existing sinks, it looks like there's no real value in having an abstraction between expat and code that does the actual work in response to expat's callbacks. If we switched away from expat today, we'd have to change the current abstraction layer anyway. That is, I think it doesn't make sense to have a single class (like the old nsExpatDriver) that provides a set of expat callbacks and then provides another abstraction for concrete handler classes that do the real work. I propose we make the concrete handler classes set themselves as expat callbacks directly. That is, mozilla::parser::xml::TreeOpGenerator should know how to register itself as the handler of various expat callbacks. This way, we don't need a layer of virtual calls on right on top of expat's function pointer-based calls.
Looking at the existing sinks, it looks like there's no real value in having an abstraction between expat and code that does the actual work in response to expat's callbacks. If we switched away from expat today, we'd have to change the current abstraction layer anyway. That is, I think it doesn't make sense to have a single class (like the old nsExpatDriver) that provides a set of expat callbacks and then provides another abstraction for concrete handler classes that do the real work. I propose we make the concrete handler classes set themselves as expat callbacks directly. That is, mozilla::XmlTreeOpGenerator should know how to register itself as the handler of various expat callbacks. This way, we don't need a layer of virtual calls on right on top of expat's function pointer-based calls.
 
I think it would make sense to have a common superclass for classes that can handle expat callbacks, but instead of being anything like nsIXMLContentSink or nsIExpatSink, I think the only commonality the classes need to have is the ability to register themselves as expat callback handler. So the common superclass could look like this:
 
class mozilla::parser::xml::AExpatHandler {
  virtual void RegisterCallbacksInto(XML_Parser aParser) = 0;
};


===Dealing with stream data off the main thread===
===Dealing with stream data off the main thread===


mozilla::parser::xml::StreamParser should implement nsIStreamListener on the main thread and copy data over to the parser thread the way nsHtml5StreamParser does.
mozilla::XmlStreamParser should implement nsIStreamListener on the main thread and copy data over to the parser thread the way nsHtml5StreamParser does.


===Dealing with entity references off the main thread===
===Dealing with entity references off the main thread===
Line 64: Line 56:
Since our IO APIs are meant to be called on the main thread, starting IO for the local DTD file from the parser thread is not good. And in any case, it's rather silly to parse an actual file when we know in advance what the file will contain.
Since our IO APIs are meant to be called on the main thread, starting IO for the local DTD file from the parser thread is not good. And in any case, it's rather silly to parse an actual file when we know in advance what the file will contain.


Instead of parsing a special file in this case, expat should be hacked in such a way that its internal entity tables can be mutated to a state that's equivalent with the state they'd end up in by parsing the special DTD without actually parsing anything.
Instead of parsing a special file in this case, expat should be hacked in such a way that its internal entity tables can be mutated to a state that's equivalent with the state they'd end up in by parsing the special DTD without actually parsing anything. Failing that, we could bake the data into the shared library so that it's available as static data on any thread.


===Lack of actual speculation===
===Lack of actual speculation===
Line 72: Line 64:
==Parsing chrome: XML==
==Parsing chrome: XML==


Chrome documents in Firefox are localized using external DTDs that define named entities. This needs to work with the new implementation. Since initiating DTD IO from off the main thread is trouble, chrome: documents should be parsed on the main thread. To enable this, there should be an on-the-main-thread alternative for mozilla::parser::xml::StreamParser: mozilla::parser::xml::MainThreadStreamParser. To get assertions about which methods should run on which thread right, it is probably useful to actually have two classes instead of having one class with a flag that picks different code paths within the class. The two classes should probably share encoding sniffing code in a common superclass.
Chrome documents in Firefox are localized using external DTDs that define named entities. This needs to work with the new implementation. Since initiating DTD IO from off the main thread is trouble, chrome: documents should be parsed on the main thread. To enable this, there should be an on-the-main-thread alternative for mozilla::XmlStreamParser: mozilla::XmlMainThreadStreamParser. To get assertions about which methods should run on which thread right, it is probably useful to actually have two classes instead of having one class with a flag that picks different code paths within the class. The two classes should probably share encoding sniffing code in a common superclass.


Since chrome: documents can be XHTML, it follows that mozilla::parser::xml::TreeOpGenerator needs to work on the main thread, too. This shouldn't be a big deal considering that tree op generation in the HTML case can run on either thread.
Since chrome: documents can be XHTML, it follows that mozilla::parser::xml::TreeOpGenerator needs to work on the main thread, too. This shouldn't be a big deal considering that tree op generation in the HTML case can run on either thread.
Line 78: Line 70:
==Parsing XML that's not Web content-like==
==Parsing XML that's not Web content-like==


We have various uses of expat that don't currently target nsXMLContentSink. XUL, XBL1, XSLT (transformation program compilation), RDF and SAX all have special sinks. It's probably not worthwhile to move these off the main thread or to otherwise make substantial changes here. I propose making these cases use mozilla::parser::xml::MainThreadStreamParser for feeding content to expat and to deCOMtaminate the special sinks so that they inherit from mozilla::parser::xml::AExpatHandler (presented above) and know how to set themselves directly as expat callback handlers.
We use expat for 1) XHR, 2) XML Web content, 3) Firefox UI files using the prototype parser and 4) for XSLT programs. Moving the last two off-the-main-thread may not be worthwhile.


==Fragment parsing==
==Fragment parsing==
Line 93: Line 85:


  class mozilla::parser::AParser {
  class mozilla::parser::AParser {
   virtual void setCharsetAndSource(nsACString aCharset, PRInt32 aSource) = 0;
   virtual void setCharsetAndSource(mozilla::Encoding* aEncoding, uint32_t aSource) = 0;
   virtual nsIStreamListener* GetStreamListener() = 0;
   virtual nsIStreamListener* GetStreamListener() = 0;
   virtual void UnblockParser() = 0;
   virtual void UnblockParser() = 0;
Line 101: Line 93:
  };
  };


Both mozilla::parser::xml::Parser and nsHtml5Parser (to be renamed mozilla::parser::html::Parser) would inherit from this abstract class.
Both mozilla::XmlParser and nsHtml5Parser would inherit from this abstract class.
254

edits