Platform/XML Rewrite: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 48: Line 48:
====Connecting handlers to expat====
====Connecting handlers to expat====


Looking at the existing sinks, it looks like there's no real value in having an abstraction between expat and code that does the actual work in response to expat's callbacks. That is, I think it doesn't make sense to have a single class (like the old nsExpatDriver) that provides a set of expat callbacks and then provides another abstraction for concrete handler classes that do the real work. I propose we make the concrete handler classes set themselves as expat callbacks directly. That is, mozilla::parser::xml::TreeOpGenerator should know how to register itself as the handler of various expat callbacks.
Looking at the existing sinks, it looks like there's no real value in having an abstraction between expat and code that does the actual work in response to expat's callbacks. If we switched away from expat today, we'd have to change the current abstraction layer anyway. That is, I think it doesn't make sense to have a single class (like the old nsExpatDriver) that provides a set of expat callbacks and then provides another abstraction for concrete handler classes that do the real work. I propose we make the concrete handler classes set themselves as expat callbacks directly. That is, mozilla::parser::xml::TreeOpGenerator should know how to register itself as the handler of various expat callbacks.


I think it would make sense to have a common superclass for classes that can handle expat callbacks, but instead of being anything like nsIXMLContentSink or nsIExpatSink, I think the only commonality the classes need to have is the ability to register themselves as expat callback handler. So the common superclass could look like this:
I think it would make sense to have a common superclass for classes that can handle expat callbacks, but instead of being anything like nsIXMLContentSink or nsIExpatSink, I think the only commonality the classes need to have is the ability to register themselves as expat callback handler. So the common superclass could look like this:


  class mozilla::parser::xml::AExpatHandler {
  class mozilla::parser::xml::AExpatHandler {
   virtual void registerCallBacksInto(XML_Parser aParser) = 0;
   virtual void RegisterCallbacksInto(XML_Parser aParser) = 0;
  }
  }
====Dealing with stream data off the main thread====
mozilla::parser::xml::StreamParser should implement nsIStreamListener on the main thread and copy data over to the parser thread the way nsHtml5StreamParser does.
====Dealing with entity references off the main thread====
Currently, we map a small set of magic public ids to a DTD file that we actually feed to expat so that it gets parsed every time the user loads a document that references one of the magic public ids, such as the public ids for the XHTML 1.0 DTDs. This way, entities defined in the XHTML 1.0 DTDs are available to documents.
Since our IO APIs are meant to be called on the main thread, starting IO for the local DTD file from the parser thread is not good. And in any case, it's rather silly to parse an actual file when we know in advance what the file will contain.
Instead of parsing a special file in this case, expat should be hacked in such a way that its internal entity tables can be mutated to a state that's equivalent with the state they'd end up in by parsing the special DTD without actually parsing anything.

Revision as of 14:00, 27 April 2011

Goals

  • Get rid of nsParser
  • Get rid of nsScanner
  • Get rid of nsIContentSink and related nsI stuff
  • Get rid of nsIParser
  • Get rid of content-initiated flushes
  • Move Web content XML parsing off the main thread
  • For Web content, reuse code from the HTML side
  • Less COMtamination

Non-Goals

  • Replacing expat
  • Hiding expat from application code
  • Moving XUL/XBL1/SAX/RDF/XSLT off the main thread

Background observations

The HTML5 parser has a design that works. When document.write handling complexity is not considered, the HTML5 parser has these major parts:

  • A parser object (nsHtml5Parser) that nsDocument sees and that holds the rest together.
  • An IO driver (nsHtml5StreamParser) that can receive bytes from a network stream, manages the character encoding conversion and pushes UTF-16 code units to the portable parser core.
  • The portable parser core (nsHtml5Tokenizer and nsHtml5TreeBuilder).
  • Glue code that produces tree ops from what the portable core does (nsHtml5TreeBuilderCppSupplement)
  • An executor for the tree ops (nsHtml5TreeOpExecutor)

The parser object also supports fragment parsing, but that functionality doesn't really benefit from being in the class that's oriented towards full page loading, so I think even on the HTML side, the fragment parsing functionality should be separated from nsHtml5Parser.

Basics for Web content loading on the XML side

I propose making the XML Web content load path have the same structure as the HTML loads path (with document.write simplified out). That is, it would have these major parts:

  • A parser object (mozilla::parser::xml::Parser) that nsDocument sees and that holds the rest together.
  • An IO driver (mozilla::parser::xml::StreamParser) that can receive bytes from a network stream, manages the character encoding conversion and pushes UTF-16 code units to expat.
  • expat (portable parser core)
  • An object that implements handler callback for expat and produces tree ops. (mozilla::parser::xml::TreeOpGenerator)
  • The same executor for the tree ops an on the HTML side (nsHtml5TreeOpExecutor, eventually to be named mozilla::parser::TreeOpExecutor)

Details about Web content loading

Character encodings

expat has built-in capability to decode US-ASCII, ISO-8859-1, UTF-8 and UTF-16 and has an API for plugging in support for other decoders. So why bother with putting bytes to UTF-16 conversion in mozilla::parser::xml::StreamParser outside expat?

Unfortunately, expat has an unconventional API for encoding pluggability. Instead of having an API where byte buffers go in and UTF-16 or UTF-8 buffers come out, expat has an API for loading conversion tables into expat in the format that expat wants. Our pre-existing decoders don't expose their internals in that format. Therefore, to be able to use our pre-existing converters, we can't let expat manage the conversion.

Encoding sniffing should be handled the same way nsHtml5StreamParser handles it in the XML View Source mode: mozilla::parser::xml::StreamParser itself should handle UTF-8 and UTF-16 BOM sniffing. If there's no BOM, an instance of expat itself should be used for extracting the encoding name from the XML declaration.

Connecting handlers to expat

Looking at the existing sinks, it looks like there's no real value in having an abstraction between expat and code that does the actual work in response to expat's callbacks. If we switched away from expat today, we'd have to change the current abstraction layer anyway. That is, I think it doesn't make sense to have a single class (like the old nsExpatDriver) that provides a set of expat callbacks and then provides another abstraction for concrete handler classes that do the real work. I propose we make the concrete handler classes set themselves as expat callbacks directly. That is, mozilla::parser::xml::TreeOpGenerator should know how to register itself as the handler of various expat callbacks.

I think it would make sense to have a common superclass for classes that can handle expat callbacks, but instead of being anything like nsIXMLContentSink or nsIExpatSink, I think the only commonality the classes need to have is the ability to register themselves as expat callback handler. So the common superclass could look like this:

class mozilla::parser::xml::AExpatHandler {
  virtual void RegisterCallbacksInto(XML_Parser aParser) = 0;
}

Dealing with stream data off the main thread

mozilla::parser::xml::StreamParser should implement nsIStreamListener on the main thread and copy data over to the parser thread the way nsHtml5StreamParser does.

Dealing with entity references off the main thread

Currently, we map a small set of magic public ids to a DTD file that we actually feed to expat so that it gets parsed every time the user loads a document that references one of the magic public ids, such as the public ids for the XHTML 1.0 DTDs. This way, entities defined in the XHTML 1.0 DTDs are available to documents.

Since our IO APIs are meant to be called on the main thread, starting IO for the local DTD file from the parser thread is not good. And in any case, it's rather silly to parse an actual file when we know in advance what the file will contain.

Instead of parsing a special file in this case, expat should be hacked in such a way that its internal entity tables can be mutated to a state that's equivalent with the state they'd end up in by parsing the special DTD without actually parsing anything.