Platform/XML Rewrite: Difference between revisions

Revision as of 13:20, 27 April 2011

Goals

Get rid of nsParser
Get rid of nsScanner
Get rid of nsIContentSink and related nsI stuff
Get rid of nsIParser
Get rid of content-initiated flushes
Move Web content XML parsing off the main thread
For Web content, reuse code from the HTML side
Less COMtamination

Non-Goals

Replacing expat
Hiding expat from application code
Moving XUL/XBL1/SAX/RDF/XSLT off the main thread

Background observations

The HTML5 parser has a design that works. When document.write handling complexity is not considered, the HTML5 parser has these major parts:

A parser object (nsHtml5Parser) that nsDocument sees and that holds the rest together.
An IO driver (nsHtml5StreamParser) that can receive bytes from a network stream, manages the character encoding conversion and pushes UTF-16 code units to the portable parser core.
The portable parser core (nsHtml5Tokenizer and nsHtml5TreeBuilder).
Glue code that produces tree ops from what the portable core does (nsHtml5TreeBuilderCppSupplement)
An executor for the tree ops (nsHtml5TreeOpExecutor)

The parser object also supports fragment parsing, but that functionality doesn't really benefit from being in the class that's oriented towards full page loading, so I think even on the HTML side, the fragment parsing functionality should be separated from nsHtml5Parser.

Basics for Web content loading on the XML side

I propose making the XML Web content load path have the same structure as the HTML loads path (with document.write simplified out). That is, it would have these major parts:

A parser object (mozilla::parser::xml::Parser) that nsDocument sees and that holds the rest together.
An IO driver (mozilla::parser::xml::StreamParser) that can receive bytes from a network stream, manages the character encoding conversion and pushes UTF-16 code units to expat.
expat (portable parser core)
An object that implements handler callback for expat and produces tree ops. (mozilla::parser::xml::TreeOpGenerator)
The same executor for the tree ops an on the HTML side (nsHtml5TreeOpExecutor, eventually to be named mozilla::parser::TreeOpExecutor)

Details about Web content loading

Character encodings

expat has built-in capability to decode US-ASCII, ISO-8859-1, UTF-8 and UTF-16 and has an API for plugging in support for other decoders. So why bother with putting bytes to UTF-16 conversion in mozilla::parser::xml::StreamParser outside expat?

Unfortunately, expat has an unconventional API for encoding pluggability. Instead of having an API where byte buffers go in and UTF-16 or UTF-8 buffers come out, expat has an API for loading conversion tables into expat in the format that expat wants. Our pre-existing decoders don't expose their internals in that format. Therefore, to be able to use our pre-existing converters, we can't let expat manage the conversion.

Encoding sniffing should be handled the same way nsHtml5StreamParser handles it in the XML View Source mode: mozilla::parser::xml::StreamParser itself should handle UTF-8 and UTF-16 BOM sniffing. If there's no BOM, an instance of expat itself should be used for extracting the encoding name from the XML declaration.

@@ Line 27: / Line 27: @@
 The parser object also supports fragment parsing, but that functionality doesn't really benefit from being in the class that's oriented towards full page loading, so I think even on the HTML side, the fragment parsing functionality should be separated from nsHtml5Parser.
-==Basic for Web content loading on the XML side==
+==Basics for Web content loading on the XML side==
 I propose making the XML Web content load path have the same structure as the HTML loads path (with document.write simplified out). That is, it would have these major parts:

Platform/XML Rewrite: Difference between revisions

Revision as of 13:20, 27 April 2011

Contents

Goals

Non-Goals

Background observations

Basics for Web content loading on the XML side

Details about Web content loading

Character encodings

Navigation menu

Platform/XML Rewrite: Difference between revisions

Revision as of 13:20, 27 April 2011

Goals

Non-Goals

Background observations

Basics for Web content loading on the XML side

Details about Web content loading

Character encodings

Navigation menu

Search