PDF.js

PDF.js is an HTML5-based Portable Document Format renderer.

Project Manager: Bill Walker

Developers: Artur Adib, Brendan Dahl

Repository: https://github.com/mozilla/pdf.js

IRC: #pdfjs on irc.mozilla.org

Mailing list: dev-pdf-js

Meeting Notes

Current work items, 2011-08-22

Image support

JPEG2000 support (Yury?) (issue 286)
Image rendering bugs (Shaon?)

Use Web Workers for responsiveness and throughput (Shaon)

Automated testing

Store reference renderings on Amazon S3 (~Chris, would like to hand off)
Run regression tests on EC2 on every push to pdf.js repo (~Chris, would like to hand off)

Ship pdf.js as version 0.1 Firefox extension

Mechanism for users to report broken PDFs (Vivien)
UI polish (Vivien)

Render PDF 1.7 spec perfectly

(tracked by github issues)

Analyze pdf.js's feature completeness across all PDFs

gather large corpus of PDFs (unassigned)
run pdf.js over all PDFs to collect data on missing and broken features (unassigned)
prioritize missing features (unassigned)
implement missing features (unknown work)

Integrate code into Firefox/Gecko proper

(unknown work)

Milestone: Big-splash demo

Probably will be of pixel-perfect rendering of tracemonkey paper, with nontrivial UI (i.e. eye candy).

Pixel-perfect rendering

~~Type1 fonts~~
~~Bitmaps and SMask blending~~
~~canvas.setDash()~~
~~even-odd fills~~
~~axial shading~~
~~TTF fonts (pass the sanitizer)~~
bring Quartz font rendering to parity with pango/freetype

Non-trivial UI

~~zooming~~
~~pre-rendering pages~~
~~"continuous" scrolling~~
~~integrate Worker code into multi-page viewer~~ (punted)
preview panel

Milestone: PDF.js Firefox extension 1.0

There is the code of a Firefox restartless extension to handle the application/pdf mimetype and redirect the display to the value of the extensions.pdf.js.url preferences.

The opened document URL will replace the '%s" argument in the preference url.

https://github.com/andreasgal/pdf.js/tree/master/extensions/firefox

or

http://hg.mozilla.org/users/vnicolas_mozilla.com/pdfjs/

Feel free to push changes to this repository.

Tracking in-progress work

We'll try to use github issues to track work. (If that proves too unwieldy, we can move to another system.)

Like bugzilla, "issues" are used to track both bugs filed by users and specific work items for developers. Try to file one issue per problem observed. For example,

Text looks bad in PDFs

is an unspecific issue. A more specific and more helpful one is

Glyph spacing is weird on page 10 of http://foo.com/bar.pdf

Similarly,

Implement fonts

is an unspecific work item. Better is

Add code to convert Type 1 into CFF

For big projects that span many issues, let's try to use issue "labels" to track the work. (In the same way one would use metabugs to track big projects in bugzilla.) So for the font example above, while Type1->CFF conversion is a specific work item, we might want to tag it "fonts" for easier searching.

Once a specific work item (issue) has been filed, please assign the issue to yourself if you're working on it. This avoids multiple people working on the same projects and thereby wasting time.

(TODO)

Backend

zooming
- the general idea is that the UI will set a zoom factor, say 200%
- we'll redraw the canvas, but with a scale transform to 2x, and a translation set to move the content we want to fill the screen to the top-left
draw subpage
SVG backend (issue 229)
linearization
- byte range requests
~~hyperlinks (hash URLs, intra-doc links)~~
perf (use workers for some stuff?)
color spaces (big, pervasive)
build something like gecko's display list, for hit testing
- ~~click-on-link (easy)~~
- text selection (hard) (issue 10)

UI

animations (page flip, etc.)
~~hyperlinks~~
page transitions
dual-page display (issue 590)
page-transition animations
~~pan/zoom/next/prev gestures~~ (Edit: Felipe Gomes and cjones discussed a better way to support these, but it will require new web APIs)

Platform

TextMetrics.maxHeight (to compute more accurate bounding boxes; can approximate without this, though)
implement text selection in SVG documents (bugzilla 292498)
(determine extent of SVG a11y implementation, if any)

Testing

reftest-style harness, compare hand-written PDF commands to hand-written canvas (?)
compare to poppler output, keep list of differences

Analysis

dump stream info
dump font info
dump raster image info

Big project: Color spaces

Approach: map input color values (fillcolor, strokecolor etc.) to output color space. Map input bitmaps to output space with SVG color-matrix filter/WebGL shader program/hand-written JS as available. Problem: will this work correctly for interpolated color values, like intermediate colors in a gradient, and other computed values like the result of composition operators? Does canvas need color-space support? Do we care enough? (What do other PDF renderers do?)

Big project: Hyperlinks

~~Parse link data from PDF~~
~~Add UI to highlight/set cursor on link hover~~
~~Implement "go to point X in page Y" interface in backend~~
Figure out encoding scheme for absolute links, e.g. http://foo.com/bar.pdf#[encoded link]

Big project: SVG backend

Most of SVG maps well to PDF (was influenced by?). There are existing PDF->SVG translators. Perf is the biggest concern. We want to build the SVG document in the background, without affecting main-thread interactivity. The way to do that is by building the document with a Web Worker thread. The problem is, Workers don't have access to any DOM APIs. We'll probably need to build the document as a string in the background, then send it over to the main thread for parsing.

Big project: Text selection

Option 1: In SVG backend
- Draw to canvas first. On first selection, switch to SVG-rendered content.
- Let Gecko do all text selection in SVG document
Option 2: In canvas backend
- Build data structure representing text drawn to screen (e.g., display list/BSP/etc.). For best results, collapse adjacent and same-height/width "text runs".
- Walk data structure and compute textruns at a particular point and/or within a bounding box
- Add UI for "highlighted" text above PDF and saving selected text to clipboard
  - Corner cases: clipped text, occluded text, non-white backgrounds, non-black text
- Maybe: render without display-list building first, then on first selection re-interpret PDF to build display list. Or pre-build display list in the background.

See notes from "Baz", a poppler developer.

Big project: Accessibility

Kind of like text selection, except there's no web-visible accessibility API we could hook with canvas. So

Somehow detect that a11y is enabled, permanently switch to SVG backend
Let Gecko implement a11y interfaces

(Possibly) Big project: Vertical text

Somewhat pervasive mode switch in text-drawing code. Is it just a matter of transform hackery to put glyphs in the right place, or do we need canvas support? Canvas support might be a big project.

Big project: Search

We want browsers' find-in-page features to see PDF text too. This is hard with canvas. With an SVG or HTML backend, if we set things up properly, we might get search almost for free. Search is very similar to but slightly harder than text selection because search needs to know about word-, sentence-, paragraph-, column-, and page-breaks, potentially, so as to know which chunks of text are part of the same word, whereas text selection just needs to know about text that might be adjacent.

Big project: XFA or AcroForms

Add support for interactive form objects, which allow users to fill out input fields (text, checkboxes, radio buttons, etc.) in a PDF and then "submit" them. We need to find out whether more PDFs with forms use XFA or AcroForms. Rumor has it that most tax forms are XFA, which is the biggest use case driving this so far. Once we figure out whether XFA or AcroForms is more important to implement, then we need to research whether these controls map 1:1 to HTML5 form controls. If so, implementing forms would approximately mean overlaying the PDF display with the appropriate HTML5 elements. We also need to research how these controls can be styled; that might be difficult to implement. Lastly, forms have some support for running JavaScript, which we may not want to allow. It would be an interesting problem to see if such script could be sandboxed properly.

Testing

To run tests

Have a look at this directory
Then,

pdf.js/test$ python test.py --help

Once you decide how you want to run it, that script will spin up one or more browsers, and a little web server to serve up test cases.

Utils

To uncompress a PDF

install pdftk (http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/)
run |pdftk foo.pdf output uncompressed.foo.pdf uncompress|

Coding Style

make sure HTML files are declared <!DOCTYPE html> (i.e., HTML5). IE9+ will load them in compatibility mode otherwise.

add a "use strict"; statement (exactly that!) to the top of your JS files

2 spaces for indentation.

Line break are free (I promise) don't hesitate to use them to separate logical block inside your functions.

Be sure to declare a variable with 'var' before using it you don't want to be hurt by random variables living on the global scope.

Files are named like_this.js.

Useful resources:

Also some particular points (sentence stolen from https://developer.mozilla.org/en/JavaScript_style_guide)

Don't use object methods and properties more than you have to. It is often faster to store the result in a temporary variable.

If you have to do DOM manipulations (hopefully not!):

Don't call getAttribute to see if an attribute exists, call hasAttribute instead.
Prefer to loop through childNodes rather than using first/lastChild with next/previousSibling. But prefer hasChildNodes() to childNodes.length > 0. Similarly prefer document.getElementsByTagName(aTag).item(0) != null to document.getElementsByTagName(aTag).length > 0.

Review (aka pull-request) policy

All pushes to the master must go through pull requests.

NBB: this isn't being enforced yet

New code has to pass all tests (FORTHCOMING)
New code can't regress performance on (TBD) as measured by (TBD). Unless the new code implements a new feature major enough to suffer a temporary perf regression. This is up to common sense.
Major new features should have architectural review from (TBD). Less major patches can be reviewed by (TBD).

PDF.js

Contents

Meeting Notes

Current work items, 2011-08-22

Milestone: Big-splash demo

Milestone: PDF.js Firefox extension 1.0

Tracking in-progress work

(TODO)

Big project: Color spaces

Big project: Hyperlinks

Big project: SVG backend

Big project: Text selection

Big project: Accessibility

(Possibly) Big project: Vertical text

Big project: Search

Big project: XFA or AcroForms

Testing

Utils

Coding Style

Review (aka pull-request) policy

Navigation menu