Firefox/Projects/FTS and Awesomebar: Difference between revisions

Firefox/Projects/FTS and Awesomebar (view source)

Revision as of 00:43, 14 March 2010

4,621 bytes added , 14 March 2010

Massive update to include more details

Adw

Confirmed users

764

edits

@@ Line 1: / Line 1: @@
 == Intro ==
+This project will attempt make Storage text searches faster, focusing on the awesomebar.
 SQLite supports indexed full text search (FTS) through its fts extensions.  Indexed FTS makes text searches fast:  Rather than looking at every record in the database to see if it contains a search string, target records are found by comparing the search string against an index.
-But the fts extensions aren't by default suitable for international text.  This project will try to make them suitable so we can use FTS to improve the experience of Storage- and search-related Firefox features for all of our users.
+But the fts extensions aren't by default suitable for international text.  This project will try to make them suitable so we can use FTS to improve Firefox features that rely on Storage text searches for all of our users.
-A good first target feature in Firefox is the awesomebar.  It is highly visible, and since it makes extensive use of text searches, it stands to benefit from this work.
+A good first target feature is the awesomebar.  It is highly visible, and since it makes extensive use of text searches, it stands to benefit from this work.
-<strong>Note</strong> that full text search does not mean that we store the entire text of pages.  We will use it to store page titles, URLs, tags, etc., as we already do now, only lookup will be fast.
+Note that full text search does not mean that we store the entire text of pages.  We will use it to store page titles, URLs, tags, etc., as we already do now, only lookup will be fast.
 * Champion, lead: adw
@@ Line 15: / Line 17: @@
 TAKING OFF
-* Gecko has some [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/idl/nsISemanticUnitScanner.idl facilities] for i18n word boundary analysis, but they're not comprehensive or suitable for Firefox's 300 million users.  (They aren't even used anywhere in the tree.)  Too bad.
+* Gecko has some facilities for i18n word boundary analysis under the  [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/ intl/lwbrk/] directory.
-* Thunderbird does FTS.  I talked with asuth about it, and unfortunately their i18n tokenizer doesn't seem appropriate for us either.
+** [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/idl/nsISemanticUnitScanner.idl <tt>nsISemanticUnitScanner</tt>] is the only interface there, and [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/nsSemanticUnitScanner.h <tt>nsSemanticUnitScanner</tt>] is its only implementation.  <tt>nsSemanticUnitScanner</tt> is derived from [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/nsSampleWordBreaker.cpp <tt>nsSampleWordBreaker</tt>], which as its name implies is not robust.  It supports ASCII but as far as I can tell has incomplete support for CJK and Thai and no support for other scripts.
-* Investigating pulling some components of [http://site.icu-project.org/ ICU] into our tree.  ICU is a large, established, and widely used i18n library that has facilities for word boundary analysis and tokenization.  SQLite supports an ICU tokenizer out of the box.
+** There's [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/public/nsIWordBreaker.h <tt>nsIWordBreaker</tt>], but its [http://mxr.mozilla.org/mozilla-central/source/intl/build/nsI18nModule.cpp#65 only implementation] is <tt>nsSampleWordBreaker</tt>.
-* I was able to build our SQLite with ICU support.  It works!  I'm building and linking against an ICU build outside of our tree, because I don't want to focus on the grunt work of building and linking inside the tree with our tools right now.  I assume doing so is not impossible...
+** There are some other files.  There are several line breakers in the [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/ src] directory.  There's [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/rulebrk.h another word breaker], but it's for Thai text only.  There's a scattering of files related to [http://en.wikipedia.org/wiki/JIS_encoding JIS encoding].
+* Thunderbird 3 does FTS with a custom tokenizer in the [http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/ mailnews/extensions/fts3/] directory.
+** According to the [http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/src/README.mozilla readme], the tokenizer "supports CJK indexing using bi-gram. So you have to use bi-gram search string if you wanto to search CJK character."  There is no mention of other scripts.
+** I chatted with asuth about it, asked how well their tokenizer serves their users and whether he could recommend it to us.
+*** What's in their tree right now is mainly just improvements to CJK tokenization.
+*** There's a pending patch on {{bug|525537}} that handles case and accent folding, but that's only an improvement to the Porter stemmer.  Before we (Firefox) even think about stemming, we need to competently support i18n text in the first place.  I also think that stemming, while definitely a good thing in general, is more useful for Thunderbird, where they are indexing the entire bodies of emails.  We don't currently plan to index bodies of Web pages.
+*** He too looked at our libintl library recently for the aforementioned bug and agreed that it's not a featureful or actively developed library to rely on.
+*** When I asked if they'd considered bringing in ICU or any other outside libraries, he mentioned that Mozilla had shied away from bringing in ICU before.
+*** He said their plan was to reconsider later if they should move to something more featureful, like Lucene.  Their tokenizer improvements are due entirely to one (awesome) Japanese contributor.
+*** He noted that they don't see many non-English user complaints about their search.  I'm not sure if that's a useful metric for us, and there are some complaints in the bug linked above.
+*** They don't expect to do a lot more work on their tokenizer anytime soon.
+* So.  The i18n word boundary analysis work done around Mozilla doesn't seem sufficient for us.  We could make it sufficient by brute force.  That would require expertise that Mozilla to my knowledge does currently not possess.  Even if we did, or if we called out to the Mozilla community at large, my impression is that writing a comprehensive, well tested i18n word boundary analysis library is a large undertaking.
+* If we called out to the <em>open source</em> community, however, I think we'd have better luck, specifically because it's already written these libraries!  asuth mentioned Lucene, an Apache text retrieval library.  SQLite supports the [http://site.icu-project.org/ ICU] library's tokenizer out of the box.  ICU is large, established, and widely used.
+* So, I investigated using ICU for tokenization.
+* ICU's license is the [http://source.icu-project.org/repos/icu/icu/trunk/license.html ICU License], which is the MIT license with additional clauses about documentation and promotion.  I'm pretty sure it's compatible with our tri-license, but I'm no lawyer...
+* ICU is designed to be modular.  Here's a picture showing [http://userguide.icu-project.org/design#TOC-API-Dependencies API dependencies].  We need "BreakIterator", which is inside "Common library", which depends on the "Data library."  There are many compilation options that allow you to pick only the pieces you need.  ([http://source.icu-project.org/repos/icu/icu/trunk/readme.html 1], [http://userguide.icu-project.org/packaging 2], [http://icu-project.org/repos/icu/icuhtml/trunk/design/modularization/icu_modularization_overview.html 3].)  I haven't investigated them yet, but the important point is that we don't need all of ICU to support i18n tokenization.
+* I was able to build our SQLite with ICU support.  I'm building and linking against an ICU build outside of our tree, because I don't want to focus on the grunt work of building and linking inside the tree and build system right now.  I assume doing so is not technically impossible...  Contrary to the API dependency picture I linked to above, I had to link against three ICU components:  common library, data library, and i18n library.  Since I built using default options, I presume I built more than I needed, including the i18n lib, or perhaps I misunderstood the tokenizer dependencies.  More investigation needed.
 * Next I would like to do some tests to determine its potential for improving awesomebar searches.
 == Goals ==
-* Make awesomebar results come back from the database faster.  (Note that async awesomebar prevents the UI from locking up, which is great and necessary, but it doesn't make the database queries any faster.)
+* Make awesomebar results come back from the database faster.
 == Non Goals ==
@@ Line 32: / Line 50: @@
 == Milestones ==
-Note: Dates in the future are only estimates.
+Dates in the future are only estimates.
 * 2010/03 - [Complete] Investigate i18n tokenizers