User:Beckley/Indexed Search Proposal: Difference between revisions

← Older edit

User:Beckley/Indexed Search Proposal (view source)

Revision as of 21:35, 6 August 2008

3,115 bytes added , 6 August 2008

→‎Incorporating an Indexing Engine

DavidA

Confirmed users, Bureaucrats and Sysops emeriti

525

edits

@@ Line 19: / Line 19: @@
 * Each indexing engine has different capabilities, which leads to least-common-denominator solutions or differences between platforms
 * Not all OSes come with the indexing engine (Windows XP, Linux), and so requires user download and configuration
-* Glue code needs to be written/maintained to keep a single interface in the front-end
+* Glue code needs to be written/maintained to keep a single interface in the front-end (there are existing APIs that do this, [http://xesam.org Xesam] is an example, but they don't support all of the indexing engines we would want)
 * The indexing engine can get disabled, have settings changed, or get upgraded to an incompatible state, making it unusable
 Even though OS-based indexing engines are not well suited for search inside Thunderbird, they still are useful for users who want to perform search outside of Thunderbird and have their email show up in the results.
-== Rolling Our Own ==
+== Incorporating an Indexing Engine ==
+As mentioned in the section above, a better route to proceed is to include a indexing engine inside of Thunderbird.  That way we can control it, and ensure that it is present, enabled, and compatible.  There are a number of FOSS indexing engines available for use, but a few in particular stand out: C-Lucene, SQLite Full Text Search, Ferret and Sphinx.  They're discussed below, but ideally the design for integrating indexed full-text search would be flexible enough to allow different engines to be plugged in.
+=== Architectural Approach ===
+Regardless of what indexing engine is used, we have to figure out how those engines are fed data, queried, etc.
+=== Lucene ===
+[http://lucene.apache.org Lucene] is an indexing engine that comes from the Apache project.  It was originally written in [http://lucene.apache.org/java/ Java], but has been ported to many other languages, including (for our purposes) [http://clucene.wiki.sourceforge.net/ C/C++].  Lucene is considered the de-facto text indexing engine, and is used in thousands of projects.  The [http://lucene.apache.org/java/docs/queryparsersyntax.html capabilities of Lucene] include multiple fields, boolean operators, wild cards, fuzzy matches, proximity searching, multiple word phrases, and more.
+Java Lucene (the original) is not being considered because of the complexities of bundling Java software with Thunderbird.
+C-Lucene is a possibility.  Flock has used C-Lucene, and has XPCOM bindings for it (currently under GPL).  C-Lucene is under the Apache license.
+TBD: What was the Flock experience w/ C-Lucene, and how much does it apply to us?
+TBD: What's involved in building stemmers for more languages for C-Lucene?
+=== SQLite Full Text Search ===
+[http://www.sqlite.org/cvstrac/wiki?p=FtsTwo Sqlite-FTS] is the full-text indexer that is part of Sqlite.
+Advantages:
+* part of Sqlite, which we already bundle & ship.
+Disadvantages:
+* STL currently works for text in the Sqlite database.  Gloda currently doesn't store any of the message content, so either FTS needs to be taught how to index "foreign" text, or we need to bring in more text into a sqlite database.
+* current stemming is done through the IBM ICU library, which, reportedly, is very large.
+=== Ferret ===
+Ferret started off as a Ruby port of Lucene, but has evolved to be a thin Ruby wrapper around a reportedly clean and well architected C library, which could be wrapped with XPCOM.
+Advantages:
+* MIT license IIRC
+* small
+* stemmers for many languages & multiple encodings already there
+Disadvantages:
+* [http://www.ruby-forum.com/topic/137629 Possibly not production ready for all environments]
+=== Sphinx ===
+[http://www.sphinxsearch.com/ Sphinx] is a full text indexing engine targeted at mySQL.  It gets some positive reviews.
+Advantages:
+* At least two languages supported (so i18n is possible, if not already done)
+Disadvantages:
+* Maybe mySQL specific code?
+* Currently GPL