Confirmed users
764
edits
(Massive update to include more details) |
m (→Status) |
||
Line 20: | Line 20: | ||
** [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/idl/nsISemanticUnitScanner.idl <tt>nsISemanticUnitScanner</tt>] is the only interface there, and [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/nsSemanticUnitScanner.h <tt>nsSemanticUnitScanner</tt>] is its only implementation. <tt>nsSemanticUnitScanner</tt> is derived from [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/nsSampleWordBreaker.cpp <tt>nsSampleWordBreaker</tt>], which as its name implies is not robust. It supports ASCII but as far as I can tell has incomplete support for CJK and Thai and no support for other scripts. | ** [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/idl/nsISemanticUnitScanner.idl <tt>nsISemanticUnitScanner</tt>] is the only interface there, and [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/nsSemanticUnitScanner.h <tt>nsSemanticUnitScanner</tt>] is its only implementation. <tt>nsSemanticUnitScanner</tt> is derived from [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/nsSampleWordBreaker.cpp <tt>nsSampleWordBreaker</tt>], which as its name implies is not robust. It supports ASCII but as far as I can tell has incomplete support for CJK and Thai and no support for other scripts. | ||
** There's [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/public/nsIWordBreaker.h <tt>nsIWordBreaker</tt>], but its [http://mxr.mozilla.org/mozilla-central/source/intl/build/nsI18nModule.cpp#65 only implementation] is <tt>nsSampleWordBreaker</tt>. | ** There's [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/public/nsIWordBreaker.h <tt>nsIWordBreaker</tt>], but its [http://mxr.mozilla.org/mozilla-central/source/intl/build/nsI18nModule.cpp#65 only implementation] is <tt>nsSampleWordBreaker</tt>. | ||
** There are some other files. There are several line breakers in the [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/ src] directory. There's [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/rulebrk.h another word breaker], but it's for Thai text only. There's a | ** There are some other files. There are several line breakers in the [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/ src] directory. There's [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/rulebrk.h another word breaker], but it's for Thai text only. There's a smattering of files related to [http://en.wikipedia.org/wiki/JIS_encoding JIS encoding]. | ||
* Thunderbird 3 does FTS with a custom tokenizer in the [http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/ mailnews/extensions/fts3/] directory. | * Thunderbird 3 does FTS with a custom tokenizer in the [http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/ mailnews/extensions/fts3/] directory. | ||
** According to the [http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/src/README.mozilla readme], the tokenizer "supports CJK indexing using bi-gram. So you have to use bi-gram search string if you wanto to search CJK character." There is no mention of other scripts. | ** According to the [http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/src/README.mozilla readme], the tokenizer "supports CJK indexing using bi-gram. So you have to use bi-gram search string if you wanto to search CJK character." There is no mention of other scripts. |