Confirmed users
764
edits
m (→Status) |
(→Status) |
||
Line 18: | Line 18: | ||
* Gecko has some facilities for i18n word boundary analysis under the [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/ intl/lwbrk/] directory. | * Gecko has some facilities for i18n word boundary analysis under the [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/ intl/lwbrk/] directory. | ||
** [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/idl/nsISemanticUnitScanner.idl <tt>nsISemanticUnitScanner</tt>] is the only interface there, and [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/nsSemanticUnitScanner.h <tt>nsSemanticUnitScanner</tt>] is its only implementation. <tt>nsSemanticUnitScanner</tt> is derived from [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/nsSampleWordBreaker.cpp <tt>nsSampleWordBreaker</tt>], which as its name implies is not robust. It supports ASCII but as far as I can tell has incomplete support for CJK and Thai and no support for other scripts. | ** [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/idl/nsISemanticUnitScanner.idl <tt>nsISemanticUnitScanner</tt>] is the only interface there, and [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/nsSemanticUnitScanner.h <tt>nsSemanticUnitScanner</tt>] is its only implementation. <tt>nsSemanticUnitScanner</tt> is derived from [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/nsSampleWordBreaker.cpp <tt>nsSampleWordBreaker</tt>], which as its name implies is not robust. It supports ASCII but as far as I can tell has incomplete support for CJK and Thai and no support for other scripts. Manual testing shows that it does not tokenize Japanese satisfactorily. | ||
** There's [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/public/nsIWordBreaker.h <tt>nsIWordBreaker</tt>], but its [http://mxr.mozilla.org/mozilla-central/source/intl/build/nsI18nModule.cpp#65 only implementation] is <tt>nsSampleWordBreaker</tt>. | ** There's [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/public/nsIWordBreaker.h <tt>nsIWordBreaker</tt>], but its [http://mxr.mozilla.org/mozilla-central/source/intl/build/nsI18nModule.cpp#65 only implementation] is <tt>nsSampleWordBreaker</tt>. | ||
** There are some other files. There are several line breakers in the [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/ src] directory. There's [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/rulebrk.h another word breaker], but it's for Thai text only. There's a smattering of files related to [http://en.wikipedia.org/wiki/JIS_encoding JIS encoding]. | ** There are some other files. There are several line breakers in the [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/ src] directory. There's [http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/rulebrk.h another word breaker], but it's for Thai text only. There's a smattering of files related to [http://en.wikipedia.org/wiki/JIS_encoding JIS encoding]. |