90
edits
Rkentjames (talk | contribs) |
Rkentjames (talk | contribs) |
||
Line 6: | Line 6: | ||
Another is, what do you do with messages that are hard to categorize? One example is a forward sent to me from my Father-in-law of some cutsie stuff that I would consider annoying at best if received unsolicited from a stranger. Another example: a forward of a hoax to a long list of family members. Do I mark it as spam (and thereby increase the spam weighting of my entire extended family on the CC list?) Or do I mark it as ham (and encourage acceptance of such junk in my filter?) I think that any display of uncertain emails requires some way to remove the email other than training it as ham or spam. | Another is, what do you do with messages that are hard to categorize? One example is a forward sent to me from my Father-in-law of some cutsie stuff that I would consider annoying at best if received unsolicited from a stranger. Another example: a forward of a hoax to a long list of family members. Do I mark it as spam (and thereby increase the spam weighting of my entire extended family on the CC list?) Or do I mark it as ham (and encourage acceptance of such junk in my filter?) I think that any display of uncertain emails requires some way to remove the email other than training it as ham or spam. | ||
=== Bayes data refactoring === | |||
The main goal of this will be to: | |||
# reduce the memory footprint of Bayes | |||
# convert the external storage to an SQLite format | |||
# support multiple feature types in counts | |||
# support saving detailed data to assist in understanding per-message Bayes performance. | |||
The main concepts are: | |||
#Have separate token management for corpus and message | |||
#Convert storage of training.dat into an SQLite database | |||
#Store string values in the SQLite database; store only hash keys in memory. | |||
#Unify storage of corpus counts per token. Store counts as "per feature" with Junk and Good as first two supported features. | |||
So how might I divide that into separate bugs? | |||
#Split tokenizer into Message and Corpus versions | |||
#Support detail interface to message classification | |||
#Create an SQlite database that parallels training.dat | |||
#Convert local hash functions to use saved hash function results instead of strings directly |
edits