User:Rkentjames:Spam: Difference between revisions

← Older edit

User:Rkentjames:Spam (view source)

Revision as of 18:27, 21 August 2008

927 bytes added , 21 August 2008

→‎Spam Management

Rkentjames

90

edits

@@ Line 6: / Line 6: @@
 Another is, what do you do with messages that are hard to categorize? One example is a forward sent to me from my Father-in-law of some cutsie stuff that I would consider annoying at best if received unsolicited from a stranger. Another example: a forward of a hoax to a long list of family members.  Do I mark it as spam (and thereby increase the spam weighting of my entire extended family on the CC list?) Or do I mark it as ham (and encourage acceptance of such junk in my filter?) I think that any display of uncertain emails requires some way to remove the email other than training it as ham or spam.
+=== Bayes data refactoring ===
+The main goal of this will be to:
+# reduce the memory footprint of Bayes
+# convert the external storage to an SQLite format
+# support multiple feature types in counts
+# support saving detailed data to assist in understanding per-message Bayes performance.
+The main concepts are:
+#Have separate token management for corpus and message
+#Convert storage of training.dat into an SQLite database
+#Store string values in the SQLite database; store only hash keys in memory.
+#Unify storage of corpus counts per token. Store counts as "per feature" with Junk and Good as first two supported features.
+So how might I divide that into separate bugs?
+#Split tokenizer into Message and Corpus versions
+#Support detail interface to message classification
+#Create an SQlite database that parallels training.dat
+#Convert local hash functions to use saved hash function results instead of strings directly