How Bayesian Filtering work?
First of all, there is need to tokenize the email, by separating the message in the email body into small parts. After the message being tokenized, the next process is to map them into the dictionary table which is also known as the frequency table. In this frequency table, the number occurrences of the same words will be analyzed. Then, the probability of the email will be calculated using the Bayes’ theorem rule by categorize whether the words or tokens is spam or non-spam.
The final step is to modify the values of the token in the dictionary, for example, by setting the threshold level by removing fewer frequent items. This process, however, gives a better impact in filtering the binary message. While sometimes binary results are not required, it will still be able to produce the probability of bulk mail being spam. This probability can work in many ways, but most of the Bayesian filtering implemented today will be based on this: those messages that message that are under 0.5% will be judged as non-spam. While the message above the rate of 0.5% which is 0.5%-1% will be judged as possible spam.
How to extract message using N-Gram?
In the N-Gram extraction approach, frequent tokens such as N phrased word are extracted for the use of corpus training. Let 1, g2, gL) be the ordered list (in decreasing frequency) of the most frequent n-grams of the training corpus. Then, each message is represented as a vector of length L<>1, x2, ....., xL>, where xi depends on gi. Two text representation approaches is used in N-Gram process:
1. Binary: The value of xi may be one (if gi is included at least once in the message) or zero (if gi is not included in the message.
Term Frequency (TF): The value of xi corresponds to the frequency of occurrence (normalized by the message length) of gi in the message
How to test the filter?
By using the filter for testing, first the probability is calculated as described below in figure 1, and according to its results the records in the token dictionary are modified. At this point, the value is initialized to one (for the case, where none of the words are matching from the token dictionary).ALL (no. of all e-mails) = SPAM +HAM (number of legitimate letters, added to the number of all spam letters).
-----------------------------------------------------------------------------------------------------------------
- LLet us call a word “matching word”, if the word has existed both in the letter and in the token dictionary.
- (“matching words” | “letter is spam” ) = for all matched words (N1 value of the current word / SPAM).
- P (“matching words” | “letter is legitimate”) = for all matched words (N2value of the current word / SPAM).
- P (“letter is spam”) = SPAM/ ALL.
- P (“letter is legitimate”) = HAM/ ALL.
- P (“letter is spam” | “matching words”) = P(“letter is legitimate”) * P(“matching words” | letter is legitimate”)
- Fi Final result : P (“letter is spam” | “matching words”) /
P (“letter is legitimate” | “matching words”)
For more information about the coding, kindly email me or message me. Cheers =)
No comments:
Post a Comment