
If the probability is greater than a threshold, the message is classified as spam. Using these words, the Bayesian filter calculates the probability of the new message being spam. On arrival, the new email is broken down into words and the most relevant words (those that are most significant in identifying whether the email is spam or not) are identified. Once the ham and spam databases have been created, the word probabilities can be calculated and the filter is ready for use. This will ensure that the Bayesian filter is aware of the latest spam trends, resulting in a high spam detection rate. In addition it must also constantly be updated with the latest spam by the anti-spam software. This spam data file must include a large sample of known spam. On the other hand, the Bayesian filter, if tailored to your company through an initial training period, takes note of the company's valid outbound email (and recognizes ‘mortgage’ as being frequently used in legitimate messages), it will have a much better spam detection rate and a far lower false positive rate.īesides ham email, the Bayesian filter also relies on a spam data file. Example: A financial institution might use the word ‘mortgage’ many times and would get many false positives if using a general anti-spam rule set.The analysis of ham email is performed on the company's email and therefore is tailored to that particular company. If the word ‘mortgage’ occurs in 400 out of 3,000 spam emails and in 5 out of 300 legitimate emails then its spam probability would be 0.8889 (i.e. This probability is calculated as per following example: This is done by analyzing the users' outbound email and known spam: All the words and tokens in both pools of email are analyzed to generate the probability that a particular word points to the email being spam.

This can be collected from a sample of spam email and valid email (referred to as ‘ham’).Ī probability value is then assigned to each word or token this is based on calculations that account for how often such word occurs in spam as opposed to ham.

If a snippet of text frequently occurs in spam emails but not in legitimate emails, it would be reasonable to assume that this email is probably spam.Ĭreating a tailor-made Bayesian word databaseīefore Bayesian filtering is used, a database with words and tokens (for example $ sign, IP addresses and domains, etc,) must be created. This same technique has been adapted by GFI MailEssentials to identify and classify spam. Refer to the links below for more information on the mathematical basis of Bayesian filtering:
