(Caveat: This Blog post is adapted from Chapter 4 of 'Machine Learning in R' by Lantz (2013)
In this Blog post we will explain the probabilistic learning algorithm,
Naive Bayes, which is widely used in ‘Big Data’ projects in the current climate although it has been used for predicting the weather, among other things, for quite some time. Naïve Bayes also uses principles of probability for classification. Just as meteorologists forecast weather, naive Bayes uses data about prior events to estimate the probability of future events. For instance, a common application of naive Bayes uses the frequency of words in past junk email messages to identify new junk mail and we will be guiding you through just such an example later.
Typically, Bayesian methods utilize all available evidence to subtly change the predictions.
Bayesian probability theory is rooted in the idea that the estimated likelihood of an event should be based on the evidence at hand. Key words are: ‘Events’ and ‘Trial’. Events are possible outcomes while a trial is a single opportunity for the event to occur, such as a coin flip or day’s weather.
Again, we will be talking the reader through an abbreviated step by step example as suggested by Lantz (2013) in the next Blog.
“Classifiers based on Bayesian methods utilize training data to calculate an observed probability of each class based on feature values. When the classifier is used later, on unlabelled data, it uses the observed probabilities to predict the most likely class for the new features” Lantz, (2013).
Lantz, (2013) goes on to explain that; “Typically, Bayesian classifiers are best applied to problems in which the information from numerous attributes should be considered simultaneously in order to estimate the probability of an outcome. While many algorithms ignore features that have weak effects, Bayesian methods utilize all available evidence to subtly change the predictions. If a large number of features, have relatively minor effects, taken together their combined impact could be quite large.”
Typical problems tackled using Bayseian methods might be:
Text classification, such as junk email (spam) filtering, author identification, or topic categorization
Intrusion detection or anomaly detection in computer networks
Diagnosing medical conditions, when given a set of observed symptoms
The probability of an event can be estimated from observed data by dividing the number of trials in which an event occurred by the total number of trials. P(a)=0.20 for example (Probability of an event ‘A’ being 20%).
If the trial, mentioned above, has only two outcomes that cannot occur simultaneously then knowing the probability of one outcome reveals the other. Like heads or tails on a coin.
It is often helpful to imagine probability as a two-dimensional space that is partitioned into event probabilities for events.
In the following diagram, the rectangle represents the set of all possible outcomes for an email message. The circle represents the probability that the message is spam. The remaining 80 percent represents the messages that are not spa
Often, we are interested in monitoring several non-mutually exclusive events for the same trial. If some events occur with the event of interest, we may be able to use them to make predictions. Consider, for instance, a second event based on the outcome that the email message contains the word Viagra. For most people, this word is only likely to appear in a spam message; its presence in a message is therefore a very strong piece of evidence that the email is spam. The preceding diagram, updated for this second event, might appear as shown in the following diagram:
But not all Viagra mentions in e-mails are spam and not all e-mails contain the word Viagra hence we should view these other possibilities, as seen in the Venn diagram below, which serves as a reminder to allocate probability to all possible combinations of events:
We might know that 20% of e-mail messages were without the word Viagra (left circle) and 5% appearing in ham (None Spam) on the right so we have to work out the overlap of P(Spam) and P(Viagra) occurring. This depends on how the joint probability of one event occurring and how it relates to the other event occurring. If the two events are totally unrelated then they are independent variables. Dependent events form the basis of predictive modelling, like the presence of clouds on a rainy day or the word Viagra predicting Spam e-mail.
With the knowledge that P(spam) and P(Viagra) were independent, we could then easily calculate P(spam ∩ Viagra); the probability of both events happening at the same time. Because 20 percent of all messages are spam, and 5 percent of all emails contain the word Viagra, we could assume that 5 percent of 20 percent (0.05 * 0.20 = 0.01), or 1 percent of all messages are spam containing the word Viagra. More generally, for independent events A and B, the probability of both happening is P(A ∩ B) = P(A) * P(B). It is far more likely that P(spam) and P(Viagra) are highly dependent, which means that this calculation is incorrect. We need to use a more careful formulation of the relationship between these two events.
Conditional Theory with Bayes Theorem
The relationships between dependent events can be described using Bayes' theorem, as shown in the following formula. The notation P(A|B) can be read as the probability of event A given that event B occurred. This is known as conditional probability, since the probability of A is dependent (that is, conditional) on what happened with event B.
To understand how Bayes' theorem works in practice, suppose that you were tasked with guessing the probability that an incoming email was spam. Without any additional evidence, the most reasonable guess would be the probability that any prior message was spam (that is, 20 percent in the preceding example). This estimate is known as the prior probability.
Now, also suppose that you obtained an additional piece of evidence; you were told that the incoming message used the term Viagra. The probability that the word Viagra was used in previous spam messages is called the likelihood and the probability that Viagra appeared in any message at all is known as the marginal likelihood. By applying Bayes' theorem to this evidence, we can compute a posterior probability that measures how likely the message is to be spam. If the posterior probability is greater than 50 percent, the message is more likely to be spam than ham, and it should be filtered. The following formula is the Bayes' theorem for the given evidence:
To calculate the components of Bayes' theorem, we must construct a frequency table (shown on the left in the following diagram) that records the number of times Viagra appeared in spam and ham messages. Just like a two-way cross-tabulation, one dimension of the table indicates levels of the class variable (spam or ham), while the other dimension indicates levels for features (Viagra: yes or no). The cells then indicate the number of instances having the particular combination of class value and feature value. The frequency table can then be used to construct a likelihood table, as shown on right in the following diagram:
The likelihood table reveals that P(Viagra|spam) = 4/20 = 0.20, indicating that the probability is 20 percent that a spam message contains the term Viagra. Additionally, since the theorem says that P(B|A) * P(A) = P(A ∩ B), we can calculate P(spam ∩ Viagra) as P(Viagra|spam) * P(spam) = (4/20) * (20/100) = 0.04. This is four times greater than the previous estimate under the faulty independence assumption illustrating the importance of Bayes' theorem when calculating joint probability.
To compute the posterior probability, P(spam|Viagra), we simply take P(Viagra|spam) * P(spam) / P(Viagra), or (4/20) * (20/100) / (5/100) = 0.80. Therefore, the probability is 80 percent that a message is spam, given that it contains the word Viagra. Therefore, any message containing this term should be filtered.
The Naive Bayes Algorithm
The Naive Bayes algorithm is named as such because it makes a couple of "naive" assumptions about the data. In particular, Naive Bayes assumes that all of the features in the dataset are equally important and independent. These assumptions are rarely true in most of the real-world applications. For example, if you were attempting to identify spam by monitoring email messages, it is almost certainly true that some features will be more important than others. For example, the sender of the email may be a more important indicator of spam than the message text. Additionally, the words that appear in the message body are not independent from one another, since the appearance of some words is a very good indication that other words are also likely to appear. A message with the word Viagra is probably likely to also contain the words prescription or drugs.
However, in most cases when these assumptions are violated, Naive Bayes still performs fairly well. Adding more words to monitor and classify can be heavy on memory so work becomes much easier if we can exploit the fact that Naive Bayes assumes independence among events. Specifically, Naive Bayes assumes class-conditional independence, which means that events are independent so long as they are conditioned on the same class value. Assuming conditional independence allows us to simplify the formula using the probability rule for independent events, which you may recall is P(A ∩ B) = P(A) * P(B). These results in a much easier-to-compute formulation, shown as follows:
Theorem Formula_1 (For Spam)
The result of this formula should be compared to the probability that the message is ham:
Naive Bayes Theorem Formula_2 (For Ham)
Using the values in the likelihood table, we can start filling numbers in these equations. Because the denominator is the same in both cases, it can be ignored for now. The overall likelihood of spam is then:
Bayes_Theorem_Formula_3 (Overall Likelihood of Spam)
While the likelihood of ham given this pattern of words is:
_Bayes_Theorem_Formula_4 (Overall Likelihood of Ham)
Because 0.012 / 0.002 = 6, we can say that this message is six times more likely to be spam than ham. However, to convert these numbers to probabilities, we need one last step.
The probability of spam is equal to the likelihood that the message is spam divided by the likelihood that the message is either spam or ham:
Similarly, the probability of ham is equal to the likelihood that the message is ham divided by the likelihood that the message is either spam or ham:
Given the pattern of words in the message, we expect that the message is spam with 85.7 percent probability, and ham with 14.3 percent probability. Because these are mutually exclusive and exhaustive events, the probabilities sum up to one. The Naive Bayes classification algorithm we used in the preceding example can be summarized by the following formula. Essentially, the probability of level L for class C, given the evidence provided by features F1 through Fn, is equal to the product of the probabilities of each piece of evidence conditioned on the class level, the prior probability of the class level, and a scaling factor 1 / Z, which converts the result to a probability:
Sometimes when we add more features, or in this case words to the list of words to filter for the Laplace estimator is required to correct anomalies that may occur in the data. However we will not venture into this area just right now.
Using Numeric Features with Naive Bayes
Naïve Bayes is perfect for looking at categorical data but sometimes we desire to include some numbers but how does the algorithm deal with this? We can discretize numeric features, which simply means that the numbers are put into categories known as bins. This method is ideal when there are large amounts of training data, a common condition when working with naive Bayes.
There are several different ways to discretize a numeric feature. Perhaps the most common is to explore the data for natural categories or cut points in the distribution of data. For example, suppose that you added a feature to the spam dataset that recorded the time of night or day the email was sent, from 0 to 24 hours past midnight. The following histogram diagram below shows how one might divide this into four ‘nominal’ bins of data.
Alternatively, if cut points aren’t obvious one could divide the numeric data in quartiles but the more you break down the data the more of the original information you lose as the granularity gets smaller.
Watch out for the next Blog for a step by step exercise from ‘Machine Learning with R’ by Lantz, (2013). This exercise will aim to filter out unwanted mobile text messages using Naive Bayes.