Getting to Grips with Naïve Bayes – A Step by Step Exercise in Spam Vs Ham from Mobile Text Messages
Step 1 – Collect the Data
To develop the naive Bayes classifier, we will use data adapted from the SMS Spam Collection at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/.
This dataset includes the text of SMS messages along with a label indicating whether the message is unwanted. Junk messages are labelled spam, while legitimate messages are labelled ham.
The following is a sample of ‘ham’ messages:
Better. Made up for Friday and stuffed myself like a pig yesterday. Now I feel
bleh. But at least it’s not writhing pain kind of bleh.
If he started searching, he will get job in few days. He has great potential
I got another job! The one at the hospital doing data analysis or something, starts
on Monday! Not sure when my thesis will get finished.
The following is a sample ‘spam’ message:
Congratulations ur awarded 500 of CD vouchers or 125gift guaranteed & Free
entry 2 100 wkly draw txt MUSIC to 87066
December only! Had your mobile 11mths+? You are entitled to update to the latest
colour camera mobile for Free! Call The Mobile Update Co FREE on 08002986906
Valentines Day Special! Win over £1000 in our quiz and take your partner on the
trip of a lifetime! Send GO to 83600 now. 150p/msg rcvd.
Looking at the preceding sample messages, there are some distinguishing characteristics about spam such as the word ‘free’. Days of the week on the other hand are mentioned in the ham messages but not the Spam ones.
Our naive Bayes classifier will take advantage of such patterns in the word frequency to determine whether the SMS messages seem to better fit the profile of spam or ham. While it's not inconceivable that the word "free" would appear outside of a spam SMS, a legitimate message is likely to provide additional words providing context. For instance, a ham message might state "are you free on Sunday?", whereas a spam message might use the phrase "free ringtones." The classifier will compute the probability of spam and ham given the evidence provided by all the words in the message.
Step 2 – Exploring and preparing the data
Text data are challenging to prepare because it is necessary to transform the words and sentences into a form that a computer can understand. We will transform our data into a representation known as bag-of-words, which ignores the order that words appear in and simply provides a variable indicating whether the word appears at all.
Access to this data will be on the Packt Publishing website where it should be downloaded to your R directory /folder. It will likely be necessary to make a purchase first in this particular example, especially as the original data set has been modified to suit the R language. We will assign the file to a variable called sms_raw.
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)
We look at the structure and find that there are 5559 objects and two features, namely type (ham and spam) and text which contains the unstructured SMS messages.
The type variable is currently a character vector. Since this is a categorical variable, it would be better to convert it to a factor, as shown in the following code:
sms_raw$type <- factor(sms_raw$type)
Examining the type variable with the str() and table() functions, we see that the variable has now been appropriately recoded as a factor. Additionally, we see that 747 (or about 13 percent) of SMS messages in our data were labeled spam, while the remainder were labeled ham:
Factor w/ 2 levels "ham","spam": 1 1 1 2 2 1 1 1 2 1 ...
Data preparation – Processing text data for analysis
We need a powerful set of tools to process text data.
SMS messages are strings of text composed of words, spaces, numbers, and punc