Getting to Grips with Naïve Bayes – A Step by Step Exercise in Spam Vs Ham from Mobile Text Messages

Step 1 – Collect the Data

To develop the naive Bayes classifier, we will use data adapted from the SMS Spam Collection at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/.


This dataset includes the text of SMS messages along with a label indicating whether the message is unwanted. Junk messages are labelled spam, while legitimate messages are labelled ham.

The following is a sample of ‘ham’ messages:


Better. Made up for Friday and stuffed myself like a pig yesterday. Now I feel

bleh. But at least it’s not writhing pain kind of bleh.

If he started searching, he will get job in few days. He has great potential

and talent.

I got another job! The one at the hospital doing data analysis or something, starts

on Monday! Not sure when my thesis will get finished.


The following is a sample ‘spam’ message:

Congratulations ur awarded 500 of CD vouchers or 125gift guaranteed & Free

entry 2 100 wkly draw txt MUSIC to 87066


December only! Had your mobile 11mths+? You are entitled to update to the latest

colour camera mobile for Free! Call The Mobile Update Co FREE on 08002986906

Valentines Day Special! Win over £1000 in our quiz and take your partner on the

trip of a lifetime! Send GO to 83600 now. 150p/msg rcvd.


Looking at the preceding sample messages, there are some distinguishing characteristics about spam such as the word ‘free’. Days of the week on the other hand are mentioned in the ham messages but not the Spam ones.


Our naive Bayes classifier will take advantage of such patterns in the word frequency to determine whether the SMS messages seem to better fit the profile of spam or ham. While it's not inconceivable that the word "free" would appear outside of a spam SMS, a legitimate message is likely to provide additional words providing context. For instance, a ham message might state "are you free on Sunday?", whereas a spam message might use the phrase "free ringtones." The classifier will compute the probability of spam and ham given the evidence provided by all the words in the message.


Step 2 – Exploring and preparing the data

Text data are challenging to prepare because it is necessary to transform the words and sentences into a form that a computer can understand. We will transform our data into a representation known as bag-of-words, which ignores the order that words appear in and simply provides a variable indicating whether the word appears at all.

Access to this data will be on the Packt Publishing website where it should be downloaded to your R directory /folder. It will likely be necessary to make a purchase first in this particular example, especially as the original data set has been modified to suit the R language. We will assign the file to a variable called sms_raw.


sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)

We look at the structure and find that there are 5559 objects and two features, namely type (ham and spam) and text which contains the unstructured SMS messages.


The type variable is currently a character vector. Since this is a categorical variable, it would be better to convert it to a factor, as shown in the following code:


sms_raw$type <- factor(sms_raw$type)


Examining the type variable with the str() and table() functions, we see that the variable has now been appropriately recoded as a factor. Additionally, we see that 747 (or about 13 percent) of SMS messages in our data were labeled spam, while the remainder were labeled ham:


> str(sms_raw$type)

Factor w/ 2 levels "ham","spam": 1 1 1 2 2 1 1 1 2 1 ...

> table(sms_raw$type)

ham spam

4812 747


Data preparation – Processing text data for analysis

We need a powerful set of tools to process text data.

SMS messages are strings of text composed of words, spaces, numbers, and punctuation. Handling this type of complex data takes a large amount of thought and effort. One needs to consider how to remove numbers, punctuation, handle uninteresting words such as and, but, and or, and how to break apart sentences into individual words. Thankfully, this functionality has been provided by members of the R community in a text mining package titled tm.


The tm text mining package can be installed via the install.packages("tm") command and loaded with library(tm).


The first step in processing text data involves creating a corpus, which refers to a collection of text documents. In our project, a text document refers to a single SMS message. We'll build a corpus containing the SMS messages in the training data using the following command:


> sms_corpus <- Corpus(VectorSource(sms_raw$text))


This command uses two functions. First, the Corpus() function creates an R object to store text documents. This function takes a parameter specifying the format of the text documents to be loaded. Since we have already read the SMS messages and stored them in an R vector, we specify VectorSource(), which tells Corpus() to use the messages in the vector sms_train$text. The Corpus() function stores the result in an object named sms_corpus.

If we print() the corpus we just created, we will see that it contains documents for each of the 5,559 SMS messages in the training data:


> print(sms_corpus)

A corpus with 5559 text documents


To look at the contents of the corpus, we can use the inspect() function. By combining this with methods for accessing vectors, we can view specific SMS messages. The following command will view the first, second, and third SMS messages:


> inspect(sms_corpus[1:3])

[[1]]

Hope you are having a good week. Just checking in

[[2]]

K..give back my thanks.

[[3]]

Am also doing in cbe only. But have to pay.


The corpus now contains the raw text of 5,559 text messages. Before splitting the text into words, we will need to perform some common cleaning steps in order to remove punctuation and other characters that may clutter the result. For example, we would like to count hello!, HELLO..., and Hello as instances of the word hello.

The function tm_map() provides a method for transforming (that is, mapping) a tm corpus. We will use this to clean up our corpus using a series of transformation functions, and save the result in a new object called corpus_clean.


First, we will convert all of the SMS messages to lowercase and remove any numbers:


> corpus_clean <- tm_map(sms_corpus, tolower)

> corpus_clean <- tm_map(corpus_clean, removeNumbers)


A common practice when analyzing text data is to remove filler words such as to, and, but, and or. These are known as stop words. Rather than define a list of stop words ourselves, we will use the stopwords() function provided by the tm package. It contains a set of numerous stop words. To see them all, type stopwords() at the command line. As we did before, we'll use the tm_map() function to apply this function to the data:


corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())


We'll also remove punctuation:


corpus_clean <- tm_map(corpus_clean, removePunctuation)


Now that we have removed numbers, stop words, and punctuation, the text messages are left with blank spaces where these characters used to be. The last step then is to remove additional whitespace, leaving only a single space between words.

Now that we have removed numbers, stop words, and punctuation, the text messages are left with blank spaces where these characters used to be. The last step then is to remove additional whitespace, leaving only a single space between words.


> corpus_clean <- tm_map(corpus_clean, stripWhitespace)


The following table shows the first three messages in SMS corpus before and after the cleaning process. The messages have been limited to the most interesting words and punctuation and capitalization have been removed:



When the data set is finally to one’s liking the final step is to split the messages into individual components through a process called tokenization. A token is a single element of a text string; in this case, the tokens are words.


The tmpackage provides functionality to tokenize the SMS message corpus. The DocumentTermMatrix() function will take a corpus and create a data structure called a sparse matrix, in which the rows of the matrix indicate documents (that is, SMS messages) and the columns indicate terms (that is, words). Each cell in the matrix stores a number indicating a count of the times the word indicated by the column appears in the document indicated by the row. The following screenshot illustrates a small portion of the document term matrix for the SMS corpus, as the complete matrix has 5,559 rows and over 7,000 columns:




The fact that each cell in the table is zero implies that none of the words listed at the top of the columns appears in any of the first five messages in the corpus. This highlights the reason why this data structure is called a sparse matrix; the vast majority of cells in the matrix are filled with zeros. Although each message contains some words, the probability of any specific word appearing in a given message is small.


Creating a sparse matrix given a tm corpus involves a single command:


> sms_dtm <- DocumentTermMatrix(corpus_clean)


This will tokenize the corpus and return the sparse matrix with the name sms_dtm. From here, we'll be able to perform analyses involving word frequency.

However as you can see from the output in R, there was an error. The suggestion presented on StackOverFlow, below appears to have fixed the error.



The lines were re-entered starting with the command: > corpus_clean <- tm_map(corpus_clean, PlainTextDocument) and this was run:



Data preparation – Creating Training and Test Datasets


Since our data have been prepared for analysis, we now need to split the data into a training dataset and test dataset so that the spam classifier can be evaluated on data it had not seen previously. We'll divide the data into two portions: 75 percent for training and 25 percent for testing. Since the SMS messages are sorted in a random order, we can simply take the first 4,169 for training and leave the remaining 1,390 for testing.

We'll begin by splitting the raw data frame:


> sms_raw_train <- sms_raw[1:4169, ]


> sms_raw_test <- sms_raw[4170:5559, ]


Then the document-term matrix:


> sms_dtm_train <- sms_dtm[1:4169, ]


> sms_dtm_test <- sms_dtm[4170:5559, ]


And finally, the corpus:


> sms_corpus_train <- corpus_clean[1:4169]


> sms_corpus_test <- corpus_clean[4170:5559]


To confirm that the subsets are representative of the complete set of SMS data, let's compare the proportion of spam in the training and test data frames:


> prop.table(table(sms_raw_train$type))


ham spam

0.8647158 0.1352842


> prop.table(table(sms_raw_test$type))


ham spam

0.8683453 0.1316547


Both the training data and test data contain about 13 percent spam. This suggests that the spam messages were divided evenly between the two datasets.


Visualizing Text Data – Word Clouds


A word cloud is a way to visually depict the frequency at which words appear in text data. The cloud is made up of words scattered somewhat randomly around the figure. Words appearing more often in the text are shown in a larger font, while less common terms are shown in smaller fonts. This type of figure has grown in popularity recently since it provides a way to observe trending topics on social media websites.

The wordcloud package provides a simple R function to create this type of diagram. We'll use it to visualize the types of words in SMS messages. Comparing the word clouds for spam and ham messages will help us gauge whether our naive Bayes spam filter is likely to be successful. If you haven't already done so, install the package by typing install.packages("wordcloud") and load the package by typing library(wordcloud) at the R command line.


A word cloud can be created directly from a tm corpus object using the syntax:


> wordcloud(sms_corpus_train, min.freq = 40, random.order = FALSE)


This will create a word cloud from sms_corpus_train corpus. Since we specified random.order = FALSE, the cloud will be arranged in non-random order, with the higher-frequency words placed closer to the center. If we do not specify random.order, the cloud would be arranged randomly by default. The min.freq parameter specifies the number of times a word must appear in the corpus before it will be displayed in the cloud. A general rule is to begin by setting min.freq to a number roughly 10 percent of the number of documents in the corpus; in this case 10 percent is about 40. Therefore, words in the cloud must appear in at least 40 SMS messages.

The instructions and resulting ‘word cloud’ can be seen below: