[Ryan Carlson, Blog Editor] Here at The Nerdery we like to tinker and create, even if it means we have to build our own playground and fill the sandbox with sand. Below is an expedition into sandbox to tinker with technology.
———————
In this post I will describe how to build a simple spam detector with python. Without going too deep into the theory (or half-heartedly regurgitating Wikipedia), I will give a high level overview of how a probabilistic classifier works. Then we will get into the business of training and classifying emails using the Enron spam/ham corpora, which contains several thousand emails that have been pre-categorized as spam or ham. We will use this data set both to train and validate our classifier.
I hope this post will show one interesting application of probabilistic classification, and also highlight the adaptability of the python language.
Overview of Probabilistic Classification
The classifier we will be building will use a technique called “supervised learning”. This means we will spend some time up-front training the classifier before we will trust it to accurately detect spam.
So how will we train this thing? Or, in other words, “what clues do we have that a message is spam?” The answer is actually more simple than you may be thinking – we will just use the individual words in the message. These words and their association with either spam or ham messages will form the basis of our classifier.
Once we have associated the various words with our two classifications (spam and ham), we can calculate the probability that a given word belongs to one label or another. For instance, the probability that the word “money” appears in a spam message is much higher than the probability it appears in a legitimate email. This begs the question, how do you calculate this probability?
This is actually pretty straightforward. Say we have trained our classifier using 200 documents, 100 are spam and 100 are ham. Now, suppose that the word “money” appears in 25 spam documents, but only 5 ham documents. The probability, then, that the word “money” indicates a spam document is calculated:
Probability that "money" is spam = (.25 * .5) / ((.25 * .5) + (.05 * .5)) = .83, or 83%.
Where did these numbers come from? The “.25″ and “.05″ are the percentage of documents containing the word money that are spam and ham respectively. The “.5″ is the interesting number and is the percentage of documents that are spam or ham. Since we have classified 100 of each, the total number of documents is 200, and it is overall 50% likely that a document is spam.
By combining the probabilities for all the words in a document, it is possible to get an overall view of the likelihood a document is either spam or ham.
If you are interested in reading a bit more, perhaps the best introduction to Bayes’ theorem is the Wikipedia introductory example – I strongly recommend you check it out. For a more thorough introduction I recommend reading the excellent post “An intuitive and short explanation of Bayes’ Theorem”.
Building the classifier
This classifier will be based in part on the classifier in Toby Segaran’s excellent book Programming Collective Intelligence. I recommend picking up a copy of this book! It is packed with useful information and interesting applications of machine learning algorithms.
I hope its clear from the previous section that what we are interested in storing is “counts” of things, because it is these “counts” that allow us to calculate percentages, which can be combined to give an overall probability.
Recollecting the example above, we have:
= ((25 / 100) * (100 / 200)) / (((25 / 100) * (5 / 100)) + ((5 / 100) * (100 / 200)))
= (.25 * .5) / ((.25 * .5) + (.05 * .5))
So we will need to store the following counts of things:
- how many documents we have seen (200)
- how many documents go in each label (100 each)
- how often a word is associated with each label (25 and 5)
Python provides the perfect datastructure for us – the dictionary. Dictionaries provide O(1) lookup, guaranteeing that when we look up or update the count of something it’s going to be as fast as possible:
# file: classify.py
class Classifier(object):
def __init__(self):
# ``defaultdict`` is an optimized dictionary-like object that
# allows us to specify a default value when a key is accessed
# that has not been previously set.
self.features = defaultdict(int)
self.labels = defaultdict(int)
self.feature_counts = defaultdict(lambda: defaultdict(int))
self.total_count = 0
Now we need to build the training method. This will simply update the counts of various items:
def train(self, features, labels):
# what labels are these features associated with?
for label in labels:
# update the count of each feature for the given label
for feature in features:
self.feature_counts[feature][label] += 1
self.features[feature] += 1
# update the count of documents associated with this label
self.labels[label] += 1
# update the total count of documents processed
self.total_count += 1
Believe it or not, the above is all the code we need to start training our classifier! Of course, we’re not done yet — we need to write the code to classify new documents. Let’s start plugging the training data into some methods we can use to classify documents. Looking back at the formula we used to calculate the likelihood “money” indicated a spam document, let’s try to generate that with python:
def feature_probability(self, feature, label):
# get the count of this feature in the given label, this would
# be "25" for "money"/"spam", or "5" for "money"/"ham"
feature_count = self.feature_counts[feature][label]
# get the count of documents with this label (e.g. 100)
label_count = self.labels[label]
if feature_count and label_count:
# divide by the count of all features in the given category
return float(feature_count) / label_count
return 0
def weighted_probability(self, feature, label, weight=1.0, ap=0.5):
# calculate the "initial" probability that the given feature will
# appear in the label -- this is .25 for "money"/"spam"
initial_prob = self.feature_probability(feature, label)
# sum the counts of this feature across all labels -- e.g.,
# how many times overall does the word "money" appear? (30)
feature_total = self.features[feature]
# calculate weighted avg -- this is slightly different than what
# we did in the above example and helps give a more evenly
# weighted result and prevents us returning "0"
return float((weight * ap) + (feature_total * initial_prob)) / (weight + feature_total)
The above “weighted_probability” function allows us to calculate the probability that a feature is associated with a given label. Now it will get more interesting as we will be calculating the probability that a set of features matches a label. To calculate this, simply multiply together all the probabilities of the individual features:
def document_probability(self, features, label):
# calculate the probability these features match the label
p = 1
for feature in features:
p *= self.weighted_probability(feature, label)
return p
The final step is to weight the probabilities of the individual features by the overall probability that a document has a given label.
def probability(self, features, label):
if not self.total_count:
# avoid doing a divide by zero
return 0
# calculate the probability that a document will have the given
# label -- in our example this is (100 / 200)
label_prob = float(self.labels[label]) / self.total_count
# get the probabilities of each feature for the given label
doc_prob = self.document_probability(features, label)
# weight the document probability by the label probability
return doc_prob * label_prob
Now we can write a method to classify a set of features. This will calculate the probability for each label (i.e., the probability for spam and ham) and then return them sorted so the best match is first:
def classify(self, features, limit=5):
# calculate the probability for each label
probs = {}
for label in self.labels.keys():
probs[label] = self.probability(features, label)
# sort the results so the highest probabilities come first
return sorted(probs.items(), key=lambda (k,v): v, reverse=True)[:limit]
That’s all there is to it. There are several steps — I hope you didn’t get bored. In the next section we will use this classifier to process data from Enron’s spam corpus.
Processing data from the Enron spam corpus
To begin with it will be necessary to download the spam corpora. This file contains 3 different collections of spam / ham emails from Enron and will be used to train and test the classifier.
Let’s start a new script called “enron.py” that will read the emails from the Enron corpora and train our classifier. The first function we write will read all the files in a given corpus and train the classifier. This is straightforward in python:
import os
# import our classifier, assumed to be in same directory
from classify import Classifier
def train(corpus='corpus'):
classifier = Classifier()
curdir = os.path.dirname(__file__)
# paths to spam and ham documents
spam_dir = os.path.join(curdir, corpus, 'spam')
ham_dir = os.path.join(curdir, corpus, 'ham')
# train the classifier with the spam documents
train_classifier(classifier, spam_dir, 'spam')
# train the classifier with the ham documents
train_classifier(classifier, ham_dir, 'ham')
def train_classifier(classifier, path, label):
for filename in os.listdir(path):
with open(os.path.join(path, filename)) as fh:
contents = fh.read()
# extract the words from the document
features = extract_features(contents)
# train the classifier to associate the features with the label
classifier.train(features, [label])
As you can see in the above code, we are calling a function “extract_features” to extract the words from the file contents.
def extract_features(s, min_len=2, max_len=20):
"""
Extract all the words in the string ``s`` that have a length within
the specified bounds
"""
words = []
for w in s.lower().split():
wlen = len(w)
if wlen > min_len and wlen < max_len:
words.append(w)
return words
After training the classifier, let’s test it on a different corpus. The following function should look pretty similar to the training code:
def test(classifier, corpus='corpus2'):
curdir = os.path.dirname(__file__)
# paths to spam and ham documents
spam_dir = os.path.join(curdir, corpus, 'spam')
ham_dir = os.path.join(curdir, corpus, 'ham')
correct = total = 0
for path, label in ((spam_dir, 'spam'), (ham_dir, 'ham')):
for filename in os.listdir(path):
with open(os.path.join(path, filename)) as fh:
contents = fh.read()
# extract the words from the document
features = extract_features(contents)
results = classifier.classify(features)
if results[0][0] == label:
correct += 1
total += 1
pct = 100 * (float(correct) / total)
print '[%s]: processed %s documents, %02f%% accurate' % (corpus, total, pct)
Let’s make it so that when we run our script from the command line it will train itself using “corpus” and will then test itself against the other 2 corpora:
if __name__ == '__main__':
classifier = train()
test(classifier, 'corpus2')
test(classifier, 'corpus3')
Here is the output I get from running the script:
$ python enron2.py
[corpus2]: processed 5175 documents, 90.318841% accurate
[corpus3]: processed 6000 documents, 85.533333% accurate
That’s not too bad!
Improving Accuracy
While the accuracy is better than a random guess, it could definitely be improved. How can we improve the accuracy of the classifier? The easiest way is to try and select “better” features in the extract_features() function.
A couple ideas:
- filter out noise while extracting words, things like common stop words
- treat the email subject on it’s own, distinct from the words that make up the body
- check for things like words in all caps or the presence of links in the text
Since the features themselves are identified by a string, you can indicate a feature is a “subject” word by prefixing it with an “s:”. Or you can add “meta”-features like “ALL_CAPS” or “CONTAINS_LINKS”.
For instance, simply by filtering out stop words I was able to bump the accuracy up by 2%:
$ python enron2.py
[corpus2]: processed 5175 documents, 91.826087% accurate
[corpus3]: processed 6000 documents, 87.350000% accurate
Closing Remarks
I hope you enjoyed reading this post! As you may have noticed, the classifier module is not written in such a way that it is “spam”-specific, so you can adapt it to all sorts of other uses. One example might be suggesting tags for a blog post. If you’re interested in learning more, I again would suggest picking up a copy of Programming Collective Intelligence.
All the source code can be found on GitHub: https://gist.github.com/coleifer/75f4a428b0250822579e
You can also check out the git repo:
$ git clone https://gist.github.com/75f4a428b0250822579e.git classifier