Technology

Hacking Google Glass – Google I/O Session Summary

Woman wearing early prototype Glass

Google I/O 2013 LogoOne talk at Google I/O was about how to get debug access and how to root your Google Glass.  I really like that Google has allowed and demonstrated these options.  One of the many reasons I switched to using Android in the first place was to get rid of all the restrictions that were placed on what I could and could not do with a device I owned (If you haven’t heard yet, Google Glass is running Android OS).

Read more

Filed under Technology

Nerd Reaction To Google I/O 2013

Google I/O 2013 LogoGoogle I/O 2013 is over and there has been a deluge of new updates about services and tools for developers. Unlike previous years there wasn’t a fancy new device or jaw-dropping new technology debut. From a developer’s perspective, that’s just fine because what was announced was a strong foundation for future work on Google’s various platforms. Let’s review the announcements and their meaning by going through the platforms one-by-one. Read more

Filed under Technology

Taking a Spin with Facebook Home. It’s Pretty, but Lacking Features and a Security Concern.

I’ve spent the last few hours checking out Facebook Home and my initial impressions are that it’s a pretty cool app, but not something that I’d want permanently.

Facebook-Home

I only played with it for a short period of time, so my impressions are pretty introductory. The first 20 minutes are probably the hardest. As with anything new, there is a learning curve. Even though Facebook home is pretty simple, it’s going to take a little bit of time to figure it out. I think I liked and unliked quite a few status updates that I didn’t want to. However, once I figured it out, it was easy going. Read more

Filed under Technology

The New Facebook Logo

Facebook's New Brand

Facebook has updated their brand including their signature ”f” logo. All the details, do’s & dont’s and downloads can be seen at their new Brand Resources site; which is really well done.

It’s important to point out the one thing every designer and website owner should know. Read more

Filed under Technology

Tech Tips: Creating a SharePoint 2010 Master Page with Code-Behind

zs

In our regular series of Tech Tips, we asked resident SharePoint expert Joseph Szczesniak to share some of the information in his brain about SharePoint. If you’d like to see more information SharePoint, let us know! ————————

Starter Page
For this example, I’m going to get you kicked off using a quick and easy custom Master Page starter solution available from CodePlex. Go ahead and download Starter 2010 Master Pages by Randy Drisgill.zip. This will give you a stripped down version of your desired Master Page with tons of documentation right inside the page. Very useful!

Read more

Filed under Tech Tips, Technology

Facebook Home – Facebook’s “Mobile Best” Strategy

Facebook HomeIf you missed the announcement about the new Facebook in which Mark Zuckerberg talked about their mobile best strategy (as opposed to the common mobile first strategy for interactive software). Facebook Home – Facebook Live. You can catch the 40 minute long video that gives a demo of the flagship features.

  • Replacing the lock-screen with Timeline updates and Photos from your Facebook account.
  • Chat Heads – Their new feature that keeps the communication portion of SMS and Facebook Messenger in the forefront of your phone experience.
  • People First, Apps Second – The user interface that highlights “the people in your life” and let Apps take a back seat.

If you want to hear more in-depth discussion about what these updates mean for developers of Android and Facebook applications on the web, make sure you check out the recorded conversation between developers on yesterday’s NerdCast.

Filed under Technology, Web Culture

Beautifully Redesigned Facebook News Feed – A Review

thIn early March Facebook announced that the News Feed was going to be re-designed. The redesign is going to de-clutter your News Feed and make it easier to consume content. They’re working to make it easier to find the updates you are looking for and using user feedback to make Facebook better overall; and they are succeeding.

I’ve had the new News Feed design for a few weeks now and I really like it. The information that you see hasn’t changed, however how it’s presented has been.

Facebook Less Clutter

Upon switching to the new layout, the first thing you’ll notice is that it has actually been cleaned up and de-cluttered quite a bit. Visually, it’s very impressive. It feels like a well-polished, professional site.

Newsfeed Redesign

The right sidebar ticker and chat has been merged with the left sidebar and placed on a dark background. This is similar to how the mobile apps are setup. The sidebar also collapses down on smaller screens to free up space for your News Feed. This is a great feature, except it expands when you hover over it and that gets to be annoying at times. The right side ads remain, however they are laid out nicer.

The blue bar at the top has been replaced with Graph Search and the amount of information up there has been cut down as well. This area is still in development though as I’ve seen a few different designs and recently my name and image disappeared from the blue bar. It’s also a bit weird how clicking on the Facebook logo takes you to your News Feed and also changes into a search icon for the Graph Search bar, however the usability has been getting better over the past few weeks.

In the News Feed, most everything is bigger. Avatars, images, links and Likes all get more space. Status updates have nice boxes around them to separate the content, which makes it easier to consume as you scroll down the page. Like, Comment and Share buttons are bigger too, but not too big.

If multiple friends have shared the same thing, it’ll be combined in a way that doesn’t take up more space and yet doesn’t lose the share data. This is a nice upgrade to the awful way Facebook combined shared information in the past.

Facebook Multilple Shares

The other big feature that comes with the new News Feed is dedicated feeds. Dedicated feeds show you only updates for Friends, Following (aka Likes or Pages), Most Recent, Photos, Groups, Games and more. This is where the really good stuff comes into play.

Facebook Dedicated Feeds

The All Friends feed shows no pages, games or ads in the feed. Overall, this is kind of what I always wished Facebook was; just my friends. Facebook does throw in likes, cover photo updates, and other information on what your friends are doing so there is a bit of noise, but it’s very minimal.

The All Following feed shows you all the posts from the pages you’ve Liked on Facebook. Unlike the main News Feed, you actually do see all posts from all pages. Once you take a look at this feed, you’ll realize how much information Facebook doesn’t show you from things you’ve opted into. In the past, this page also shown information from pages that your friends like, that Facebook thinks you’d like, however that seems to be gone today.

The Most Recent feed shows you the most recent activity from your friends and pages on Facebook. You see games played, cover photo updates, Likes, status updates and all that jazz. Again, viewing this feed really makes you realize what Facebook isn’t showing you in the main News Feed; good or bad.

The Photos feed is beautiful. It’s only the photos your friends have shared and they are all front and center. No additional noise. Facebook also prompts you to organize your photos by removing the right sidebar ads and asking you to fill out information about your albums. From what I’ve been hearing, this is one of the most loved new features.

There is also group specific feeds show you updates from just people in those groups. I think that groups are often under used on Facebook and this is one way for Facebook to get you to start using them. They are quite nice and chances are you already have a bunch of groups that were auto setup.

Overall, the News Feed redesign is great. It really puts the focus on what your friends are doing and makes it much more enjoyable to be on Facebook.

As Facebook continues to roll out the News Feed design they will continue to tweak and change things. What I see today is much improved over what I saw in early March. It’s actually gotten quite a bit better.

If you’d like to test out the News Feed redesign, look for this banner at the top of your current News Feed and give it a try.

Opt Into Facebook Newsfeed

Have you opted into the new design yet? If not, I’d recommend taking it for a spin. Chances are, you won’t look back.

Filed under Technology

Should You Be Worried or Excited About Facebook’s Graph Search?

thFacebook’s Graph Search is supposed to be a great way to connect you with others on Facebook, to find information, and explore something new. Users can search by location, Likes, businesses, interests, or many other things. Results can also be refined by a number of different options based on what you search for. The idea behind Graph Search is great, but the reality leaves something to be desired at the moment.

You can do searches for things like:

  • Images taken in Minneapolis, Minnesota
  • Photos I Like
  • My friends who like Apple
  • My friends of friends who work at Google
  • Microsoft employees that like Apple Inc.
  • Restaurants people who like Star Wars like.
  • Favorite movies of people who like Mystery Science Theater 3000
  • Facebook employees that are single and are under 30 years old and that live in Arizona and like Cats
  • My friends over 50 years old who like Anime and Justin Bieber
  • Photos of my friends who work at The Nerdery
Sample Graph Search looking for photos of coworkers.

Sample Graph Search looking for photos of coworkers.

As you can see, the searches can be pretty basic, or you can get oddly specific in your searches.

From a business perspective, Graph Search could be a great way to get additional exposure of your company, product or band. Finding local events, new restaurants to try, and things to buy based on friends recommendations is a huge opportunity; as long as users are checking in, rating things and talking about your business. Read more

Filed under Technology

Playing in the Sandbox: Building a Spam Detector With Python

[Ryan Carlson, Blog Editor] Here at The Nerdery we like to tinker and create, even if it means we have to build our own playground and fill the sandbox with sand. Below is an expedition into sandbox to tinker with technology.

———————

Charles Leifer employee photoIn this post I will describe how to build a simple spam detector with python. Without going too deep into the theory (or half-heartedly regurgitating Wikipedia), I will give a high level overview of how a probabilistic classifier works. Then we will get into the business of training and classifying emails using the Enron spam/ham corpora, which contains several thousand emails that have been pre-categorized as spam or ham. We will use this data set both to train and validate our classifier.

I hope this post will show one interesting application of probabilistic classification, and also highlight the adaptability of the python language.

Overview of Probabilistic Classification

The classifier we will be building will use a technique called “supervised learning”. This means we will spend some time up-front training the classifier before we will trust it to accurately detect spam.

So how will we train this thing? Or, in other words, “what clues do we have that a message is spam?” The answer is actually more simple than you may be thinking – we will just use the individual words in the message. These words and their association with either spam or ham messages will form the basis of our classifier.

Once we have associated the various words with our two classifications (spam and ham), we can calculate the probability that a given word belongs to one label or another. For instance, the probability that the word “money” appears in a spam message is much higher than the probability it appears in a legitimate email. This begs the question, how do you calculate this probability?

This is actually pretty straightforward. Say we have trained our classifier using 200 documents, 100 are spam and 100 are ham. Now, suppose that the word “money” appears in 25 spam documents, but only 5 ham documents. The probability, then, that the word “money” indicates a spam document is calculated:

Probability that "money" is spam = (.25 * .5) / ((.25 * .5) + (.05 * .5)) = .83, or 83%.

Where did these numbers come from? The “.25″ and “.05″ are the percentage of documents containing the word money that are spam and ham respectively. The “.5″ is the interesting number and is the percentage of documents that are spam or ham. Since we have classified 100 of each, the total number of documents is 200, and it is overall 50% likely that a document is spam.

By combining the probabilities for all the words in a document, it is possible to get an overall view of the likelihood a document is either spam or ham.

If you are interested in reading a bit more, perhaps the best introduction to Bayes’ theorem is the Wikipedia introductory example – I strongly recommend you check it out. For a more thorough introduction I recommend reading the excellent post “An intuitive and short explanation of Bayes’ Theorem”.

Building the classifier

This classifier will be based in part on the classifier in Toby Segaran’s excellent book Programming Collective Intelligence. I recommend picking up a copy of this book! It is packed with useful information and interesting applications of machine learning algorithms.

I hope its clear from the previous section that what we are interested in storing is “counts” of things, because it is these “counts” that allow us to calculate percentages, which can be combined to give an overall probability.

Recollecting the example above, we have:

= ((25 / 100) * (100 / 200)) / (((25 / 100) * (5 / 100)) + ((5 / 100) * (100 / 200)))
= (.25 * .5) / ((.25 * .5) + (.05 * .5))

So we will need to store the following counts of things:

  • how many documents we have seen (200)
  • how many documents go in each label (100 each)
  • how often a word is associated with each label (25 and 5)

Python provides the perfect datastructure for us – the dictionary. Dictionaries provide O(1) lookup, guaranteeing that when we look up or update the count of something it’s going to be as fast as possible:

# file: classify.py

class Classifier(object):
    def __init__(self):
        # ``defaultdict`` is an optimized dictionary-like object that
        # allows us to specify a default value when a key is accessed
        # that has not been previously set.

        self.features = defaultdict(int)
        self.labels = defaultdict(int)
        self.feature_counts = defaultdict(lambda: defaultdict(int))
        self.total_count = 0

Now we need to build the training method. This will simply update the counts of various items:

def train(self, features, labels):
    # what labels are these features associated with?
    for label in labels:
        # update the count of each feature for the given label
        for feature in features:
            self.feature_counts[feature][label] += 1
            self.features[feature] += 1

        # update the count of documents associated with this label
        self.labels[label] += 1

    # update the total count of documents processed
    self.total_count += 1

Believe it or not, the above is all the code we need to start training our classifier! Of course, we’re not done yet — we need to write the code to classify new documents. Let’s start plugging the training data into some methods we can use to classify documents. Looking back at the formula we used to calculate the likelihood “money” indicated a spam document, let’s try to generate that with python:

def feature_probability(self, feature, label):
    # get the count of this feature in the given label, this would
    # be "25" for "money"/"spam", or "5" for "money"/"ham"
    feature_count = self.feature_counts[feature][label]

    # get the count of documents with this label (e.g. 100)
    label_count = self.labels[label]

    if feature_count and label_count:
        # divide by the count of all features in the given category
        return float(feature_count) / label_count
    return 0

def weighted_probability(self, feature, label, weight=1.0, ap=0.5):
    # calculate the "initial" probability that the given feature will
    # appear in the label -- this is .25 for "money"/"spam"
    initial_prob = self.feature_probability(feature, label)

    # sum the counts of this feature across all labels -- e.g.,
    # how many times overall does the word "money" appear? (30)
    feature_total = self.features[feature]

    # calculate weighted avg -- this is slightly different than what
    # we did in the above example and helps give a more evenly
    # weighted result and prevents us returning "0"
    return float((weight * ap) + (feature_total * initial_prob)) / (weight + feature_total)

The above “weighted_probability” function allows us to calculate the probability that a feature is associated with a given label. Now it will get more interesting as we will be calculating the probability that a set of features matches a label. To calculate this, simply multiply together all the probabilities of the individual features:


def document_probability(self, features, label):
    # calculate the probability these features match the label
    p = 1
    for feature in features:
        p *= self.weighted_probability(feature, label)
    return p

The final step is to weight the probabilities of the individual features by the overall probability that a document has a given label.

def probability(self, features, label):
    if not self.total_count:
        # avoid doing a divide by zero
        return 0

    # calculate the probability that a document will have the given
    # label -- in our example this is (100 / 200)
    label_prob = float(self.labels[label]) / self.total_count

    # get the probabilities of each feature for the given label
    doc_prob = self.document_probability(features, label)

    # weight the document probability by the label probability
    return doc_prob * label_prob

Now we can write a method to classify a set of features. This will calculate the probability for each label (i.e., the probability for spam and ham) and then return them sorted so the best match is first:

def classify(self, features, limit=5):
    # calculate the probability for each label
    probs = {}
    for label in self.labels.keys():
        probs[label] = self.probability(features, label)

    # sort the results so the highest probabilities come first
    return sorted(probs.items(), key=lambda (k,v): v, reverse=True)[:limit]

That’s all there is to it. There are several steps — I hope you didn’t get bored. In the next section we will use this classifier to process data from Enron’s spam corpus.

Processing data from the Enron spam corpus

To begin with it will be necessary to download the spam corpora. This file contains 3 different collections of spam / ham emails from Enron and will be used to train and test the classifier.

Let’s start a new script called “enron.py” that will read the emails from the Enron corpora and train our classifier. The first function we write will read all the files in a given corpus and train the classifier. This is straightforward in python:

import os

# import our classifier, assumed to be in same directory
from classify import Classifier

def train(corpus='corpus'):
    classifier = Classifier()
    curdir = os.path.dirname(__file__)

    # paths to spam and ham documents
    spam_dir = os.path.join(curdir, corpus, 'spam')
    ham_dir = os.path.join(curdir, corpus, 'ham')

    # train the classifier with the spam documents
    train_classifier(classifier, spam_dir, 'spam')

    # train the classifier with the ham documents
    train_classifier(classifier, ham_dir, 'ham')

def train_classifier(classifier, path, label):
    for filename in os.listdir(path):
        with open(os.path.join(path, filename)) as fh:
            contents = fh.read()

        # extract the words from the document
        features = extract_features(contents)

        # train the classifier to associate the features with the label
        classifier.train(features, [label])

As you can see in the above code, we are calling a function “extract_features” to extract the words from the file contents.

def extract_features(s, min_len=2, max_len=20):
    """
    Extract all the words in the string ``s`` that have a length within
    the specified bounds
    """
    words = []
    for w in s.lower().split():
        wlen = len(w)
        if wlen > min_len and wlen < max_len:
            words.append(w)
    return words

After training the classifier, let’s test it on a different corpus. The following function should look pretty similar to the training code:

def test(classifier, corpus='corpus2'):
    curdir = os.path.dirname(__file__)

    # paths to spam and ham documents
    spam_dir = os.path.join(curdir, corpus, 'spam')
    ham_dir = os.path.join(curdir, corpus, 'ham')

    correct = total = 0

    for path, label in ((spam_dir, 'spam'), (ham_dir, 'ham')):
        for filename in os.listdir(path):
            with open(os.path.join(path, filename)) as fh:
                contents = fh.read()

            # extract the words from the document
            features = extract_features(contents)

            results = classifier.classify(features)

            if results[0][0] == label:
                correct += 1
            total += 1

    pct = 100 * (float(correct) / total)
    print '[%s]: processed %s documents, %02f%% accurate' % (corpus, total, pct)

Let’s make it so that when we run our script from the command line it will train itself using “corpus” and will then test itself against the other 2 corpora:

if __name__ == '__main__':
    classifier = train()
    test(classifier, 'corpus2')
    test(classifier, 'corpus3')

Here is the output I get from running the script:

$ python enron2.py
[corpus2]: processed 5175 documents, 90.318841% accurate
[corpus3]: processed 6000 documents, 85.533333% accurate

That’s not too bad!

Improving Accuracy

While the accuracy is better than a random guess, it could definitely be improved. How can we improve the accuracy of the classifier? The easiest way is to try and select “better” features in the extract_features() function.

A couple ideas:

  • filter out noise while extracting words, things like common stop words
  • treat the email subject on it’s own, distinct from the words that make up the body
  • check for things like words in all caps or the presence of links in the text

Since the features themselves are identified by a string, you can indicate a feature is a “subject” word by prefixing it with an “s:”. Or you can add “meta”-features like “ALL_CAPS” or “CONTAINS_LINKS”.

For instance, simply by filtering out stop words I was able to bump the accuracy up by 2%:

$ python enron2.py
[corpus2]: processed 5175 documents, 91.826087% accurate
[corpus3]: processed 6000 documents, 87.350000% accurate

Closing Remarks

I hope you enjoyed reading this post! As you may have noticed, the classifier module is not written in such a way that it is “spam”-specific, so you can adapt it to all sorts of other uses. One example might be suggesting tags for a blog post. If you’re interested in learning more, I again would suggest picking up a copy of Programming Collective Intelligence.

All the source code can be found on GitHub: https://gist.github.com/coleifer/75f4a428b0250822579e

You can also check out the git repo:

$ git clone https://gist.github.com/75f4a428b0250822579e.git classifier

Filed under Tech Tips, Technology

Google+ Platform Applications

cwOn Feb 26th, Google revealed a series of new APIs that help mobile and web developers integrate their applications into the social network.

The way they’re engaging applications is interesting because it comes in three layers:

1. Almost every language, Almost every platform:
They’ve released code samples for just about every programming language and have made it clear that they have iOS, Android, and web development as primary targets for this API. Google has repeatedly stated that Google+ is Google. Meaning that it isn’t a product, but rather it’s involved in everything that they do. To me, nothing makes that clearer than attempting to drive all of this activity information into their social network. Read more

Filed under Technology