My latest target was a basket of different libraries in the Python ecosystem covering things like web development, caching, asynchronous messaging, and visualization. And since I’m a data scientist, I threw in a machine learning component just for fun. To explore these technologies, I created a semi-practical application that reads from the Twitter stream, parses tweets, and does some machine learning magic to score the tweet’s sentiment and project it into a two-dimensional grid, where tweets with similar content will appear closer to each other. It does all of this more or less in real time using asynchronous messaging.

The remainder of this blog post is devoted to showing how to build this from scratch. Just to be completely transparent about what it does and what technologies are involved, here are the components that I’ll demonstrate how to build:

- Basic web development in Python using Flask + standard front-end stuff (Bootstrap, JQuery etc.)
- Asynchronous, chainable task queues using Celery and Redis
- Real-time event-based communication between Flask and connected clients using Socket-IO
- Twitter stream filtering/parsing using Pattern
- Streaming real-time visualization using NVD3
- Sentiment analysis and word embeddings using Scikit-learn and Gensim (word2vec)

And here's a screenshot of what the finished app looks like.

Why are we doing this? Does this app have some tangible, real-world purpose? Probably not. But it’s fun and neat and you’ll hopefully learn a lot. If that sounds interesting then read on!

(NOTE: The completed application is located here. Feel free to reference this often as I may leave out some details throughout the post).

To get started, let’s talk about the setup required to run the application. In addition to a working Python interpreter, you’ll need to install a bunch of libraries that the application uses. You’ll also need to know how to perform a few tasks such as starting Redis, launching Celery, and running a Flask server. Fortunately these are all very easy to do. I wrote up detailed instructions in the project’s README file. Follow those steps and you should be up and running.

Now let’s build sentiment and word vector models to transform tweets. We’ll use this Twitter sentiment dataset as training data for both models. The first step is to read in the dataset and do some pre-processing using TF-IDF to convert each tweet to a bag-of-words representation.

```
print('Reading in data file...')
data = pd.read_csv(path + 'Sentiment Analysis Dataset.csv',
usecols=['Sentiment', 'SentimentText'], error_bad_lines=False)
print('Pre-processing tweet text...')
corpus = data['SentimentText']
vectorizer = TfidfVectorizer(decode_error='replace', strip_accents='unicode',
stop_words='english', tokenizer=tokenize)
X = vectorizer.fit_transform(corpus.values)
y = data['Sentiment'].values
```

Note that we’re using a custom tokenizer designed to handle patterns common in tweets. I borrowed this from a script Christopher Potts wrote and adapted it slightly (final version is in the "scripts" folder). Next, we can train the sentiment classifier and word2vec model.

```
print('Training sentiment classification model...')
classifier = MultinomialNB()
classifier.fit(X, y)
print('Training word2vec model...')
corpus = corpus.map(lambda x: tokenize(x))
word2vec = Word2Vec(corpus.tolist(), size=100, window=4, min_count=10, workers=4)
word2vec.init_sims(replace=True)
```

This should run pretty fast since the training data set is not that big. We now have a model that can read a tweet and classify its sentiment as either positive or negative, and another model that transforms the words in a tweet to 100-dimensional vectors. But we still need a way to use those 100-dimensional vectors to spatially plot a tweet on a 2-dimensional grid. To do that, we’re going to fit a PCA transform for the word vectors and keep only the first 2 principal components.

```
print('Fitting PCA transform...')
word_vectors = [word2vec[word] for word in word2vec.vocab]
pca = PCA(n_components=2)
pca.fit(word_vectors)
```

Finally, we’re going to save all of these artifacts to disk so we can call them later from the web application.

```
print('Saving artifacts to disk...')
joblib.dump(vectorizer, path + 'vectorizer.pkl')
joblib.dump(classifier, path + 'classifier.pkl')
joblib.dump(pca, path + 'pca.pkl')
word2vec.save(path + 'word2vec.pkl')
```

Now that we have all the models we need ready to go, we can get started on the meat of the application. First, some initialization. This code runs only once, when the Flask server is launched.

```
# Initialize and configure Flask
app = Flask(__name__)
app.config['SECRET_KEY'] = 'secret'
app.config['CELERY_BROKER_URL'] = 'redis://localhost:6379/0'
app.config['CELERY_RESULT_BACKEND'] = 'redis://localhost:6379/0'
app.config['SOCKETIO_REDIS_URL'] = 'redis://localhost:6379/0'
app.config['BROKER_TRANSPORT'] = 'redis'
app.config['CELERY_ACCEPT_CONTENT'] = ['pickle']
# Initialize SocketIO
socketio = SocketIO(app, message_queue=app.config['SOCKETIO_REDIS_URL'])
# Initialize and configure Celery
celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL'])
celery.conf.update(app.config)
```

There’s a bunch of stuff going on here so let’s break it down. We’ve created a variable called "app" that’s an instantiation of Flask, and set some configuration items to do things like tell it to use Redis as the broker (note that "config" is just a dictionary of key/value pairs which we can use for other settings not required by Flask). We also created a SocketIO instance, which is a class from the Flask-SocketIO integration library that basically wraps Flask with SocketIO support. Finally, we created our Celery app and updated its configuration settings to use the "config" dictionary we defined for Flask.

Next we need to load the models we created earlier into memory so they can be used by the application.

```
# Load transforms and models
vectorizer = joblib.load(path + 'vectorizer.pkl')
classifier = joblib.load(path + 'classifier.pkl')
pca = joblib.load(path + 'pca.pkl')
word2vec = Word2Vec.load(path + 'word2vec.pkl')
```

Finally, we’ll create some helper functions that use these models to classify the sentiment of a tweet and transform a tweet into 2D coordinates.

```
def classify_tweet(tweet):
"""
Classify a tweet with either a positive (1) or negative (0) sentiment.
"""
pred = classifier.predict(vectorizer.transform(np.array([tweet.text])))
return str(pred[0])
def vectorize_tweet(tweet):
"""
Convert a tweet to vector space using a pre-trained word2vec model, then transform
a sum of the vectorized words to 2-dimensional space using PCA to give a simple
2D coordinate representation of the original tweet.
"""
tweet_vector = np.zeros(100)
for word in tokenize(tweet.text):
if word in word2vec.vocab:
tweet_vector = tweet_vector + word2vec[word]
components = pca.transform(tweet_vector)
x = components[0, 0]
y = components[0, 1]
return str(x), str(y)
```

The server code handling web requests is very simple, in fact there are only two routes. The first one is the root path and just returns the main page ("index.html").

```
@app.route('/', methods=['GET'])
def index():
"""
Route that maps to the main index page.
"""
return render_template('index.html')
```

The second route accepts an input phrase and initiates a task chain that will kick off a stream of tweets from the Twitter firehose that match the input phrase.

```
@app.route('/twitter/<phrase>', methods=['POST'])
def twitter(phrase):
"""
Route that accepts a twitter search phrase and queues a task to initiate
a connection to twitter.
"""
queue = app.config['SOCKETIO_REDIS_URL']
# create_stream.apply_async(args=[phrase, queue])
chain(create_stream.s(phrase, queue), send_complete_message.s()).apply_async()
return 'Establishing connection...'
```

This is probably confusing at first glance so let’s examine a bit closer. The "queue" variable is just a string pointing to the URL where Redis is running ("redis://localhost:6379/0"), which we’ll use later. The "chain" line is a Celery function that composes a sequence of tasks to be queued up and run asynchronously. In this case, we’re queuing a call to the "create_stream" task following by a call to the "send_complete_message" task. These tasks may or may not begin executing immediately (depending on the state of the task queue) but in any event, execution of the function continues on to the last line where we return a message to the client saying that a connection (to Twitter) is being established.

In the last section we queued calls to two functions, "create_stream" and "send_complete_message". We said that these were defined as tasks in Celery, meaning that they execute asynchronously apart from the main program thread. Let’s take a look at those functions.

```
@celery.task
def create_stream(phrase, queue):
"""
Celery task that connects to the twitter stream and runs a loop, periodically
emitting tweet information to all connected clients.
"""
local = SocketIO(message_queue=queue)
stream = Twitter().stream(phrase, timeout=30)
for i in range(60):
stream.update()
for tweet in reversed(stream):
sentiment = classify_tweet(tweet)
x, y = vectorize_tweet(tweet)
local.emit('tweet', {'id': str(i),
'text': str(tweet.text.encode('ascii', 'ignore')),
'sentiment': sentiment,
'x': x,
'y': y})
stream.clear()
time.sleep(1)
return queue
```

There’s a lot to digest here so let’s step through it. First, "local" is a Socket-IO instance. Recall that we’re using Socket-IO (through the Flask-SocketIO wrapper library) to essentially "push" information from the Flask server to all connected clients, so this object is creating a channel that will allow us to do that. The reason we’re creating a local instance within the function and not using the global instance is because our Celery task runs in a separate thread (remember, this function gets called asynchronously).

In the next line, we create an object that we imported from the Pattern library that creates a persistent connection with Twitter. Notice we pass in the search phrase that we got via client input. This will restrict the Twitter stream to only capture tweets with that phrase in it. The next section enters a 60-second loop where each second we check for new tweets, and if any are found, we run our sentiment and vectorization functions on them. The last step is the "emit" line, where we tell Socket-IO to push out a message with all of our data for that tweet.

At the very end of the function’s execution, it returns the "queue" variable. You may be wondering – what is this for? Where does it go? Remember that we chained multiple Celery tasks together. When "create_stream" finishes executing the next task in the chain ("send_complete_message") is called, with the output of the first task passed to the next task. In this case we’re using the chain to propagate information (the URL of the Redis queue) through to each task.

The "send_complete_message" function is much simpler and shouldn’t require much explanation. Essentially what we’re doing is emitting a message to our clients that the loop in the previous function finished executing.

```
@celery.task
def send_complete_message(queue):
"""
Celery task that notifies the client that the twitter loop has completed executing.
"""
local = SocketIO(message_queue=queue)
local.emit('complete', {'data': 'Operation complete!'})
```

As a practical matter, it wasn’t necessary to set things up this way. We could have very easily just stuck this at the end of the "create_stream" function. But I wanted to demonstrate a simple example of how to compose workflows in Celery and pass information between tasks. The whole point is to learn about these libraries and what they can do!

All we have left is to create a UI for the user to interact with and wire up some javascript to communicate through Socket-IO and handle the events pushed from the server. The UI code is pretty generic so I won’t cover it in this post but feel free to inspect index.html in the repo to see it all put together. I’ll say a few things about the javascript code though. First we need a function that takes a user-provided phrase and posts it to the web server, which initiates the whole process. This is pretty standard JQuery stuff.

```
function twitter() {
phrase = $('#phrase').val();
url = encodeURI('/twitter/' + phrase);
$.ajax({
type: 'POST',
url: url,
success: function(data, status, request) {
$('#twitter-status').html(data);
},
error: function() {
alert('An error occurred submitting the request.');
}
});
}
```

We also need a function to create the chart what we’ll use to display tweets. I pulled some examples from the NVD3 website and came up with the following, which renders a scatter plot using the “data” variable as input (this gets populated later).

```
function loadGraph() {
var chart = nv.models.scatterChart()
.pointRange([2000, 2000])
.color(d3.scale.category10().range());
chart.xAxis.tickFormat(d3.format('.02f'));
chart.yAxis.tickFormat(d3.format('.02f'));
d3.select('#chart svg')
.datum(data)
.transition()
.duration(500)
.call(chart);
nv.utils.windowResize(chart.update);
return chart;
}
```

Additionally there’s a few things going on in the main script body, which runs a single time when the page is first loaded. First, we open a socket to establish a persistent connection with the server.

```
var socket = io.connect('http://' + document.domain + ':' + location.port);
```

Next we need to create an event handler that responds to a "tweet" message from the server. The event handler does a few things:

- Update some labels with a new status
- Add a new JSON record to the data structure used by the graph
- Redraw the graph to display the new data
- Append the raw text from the tweet to a display panel

```
socket.on('tweet', function(msg) {
$('#phrase').val('');
$('#twitter-status').html(
'Connection established. Streaming for 60 seconds (currently at ' + msg.id + ')...');
sentiment = parseInt(msg.sentiment);
x = parseFloat(msg.x);
y = parseFloat(msg.y);
data[sentiment].values.push({
id: msg.id,
x: x,
y: y,
size: 2000,
shape: "circle"});
loadGraph();
$('#twitter-results').append(
'<br>' + $('<div/>').text('(' + x.toFixed(2) + ', ' + y.toFixed(2) + ') ' + msg.text).html());
});
```

We also need an event handler that responds to the "complete" message.

```
socket.on('complete', function(msg) {
$('#twitter-status').html(msg.data);
});
```

The last step is just to wire up the beginning click event and initiate the graph.

```
$('#begin').click(twitter);
nv.addGraph(loadGraph);
```

That’s essentially it. If you want to run it yourself, the easiest way would be to download the source code here and follow the instructions in README.md. If all goes well, you should see something like this.

There are a fair amount of details that I glossed over in the interest of not making this post too long, but if you’d like to learn more about any of the libraries I used I’d highly encourage spinning through the documentation for that library. Pretty much all of them are well-documented with lots of example code, which is how I learned a lot of what I needed to know to make this app.

That wraps up the walk-through. As I said at the top, experimenting with new technologies is a great way to learn. By combining several new technologies at once, it forces even greater comprehension. This app is pretty bare-bones and not all that useful in its current state beyond serving as a learning tool, but there are a million ways it could be extended. If you find this stuff interesting, feel free to fork the repo and make your own version!

]]>But there’s more to it than that. The common link between each story is a theme that the seeds of innovation are not random, and there are some commonalities that increase the likelihood not just of having a breakthrough but making sure it takes off. Throughout the book, Isaacson frames each story in a similar narrative – that innovation is a collaborative process, and the legend of the lone hacker changing the world on his or her own is a cultural myth. As a telling of historical events, this could be viewed as problematic. There are probably lots of examples of important research, discoveries, inventions etc. that were left out of the book. It would be perfectly reasonable to argue that the sample set used in the book is biased toward examples that fit the narrative that the author wanted to tell. For this reason I would not characterize it as a history book, but recognizing this fact doesn’t diminish its value, because the questions it raises around innovation are arguably more interesting.

It’s worth pausing for a moment to consider a relatively foundational question – what is innovation? How do we define it? Why is it so desirable? As far as I can tell, there’s no single agreed-upon definition. According to Wikipedia, the most “complete” way to describe innovation is:

…production or adoption, assimilation, and exploitation of a value-added novelty in economic and social spheres; renewal and enlargement of products, services, and markets; development of new methods of production; and establishment of new management systems. It is both a process and an outcome.

Maybe a good way to summarize it is that innovation unlocks some sort of economic value by doing something new/different from what existed before. You may notice that this definition is highly market-centric. There's no mention of things like ground-breaking scientific discoveries. I think the key realization here is that new discoveries/inventions aren’t automatically innovative – you have to do something WITH them that enriches people’s lives.

Let’s do a hypothetical thought experiment. Imagine that I write a research paper that demonstrably shows how to create an effective vaccine for malaria, but it’s published in an academic journal where it goes largely unnoticed. Was it innovative? What if someone follows up on my work years later and eventually, after several iterations, it leads to the mass production of a vaccine that saves lives? Based on the framing of innovation in the book, it seems like the right conclusion is fairly straightforward - new science, no matter how important or significant, is only innovative if someone ultimately does something with it for the betterment of humanity.

I think what Isaacson was trying to convey is that to be truly innovative, it’s not enough to be a genius. Having a big impact goes beyond discovery. You also need to be able to communicate why it’s important, build a product/service around it, find an appropriate market, convince people it’s worthwhile etc. Isaacson uses this argument as rationale for his thesis, but here I think we diverge a bit. I think innovation isn’t by nature *collaborative*, but rather *multidisciplinary*. Innovation requires ability in lots of different areas. Isaacson himself references this idea when he points out that innovative people are often operating at the intersection of the sciences and the humanities. One reason it may seem like innovation requires collaboration is because people that excel at many disciplines are very rare.

The multidisciplinary nature of innovation reminds me of the teachings of someone I enjoy reading about immensely, Warren Buffet’s partner Charlie Munger. Charlie speaks often about the benefits of being well-versed in a wide range of subject areas and developing a latticework of mental models, a network of concepts and their relationships to one another that captures the really important ideas from every major discipline. I think having access to a robust set of mental models, or partnering with others whose mental models complement your own, is probably one of the keys to successful innovation.

]]>The original approach that Cesar used was to pick the most interesting or memorable moment each day and record that, because it would "nudge" you to do interesting things so you've got something worth recording. But there are lots of different ways to approach this. It could be random, otherwise mundane moments. It could be snapshots of your kids, or a selfie, or anything really. What I've found most fascinating about this experiment is how it doesn't really matter what you record. That one-second clip is enough to sort of "jog" your memory and surface details about that day that otherwise might have been hard to remember. I don't know if the effect is quite the same watching someone else's life play out as it is for your own. There's so much context around each clip that only you will be aware of. One might think that a single second of video isn't much, but it's interesting how much can be recalled just from that. It's like a trigger that loads bits and pieces from your brain's long-term storage for that day and puts them into working memory. I would probably never be able to remember what I did two Thursdays ago, but I can see the moment I recorded that day and go "oh yeah, we did x and then y and then z happened". And when you put it all together, the result is mesmerizing.

The hardest part is just remembering to record something every day, but this is also one of the benefits. By forcing yourself to think about finding a moment to record, you'll unconsciously spend more time in the present and less time worrying about things that don't really matter. I don't want to oversell this with some deep philosophical argument but anecdotally my experience has been that just thinking about finding a snapshot in time to "immortalize" each day puts a lot of things in perspective.

If I've successfully convinced you to give this a try, you might be thinking about how it would be a pain in the ass to record and stitch together one-second video clips every day. And you're right, it would be. Fortunately, there's an app for that. It's not free, but the cost is marginal and well worth the time savings (and to be clear, I have no involvement or personal stake in this app, just passing along a recommendation). I've only used the iOS version but I think it's available for Android as well.

I can't wait to see what my highlight reel looks like years from now. I think seeing my kids grow up day by day over a period of months or years is going to be a surreal experience. So much of what we do and experience each day is lost, but with this one simple idea, it's possible to reclaim a small piece of it.

]]>*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part*

*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part 4 - Multivariate Logistic Regression*

*Part 5 - Neural Networks*

*Part 6 - Support Vector Machines*

*Part 7 - K-Means Clustering & PCA*

*Part 8 - Anomaly Detection & Recommendation*

We've now reached the last post in this series! It's been an interesting journey. Andrew's class was really well-done and translating it all to python has been a fun experience. In this final installment we'll cover the last two topics in the course - anomaly detection and recommendation systems. We'll implement an anomaly detection algorithm using a Gaussian model and apply it to detect failing servers on a network. We'll also see how to build a recommendation system using collaborative filtering and apply it to a movie recommendations data set. As always, it helps to follow along using the exercise text for the course (posted here).

Our first task is to use a Gaussian model to detect if an unlabeled example from a data set should be considered an anomaly. We have a simple 2-dimensional data set to start off with so we can easily visualize what the algorithm is doing. Let's pull in and plot the data.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
%matplotlib inline
data = loadmat('data/ex8data1.mat')
X = data['X']
X.shape
```

(307L, 2L)

```
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X[:,0], X[:,1])
```

It appears that there's a pretty tight cluster in the center with several values further out away from the cluster. In this simple example, these could be considered anomalies. To find out, we're tasked with estimating a Gaussian distribution for each feature in the data. You may recall that to define a probability distribution we need two things - mean and variance. To accomplish this we'll create a simple function that calculates the mean and variance for each feature in our data set.

```
def estimate_gaussian(X):
mu = X.mean(axis=0)
sigma = X.var(axis=0)
return mu, sigma
mu, sigma = estimate_gaussian(X)
mu, sigma
```

(array([ 14.11222578, 14.99771051]), array([ 1.83263141, 1.70974533]))

Now that we have our model parameters, we need to determine a probability threshold which indicates that an example should be considered an anomaly. To do this, we need to use a set of labeled validation data (where the true anomalies have been marked for us) and test the model's performance at identifying those anomalies given different threshold values.

```
Xval = data['Xval']
yval = data['yval']
Xval.shape, yval.shape
```

((307L, 2L), (307L, 1L))

We also need a way to calculate the probability that a data point belongs to a normal distribution given some set of parameters. Fortunately SciPy has this built-in.

```
from scipy import stats
dist = stats.norm(mu[0], sigma[0])
dist.pdf(X[:,0])[0:50]
```

array([ 0.183842 , 0.20221694, 0.21746136, 0.19778763, 0.20858956, 0.21652359, 0.16991291, 0.15123542, 0.1163989 , 0.1594734 , 0.21716057, 0.21760472, 0.20141857, 0.20157497, 0.21711385, 0.21758775, 0.21695576, 0.2138258 , 0.21057069, 0.1173018 , 0.20765108, 0.21717452, 0.19510663, 0.21702152, 0.17429399, 0.15413455, 0.21000109, 0.20223586, 0.21031898, 0.21313426, 0.16158946, 0.2170794 , 0.17825767, 0.17414633, 0.1264951 , 0.19723662, 0.14538809, 0.21766361, 0.21191386, 0.21729442, 0.21238912, 0.18799417, 0.21259798, 0.21752767, 0.20616968, 0.21520366, 0.1280081 , 0.21768113, 0.21539967, 0.16913173])

In case it isn't clear, we just calculated the probability that each of the first 50 instances of our data set's first dimension belong to the distribution that we defined earlier by calculating the mean and variance for that dimension. Essentially it's computing how far each instance is from the mean and how that compares to the "typical" distance from the mean for this data.

Let's compute and save the probability density of each of the values in our data set given the Gaussian model parameters we calculated above.

```
p = np.zeros((X.shape[0], X.shape[1]))
p[:,0] = stats.norm(mu[0], sigma[0]).pdf(X[:,0])
p[:,1] = stats.norm(mu[1], sigma[1]).pdf(X[:,1])
p.shape
```

(307L, 2L)

We also need to do this for the validation set (using the same model parameters). We'll use these probabilities combined with the true label to determine the optimal probability threshold to assign data points as anomalies.

```
pval = np.zeros((Xval.shape[0], Xval.shape[1]))
pval[:,0] = stats.norm(mu[0], sigma[0]).pdf(Xval[:,0])
pval[:,1] = stats.norm(mu[1], sigma[1]).pdf(Xval[:,1])
```

Next, we need a function that finds the best threshold value given the probability density values and true labels. To do this we'll calculate the F1 score for varying values of epsilon. F1 is a function of the number of true positives, false positives, and false negatives.

```
def select_threshold(pval, yval):
best_epsilon = 0
best_f1 = 0
f1 = 0
step = (pval.max() - pval.min()) / 1000
for epsilon in np.arange(pval.min(), pval.max(), step):
preds = pval < epsilon
tp = np.sum(np.logical_and(preds == 1, yval == 1)).astype(float)
fp = np.sum(np.logical_and(preds == 1, yval == 0)).astype(float)
fn = np.sum(np.logical_and(preds == 0, yval == 1)).astype(float)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = (2 * precision * recall) / (precision + recall)
if f1 > best_f1:
best_f1 = f1
best_epsilon = epsilon
return best_epsilon, best_f1
epsilon, f1 = select_threshold(pval, yval)
epsilon, f1
```

(0.0095667060059568421, 0.7142857142857143)

Finally, we can apply the threshold to the data set and visualize the results.

```
# indexes of the values considered to be outliers
outliers = np.where(p < epsilon)
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X[:,0], X[:,1])
ax.scatter(X[outliers[0],0], X[outliers[0],1], s=50, color='r', marker='o')
```

Not bad! The points in red are the ones that were flagged as outliers. Visually these seem pretty reasonable. The top right point that has some separation (but was not flagged) may be an outlier too, but it's fairly close. There's another example in the text of applying this to a higher-dimensional data set, but since it's a trivial extension of the two-dimensional example we'll move on to the last section.

Recommendation engines use item and user-based similarity measures to examine a user's historical preferences to make recommendations for new "things" the user might be interested in. In this exercise we'll implement a particular recommendation algorithm called collaborative filtering and apply it to a data set of movie ratings. Let's first load and examine the data we'll be working with.

```
data = loadmat('data/ex8_movies.mat')
data
```

{'R': array([[1, 1, 0, ..., 1, 0, 0], [1, 0, 0, ..., 0, 0, 1], [1, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'Y': array([[5, 4, 0, ..., 5, 0, 0], [3, 0, 0, ..., 0, 0, 5], [4, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Thu Dec 1 17:19:26 2011', '__version__': '1.0'}

Y is a (number of movies x number of users) array containing ratings from 1 to 5. R is an "indicator" array containing binary values indicating if a user has rated a movie or not. Both should have the same shape.

```
Y = data['Y']
R = data['R']
Y.shape, R.shape
```

((1682L, 943L), (1682L, 943L))

We can look at the average rating for a movie by averaging over a row in Y for indexes where a rating is present.

```
Y[1,R[1,:]].mean()
```

2.5832449628844114

We can also try to "visualize" the data by rendering the matrix as if it were an image. We can't glean too much from this but it does give us an idea of a relative density of ratings across users and movies.

```
fig, ax = plt.subplots(figsize=(12,12))
ax.imshow(Y)
ax.set_xlabel('Users')
ax.set_ylabel('Movies')
fig.tight_layout()
```

Next we're going to implement a cost function for collaborative filtering. Intuitively, the "cost" is the degree to which a set of movie rating predictions deviate from the true predictions. The cost equation is given in the exercise text. It is based on two sets of parameter matrices called X and Theta in the text. These are "unrolled" into the "params" input so that we can use SciPy's optimization package later on. Note that I've included the array/matrix shapes in comments to help illustrate how the matrix interactions work.

```
def cost(params, Y, R, num_features):
Y = np.matrix(Y) # (1682, 943)
R = np.matrix(R) # (1682, 943)
num_movies = Y.shape[0]
num_users = Y.shape[1]
# reshape the parameter array into parameter matrices
X = np.matrix(np.reshape(params[:num_movies * num_features], (num_movies, num_features))) # (1682, 10)
Theta = np.matrix(np.reshape(params[num_movies * num_features:], (num_users, num_features))) # (943, 10)
# initializations
J = 0
# compute the cost
error = np.multiply((X * Theta.T) - Y, R) # (1682, 943)
squared_error = np.power(error, 2) # (1682, 943)
J = (1. / 2) * np.sum(squared_error)
return J
```

In order to test this, we're provided with a set of pre-trained parameters that we can evaluate. To keep the evaluation time down, we'll look at just a small sub-set of the data.

```
users = 4
movies = 5
features = 3
params_data = loadmat('data/ex8_movieParams.mat')
X = params_data['X']
Theta = params_data['Theta']
X_sub = X[:movies, :features]
Theta_sub = Theta[:users, :features]
Y_sub = Y[:movies, :users]
R_sub = R[:movies, :users]
params = np.concatenate((np.ravel(X_sub), np.ravel(Theta_sub)))
cost(params, Y_sub, R_sub, features)
```

22.224603725685675

This answer matches what the exercise text said we're supposed to get. Next we need to implement the gradient computations. Just like we did with the neural networks implementation in exercise 4, we'll extend the cost function to also compute the gradients.

```
def cost(params, Y, R, num_features):
Y = np.matrix(Y) # (1682, 943)
R = np.matrix(R) # (1682, 943)
num_movies = Y.shape[0]
num_users = Y.shape[1]
# reshape the parameter array into parameter matrices
X = np.matrix(np.reshape(params[:num_movies * num_features], (num_movies, num_features))) # (1682, 10)
Theta = np.matrix(np.reshape(params[num_movies * num_features:], (num_users, num_features))) # (943, 10)
# initializations
J = 0
X_grad = np.zeros(X.shape) # (1682, 10)
Theta_grad = np.zeros(Theta.shape) # (943, 10)
# compute the cost
error = np.multiply((X * Theta.T) - Y, R) # (1682, 943)
squared_error = np.power(error, 2) # (1682, 943)
J = (1. / 2) * np.sum(squared_error)
# calculate the gradients
X_grad = error * Theta
Theta_grad = error.T * X
# unravel the gradient matrices into a single array
grad = np.concatenate((np.ravel(X_grad), np.ravel(Theta_grad)))
return J, grad
J, grad = cost(params, Y_sub, R_sub, features)
J, grad
```

(22.224603725685675, array([ -2.52899165, 7.57570308, -1.89979026, -0.56819597, 3.35265031, -0.52339845, -0.83240713, 4.91163297, -0.76677878, -0.38358278, 2.26333698, -0.35334048, -0.80378006, 4.74271842, -0.74040871, -10.5680202 , 4.62776019, -7.16004443, -3.05099006, 1.16441367, -3.47410789, 0. , 0. , 0. , 0. , 0. , 0. ]))

Our next step is to add regularization to both the cost and gradient calculations. We'll create one final regularized version of the function (note that this version includes an additional learning rate parameter called "lambda").

```
def cost(params, Y, R, num_features, learning_rate):
Y = np.matrix(Y) # (1682, 943)
R = np.matrix(R) # (1682, 943)
num_movies = Y.shape[0]
num_users = Y.shape[1]
# reshape the parameter array into parameter matrices
X = np.matrix(np.reshape(params[:num_movies * num_features], (num_movies, num_features))) # (1682, 10)
Theta = np.matrix(np.reshape(params[num_movies * num_features:], (num_users, num_features))) # (943, 10)
# initializations
J = 0
X_grad = np.zeros(X.shape) # (1682, 10)
Theta_grad = np.zeros(Theta.shape) # (943, 10)
# compute the cost
error = np.multiply((X * Theta.T) - Y, R) # (1682, 943)
squared_error = np.power(error, 2) # (1682, 943)
J = (1. / 2) * np.sum(squared_error)
# add the cost regularization
J = J + ((learning_rate / 2) * np.sum(np.power(Theta, 2)))
J = J + ((learning_rate / 2) * np.sum(np.power(X, 2)))
# calculate the gradients with regularization
X_grad = (error * Theta) + (learning_rate * X)
Theta_grad = (error.T * X) + (learning_rate * Theta)
# unravel the gradient matrices into a single array
grad = np.concatenate((np.ravel(X_grad), np.ravel(Theta_grad)))
return J, grad
J, grad = cost(params, Y_sub, R_sub, features, 1.5)
J, grad
```

(31.344056244274221, array([ -0.95596339, 6.97535514, -0.10861109, 0.60308088, 2.77421145, 0.25839822, 0.12985616, 4.0898522 , -0.89247334, 0.29684395, 1.06300933, 0.66738144, 0.60252677, 4.90185327, -0.19747928, -10.13985478, 2.10136256, -6.76563628, -2.29347024, 0.48244098, -2.99791422, -0.64787484, -0.71820673, 1.27006666, 1.09289758, -0.40784086, 0.49026541]))

This result again matches up with the expected output from the exercise code, so it looks like the regularization is working. Before we train the model, we have one final step. We're tasked with creating our own movie ratings so we can use the model to generate personalized recommendations. A file is provided for us that links the movie index to its title. Let's load the file into a dictionary and use some sample ratings provided in the exercise.

```
movie_idx = {}
f = open('data/movie_ids.txt')
for line in f:
tokens = line.split(' ')
tokens[-1] = tokens[-1][:-1]
movie_idx[int(tokens[0]) - 1] = ' '.join(tokens[1:])
ratings = np.zeros((1682, 1))
ratings[0] = 4
ratings[6] = 3
ratings[11] = 5
ratings[53] = 4
ratings[63] = 5
ratings[65] = 3
ratings[68] = 5
ratings[97] = 2
ratings[182] = 4
ratings[225] = 5
ratings[354] = 5
print('Rated {0} with {1} stars.'.format(movie_idx[0], str(int(ratings[0]))))
print('Rated {0} with {1} stars.'.format(movie_idx[6], str(int(ratings[6]))))
print('Rated {0} with {1} stars.'.format(movie_idx[11], str(int(ratings[11]))))
print('Rated {0} with {1} stars.'.format(movie_idx[53], str(int(ratings[53]))))
print('Rated {0} with {1} stars.'.format(movie_idx[63], str(int(ratings[63]))))
print('Rated {0} with {1} stars.'.format(movie_idx[65], str(int(ratings[65]))))
print('Rated {0} with {1} stars.'.format(movie_idx[68], str(int(ratings[68]))))
print('Rated {0} with {1} stars.'.format(movie_idx[97], str(int(ratings[97]))))
print('Rated {0} with {1} stars.'.format(movie_idx[182], str(int(ratings[182]))))
print('Rated {0} with {1} stars.'.format(movie_idx[225], str(int(ratings[225]))))
print('Rated {0} with {1} stars.'.format(movie_idx[354], str(int(ratings[354]))))
```

Rated Toy Story (1995) with 4 stars. Rated Twelve Monkeys (1995) with 3 stars. Rated Usual Suspects, The (1995) with 5 stars. Rated Outbreak (1995) with 4 stars. Rated Shawshank Redemption, The (1994) with 5 stars. Rated While You Were Sleeping (1995) with 3 stars. Rated Forrest Gump (1994) with 5 stars. Rated Silence of the Lambs, The (1991) with 2 stars. Rated Alien (1979) with 4 stars. Rated Die Hard 2 (1990) with 5 stars. Rated Sphere (1998) with 5 stars.

We can add this custom ratings vector to the data set so it gets included in the model.

```
R = data['R']
Y = data['Y']
Y = np.append(Y, ratings, axis=1)
R = np.append(R, ratings != 0, axis=1)
```

We're now ready to train the collaborative filtering model. We're going to normalize the ratings and then run the optimization routine using our cost function, parameter vector, and data matrices at inputs.

```
from scipy.optimize import minimize
movies = Y.shape[0]
users = Y.shape[1]
features = 10
learning_rate = 10.
X = np.random.random(size=(movies, features))
Theta = np.random.random(size=(users, features))
params = np.concatenate((np.ravel(X), np.ravel(Theta)))
Ymean = np.zeros((movies, 1))
Ynorm = np.zeros((movies, users))
for i in range(movies):
idx = np.where(R[i,:] == 1)[0]
Ymean[i] = Y[i,idx].mean()
Ynorm[i,idx] = Y[i,idx] - Ymean[i]
fmin = minimize(fun=cost, x0=params, args=(Ynorm, R, features, learning_rate),
method='CG', jac=True, options={'maxiter': 100})
fmin
```

status: 1 success: False njev: 149 nfev: 149 fun: 38953.88249706676 x: array([-0.07177334, -0.08315075, 0.1081135 , ..., 0.1817828 , 0.16873062, 0.03383596]) message: 'Maximum number of iterations has been exceeded.' jac: array([ 0.01833555, 0.07377974, 0.03999323, ..., -0.00970181, 0.00758961, -0.01181811])

Since everything was "unrolled" for the optimization routine to work properly, we need to reshape our matrices back to their original dimensions.

```
X = np.matrix(np.reshape(fmin.x[:movies * features], (movies, features)))
Theta = np.matrix(np.reshape(fmin.x[movies * features:], (users, features)))
X.shape, Theta.shape
```

((1682L, 10L), (944L, 10L))

Our trained parameters are now in X and Theta. We can use these to create some recommendations for the user we added earlier.

```
predictions = X * Theta.T
my_preds = predictions[:, -1] + Ymean
sorted_preds = np.sort(my_preds, axis=0)[::-1]
sorted_preds[:10]
```

matrix([[ 5.00000264], [ 5.00000249], [ 4.99999831], [ 4.99999671], [ 4.99999659], [ 4.99999253], [ 4.99999238], [ 4.9999915 ], [ 4.99999019], [ 4.99998643]]

That gives us an ordered list of the top ratings, but we lost what index those ratings are for. We actually need to use argsort so we know what movie the predicted rating corresponds to.

```
idx = np.argsort(my_preds, axis=0)[::-1]
print("Top 10 movie predictions:")
for i in range(10):
j = int(idx[i])
print('Predicted rating of {0} for movie {1}.'.format(str(float(my_preds[j])), movie_idx[j]))
```

Top 10 movie predictions: Predicted rating of 5.00000264002 for movie Prefontaine (1997). Predicted rating of 5.00000249142 for movie Santa with Muscles (1996). Predicted rating of 4.99999831018 for movie Marlene Dietrich: Shadow and Light (1996) . Predicted rating of 4.9999967124 for movie Saint of Fort Washington, The (1993). Predicted rating of 4.99999658864 for movie They Made Me a Criminal (1939). Predicted rating of 4.999992533 for movie Someone Else's America (1995). Predicted rating of 4.99999238336 for movie Great Day in Harlem, A (1994). Predicted rating of 4.99999149604 for movie Star Kid (1997). Predicted rating of 4.99999018592 for movie Aiqing wansui (1994). Predicted rating of 4.99998642746 for movie Entertaining Angels: The Dorothy Day Story (1996).

The recommended movies don't actually line up that well with what's in the exercise text. The reason why isn't too clear and I haven't found anything to account for it, but it's possible there's a mistake in the code somewhere. Bonus points if someone spots an error and points it out! Still, even if there's some minor difference the bulk of the example is accurate.

That concludes the last exercise! When I started this series, my goal was to become more proficient in python as well as refine the machine learning knowledge I'd gained from taking Andrew's class. I feel confident that I accomplished that goal. My hope though is that it's just as valuable to read as it was for me to create.

]]>*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part*

*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part 4 - Multivariate Logistic Regression*

*Part 5 - Neural Networks*

*Part 6 - Support Vector Machines*

*Part 7 - K-Means Clustering & PCA*

*Part 8 - Anomaly Detection & Recommendation*

We're now down to the last two posts in this series! In this installment we'll cover two fascinating topics: K-means clustering and principal component analysis (PCA). K-means and PCA are both examples of unsupervised learning techniques. Unsupervised learning problems do not have any label or target for us to learn from to make predictions, so unsupervised algorithms instead attempt to learn some interesting structure in the data itself. We'll first implement K-means and see how it can be used it to compress an image. We'll also experiment with PCA to find a low-dimensional representation of images of faces. As always, it helps to follow along using the exercise text for the course (posted here).

To start out we're going to implement and apply K-means to a simple 2-dimensional data set to gain some intuition about how it works. K-means is an iterative, unsupervised clustering algorithm that groups similar instances together into clusters. The algorithm starts by guessing the initial centroids for each cluster, and then repeatedly assigns instances to the nearest cluster and re-computes the centroid of that cluster. The first piece that we're going to implement is a function that finds the closest centroid for each instance in the data.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
%matplotlib inline
def find_closest_centroids(X, centroids):
m = X.shape[0]
k = centroids.shape[0]
idx = np.zeros(m)
for i in range(m):
min_dist = 1000000
for j in range(k):
dist = np.sum((X[i,:] - centroids[j,:]) ** 2)
if dist < min_dist:
min_dist = dist
idx[i] = j
return idx
```

Let's test the function to make sure it's working as expected. We'll use the test case provided in the exercise.

```
data = loadmat('data/ex7data2.mat')
X = data['X']
initial_centroids = initial_centroids = np.array([[3, 3], [6, 2], [8, 5]])
idx = find_closest_centroids(X, initial_centroids)
idx[0:3]
```

array([ 0., 2., 1.])

The output matches the expected values in the text (remember our arrays are zero-indexed instead of one-indexed so the values are one lower than in the exercise). Next we need a function to compute the centroid of a cluster. The centroid is simply the mean of all of the examples currently assigned to the cluster.

```
def compute_centroids(X, idx, k):
m, n = X.shape
centroids = np.zeros((k, n))
for i in range(k):
indices = np.where(idx == i)
centroids[i,:] = (np.sum(X[indices,:], axis=1) / len(indices[0])).ravel()
return centroids
compute_centroids(X, idx, 3)
```

array([[ 2.42830111, 3.15792418], [ 5.81350331, 2.63365645], [ 7.11938687, 3.6166844 ]])

This output also matches the expected values from the exercise. So far so good. The next part involves actually running the algorithm for some number of iterations and visualizing the result. This step was implmented for us in the exercise, but since it's not that complicated I'll build it here from scratch. In order to run the algorithm we just need to alternate between assigning examples to the nearest cluster and re-computing the cluster centroids.

```
def run_k_means(X, initial_centroids, max_iters):
m, n = X.shape
k = initial_centroids.shape[0]
idx = np.zeros(m)
centroids = initial_centroids
for i in range(max_iters):
idx = find_closest_centroids(X, centroids)
centroids = compute_centroids(X, idx, k)
return idx, centroids
idx, centroids = run_k_means(X, initial_centroids, 10)
```

We can now plot the result using color coding to indicate cluster membership.

```
cluster1 = X[np.where(idx == 0)[0],:]
cluster2 = X[np.where(idx == 1)[0],:]
cluster3 = X[np.where(idx == 2)[0],:]
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(cluster1[:,0], cluster1[:,1], s=30, color='r', label='Cluster 1')
ax.scatter(cluster2[:,0], cluster2[:,1], s=30, color='g', label='Cluster 2')
ax.scatter(cluster3[:,0], cluster3[:,1], s=30, color='b', label='Cluster 3')
ax.legend()
```

One step we skipped over is a process for initializing the centroids. This can affect the convergence of the algorithm. We're tasked with creating a function that selects random examples and uses them as the initial centroids.

```
def init_centroids(X, k):
m, n = X.shape
centroids = np.zeros((k, n))
idx = np.random.randint(0, m, k)
for i in range(k):
centroids[i,:] = X[idx[i],:]
return centroids
init_centroids(X, 3)
```

array([[ 1.15354031, 4.67866717], [ 6.27376271, 2.24256036], [ 2.20960296, 4.91469264]])

Our next task is to apply K-means to image compression. The intuition here is that we can use clustering to find a small number of colors that are most representative of the image, and map the original 24-bit colors to a lower-dimensional color space using the cluster assignments. Here's the image we're going to compress.

The raw pixel data has been pre-loaded for us so let's pull it in.

```
image_data = loadmat('data/bird_small.mat')
image_data
```

{'A': array([[[219, 180, 103], [230, 185, 116], [226, 186, 110], ..., [ 14, 15, 13], [ 13, 15, 12], [ 12, 14, 12]], ..., [[ 15, 19, 19], [ 20, 20, 18], [ 18, 19, 17], ..., [ 65, 43, 39], [ 58, 37, 38], [ 52, 39, 34]]], dtype=uint8), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Tue Jun 5 04:06:24 2012', '__version__': '1.0'}

We can quickly look at the shape of the data to validate that it looks like what we'd expect for an image.

```
A = image_data['A']
A.shape
```

(128L, 128L, 3L)

Now we need to apply some pre-processing to the data and feed it into the K-means algorithm.

```
# normalize value ranges
A = A / 255.
# reshape the array
X = np.reshape(A, (A.shape[0] * A.shape[1], A.shape[2]))
# randomly initialize the centroids
initial_centroids = init_centroids(X, 16)
# run the algorithm
idx, centroids = run_k_means(X, initial_centroids, 10)
# get the closest centroids one last time
idx = find_closest_centroids(X, centroids)
# map each pixel to the centroid value
X_recovered = centroids[idx.astype(int),:]
# reshape to the original dimensions
X_recovered = np.reshape(X_recovered, (A.shape[0], A.shape[1], A.shape[2]))
plt.imshow(X_recovered)
```

Cool! You can see that we created some artifacts in the compression but the main features of the image are still there despite mapping the original image to only 16 colors. That's it for K-means. We'll now move on to principal component analysis.

PCA is a linear transformation that finds the "principal components", or directions of greatest variance, in a data set. It can be used for dimension reduction among other things. In this exercise we're first tasked with implementing PCA and applying it to a simple 2-dimensional data set to see how it works. Let's start off by loading and visualizing the data set.

```
data = loadmat('data/ex7data1.mat')
X = data['X']
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X[:, 0], X[:, 1])
```

The algorithm for PCA is fairly simple. After ensuring that the data is normalized, the output is simply the singular value decomposition of the covariance matrix of the original data. Since numpy already has built-in functions to calculate the covariance and SVD of a matrix, we'll use those rather than build from scratch.

```
def pca(X):
# normalize the features
X = (X - X.mean()) / X.std()
# compute the covariance matrix
X = np.matrix(X)
cov = (X.T * X) / X.shape[0]
# perform SVD
U, S, V = np.linalg.svd(cov)
return U, S, V
U, S, V = pca(X)
U, S, V
```

(matrix([[-0.79241747, -0.60997914], [-0.60997914, 0.79241747]]), array([ 1.43584536, 0.56415464]), matrix([[-0.79241747, -0.60997914], [-0.60997914, 0.79241747]]))

Now that we have the principal components (matrix U), we can use these to project the original data into a lower-dimensional space. For this task we'll implement a function that computes the projection and selects only the top K components, effectively reducing the number of dimensions.

```
def project_data(X, U, k):
U_reduced = U[:,:k]
return np.dot(X, U_reduced)
Z = project_data(X, U, 1)
Z
```

matrix([[-4.74689738], [-7.15889408], [-4.79563345], [-4.45754509], [-4.80263579], ..., [-6.44590096], [-2.69118076], [-4.61386195], [-5.88236227], [-7.76732508]])

We can also attempt to recover the original data by reversing the steps we took to project it.

```
def recover_data(Z, U, k):
U_reduced = U[:,:k]
return np.dot(Z, U_reduced.T)
X_recovered = recover_data(Z, U, 1)
X_recovered
```

matrix([[ 3.76152442, 2.89550838], [ 5.67283275, 4.36677606], [ 3.80014373, 2.92523637], [ 3.53223661, 2.71900952], [ 3.80569251, 2.92950765], ..., [ 5.10784454, 3.93186513], [ 2.13253865, 1.64156413], [ 3.65610482, 2.81435955], [ 4.66128664, 3.58811828], [ 6.1549641 , 4.73790627]])

If we then attempt to visualize the recovered data, the intuition behind how the algorithm works becomes really obvious.

```
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X_recovered[:, 0], X_recovered[:, 1])
```

Notice how the points all seem to be compressed down to an invisible line. That invisible line is essentially the first principal component. The second principal component, which we cut off when we reduced the data to one dimension, can be thought of as the variation orthogonal to that line. Since we lost that information, our reconstruction can only place the points relative to the first principal component.

Our last task in this exercise is to apply PCA to images of faces. By using the same dimension reduction techniques we can capture the "essence" of the images using much less data than the original images.

```
faces = loadmat('data/ex7faces.mat')
X = faces['X']
X.shape
```

(5000L, 1024L)

The exercise code includes a function that will render the first 100 faces in the data set in a grid. Rather than try to re-produce that here, you can look in the exercise text for an example of what they look like. We can at least render one image fairly easily though.

```
face = np.reshape(X[3,:], (32, 32))
plt.imshow(face)
```

Yikes, that looks awful! These are only 32 x 32 grayscale images though (it's also rendering sideways, but we can ignore that for now). Our next step is to run PCA on the faces data set and take the top 100 principal components.

```
U, S, V = pca(X)
Z = project_data(X, U, 100)
```

Now we can attempt to recover the original structure and render it again.

```
X_recovered = recover_data(Z, U, 100)
face = np.reshape(X_recovered[3,:], (32, 32))
plt.imshow(face)
```

Notice that we lost some detail, though not as much as you might expect for a 10x reduction in the number of dimensions.

That concludes exercise 7! In the final exercise we'll implement algorithms for anomaly detection and build a recommendation system using collaborative filtering.

]]>*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part*

*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part 4 - Multivariate Logistic Regression*

*Part 5 - Neural Networks*

*Part 6 - Support Vector Machines*

*Part 7 - K-Means Clustering & PCA*

*Part 8 - Anomaly Detection & Recommendation*

We're now hitting the home stretch of both the course content and this series of blog posts. In this exercise, we'll be using support vector machines (SVMs) to build a spam classifier. We'll start with SVMs on some simple 2D data sets to see how they work. Then we'll look at a set of email data and build a classifier on the processed emails using a SVM to determine if they are spam or not. As always, it helps to follow along using the exercise text for the course (posted here).

Before jumping into the code, let's quickly recap what an SVM is and why it's worth learning about. Broadly speaking, SVMs are a class of supervised learning algorithm that builds a representation of the examples in the training data as points in space, mapped so that the examples belonging to each class in the data are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a class based on which side of the gap they fall on. SVMs are a binary classifcation tool by default, although there are ways to use them in multi-class scenarios. SVMs can also handle non-linear classification using something called the kernel trick to project the data into a high-dimensional space before attempting to find a hyperplane. SVMs are a powerful class of algorithms and are used often in practical machine learning applications.

The first thing we're going to do is look at a simple 2-dimensional data set and see how a linear SVM works on the data set for varying values of C (similar to the regularization term in linear/logistic regression). Let's load the data.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
%matplotlib inline
raw_data = loadmat('data/ex6data1.mat')
raw_data
```

{'X': array([[ 1.9643 , 4.5957 ], [ 2.2753 , 3.8589 ], [ 2.9781 , 4.5651 ], ..., [ 0.9044 , 3.0198 ], [ 0.76615 , 2.5899 ], [ 0.086405, 4.1045 ]]), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Nov 13 14:28:43 2011', '__version__': '1.0', 'y': array([[1], [1], [1], ..., [0], [0], [1]], dtype=uint8)}

We'll visualize it as a scatter plot where the class label is denoted by a symbol ('+' for positive, 'o' for negative).

```
data = pd.DataFrame(raw_data['X'], columns=['X1', 'X2'])
data['y'] = raw_data['y']
positive = data[data['y'].isin([1])]
negative = data[data['y'].isin([0])]
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(positive['X1'], positive['X2'], s=50, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=50, marker='o', label='Negative')
ax.legend()
```

Notice that there is one outlier positive example that sits apart from the others. The classes are still linearly separable but it's a very tight fit. We're going to train a linear support vector machine to learn the class boundary. In this exercise we're not tasked with implementing an SVM from scratch, so I'm going to use the one built into scikit-learn.

```
from sklearn import svm
svc = svm.LinearSVC(C=1, loss='hinge', max_iter=1000)
svc
```

LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0)

For the first experiment we'll use C=1 and see how it performs.

```
svc.fit(data[['X1', 'X2']], data['y'])
svc.score(data[['X1', 'X2']], data['y'])
```

0.98039215686274506

It appears that it mis-classified the outlier. Let's see what happens with a larger value of C.

```
svc2 = svm.LinearSVC(C=100, loss='hinge', max_iter=1000)
svc2.fit(data[['X1', 'X2']], data['y'])
svc2.score(data[['X1', 'X2']], data['y'])
```

1.0

This time we got a perfect classification of the training data, however by increasing the value of C we've created a decision boundary that is no longer a natural fit for the data. We can visualize this by looking at the confidence level for each class prediction, which is a function of the point's distance from the hyperplane.

```
data['SVM 1 Confidence'] = svc.decision_function(data[['X1', 'X2']])
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=50, c=data['SVM 1 Confidence'], cmap='seismic')
ax.set_title('SVM (C=1) Decision Confidence')
```

```
data['SVM 2 Confidence'] = svc2.decision_function(data[['X1', 'X2']])
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=50, c=data['SVM 2 Confidence'], cmap='seismic')
ax.set_title('SVM (C=100) Decision Confidence')
```

The difference is a bit subtle but look at the color of the points near the boundary. In the first image the points near the boundary are a strong red or blue, indicating that they're a solid distance from the hyperplane. This is not the case in the second image, where a number of points are nearly white, indicating that they are directly adjacent to the hyperplane.

Now we're going to move from a linear SVM to one that's capable of non-linear classification using kernels. We're first tasked with implementing a gaussian kernel function. Although scikit-learn has a gaussian kernel built in, for transparency we'll implement one from scratch.

```
def gaussian_kernel(x1, x2, sigma):
return np.exp(-(np.sum((x1 - x2) ** 2) / (2 * (sigma ** 2))))
x1 = np.array([1.0, 2.0, 1.0])
x2 = np.array([0.0, 4.0, -1.0])
sigma = 2
gaussian_kernel(x1, x2, sigma)
```

0.32465246735834974

That result matches the expected value from the exercise. Next we're going to examine another data set, this time with a non-linear decision boundary.

```
raw_data = loadmat('data/ex6data2.mat')
data = pd.DataFrame(raw_data['X'], columns=['X1', 'X2'])
data['y'] = raw_data['y']
positive = data[data['y'].isin([1])]
negative = data[data['y'].isin([0])]
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(positive['X1'], positive['X2'], s=30, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=30, marker='o', label='Negative')
ax.legend()
```

For this data set we'll build a support vector machine classifier using the built-in RBF kernel and examine its accuracy on the training data. To visualize the decision boundary, this time we'll shade the points based on the predicted probability that the instance has a negative class label. We'll see from the result that it gets most of them right.

```
svc = svm.SVC(C=100, gamma=10, probability=True)
svc.fit(data[['X1', 'X2']], data['y'])
data['Probability'] = svc.predict_proba(data[['X1', 'X2']])[:,0]
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=30, c=data['Probability'], cmap='Reds')
```

For the third data set we're given both training and validation sets and tasked with finding optimal hyper-parameters for an SVM model based on validation set performance. Although we could use scikit-learn's built-in grid search to do this quite easily, in the spirit of following the exercise directions we'll implement a simple grid search from scratch.

```
raw_data = loadmat('data/ex6data3.mat')
X = raw_data['X']
Xval = raw_data['Xval']
y = raw_data['y'].ravel()
yval = raw_data['yval'].ravel()
C_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]
gamma_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]
best_score = 0
best_params = {'C': None, 'gamma': None}
for C in C_values:
for gamma in gamma_values:
svc = svm.SVC(C=C, gamma=gamma)
svc.fit(X, y)
score = svc.score(Xval, yval)
if score > best_score:
best_score = score
best_params['C'] = C
best_params['gamma'] = gamma
best_score, best_params
```

(0.96499999999999997, {'C': 0.3, 'gamma': 100})

Now we'll move on to the last part of the exercise. In this part our objective is to use SVMs to build a spam filter. In the exercise text, there's a task involving some text pre-processing to get our data in a format suitable for an SVM to handle. However, the task is pretty trivial (mapping words to an ID from a dictionary that's provided for the exercise) and the rest of the pre-processing steps such as HTML removal, stemming, normalization etc. are already done. Rather than reproduce these pre-processing steps, I'm going to skip ahead to the machine learning task which involves building a classifier from pre-processed train and test data sets consisting of spam and non-spam emails transformed to word occurance vectors.

```
spam_train = loadmat('data/spamTrain.mat')
spam_test = loadmat('data/spamTest.mat')
spam_train
```

{'X': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 1, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Nov 13 14:27:25 2011', '__version__': '1.0', 'y': array([[1], [1], [0], ..., [1], [0], [0]], dtype=uint8)}

```
X = spam_train['X']
Xtest = spam_test['Xtest']
y = spam_train['y'].ravel()
ytest = spam_test['ytest'].ravel()
X.shape, y.shape, Xtest.shape, ytest.shape
```

((4000L, 1899L), (4000L,), (1000L, 1899L), (1000L,))

Each document has been converted to a vector with 1,899 dimensions corresponding to the 1,899 words in the vocabulary. The values are binary, indicating the presence or absence of the word in the document. At this point, training and evaluation are just a matter of fitting the testing the classifer.

```
svc = svm.SVC()
svc.fit(X, y)
print('Test accuracy = {0}%'.format(np.round(svc.score(Xtest, ytest) * 100, 2)))
```

Test accuracy = 95.3%

This result is with the default parameters. We could probably get it a bit higher using some parameter tuning, but 95% accuracy still isn't bad.

That concludes exercise 6! In the next exercise we'll perform clustering and image compression with K-Means and principal component analysis.

]]>*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part*

*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part 4 - Multivariate Logistic Regression*

*Part 5 - Neural Networks*

*Part 6 - Support Vector Machines*

*Part 7 - K-Means Clustering & PCA*

*Part 8 - Anomaly Detection & Recommendation*

In part four we wrapped up our implementation of logistic regression by extending our solution to handle multi-class classification and testing it on the hand-written digits data set. Using just logistic regression we were able to hit a classification accuracy of about 97.5%, which is reasonably good but pretty much maxes out what we can achieve with a linear model. In this blog post we'll again tackle the hand-written digits data set, but this time using a feed-forward neural network with backpropagation. We'll implement un-regularized and regularized versions of the neural network cost function and compute gradients via the backpropagation algorithm. Finally, we'll run the algorithm through an optimizer and evaluate the performance of the network on the handwritten digits data set

I'll note up front that the math (and code) in this exercise gets a bit hairy. Implementing a neural network from scratch is not for the faint of heart. For those that venture on, be warned - here be dragons. I also strongly encourage the reader to skim through the corresponding exercise text from Andrew Ng's class. I uploaded a copy here. This text contains all of the equations that we'll be implementing. With that, let's dive in!

Since the data set is the same one we used in the last exercise, we'll re-use the code from last time to load the data.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
%matplotlib inline
data = loadmat('data/ex3data1.mat')
data
```

{'X': array([[ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], ..., [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.]]), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct 16 13:09:09 2011', '__version__': '1.0', 'y': array([[10], [10], [10], ..., [ 9], [ 9], [ 9]], dtype=uint8)}

Since we're going to need these later (and will use them often), let's create some useful variables up-front.

```
X = data['X']
y = data['y']
X.shape, y.shape
((5000L, 400L), (5000L, 1L))
```

We're also going to need to one-hot encode our labels. One-hot encoding turns a class label \(n\) (out of \(k\) classes) into a vector of length \(k\) where index \(n\) is "hot" (1) while the rest are zero. Scikit-learn has a built in utility we can use for this.

```
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
y_onehot = encoder.fit_transform(y)
y_onehot.shape
(5000L, 10L)
```

```
y[0], y_onehot[0,:]
(array([10], dtype=uint8),
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]))
```

The neural network we're going to build for this exercise has an input layer matching the size of our instance data (400 + the bias unit), a hidden layer with 25 units (26 with the bias unit), and an output layer with 10 units corresponding to our one-hot encoding for the class labels. The first piece we need to implement is a cost function to evaluate the loss for a given set of network parameters. The source mathematical function is in the exercise text, and looks pretty intimidating, but it helps to really break it down into pieces. Here are the functions required to compute the cost.

```
def sigmoid(z):
return 1 / (1 + np.exp(-z))
```

```
def forward_propagate(X, theta1, theta2):
m = X.shape[0]
a1 = np.insert(X, 0, values=np.ones(m), axis=1)
z2 = a1 * theta1.T
a2 = np.insert(sigmoid(z2), 0, values=np.ones(m), axis=1)
z3 = a2 * theta2.T
h = sigmoid(z3)
return a1, z2, a2, z3, h
```

```
def cost(params, input_size, hidden_size, num_labels, X, y, learning_rate):
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)
# reshape the parameter array into parameter matrices for each layer
theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
# run the feed-forward pass
a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
# compute the cost
J = 0
for i in range(m):
first_term = np.multiply(-y[i,:], np.log(h[i,:]))
second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))
J += np.sum(first_term - second_term)
J = J / m
return J
```

We've used the sigmoid function before so that's not new. The forward-propagate function computes the hypothesis for each training instance given the current parameters (in other words, given some current state of the network and a set of inputs, it calculates the outputs at each layer in the network). The shape of the hypothesis vector (denoted by \(h\)), which contains the prediction probabilities for each class, should match our one-hot encoding for y. Finally, the cost function runs the forward-propagation step and calculates the error of the hypothesis (predictions) vs. the true label for the instance.

We can test this real quick to convince ourselves that it's working as expected. Seeing the output from intermediate steps is also useful to understand what's going on.

```
# initial setup
input_size = 400
hidden_size = 25
num_labels = 10
learning_rate = 1
# randomly initialize a parameter array of the size of the full network's parameters
params = (np.random.random(size=hidden_size * (input_size + 1) + num_labels * (hidden_size + 1)) - 0.5) * 0.25
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)
# unravel the parameter array into parameter matrices for each layer
theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
theta1.shape, theta2.shape
((25L, 401L), (10L, 26L))
```

```
a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
a1.shape, z2.shape, a2.shape, z3.shape, h.shape
((5000L, 401L), (5000L, 25L), (5000L, 26L), (5000L, 10L), (5000L, 10L))
```

The cost function, after computing the hypothesis matrix \(h\), applies the cost equation to compute the total error between \(y\) and \(h\).

```
cost(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)
6.8228086634127862
```

Our next step is to add regularization to the cost function, which adds a penalty term to the cost that scales with the magnitude of the parameters. The equation for this looks pretty ugly, but it can be boiled down to just one line of code added to the original cost function. Just add the following right before the return statement.

```
J += (float(learning_rate) / (2 * m)) * (np.sum(np.power(theta1[:,1:], 2)) + np.sum(np.power(theta2[:,1:], 2)))
```

Next up is the backpropagation algorithm. Backpropagation computes the parameter updates that will reduce the error of the network on the training data. The first thing we need is a function that computes the gradient of the sigmoid function we created earlier.

```
def sigmoid_gradient(z):
return np.multiply(sigmoid(z), (1 - sigmoid(z)))
```

Now we're ready to implement backpropagation to compute the gradients. Since the computations required for backpropagation are a superset of those required in the cost function, we're actually going to extend the cost function to also perform backpropagation and return both the cost and the gradients. If you're wondering why I'm not just calling the existing cost function from within the backprop function to make the design more modular, it's because backprop uses a number of other variables calculated inside the cost function. Here's the full implementation. I skipped ahead and added gradient regularization rather than first create an un-regularized version.

```
def backprop(params, input_size, hidden_size, num_labels, X, y, learning_rate):
##### this section is identical to the cost function logic we already saw #####
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)
# reshape the parameter array into parameter matrices for each layer
theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
# run the feed-forward pass
a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
# initializations
J = 0
delta1 = np.zeros(theta1.shape) # (25, 401)
delta2 = np.zeros(theta2.shape) # (10, 26)
# compute the cost
for i in range(m):
first_term = np.multiply(-y[i,:], np.log(h[i,:]))
second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))
J += np.sum(first_term - second_term)
J = J / m
# add the cost regularization term
J += (float(learning_rate) / (2 * m)) * (np.sum(np.power(theta1[:,1:], 2)) + np.sum(np.power(theta2[:,1:], 2)))
##### end of cost function logic, below is the new part #####
# perform backpropagation
for t in range(m):
a1t = a1[t,:] # (1, 401)
z2t = z2[t,:] # (1, 25)
a2t = a2[t,:] # (1, 26)
ht = h[t,:] # (1, 10)
yt = y[t,:] # (1, 10)
d3t = ht - yt # (1, 10)
z2t = np.insert(z2t, 0, values=np.ones(1)) # (1, 26)
d2t = np.multiply((theta2.T * d3t.T).T, sigmoid_gradient(z2t)) # (1, 26)
delta1 = delta1 + (d2t[:,1:]).T * a1t
delta2 = delta2 + d3t.T * a2t
delta1 = delta1 / m
delta2 = delta2 / m
# add the gradient regularization term
delta1[:,1:] = delta1[:,1:] + (theta1[:,1:] * learning_rate) / m
delta2[:,1:] = delta2[:,1:] + (theta2[:,1:] * learning_rate) / m
# unravel the gradient matrices into a single array
grad = np.concatenate((np.ravel(delta1), np.ravel(delta2)))
return J, grad
```

There's a lot going on here so let's try to unpack it a bit. The first half of the function is calculating the error by running the data plus current parameters through the "network" (the forward-propagate function) and comparing the output to the true labels. The total error across the whole data set is represented as \(J\). This is the part we discussed earlier as the cost function.

The rest of the function is essentially answering the question "how can I adjust my parameters to reduce the error the next time I run through the network"? It does this by computing the contributions at each layer to the total error and adjusting appropriately by coming up with a "gradient" matrix (or, how much to change each parameter and in what direction).

The hardest part of the backprop computation (other than understanding WHY we're doing all these calculations) is getting the matrix dimensions right, which is why I annotated some of the calculations with comments showing the resulting shape of the computation. By the way, if you find it confusing when to use A * B vs. np.multiply(A, B), you're not alone. Basically the former is a matrix multiplication and the latter is an element-wise multiplication (unless A or B is a scalar value, in which case it doesn't matter). I wish there was a more concise syntax for this (maybe there is and I'm just not aware of it).

Anyway, let's test it out to make sure the function returns what we're expecting it to.

```
J, grad = backprop(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)
J, grad.shape
(6.8281541822949299, (10285L,))
```

We're finally ready to train our network and use it to make predictions. This is roughly similar to the previous exercise with multi-class logistic regression.

```
from scipy.optimize import minimize
# minimize the objective function
fmin = minimize(fun=backprop, x0=params, args=(input_size, hidden_size, num_labels, X, y_onehot, learning_rate),
method='TNC', jac=True, options={'maxiter': 250})
fmin
```

status: 3 success: False nfev: 250 fun: 0.33900736818312283 x: array([ -8.85740564e-01, 2.57420350e-04, -4.09396202e-04, ..., 1.44634791e+00, 1.68974302e+00, 7.10121593e-01]) message: 'Max. number of function evaluations reach' jac: array([ -5.11463703e-04, 5.14840700e-08, -8.18792403e-08, ..., -2.48297749e-04, -3.17870911e-04, -3.31404592e-04]) nit: 21

We put a bound on the number of iterations since the objective function is not likely to completely converge. Our total cost has dropped below 0.5 though so that's a good indicator that the algorithm is working. Let's use the parameters it found and forward-propagate them through the network to get some predictions. We have to reshape the output from the optimizer to match the parameter matrix shapes that our network is expecting, then run the forward propagation to generate a hypothesis for the input data.

```
X = np.matrix(X)
theta1 = np.matrix(np.reshape(fmin.x[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(fmin.x[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
y_pred = np.array(np.argmax(h, axis=1) + 1)
y_pred
```

array([[10], [10], [10], ..., [ 9], [ 9], [ 9]], dtype=int64)

Finally we can compute the accuracy to see how well our trained network is doing.

```
correct = [1 if a == b else 0 for (a, b) in zip(y_pred, y)]
accuracy = (sum(map(int, correct)) / float(len(correct)))
print 'accuracy = {0}%'.format(accuracy * 100)
accuracy = 99.22%
```

And we're done! We've successfully implemented a rudimentary feed-forward neural network with backpropagation and used it to classify images of handwritten digits. It might seem surprising at first that we managed to do this without implementing any classes or data structures resembling a network in any way. Isn't that what neural networks are all about? This was one of the biggest surprises to me when I took the class - how the whole thing basically boils down to a series of matrix multiplications. It turns out that this is by far the most efficient way to solve the problem. In fact if you look at any of the popular deep learning frameworks such as Tensorflow, they're essentially graphs of linear algebra computations. It's a very useful and practical way to think about machine learning algorithms.

That concludes the exercise on neural networks. In the next exercise we'll look at another popular supervised learning algorithm, support vector machines.

]]>Dynamic programming is related to a number of other fundamental concepts in computer science in interesting ways. Recursion, for example, is similar to (but not identical to) dynamic programming. The key difference is that in a naive recursive solution, answers to sub-problems may be computed many times. A recursive solution that caches answers to sub-problems which were already computed is called memoization, which is basically the inverse of dynamic programming. Another variation is when the sub-problems don't actually overlap at all, in which case the technique is known as divide and conquer. Finally, dynamic programming is tied to the concept of mathematical induction and can be thought of as a specific application of inductive reasoning in practice.

While the core ideas behind dynamic programming are actually pretty simple, it turns out that it's fairly challenging to use on non-trivial problems because it's often not obvious how to frame a difficult problem in terms of overlapping sub-problems. This is where experience and practice come in handy, which is the idea for this blog post. We'll build both naive and "intelligent" solutions to several well-known problems and see how the problems are decomposed to use dynamic programming solutions. The code is written in basic python with no special dependencies.

First we'll look at the problem of computing numbers in the Fibonacci sequence. The problem definition is very simple - each number in the sequence is the sum of the two previous numbers in the sequence. Or, more formally:

\(F_n = F_{n-1} + F_{n-2}\), with \(F_0 = 0\) and \(F_1 = 1\) as the seed values.

Our solution will be responsible for calculating each of the Fibonacci numbers up to some defined limit. We'll first implement a naive solution that re-calculates each number in the sequence from scratch.

```
def fib(n):
if n == 0:
return 0
if n == 1:
return 1
return fib(n - 1) + fib(n - 2)
def all_fib(n):
fibs = []
for i in range(n):
fibs.append(fib(i))
return fibs
```

Let's try it out on a pretty small number first.

```
%time print(all_fib(10))
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
Wall time: 0 ns
```

Okay, probably too trivial. Let's try a bit bigger...

```
%time print(all_fib(20))
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181]
Wall time: 5 ms
```

The runtime was at least measurable now, but still pretty quick. Let's try one more time...

```
%time print(all_fib(40))
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986]
Wall time: 1min 9s
```

That escalated quickly! Clearly this is a pretty bad solution. Let's see what it looks like when applying dynamic programming.

```
def all_fib_dp(n):
fibs = []
for i in range(n):
if i < 2:
fibs.append(i)
else:
fibs.append(fibs[i - 2] + fibs[i - 1])
return fibs
```

This time we're saving the result at each iteration and computing new numbers as a sum of the previously saved results. Let's see what this does to the performance of the function.

```
%time print(all_fib_dp(40))
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986]
Wall time: 0 ns
```

By not computing the full recusrive tree on each iteration, we've essentially reduced the running time for the first 40 numbers from ~75 seconds to virtually instant. This also happens to be a good example of the danger of naive recursive functions. Our new Fibonaci number function can compute additional values in linear time vs. exponential time for the first version.

```
%time print(all_fib_dp(100))
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437, 701408733, 1134903170, 1836311903, 2971215073L, 4807526976L, 7778742049L, 12586269025L, 20365011074L, 32951280099L, 53316291173L, 86267571272L, 139583862445L, 225851433717L, 365435296162L, 591286729879L, 956722026041L, 1548008755920L, 2504730781961L, 4052739537881L, 6557470319842L, 10610209857723L, 17167680177565L, 27777890035288L, 44945570212853L, 72723460248141L, 117669030460994L, 190392490709135L, 308061521170129L, 498454011879264L, 806515533049393L, 1304969544928657L, 2111485077978050L, 3416454622906707L, 5527939700884757L, 8944394323791464L, 14472334024676221L, 23416728348467685L, 37889062373143906L, 61305790721611591L, 99194853094755497L, 160500643816367088L, 259695496911122585L, 420196140727489673L, 679891637638612258L, 1100087778366101931L, 1779979416004714189L, 2880067194370816120L, 4660046610375530309L, 7540113804746346429L, 12200160415121876738L, 19740274219868223167L, 31940434634990099905L, 51680708854858323072L, 83621143489848422977L, 135301852344706746049L, 218922995834555169026L]
Wall time: 0 ns
```

The Fibonacci problem is a good starter example but doesn't really capture the challenge of representing problems in terms of optimal sub-problems because for Fibonacci numbers the answer is pretty obvious. Let's move up one step in difficulty to a problem known as the longest increasing subsequence problem. The objective is to find the longest subsequence of a given sequence such that all elements in the subsequence are sorted in increasing order. Note that the elements do not need to be contiguous; that is, they are not required to appear next to each other. For example, in the sequence [10, 22, 9, 33, 21, 50, 41, 60, 80] the longest increasing subsequence (LIS) is [10, 22, 33, 50, 60, 80].

It turns out that it's fairly difficult to do a "brute-force" solution to this problem. The dynamic programming solution is much more concise and a natural fit for the problem definition, so we'll skip creating an unnecessarily complicated naive solution and jump straight to the DP solution.

```
def find_lis(seq):
n = len(seq)
max_length = 1
best_seq_end = -1
# keep a chain of the values of the lis
prev = [0 for i in range(n)]
prev[0] = -1
# the length of the lis at each position
length = [0 for i in range(n)]
length[0] = 1
for i in range(1, n):
length[i] = 0
prev[i] = -1
# start from index i-1 and work back to 0
for j in range(i - 1, -1, -1):
if (length[j] + 1) > length[i] and seq[j] < seq[i]:
# there's a number before position i that increases the lis at i
length[i] = length[j] + 1
prev[i] = j
if length[i] > max_length:
max_length = length[i]
best_seq_end = i
# recover the subsequence
lis = []
element = best_seq_end
while element != -1:
lis.append(seq[element])
element = prev[element]
return lis[::-1]
```

The intuition here is that for a given index \(i\), we can compute the length of the longest increasing subsequence \(length(i)\) by looking at all indices \(j < i\) and if \(length(j) + 1 > i\) and \(seq[j] < seq[i]\) (meaning there's a number at position \(j\) that increases the longest subsequence at that index such that it is now longer than the longest recorded subsequence at \(i\)) then we increase \(length(i)\) by 1. It's a bit confusing at first glance but step through it carefully and convince yourself that this solution finds the optimal subsequence. The "prev" list holds the indices of the elements that form the actual values in the subsequence.

Let's generate some test data and try it out.

```
import numpy as np
seq_small = list(np.random.randint(0, 20, 20))
seq_small
[16, 10, 17, 18, 9, 0, 2, 19, 4, 3, 1, 14, 12, 6, 2, 4, 11, 5, 19, 4]
```

Now we can run a quick test to see if it works on a small sequence.

```
%time print(find_lis(seq_small))
[0, 1, 2, 4, 5, 19]
Wall time: 0 ns
```

Just based on the eye test the output looks correct. Let's see how well it performs on much larger sequences.

```
seq = list(np.random.randint(0, 10000, 10000))
%time print(find_lis(seq))
[29, 94, 125, 159, 262, 271, 274, 345, 375, 421, 524, 536, 668, 689, 694, 755, 763, 774, 788, 854, 916, 1018, 1022, 1098, 1136, 1154, 1172, 1237, 1325, 1361, 1400, 1401, 1406, 1450, 1498, 1633, 1693, 1745, 1765, 1793, 1835, 1949, 1997, 2069, 2072, 2096, 2157, 2336, 2345, 2468, 2519, 2529, 2624, 2630, 2924, 3103, 3291, 3321, 3380, 3546, 3635, 3657, 3668, 3703, 3775, 3836, 3850, 3961, 4002, 4004, 4039, 4060, 4128, 4361, 4377, 4424, 4432, 4460, 4465, 4493, 4540, 4595, 4693, 4732, 4735, 4766, 4831, 4850, 4873, 4908, 4940, 4969, 5013, 5073, 5087, 5139, 5144, 5271, 5280, 5299, 5300, 5355, 5393, 5430, 5536, 5538, 5559, 5565, 5822, 5891, 5895, 5906, 6157, 6199, 6286, 6369, 6431, 6450, 6510, 6533, 6577, 6585, 6683, 6701, 6740, 6745, 6829, 6853, 6863, 6872, 6884, 6923, 6925, 7009, 7019, 7028, 7040, 7170, 7235, 7304, 7356, 7377, 7416, 7490, 7495, 7662, 7676, 7703, 7808, 7925, 7971, 8036, 8073, 8282, 8295, 8332, 8342, 8360, 8429, 8454, 8499, 8557, 8585, 8639, 8649, 8725, 8759, 8831, 8860, 8899, 8969, 9046, 9146, 9161, 9245, 9270, 9374, 9451, 9465, 9515, 9522, 9525, 9527, 9664, 9770, 9781, 9787, 9914, 9993]
Wall time: 4.94 s
```

So it's still pretty fast, but the difference is definitely noticable. At 10,000 integers in the sequence our algorithm already takes several seconds to complete. In fact, even though this solution uses dynamic programming its runtime is still \(O(n^2)\). The lesson here is that dynamic programming doesn't always result in lightning-fast solutions. There are also different ways to apply DP to the same problem. In fact there's a solution to this problem that uses binary search trees and runs in \(O(nlogn)\) time, significantly better than the solution we just came up with.

The knapsack problem is another classic dynamic programming exercise. The generalization of this problem is very old and comes in many variations, and there are actually multiple ways to tackle this problem aside from dynamic programming. Still, it's a common example for DP exercises.

The problem at its core is one of combinatorial optimization. Given a set of items, each with a mass and a value, determine the collection of items that results in the highest possible value while not exceeding some limit on the total weight. The variation we'll look at is commonly referred to as the 0-1 knapsack problem, which restricts the number of copies of each kind of item to 0 or 1. More formally, given a set of \(n\) items each with weight \(w_i\) and value \(v_i\) along with a maximum total weight \(W\), our objective is:

\(\Large max \Sigma v_i x_i\), where \(\Large \Sigma w_i x_i \leq W\)

Let's see what the implementation looks like then discuss why it works.

```
def knapsack(W, w, v):
# create a W x n solution matrix to store the sub-problem results
n = len(v)
S = [[0 for x in range(W)] for k in range(n)]
for x in range(1, W):
for k in range(1, n):
# using this notation k is the number of items in the solution and x is the max weight of the solution,
# so the initial assumption is that the optimal solution with k items at weight x is at least as good
# as the optimal solution with k-1 items for the same max weight
S[k][x] = S[k-1][x]
# if the current item weighs less than the max weight and the optimal solution including this item is
# better than the current optimum, the new optimum is the one resulting from including the current item
if w[k] < x and S[k-1][x-w[k]] + v[k] > S[k][x]:
S[k][x] = S[k-1][x-w[k]] + v[k]
return S
```

The intuition behind this algorithm is that once you've solved for the optimal combination of items at some weight \(x < W\) and with some number of items \(k < n\), then it's easy to solve the problem with one more item or one higher max weight because you can just check to see if the solution obtained by incorporating that item is better than the best solution you've already found. So how do you get the initial solution? Keep going down the rabbit hole until to reach 0 (in which case the answer is 0). At first glance it's very hard to grasp, but that's part of the magic of dynamic programming. Let's run an example to see what it looks like. We'll start with some randomly-generated weights and values.

```
w = list(np.random.randint(0, 10, 5))
v = list(np.random.randint(0, 100, 5))
w, v
([3, 9, 3, 6, 5], [40, 45, 72, 77, 16])
```

Now we can run the algorithm with a constraint that the weights of the items can't add up to more than 15.

```
knapsack(15, w, v)
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 45, 45, 45, 45, 45],
[0, 0, 0, 0, 72, 72, 72, 72, 72, 72, 72, 72, 72, 117, 117],
[0, 0, 0, 0, 72, 72, 72, 77, 77, 77, 149, 149, 149, 149, 149],
[0, 0, 0, 0, 72, 72, 72, 77, 77, 88, 149, 149, 149, 149, 149]]
```

The output here is the array of optimal values for a given max weight (think of it as the column index) and max number of items (the row index). Notice how the output follows what looks sort of like a wavefront pattern. This seems to be a recurring phenomenon with dynamic programming solutions. The value in the lower right corner is the max value that we were looking for under the given constraints and is the answer to the problem.

That concludes our introduction to dynamic programming! Using this technique in the real world definitely requires a lot of practice; most applications of dynamic programming are not very obvious and take some skill to discover. Personally it doesn't come naturally to me at all and even learning these relatively simple examples took quite a bit of thought. It might seem like these sorts of problems don't come up all that often in practice, and there's probably some truth to that. However I've found that simply knowing about dynamic programming and how it fits into a more general problem-solving framework has made me a better engineer, and that in of itself makes it worth the time investment to understand.

]]>I'm planning to do a future blog post where I'll go more in depth on this class, and deep learning in general, but for now I'll just offer a few initial impressions of the class in case anyone else is planning to give it a try.

The video content, while informative, is very light on substance. After doing a number of MOOCs on Coursera, I was a bit disappointed by the lack of depth in Udacity's course format. I get that a lot of the Coursera content is developed by universities and adapted from real semester-long courses, and by contrast, many Udacity courses (such as Google's course) are developed by companies with a somewhat limited investment in the curriculum. But there's a pretty stark difference between the depth of a class like Andrew Ng's machine learning class and this one.

As a direct consequence of this, don't expect to learn everything you need to complete the assignments from watching the videos. You really need to be willing to go out and read research papers on relevant topics to meaningfully progress on the labs. If that's not your thing then you can just watch the videos and skip the assignments but all of the "meat" of the class is in the notebook code posted with the labs. You're expected to be reasonably comfortable building fairly complex computation graphs in Tensorflow (or at least comfortable enough to learn as you go). The course itself doesn't teach you any of this so you'll have to seek that knowledge out yourself.

If you can get past those hurdles, there's quite a lot of value in the code that Google provides with the labs. I found it to be fairly complex and hard to follow at times, but if you're willing to invest the time there's a lot to learn. It's amazing how much effort can go into just pre-processing, formatting, and batching data to get it ready for input to a neural net, and there are some great of tips and tricks to learn from the code. Likewise, the example code in Tensorflow is very helpful and sets a good foundation to build on with the labs.

Overall I enjoyed the class but wouldn't recommend it as a first step into deep learning. It seems much better suited for individuals with prior conceptual knowledge of the subjects that are looking to develop hands-on experience. If you're totally unfamilar with deep learning and just looking to get started, I'd recommend some of the free online books linked in Awesome Deep Learning as a starting point.

]]>*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part*

*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part 4 - Multivariate Logistic Regression*

*Part 5 - Neural Networks*

*Part 6 - Support Vector Machines*

*Part 7 - K-Means Clustering & PCA*

*Part 8 - Anomaly Detection & Recommendation*

In part three of this series we implemented both simple and regularized logistic regression, completing our Python implementation of the second exercise from Andrew Ng's machine learning class. There's a limitation with our solution though - it only works for binary classification. In this post we'll extend our solution from the previous exercise to handle multi-class classification. In doing so, we'll cover the first half of exercise 3 and set ourselves up for the next big topic, neural networks.

Just a quick note on syntax - to show the output of a statement I'm appending the result in the code block with a ">" to indicate that it's the result of running the previous statement. If the result is very long (more than 1-2 lines) then I'm sticking it below the code block in a separate block of its own. Hopefully it's clear which statements are input and which are output.

Our task in this exercise is to use logistic regression to recognize hand-written digits (0 to 9). Let's get started by loading the data set. Unlike the previous examples, our data file is in a format native to MATLAB and not automatically recognizable by pandas, so to load it in Python we need to use a SciPy utility (remember the data files are available at the link at the top of the post).

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
%matplotlib inline
data = loadmat('data/ex3data1.mat')
data
```

{'X': array([[ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], ..., [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.]]), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct 16 13:09:09 2011', '__version__': '1.0', 'y': array([[10], [10], [10], ..., [ 9], [ 9], [ 9]], dtype=uint8)}

Let's quickly examine the shape of the arrays we just loaded into memory.

```
data['X'].shape, data['y'].shape
> ((5000L, 400L), (5000L, 1L))
```

Great, we've got our data loaded. The images are represented in martix X as a 400-dimensional vector (of which there are 5,000 of them). The 400 "features" are grayscale intensities of each pixel in the original 20 x 20 image. The class labels are in the vector y as a numeric class representing the digit that's in the image. There's an illustration below gives an example of what some of the digits look like. Each grey box with a white hand-drawn digit represents on 400-dimensional row in our dataset.

Our first task is to modify our logistic regression implementation to be completely vectorized (i.e. no "for" loops). This is because vectorized code, in addition to being short and concise, is able to take advantage of linear algebra optimizations and is typically much faster than iterative code. However if you look at our cost function implementation from exercise 2, it's already vectorized! So we can re-use the same implementation here. Note we're skipping straight to the final, regularized version.

```
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def cost(theta, X, y, learningRate):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
reg = (learningRate / 2 * len(X)) * np.sum(np.power(theta[:,1:theta.shape[1]], 2))
return np.sum(first - second) / (len(X)) + reg
```

Again, this cost function is identical to the one we created in the previous exercise so if you're not sure how we got to this point, check out the previous post before moving on.

Next we need the function that computes the gradient. We already defined this in the previous exercise, only in this case we do have a "for" loop in the update step that we need to get rid of. Here's the original code for reference:

```
def gradient_with_loop(theta, X, y, learningRate):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
parameters = int(theta.ravel().shape[1])
grad = np.zeros(parameters)
error = sigmoid(X * theta.T) - y
for i in range(parameters):
term = np.multiply(error, X[:,i])
if (i == 0):
grad[i] = np.sum(term) / len(X)
else:
grad[i] = (np.sum(term) / len(X)) + ((learningRate / len(X)) * theta[:,i])
return grad
```

Let's take a step back and remind ourselves what's going on here. We just defined a cost function that says, in a nutshell, "given some candidate solution *theta* applied to input *X*, how far off is the result from the true desired outcome *y*". This is our method that evaluates a set of parameters. By contrast, the gradient function specifies how to *change* those parameters to get an answer that's slightly better than the one we've already got. The math behind how this all works can get a little hairy if you're not comfortable with linear algebra, but it's laid out pretty well in the exercise text, and I would encourage the reader to get comfortable with it to better understand why these functions work.

Now we need to create a version of the gradient function that doesn't use any loops. In our new version we're going to pull out the "for" loop and compute the gradient for each parameter at once using linear algebra (except for the intercept parameter, which is not regularized so it's computed separately).

Also note that we're converting the data structures to NumPy matrices (which I've used for the most part throughout these exercises). This is done in an attempt to make the code look more similar to Octave than it would using arrays because matrices automatically follow matrix operation rules vs. element-wise operations, which is the default for arrays. There is some debate in the community over whether or not the matrix class should be used at all, but it's there so we're using it in these examples.

```
def gradient(theta, X, y, learningRate):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
parameters = int(theta.ravel().shape[1])
error = sigmoid(X * theta.T) - y
grad = ((X.T * error) / len(X)).T + ((learningRate / len(X)) * theta)
# intercept gradient is not regularized
grad[0, 0] = np.sum(np.multiply(error, X[:,0])) / len(X)
return np.array(grad).ravel()
```

Now that we've defined our cost and gradient functions, it's time to build a classifier. For this task we've got 10 possible classes, and since logistic regression is only able to distiguish between 2 classes at a time, we need a strategy to deal with the multi-class scenario. In this exercise we're tasked with implementing a one-vs-all classification approach, where a label with k different classes results in k classifiers, each one deciding between "class i" and "not class i" (i.e. any class other than i). We're going to wrap the classifier training up in one function that computes the final weights for each of the 10 classifiers and returns the weights as a k X (n + 1) array, where n is the number of parameters.

```
from scipy.optimize import minimize
def one_vs_all(X, y, num_labels, learning_rate):
rows = X.shape[0]
params = X.shape[1]
# k X (n + 1) array for the parameters of each of the k classifiers
all_theta = np.zeros((num_labels, params + 1))
# insert a column of ones at the beginning for the intercept term
X = np.insert(X, 0, values=np.ones(rows), axis=1)
# labels are 1-indexed instead of 0-indexed
for i in range(1, num_labels + 1):
theta = np.zeros(params + 1)
y_i = np.array([1 if label == i else 0 for label in y])
y_i = np.reshape(y_i, (rows, 1))
# minimize the objective function
fmin = minimize(fun=cost, x0=theta, args=(X, y_i, learning_rate), method='TNC', jac=gradient)
all_theta[i-1,:] = fmin.x
return all_theta
```

A few things to note here...first, we're adding an extra parameter to theta (along with a column of ones to the training data) to account for the intercept term. Second, we're transforming y from a class label to a binary value for each classifier (either is class i or is not class i). Finally, we're using SciPy's newer optimization API to minimize the cost function for each classifier. The API takes an objective function, an initial set of parameters, an optimization method, and a jacobian (gradient) function if specified. The parameters found by the optimization routine are then assigned to the parameter array.

One of the more challenging parts of implementing vectorized code is getting all of the matrix interactions written correctly, so I find it useful to do some sanity checks by looking at the shapes of the arrays/matrices I'm working with and convincing myself that they're sensible. Let's look at some of the data structures used in the above function.

```
rows = data['X'].shape[0]
params = data['X'].shape[1]
all_theta = np.zeros((10, params + 1))
X = np.insert(data['X'], 0, values=np.ones(rows), axis=1)
theta = np.zeros(params + 1)
y_0 = np.array([1 if label == 0 else 0 for label in data['y']])
y_0 = np.reshape(y_0, (rows, 1))
X.shape, y_0.shape, theta.shape, all_theta.shape
> ((5000L, 401L), (5000L, 1L), (401L,), (10L, 401L))
```

These all appear to make sense. Note that theta is a one-dimensional array, so when it gets converted to a matrix in the code that computes the gradient, it turns into a (1 X 401) matrix. Let's also check the class labels in y to make sure they look like what we're expecting.

```
np.unique(data['y'])
> array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=uint8)
```

Let's make sure that our training function actually runs, and we get some sensible outputs, before going any further.

```
all_theta = one_vs_all(data['X'], data['y'], 10, 1)
all_theta
```

array([[ -5.79312170e+00, 0.00000000e+00, 0.00000000e+00, ..., 1.22140973e-02, 2.88611969e-07, 0.00000000e+00], [ -4.91685285e+00, 0.00000000e+00, 0.00000000e+00, ..., 2.40449128e-01, -1.08488270e-02, 0.00000000e+00], [ -8.56840371e+00, 0.00000000e+00, 0.00000000e+00, ..., -2.59241796e-04, -1.12756844e-06, 0.00000000e+00], ..., [ -1.32641613e+01, 0.00000000e+00, 0.00000000e+00, ..., -5.63659404e+00, 6.50939114e-01, 0.00000000e+00], [ -8.55392716e+00, 0.00000000e+00, 0.00000000e+00, ..., -2.01206880e-01, 9.61930149e-03, 0.00000000e+00], [ -1.29807876e+01, 0.00000000e+00, 0.00000000e+00, ..., 2.60651472e-04, 4.22693052e-05, 0.00000000e+00]])

We're now ready for the final step - using the trained classifiers to predict a label for each image. For this step we're going to compute the class probability for each class, for each training instance (using vectorized code of course!) and assign the output class label as the class with the highest probability.

```
def predict_all(X, all_theta):
rows = X.shape[0]
params = X.shape[1]
num_labels = all_theta.shape[0]
# same as before, insert ones to match the shape
X = np.insert(X, 0, values=np.ones(rows), axis=1)
# convert to matrices
X = np.matrix(X)
all_theta = np.matrix(all_theta)
# compute the class probability for each class on each training instance
h = sigmoid(X * all_theta.T)
# create array of the index with the maximum probability
h_argmax = np.argmax(h, axis=1)
# because our array was zero-indexed we need to add one for the true label prediction
h_argmax = h_argmax + 1
return h_argmax
```

Now we can use the predict_all function to generate class predictions for each instance and see how well our classifier works.

```
y_pred = predict_all(data['X'], all_theta)
correct = [1 if a == b else 0 for (a, b) in zip(y_pred, data['y'])]
accuracy = (sum(map(int, correct)) / float(len(correct)))
print 'accuracy = {0}%'.format(accuracy * 100)
> accuracy = 97.58%
```

Close to 98% is actually pretty good for a relatively simple method like logistic regression. We can get much, much better though. In the next post, we'll see how to improve on this result by implementing a feed-forward neural network from scratch and applying it to the same image classification task.

]]>There are a number of different ways to look at these seemingly unrelated events. I think it would be interesting to try to parse out 1) why each move was made, 2) whether or not the move makes sense strategically for those involved, and 3) what the long-term impact will be on open-source machine learning, and perhaps on the overall field of artifical intelligence.

First let's look at Tensorflow. Machine learning is really important to Google and they probably have the best team on the planet dedicated to advancing the state of the art in this area. So on the surface, it appears like Google is giving away something really valuable that lessens their competitve advantage. After all, couldn't Facebook or Microsoft or some random startup use this software to build some awesome machine learning apps that compete with Google? Well...not exactly. It's important to remember that there are three critical components to a large-scale machine learning project. Algorithms/software are one of those pieces, but you also need lots of data and lots of infrastructure to run it on, two areas where Google also has a huge competitive advantage over most firms. Data and infrastructure are also MUCH harder and/or expensive to acquire at massive scale.

But still, why give away ANY of their competitive advantage, even if it's the easiest part to replicate? I think there are a number of advantages Google gets out of this model that likely outweighed any downside. Open-sourcing their software allows contributors outside of Google to fix bugs, add new features and improve the software basically for free, which benefits Google. It also gives new up-and-coming talent a chance to gain experience using Google's toolset on their own time, effectively pre-training them for Google to then hire and drop into an engineering team ready to go (this is further supported by the fact that Google just released a class on Udacity for this very purpose). Finally, the open model is very appealing to top researchers and engineers so it's a powerful recruiting draw as well.

It's also important to remember that Google did NOT give away all of their algorithmic advantage. Notably absent from the release is anything pertaining to the work that the DeepMind team is doing on reinforcement learning. In fact, by most accounts that do not even use Tensorflow, they're still running an in-house version of Torch. Tensorflow, while still novel and a significant contribution, is closer to being a better Theano than it is to some revolutionary new technology unleashed on the world. The algorithms baked into the software are already widely known and understood. The real value is in the engineering that runs those algorithms cross-platform and at scale across a wide range of devices. That's not a trivial feat by any stretch, but it also isn't something that only Google could do. Given all of the benefits that Google gets from open-sourcing it relative to what they're giving up, it was likely a very easy decision.

A similar argument could be made with Facebook's release of their deep learning server design. Facebook's competitive advantage is their data, not their server designs. And it's debatable who would even benefit from specs like this. Is a typical ML startup going to invest tons of capital on servers with dozens of video cards in it to build deep learning models? Nope, they're going to fire up AWS instead. Is it possible that Google or Microsoft could look at these plans and go "hey, there are some good ideas here, let's copy what they're doing"? Maybe, but it's probably unlikely. The designs could make their way into some university labs, but that doesn't really pose a threat to Facebook. Again, there's very little competitive advantage in design specs for a commoditized resource.

The benefits for Facebook, however, are similar to those for Google, particularly around attracting talent. I don't think it's a coincidence that Google, Facbook, and OpenAI all had major announcements in the span of a few weeks. These guys are basically at war with each other to get the top researchers and engineers on their teams, and their strategies around "openness" are a key factor in luring those individuals. I don't think it would be a stretch to argue that the MOST important factor for success in large-scale, bleeding edge machine learning is the people on the team. More important than software, more important than data, more important than anything. When viewed through this lens their actions make a LOT of sense, because giving away stuff that others might consider valuable is a distant second to getting the best people to come work for them.

Which brings us to OpenAI. Their announcement was interesting in that it sounded very grandiose and far-reaching but lacked many specifics other than defining the initial team. The basic idea is that a bunch of wealthy technologists and venture capitalists and getting together and promising $1 billion to build a team to work on various problems in AI and give away all of their output for free. We don't really know yet what this output will look like but it could be new research, software tools, data, etc. On the surface this seems like an incredibly altruistic thing to do. And if the group of founders really, truly believes that human-level AI is simultaneously an existential threat to humanity and a massive growth opportunity, then it's possible (though debatable) to reach the conclusion that developing this technology in the open is the logical next step.

But it's ALSO possible, maybe even likely, that this is a really smart strategy move from a group of individuals that have a vested interest in giving startups an opportunity to compete in this arena. Imagine that you're a venture capitalist funding AI startups. In today's world you have to compete with the likes of Google, Facebook etc. on talent, technology, data, resources, you name it. It's a pretty daunting proposition. But what if you could put together a group that could negate some of the incumbents' advantages by developing the same technology in the open, effectively commoditizing a portion of it? And in the process you even manage to hire away some of their key talent, thus further weakening the incumbents. The output from OpenAI could become a "building block" of sorts that allows startups to build competitive AI-based products without hiring all the researchers and engineers necessary to start from scratch. If this turned out to be the case, it's concievable that the value unlocked by that output could greatly exceed the $1 billion investment, with at least some of the value going back to the original investors.

I'm not saying that this is the OpenAI founders' true motivation, because I have no idea. But it would be foolish to think they haven't considered it. There's no reason that OpenAI can't be both a genuine attempt to get a step ahead in the hypothetical AI arms race AND a pretty smart strategy move. I do not, however, buy the notion that OpenAI is purely meant to counter the so-called "killer robot" scenario that Elon Musk has talked about - the idea that a super-intelligent machine could escape the control of its creators and eventually wipe out humanity.

While it's possible that we will - at some point - create a sentient, artifical general intelligence that rivals or surpasses human capability, I do not think it's as imminent a theat as some seem to believe. The general consensus among those actually doing research on the front lines seems to be that these concerns are overblown. Andrew Ng probably stated it best when he said "I don’t work on not turning AI evil today for the same reason I don't worry about the problem of overpopulation on the planet Mars". The implication is that it's possible that someday we will have to concern ourselves with this scenario, but we're so far away from building a truly sentient machine that this just isn't a problem than anyone can productively work on. There's a slightly better argument to be made for things like autonomous weapons, which are probably closer to becoming reality than sentient machines. But it's unclear how OpenAI would do anything to solve that problem.

At the end of the day, it doesn't really matter what the group's motivations are. Just as with the Google and Facebook announcements, the end result is that great technology is given away for free to anyone with the ability to use it. This is obviously a good outcome for open source advocates and individuals or startups interested in using this technology. More broadly, I think these announcements represent the continuation of a trend that has been underway for a while but continues to accelerate. Big tech firms are increasingly adopting the open source model for portions of their technology stack, and in doing so are becoming a driving force in the open source landscape. This effect is particularly noticable in functional areas where expertise is highly sought-after because significant open source contributions can boost a firm's reputation with the community.

This is really important because there's a different dynamic at work with company-driven open source projects than there is with community-driven projects. The latter are much more organic and free-flowing but also tend to lack the focus and cohesion of a major corporate effort (this is certainly not true in all cases but is probably true on average). The open-source community has produced amazing things and will continue to produce amazing things. But it's not clear to me that the open source community could have built TensorFlow.

What does all of this mean for the state of open source machine learning? I think the end result is that the quality and capability of open source machine learning tools will increase much faster than they would organically. This is probably already true, but it will become much more evident. We could get to a point where certain parts of the stack are effectively standardized/commoditized, much like AWS has commoditized the infrastructure needed to conduct experiments (TensorFlow may already be this for tensor math). Looking further down the road, I think the tools will become much more sophisticated as they build on top of each other, providing higher and higher levels of abstraction as companies master and then open-source new technologies.

It's really fascinating to take a step back and look at the current state of things. We're effectively seeing a market where the rational behaviour is to spend lots of time and effort building something incredibly useful, and then giving it away. This was probably unthinkable just a decade ago, but things have changed quickly. It would be fascinating to analyze how we arrived at this point, but ultimately it's not that important. What matters is that the value created by this trend is probably going to be huge. When competitive market forces drive big corporations toward a model of "openness", everyone wins.

]]>The basic idea behind the efficient market hypothesis, or EMH, is that asset prices fully reflect all available information. Whatever the price of an equity is today represents the true value of that company based on all publicly-available information, and any new information that comes out is immediately priced in to the stock. There are several variants of the theory that play with different levels of stringency on what the word "efficiency" means, but they all adhere to this basic framework. The ultimate implication of the theory is that the market is rational, that "inefficiences" are quickly corrected, and that it is not possible to reliably and consistently earn excess returns.

Let's disect this a bit because there are several different perspectives through which we can examine EMH. The first is what I'll call the "academic" perspective. This is a broad set of theories that try to base their formulation of EMH on sound mathematical and economic understanding. The idea is that we have this really complex thing called the stock market and we need some model to be able to explain how it works, so we come up with things like CAPM (capital asset pricing model) and MPT (market portfolio theory) to try to generalize what we see happening empirically. The undercurrent to these theories is that expected returns are explicitly tied to risk. Make more risky investments and you have a chance at higher returns. Make conservative investments and the potential for profit is reduced. Theories falling under this category assume that some form of EMH holds; in other words, the market is rational.

A different way to look at EMH is through the lens of behavioral finance. The core idea of behavioral finance is that humans tend to act irrationally on occassion and irrational behavior can lead to inefficiences in the market. Behavioral finance was pioneered by guys like Daniel Kahneman and asserts that anomalies such as trends and asset bubbles can largely be attributed to cognitive biases and other human errors in reasoning and information processing. Tendencies such as overconfidence, information bias, herd mentality, etc. can lead investors to do things that logically may not make sense and thus can break the supposed rationality of the market.

The final way we can examine EMH is from the perspective of fundamental analysis, or "value" investing. This is the school of thought that was championed by Ben Graham in the early 1900s, and later by his most famous student, Warren Buffett. There are lots of different ways to describe the ideas put forward by Graham, Buffett, and the many other hyper-successful investors that used fundamental analysis to achieve their success. But the foundational idea of value investing is very simple: to buy shares of great companies at attractive prices. Value investors analyze the fundamentals of a business - earnings, cash flow, growth, return on capital etc. They dig into the details of how the business works, how it's positioned in its market, how efficient it is, how well it's managed, what the prospects are for its industry, and so on. In doing so, they identify opportunities to purchase portions of a business that they believe are worth more than the market does (with an appropriate margin of safety).

While behavioral finance provides some contrarian evidence against EMH, it's not 100% incompatible with at least the weaker forms of the theory. Value investing, on the other hand, basically spits in the face of EMH. In order for value investing to work at all, by definition there must be securities that are mis-priced - i.e. the market cannot be rational. And while there is no known theoretical framework or mathematical model that explains the stock market from a fundamentals perspective, the empirical results are decidedly in favor of fundamental analysis. There is perhaps no better example of this than Warren Buffet's 1984 essay The Superinvestors of Graham-And-Doddsville.

Throughout the essay, Buffet shows case after case of investors that beat the market averages by very wide margins over long periods of time, and demonstrates that these investors all shared a common intellectual origin. This is important because a common explanation for extraordinary investment success is that statistically it is expected that there will be outliers so these cases are consistent with EMH. However, for this thesis to hold the outliers must be independent and identically distributed. Warren showed very clearly in the essay that this assumption is false. In 1991 Seth Klarman (another very successful value investor) wrote a book in which he stated "Buffett's argument has never, to my knowledge, been addressed by the efficient-market theorists; they evidently prefer to continue to prove in theory what was refuted in practice".

While I presented the behavioral finance in value investing perspectives separately, in reality they've always been somewhat tied together. Even Ben Graham's early work recognized the effect that investor psychology could have on the markets and the opportunites that it could create. This was evident in his famous caricature of the stock market as a manic-depressive gentlemen called "Mr. Market" - a guy who from day to day may choose to buy or sell securities at vastly different prices depending on his mood that day. Graham understood that for human beings, absolute logic and rationality does not always come easy, and the market is a function of human behaviors and tendencies.

So where do things stand today? The stock market is certainly more liquid than it's ever been. Information is more freely available than at any time in history. According to the academic view, the case for the efficient market hypothesis should be stronger than ever. And yet, time and again, investors practicing a value-oriented philosophy of purchasing quality businesses for less than they're worth are able to beat the market in the long run. The case that Buffet made in his essay more than 30 years ago still seems to hold true. But despite this seemingly apparent contradiction of EMH, the theory is still quite popular while fundamental analysis remains somewhat of a niche style of investing. Why is this the case?

I think there are a couple things driving the continued belief in EMH. Investing in general is a relatively complicated thing and most individuals do not have the time or energy to devote to really understanding even some of the more basic concepts in investing. To someone who has not gone down that path of learning about alternative theories in investing, EMH intuitively seems to make sense. The average investor probably views the stock market as something that tons of professionals all over the world are constantly analyzing and number crunching, shrewdly capturing any inefficiencies as soon as they happen. From that perspective, it would be logical to conclude that the market must be efficient and a simple "retail" investor would have no chance of competing.

Another reason EMH is still widely believed is because it is often still taught in schools. In the academic world, at least to my knowledge, some form of EMH is still believed to be true and so students taking classes on business, economics etc. still learn about things like CAPM, MPT and so on. As new generations of students are brought up believing that EMH holds, it's likely that those beliefs become part of a self-reinforcing cycle. By that I mean, investors who assume the market is efficient will invest in ways that probably reinforce the idea that the market is efficient. You can't begin with the premise that the market is efficient and then empirically arrive at the conclusion that it is not based on your own personal results, unless the investment decisions that led to those results were based on a rejection of the idea of an efficient market in the first place!

That leads to my last point about why the efficient market hypothesis is so enduring. Believeing that it is false is actually somewhat difficult. Assuming that markets are efficient is the natural state for one to subscribe to, and running contrary to that belief requires evidence. But evidence that the market is NOT efficient is hard to come by. By that I don't mean that it's hard to find - there are plently of examples such as Warren's essay mentioned earlier that provide convincing evidence against EMH. But it's one thing to read about past instances of seemingly convincing examples that happened to someone else. It's quite a different thing to tangibly and empirically witness it yourself, which is very challenging to do. Even if it is the case that some stocks are undervalued in the market, that doesn't make it easy to find them or invest in them with a long enough time horizon to see those investments truly pay off.

Believing in the idea that it's possible to beat the market takes both a willingness to go against conventional wisdom and the conviction to stick with your hypothesis even when it doesn't always work. Most people do not have this temperament. The ideas that Graham, Buffet et al. have put forth are simple ideas, but they are not easy to follow. This is perhaps the biggest reason why EMH persists. It's much easier to assume that markets are efficient than to convince yourself otherwise. And as long as investors continue believing in an efficient market, there will continue to be discrepencies between price and value for intelligent investors to exploit.

]]>The first thing we need is some text to work with. NLTK (Natural Language Toolkit) has many popular corpora available for download directly from the API, so let's use that. For this exercise we're going to use the Brown corpus. Start by initiating the download tool and selecting the corpus.

```
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import nltk
nltk.download()
```

Check to make sure that the download was successful and we can access the data.

```
from nltk.corpus import reuters
len(reuters.words())
```

1720901

If you've never heard of the Reuters corpus, it's a collection of more than 10,000 news documents published in 1987 categorized into 90 different topics. The content is obviously a bit dated but it's still interesting to work with because of the size of the collection and the fact that we have labeled topics for each document that we can use for certain types of supervised learning algorithms. The corpus contains over 1.3 million words in total.

NLTK's feature set is focused more on the linguistic aspect of natural language processing than the machine learning aspect, with functions for tasks such as tokenization, stemming, tagging, and parsing. Even so, there are some very useful functions we can take advantage of to get a sense of what's in this corpus. Let's start by tabulating the number of unique words in the corpus.

```
vocabulary = set(reuters.words())
len(vocabulary)
```

There are 41,600 unique tokens in the corpus. This doesn't tell us anything about the distribution of these tokens though. NLTK has a built-in function to compute a frequency distribution for a text corpus.

```
fdist = nltk.FreqDist(reuters.words())
fdist.most_common(30)
```

[(u'.', 94687), (u',', 72360), (u'the', 58251), (u'of', 35979), (u'to', 34035), (u'in', 26478), (u'said', 25224), (u'and', 25043), (u'a', 23492), (u'mln', 18037), (u'vs', 14120), (u'-', 13705), (u'for', 12785), (u'dlrs', 11730), (u"'", 11272), (u'The', 10968), (u'000', 10277), (u'1', 9977), (u's', 9298), (u'pct', 9093), (u'it', 8842), (u';', 8762), (u'&', 8698), (u'lt', 8694), (u'on', 8556), (u'from', 7986), (u'cts', 7953), (u'is', 7580), (u'>', 7449), (u'that', 7377)]

We can plot these cumulatively to get a sense for how much of the corpus they represent.

```
fig, ax = plt.subplots(figsize=(16,12))
ax = fdist.plot(30, cumulative=True)
```

Just 30 tokens make up around 35% of the entire corpus! Moreover, most of these are things like punctuation and articles such as "and", "to", "of" and so on. This is useful to know as we may want to strip out tokens like these. You might also notice that the word 'the' appears on the list twice. That's because the corpus contains both upper-case and lower-case words, and they are each counted separately. Before we attempt to do anything with this data we'll need to correct these issues.

```
stopwords = nltk.corpus.stopwords.words()
cleansed_words = [w.lower() for w in reuters.words() if w.isalnum() and w.lower() not in stopwords]
vocabulary = set(cleansed_words)
len(vocabulary)
```

30618

After converting everything to lowercase, removing punctuation, and removing "stop words" using a pre-defined list of words that do not add any semantic value, we've reduced the vocabulary from almost 42,000 to just over 30,000. Note that we still didn't address things like singular vs. plural being different words. To handle this we'd have to get into topics like stemming, but for now let's leave as-is. Let's look at the top 30 again.

```
fdist = nltk.FreqDist(cleansed_words)
fdist.most_common(30)
```

[(u'said', 25383), (u'mln', 18623), (u'vs', 14341), (u'dlrs', 12417), (u'000', 10277), (u'1', 9977), (u'pct', 9810), (u'lt', 8696), (u'cts', 8361), (u'year', 7529), (u'net', 6989), (u'2', 6528), (u'billion', 5829), (u'loss', 5124), (u'3', 5091), (u'5', 4683), (u'would', 4673), (u'company', 4670), (u'1986', 4392), (u'4', 4363), (u'shr', 4182), (u'inc', 4121), (u'bank', 3654), (u'7', 3450), (u'corp', 3399), (u'6', 3376), (u'oil', 3272), (u'last', 3243), (u'8', 3218), (u'share', 3160)]

The list is arguably a bit more interesting now. There's a lot more that we could do with NLTK, but since we're interested in using this data to build statistical models, we need to find ways to "vectorize" this data. One common way to represent text data is called the "bag of words" representation. A bag of words represents each document in a corpus as a series of features that ask a question about the document. Most commonly, the features are the collection of all distinct words in the vocabulary of the entire corpus. The values are usually either binary (representing the presence or absence of that word in the document) or a count of the number of times that word appears in the document. A corpus is then represented as a matrix with one row per document and one column per unique word.

To build our initial bag of words matrix, we're going to use some of scikit-learn's built-in text processing capabilities. We'll start off using scikit-learn's CountVectorizer class to transform our corpus into a sparse bag of words representation. CountVectorizer expects as input a list of raw strings containing the documents in the corpus. It takes care of the tokenization, transformation to lowercase, filtering stop words, building the vocabulary etc. It also tabulates occurrance counts per document for each feature.

Since CountVectorizer expects the raw data as input rather than the pre-processed data we were working with in NLTK, we need to create a list of documents to pass to the vectorizer.

```
files = [f for f in reuters.fileids() if 'training' in f]
corpus = [reuters.raw(fileids=[f]) for f in files]
len(corpus)
```

7769

Let's look at an example from the corpus to get a sense of the what kind of raw text we're dealing with (part of the document has been omitted for brevity).

```
corpus[0]
```

u'BAHIA COCOA REVIEW\n Showers continued throughout the week in\n the Bahia cocoa zone, alleviating the drought since early\n January and improving prospects for the coming temporao,\n although normal humidity levels have not been restored,\n Comissaria Smith said in its weekly review.\n The dry period means the temporao will be late this year.\n Arrivals for the week ended February 22 were 155,221 bags\n of 60 kilos making a cumulative total for the season of 5.93\n mln against 5.81 at the same stage last year. Again it seems\n that cocoa delivered earlier on consignment was included in the\n arrivals figures.\n ...Final figures for the period to February 28 are expected to\n be published by the Brazilian Cocoa Trade Commission after\n carnival which ends midday on February 27.\n \n\n'

Now we have the training corpus defined as a list of raw text documents. We can create our vectorizer and pass in the corpus to build our bag of words matrix.

```
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
X
```

<7769x26001 sparse matrix of type '

with 426016 stored elements in Compressed Sparse Row format>

The vectorizer stores the data as a sparse matrix since a dense matrix would use way too much space and most of the values would be zero anyway (because each document only contains a small number of the total words in the vocabulary). We can transform this to a numpy array if necessary though.

```
X.toarray()
```

array([[0, 0, 0, ..., 0, 0, 0], [0, 3, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 2, 0, ..., 0, 0, 0]], dtype=int64)

The vectorizer stores the feature names (words) that map to the matrix column indexes. We can inspect those if desired. Note that I'm skipping to index 2000 because if you look at the beginning of the index, it's all numbers! The reuters corpus, being news articles, contains quite a high volume of numeric symbols. It's debatable wether or not we should really include these in the vocabulary, but for now they're there.

```
vectorizer.get_feature_names()[2000:2030]
```

[u'aero', u'aeroe', u'aerojet', u'aeronautics', u'aeroperu', u'aerosol', u'aerosols', u'aerospace', u'aerotech', u'aet', u'aetna', u'afbf', u'affadavit', u'affair', u'affairs', u'affandi', u'affect', u'affected', u'affecting', u'affects', u'affiliate', u'affiliated', u'affiliates', u'affiliation', u'affinerie', u'affirmation', u'affirmative', u'affirmed', u'afflicted', u'afford']

One potential issue with this representation is that it holds an in-memory mapping of the vocabulary to the document matrix that can get unwieldy on large datasets. This approach also doesn't work when training in an on-line fashion since it needs to build the entire vocabulary ahead of time. There's another vectorization algorithm implemented in scikit-learn that uses feature hashing to build the matrix in a stateless manner. This HashingVectorizer class solves both of the above problems, however is comes with some tradeoffs - it's not possible to "inverse transform" the vector back to the original words, and there's a possibility of collisions that could cause some information to be lost.

```
from sklearn.feature_extraction.text import HashingVectorizer
hv = HashingVectorizer()
X_hash = hv.transform(corpus)
X_hash
```

<7769x1048576 sparse matrix of type '

with 573305 stored elements in Compressed Sparse Row format>

For now we'll continue using the count vectorizer, but for very large corpora this approach would be faster and more efficient.

We now have a bag of words matrix, however there's another problem - some words appear much more frequently across the corpora as a whole than other words, so their presence in a document should carry less weight than a word that is very infrequent in general. To adjust for this, we'll use something called TF-IDF weighting. TF stands for term frequency, while IDF stands for inverse document frequency. The TF-IDF calculation lowers the relative weight of common words and increases the relative weight of uncommon words.

Scikit-learn implements TF-IDF as a separate transform that we can apply to the output of our vectorizer.

```
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
X_weighted = tfidf.fit_transform(X)
```

Now that we have a weighted term-document matrix, let's do something with it. A common natural language processing task is to classify documents as belonging to a particular category. Since the reuters corpus is labeled, we can used a supervised learning algorithm to attempt to learn how to categorize similar news articles.

To do this we need a few additional pieces of information. We need a set of labels, and we need a test set to evaluate performance of the model. Forunately we have both available to us for the reuters dataset.

```
# build the term-document matrix for the test set using the existing transforms
test_files = [f for f in reuters.fileids() if 'test' in f]
test_corpus = [reuters.raw(fileids=[f]) for f in test_files]
X_test = vectorizer.transform(test_corpus)
X_test_weighted = tfidf.transform(X_test)
# get the categories for each document in both the train and test sets
train_labels = [reuters.categories(fileids=[f]) for f in files]
test_labels = [reuters.categories(fileids=[f]) for f in test_files]
```

Since there are 90 distinct categories, and each document can be assigned to more than one category, we probably don't have enough documents per category to build a really good document classifier. We're going to simplify the problem a bit and reduce the classification to a binary problem - wether or not the document belongs to the 'acq' category.

```
y = np.asarray([1 if 'acq' in label else 0 for label in train_labels])
y_test = np.asarray([1 if 'acq' in label else 0 for label in test_labels])
X_weighted.shape, y.shape, X_test_weighted.shape, y_test.shape
```

((7769, 26001), (7769L,), (3019, 26001), (3019L,))

Now we're ready to train a classifier. We'll use multinomial Naive Bayes in this example.

```
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
# train the classifier
classifier = MultinomialNB()
classifier.fit(X_weighted, y)
# predict labels for the test set
predictions = classifier.predict(X_test_weighted)
# output the classification report
label_names = ['not acq', 'acq']
print(classification_report(y_test, predictions, target_names=label_names))
```

precision recall f1-score support not acq 0.87 1.00 0.93 2300 acq 0.99 0.52 0.68 719 avg / total 0.90 0.88 0.87 3019

So that's a pretty reasonable outcome, although the recall is not as high as we would like it to be. There are a number of ways we could work to improve this result, such as experimenting with removing extraenous tokens such as numbers from our vocabulary or constructing additional high-level features about the documents. For a simple bag-of-words model though it's not too bad.

Supervised learning is nice when we have a labeled dataset, but the vast majority of text in the wild does not come with any sort of label so its usefulness in natural language processing is often limited. What about unsupervised techniques to categorize documents? Scikit-learn packages a decomposition technique called non-negative matrix factorization (NMF) that we can use for topic extraction.

```
from sklearn.decomposition import NMF
nmf = NMF(n_components=10).fit(X_weighted)
feature_names = vectorizer.get_feature_names()
for topic_idx, topic in enumerate(nmf.components_):
print('Topic #%d:' % topic_idx)
print(' '.join([feature_names[i] for i in topic.argsort()[:-20 - 1:-1]]))
print('')
```

Topic #0: loss profit vs 000 cts dlrs oper shr revs year qtr 4th includes net discontinued note operations 1986 quarter excludes Topic #1: pct february january rose year rise rate index december prices 1986 inflation said fell compared growth consumer statistics base production Topic #2: cts qtly div record april pay prior vs dividend sets march quarterly lt payout 15 10 payable regular 30 31 Topic #3: said trade japan dollar bank yen japanese exchange foreign rates market rate dealers economic currency paris countries nations government told Topic #4: vs mln 000 net cts shr revs qtr avg shrs oper dlrs note profit 4th lt year mths 31 sales Topic #5: billion dlrs surplus deficit mln francs marks 1986 january year rose february reserves trade fell 1985 account foreign december exports Topic #6: fed customer repurchase says reserves federal agreements funds reserve temporary sets dlrs week repurchases securities economists supply billion add trading Topic #7: stg mln bank money market england bills band assistance shortage revised today forecast help given provided 16 estimate compares central Topic #8: tonnes 000 wheat sugar corn export said ec grain 87 traders 1986 maize production tonne exports year soviet usda china Topic #9: dlrs said mln company shares share stock lt corp offer split common quarter group earnings dividend oil board shareholders unit

The above output takes the components derived from the factorization (here assumed to model a "topic" from the corpus) and extracts the 20 words that most significantly contributed to that topic. Although it's not perfect, we can see some commonalites among the groups of words.

NMF gives some interesting results, but there are more advanced algorithms for topic modeling. Latent Dirichlet Allocation (LDA), for example, is a technique that models documents as though they are composed of some undefined number of topics. Each of the words in the document are then said to be attributed to some combination of those topics.

Scikit-learn does not implement LDA so if we want to try it out, we'll need to look elsewhere. Fortunately there's a library called gensim that's focused specifically on topic modeling. To start off we need our corpus in a format that gensim models can use as input. Gensim implements a lot of the same transforms that we just applied to the data, but rather that re-create the same transforms we can re-use what we've already done and convert our term-document matrix into gensim's expected format.

```
from gensim import corpora, models, similarities, matutils
# create the corpus using a conversion utility
gensim_corpus = matutils.Sparse2Corpus(X_weighted)
# build the LDA model
lda = models.LdaModel(gensim_corpus, num_topics=100)
```

With our LDA model we could now examine the words that most contribute to each topic (as we did with NMF), or we could compare new documents to the model to identify either the topics that make up that document or the existing documents that they are most similar to. However, I want to move on from NMF/LDA to another algorithm implemented in gensim called word2vec because it's really, really interesting. Word2vec is an unsupervised neural network model that runs on a corpus of text and learns vector representations for the individual words in the text. The word vectors are modeled in a way such that words that are semantically "close" to each other are also close in a mathematical sense, and this results in some interesting properties. Let's explore some of the implications on the reuters dataset.

Since word2vec expects a list of sentences as input, we'll need to go back to the pre-transformed sentence list provided by NLTK

```
model = models.Word2Vec(reuters.sents(), size=100, window=5, min_count=5, workers=4)
```

We now have a trained word2vec model. It's possible to look at the vector for a word directly, although it won't mean much to a person.

```
model['market']
```

array([ 0.36218089, 0.04660718, -0.03312639, -0.00589092, 0.08425491, -0.05584015, 0.39824656, -0.19913128, 0.21185778, -0.16018888, -0.30720046, 0.41359827, 0.05477867, 0.40744004, -0.15048127, -0.21775401, -0.0918686 , 0.08254269, -0.36109206, -0.08484149, -0.37724456, 0.19134018, 0.18765855, 0.17301551, -0.13106611, 0.10278706, 0.14409529, 0.09305458, -0.27449781, -0.16971849, 0.20959041, 0.12159102, 0.07963905, 0.03050068, 0.31353745, 0.06859812, -0.26051152, 0.1805039 , 0.28199297, -0.19140336, 0.13152425, 0.04389969, 0.06004116, -0.31306067, -0.12013798, -0.17255786, -0.05460097, -0.35606486, 0.31404966, 0.03737779, -0.11519474, 0.31271645, -0.31853175, 0.08142728, 0.09033886, -0.15671426, -0.07798025, 0.06073617, 0.2294289 , 0.13113637, -0.04398542, -0.34159404, 0.06506728, 0.20032322, -0.11604583, -0.14258914, -0.06725569, -0.06181487, 0.13476266, 0.17378812, 0.01733109, -0.0836978 , -0.24637276, 0.06484974, -0.02348729, 0.27839953, -0.12627478, 0.50229609, 0.02701729, -0.11646958, -0.3040815 , -0.18003054, 0.01555716, -0.11430902, -0.40754062, 0.05430043, 0.27255279, 0.12115923, 0.16014519, -0.03295279, -0.50409102, 0.38960707, -0.19293144, -0.19752754, -0.14633107, 0.24427678, -0.13369191, 0.18097162, -0.26153758, -0.11974742], dtype=float32)

Every word in the vocabulary now has a vector representation that looks like this. Since we're dealing with vectors, it's possible to compare words using vector math such as the cosine similarity.

```
model.similarity('dollar', 'yen')
```

0.7455091131291105

```
model.similarity('dollar', 'potato')
```

0.28685558842610581

According to the model, 'dollar' and 'yen' are much more similar to each other (both being currencies) than 'dollar' and 'potato'. The relationship is deeper than just a similarity measure though. The word2vec model is capable of capturing abstract concepts as well. The ubiquitous example is "woman + king - man = queen". When properly trained on a large enough amount of text, the model is able to detect that the relationship between 'woman' and 'queen' is similar to the relationship between 'man' and 'king'. Let's see if the model we just trained can do something similar.

```
model.most_similar(positive=['Japan', 'dollar'], negative=['US'])
```

[(u'appreciation', 0.6466031670570374), (u'appreciated', 0.596366822719574), (u'sterling', 0.5693594813346863), (u'Taiwan', 0.5512674450874329), (u'slowly', 0.5457212924957275), (u'mark', 0.5433770418167114), (u'yen', 0.5293248891830444), (u'stems', 0.5171161890029907), (u'pleas', 0.5137792825698853), (u'brake', 0.5080464482307434)]

It didn't get it exactly right, although 'yen' is on the list, but there's a lot of other noise too. This is most likely due to the relatively small size of the dataset. Word2vec needs a huge amount of training data to work really well. Some parameter tuning might help too - for example, a size of 100 dimensions might be way too big for the amount of data in the reuters dataset.

Let's see if there's a way to visualize some of the information captured by the model. Since the vectors are high-dimensional we can't visualize them directly, but we can apply a dimension reduction technique like PCA and use the first two principal components as coordinates. We can try this with a group of words that should be somewhat similar, such as countries.

```
from sklearn.decomposition import PCA
words = ['US', 'China', 'Japan', 'England', 'France', 'Germany', 'Soviet']
word_vectors = [model[word] for word in words]
# create and apply PCA transform
pca = PCA(n_components=2)
principal_components = pca.fit_transform(word_vectors)
# slice the 2D array
x = principal_components[:, 0]
y = principal_components[:, 1]
# plot with text annotation
fig, ax = plt.subplots(figsize=(16,12))
ax.scatter(x, y, s=0)
for i, label in enumerate(words):
ax.annotate(label, (x[i], y[i]), size='x-large')
```

They're a bit more spread out than I thought they would be, but a few (such as U.S. and China) are very close. These probably appeared frequently in the text so there may have been a larger amount of training data for these terms. The results become more interesting when applied to very large datasets, and indeed Google and others have done just that. In fact, applying this methodology to words is only the beginning. It's already been extended or is being extended to phrases and even entire documents. It's a very promising research area and I'm excited to see what comes out of it in the next few years.

]]>(NOTE: This blog post is partially adapted from the PyMC3 "Getting Started" tutorial at https://pymc-devs.github.io/pymc3/getting_started. PyMC3 is currently considered beta software and should be treated as such. The source code for this post is available here.)

For our first exercise we're going to implement multiple linear regression using a very simple two-variable dataset that I'm borrowing from an exercise I did for a machine learning course. The data is related to home sales and contains independent variables for the size of and number of bedrooms in a house, and a dependent variable for the price of the house. Let's load up and visualize the data.

```
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
path = os.getcwd() + '/data/ex1data2.txt'
data = pd.read_csv(path, header=None, names=['Size', 'Bedrooms', 'Price'])
data.head()
```

Size | Bedrooms | Price | |
---|---|---|---|

0 | 2104 | 3 | 399900 |

1 | 1600 | 3 | 329900 |

2 | 2400 | 3 | 369000 |

3 | 1416 | 2 | 232000 |

4 | 3000 | 4 | 539900 |

```
fig, ax = plt.subplots(1, 2, figsize=(15,6))
ax[0].scatter(data.Size, data.Price)
ax[1].scatter(data.Bedrooms, data.Price)
ax[0].set_ylabel('Price')
ax[0].set_xlabel('Size')
ax[1].set_xlabel('Bedrooms')
```

We also need a baseline linear regression technique to compare to the bayesian approach so we have an idea of how well it's performing. For that we'll load up scikit-learn's linear regression module. We'll use squared error to evaluate the performance.

```
from sklearn import linear_model
model = linear_model.LinearRegression()
# normalize features
data = (data - data.mean()) / data.std()
X = data[['Size', 'Bedrooms']].values
y = data['Price'].values
model.fit(X, y)
y_pred = model.predict(X)
squared_error = ((y - y_pred) ** 2).sum()
squared_error
```

12.284529170669948

For fun let's see what parameters it came up with too. We can also compare these later on.

```
model.intercept_, model.coef_
```

(-1.2033011196250568e-16, array([ 0.88476599, -0.05317882]))

Okay, now we're ready to proceed. In order to define a model in PyMC3, we need to frame the problem in bayesian terms. This is no small task for a beginner in bayesian statistics and takes some getting used to. In the case of linear regression, we're interested in predicting outcomes Y as normally-distributed observations with an expected value μ that is a linear function of two predictor variables, X1 and X2. Using this model, μ is our expected value, alpha is our intercept, beta is an array of coefficients, and sigma represents the observation error. Since these are all unknown, non-deterministic variables, we need to specify a prior to instantiate them. We'll use normal distributions with a mean of zero.

By the way, if any of the concepts I just mentioned sound completely foreign, I encourage the reader to brush up on the basics of Bayesian statistics. There are some great resources online. In particular, I've found this book to be very helpful.

```
from pymc3 import Model, Normal, HalfNormal
regression_model = Model()
with regression_model:
# priors for unknown model parameters
alpha = Normal('alpha', mu=0, sd=10)
beta = Normal('beta', mu=0, sd=10, shape=2)
sigma = HalfNormal('sigma', sd=1)
# expected value of outcome
mu = alpha + beta[0] * X[:,0] + beta[1] * X[:,1]
# likelihood (sampling distribution) of observations
y_obs = Normal('y_obs', mu=mu, sd=sigma, observed=y)
```

Note that each variable is either declared with a prior representing the distribution of that variable, or (in the case of μ) is a deterministic outcome of other stochastic variables. Also of note is the y_obs variable, which is a special type of "observed" variable that represents the data likelihood of the model. Finally, observe that we're mixing PyMC variables with the variables holding the data. PyMC's models are very expressive and we could use a variety of mathematical functions to define our variables.

Now that we've specified the model, we need to obtain posterior estimates for the unknown variables in the model. There are two techniques we can leverage for this and they can be used to complement each other. The first thing we're going to do is find the **Maximum A Priori Estimate (MAP)** for the model. The MAP is a point estimate for the model parameters obtained using numerical optimization methods.

```
from pymc3 import find_MAP
from scipy import optimize
map_estimate = find_MAP(model=regression_model, fmin=optimize.fmin_powell)
print(map_estimate)
```

{'alpha': array(3.396023412944341e-09), 'beta': array([ 0.8846891, -0.0531327]), 'sigma_log': array(-0.6630286280877248)}

In this case we're overriding the default optimization algorithm (BFGS) and specifying our own, but we could have left it as the default too. Any optimization algorithm that minimizes the loss on the objective function should work.

Finding the MAP is a good starting point but it's not necessarily the best answer we can find. To do that, we need to use a simulation-based approach such as Markov Chain Monte Carlo (MCMC). PyMC3 implements a variety of MCMC sampling algorithms including the No-U-Turn Sampler (NUTS), which is especially good for models that have many continuous variables because it uses gradient-based techniques to converge much faster than traditional sampling algorithms.

Let's use the MAP as a starting point and sample the posterior distribution 1000 times using MCMC.

```
from pymc3 import NUTS, sample
with regression_model:
# obtain starting values via MAP
start = find_MAP(fmin=optimize.fmin_powell)
# instantiate sampler
step = NUTS(scaling=start)
# draw posterior samples
trace = sample(5000, step, start=start)
```

[-----------------100%-----------------] 5000 of 5000 complete in 11.4 sec

We can examine the trace object directly to see the sampled values for each of the variables in the model.

```
trace['alpha'][-5:]
```

array([ 0.07095636, -0.05955168, 0.09537584, 0.04383463, 0.10311347])

Although the flexibility to inspect the values directly is useful, PyMC3 provides plotting and summarization functions for inspecting the sampling output that tell us much more at a glance. A simple posterior plot can be created using traceplot.

```
from pymc3 import traceplot
traceplot(trace)
```

There's also a text-based output available using the summary function.

```
from pymc3 import summary
summary(trace)
```

alpha: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------- 0.002 0.080 0.001 [-0.154, 0.157] Posterior quantiles: 2.5 25 50 75 97.5 |--------------|==============|==============|--------------| -0.152 -0.051 0.003 0.055 0.162 beta: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------- 0.884 0.096 0.002 [0.688, 1.063] -0.052 0.097 0.002 [-0.237, 0.139] Posterior quantiles: 2.5 25 50 75 97.5 |--------------|==============|==============|--------------| 0.697 0.821 0.884 0.946 1.076 -0.237 -0.118 -0.052 0.010 0.139 sigma_log: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------- -0.617 0.103 0.001 [-0.819, -0.425] Posterior quantiles: 2.5 25 50 75 97.5 |--------------|==============|==============|--------------| -0.807 -0.688 -0.622 -0.549 -0.401 sigma: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------- 0.543 0.057 0.001 [0.440, 0.653] Posterior quantiles: 2.5 25 50 75 97.5 |--------------|==============|==============|--------------| 0.446 0.502 0.537 0.578 0.669

So we now have posterior distributions for each of our model parameters, but what do we do with them? Remember our original task was to find a model that minimizes the squared error of the training data. We've created a probablistic model of the parameters in the linear regression model, but we need to use point values to calculate the squared error. The most straightforward approach would be to use the mean, or expected value, of the parameters and plug those into the linear regression equation. Let's try that.

```
# so we can use vector math to compute the predictions at once
data.insert(0, 'Ones', 1)
X = data.iloc[:, 0:3].values
params = np.array([0, 0.883, -0.052])
y_pred = np.dot(X, params.T)
bayes_squared_error = ((y - y_pred) ** 2).sum()
bayes_squared_error
```

12.284629306712745

Total squared error probably isn't the best way to evaluate the performance of a model. In a real scenario we'd be testing predictions on unseen data, testing the robustness of the model, etc. But it's still pretty interesting that the parameter means it found resulted in a model with basically the exact same squared error as the scikit-learn linear regression model. Fascinating stuff.

Since linear models are pretty common, PyMC3 also has a GLM module that makes specifying models in this format much easier. It uses patsy to define model equations as a string (similar to R-style formulas) and creates the necessary variables underneath.

```
from pymc3 import glm
with Model() as model:
glm.glm('Price ~ Size + Bedrooms', data)
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(5000, step, start=start)
```

[-----------------100%-----------------] 5000 of 5000 complete in 12.4 sec

The end result should look pretty similar to the trace that we obtained from the first example.

```
traceplot(trace)
```

So that's what bayesian inference looks like when applied to multiple linear regression. At this point it's worth asking - how would one use this in the broader picture, especially on non-trivial problems? For someone interested primarily in machine learning applications (vs. scientific analysis), where does this fit in the toolbox? These are questions that I'm still figuring out. I think the real power of bayesian modeling comes from incorporating domain knowledge into graphical models such as what Daphne Koller teaches in her Probablistic Graphical Models class. I hope to explore this further in a future notebook and attempt to apply it to a real machine learning problem.

]]>