Markov chains are essentially a way to capture the probability of state transitions in a system. A process can be considered a Markov process if one can make predictions about the future state of the process based solely on its present state (or several of the most recent states for a higher-order Markov process). In other words, the history doesn't matter beyond a certain point. There are lots of great explainers out there so I'll leave that for the reader to explore independently (this one is my favorite). It will become clearer as we step through the code, so let's dive in.

For this example we're going to build a language-based Markov chain. More specifically, we'll read in a corpus of text and identify pairs of words that appear together. The pairings are sequential such that when a word \(w1\) is followed by a word \(w2\), then we say that the system has a probabilistic state transition from \(w1\) to \(w2\). An example will help. Consider the phrase "the brown fox jumped over the lazy dog". If we break this down by word pairings, our state transitions would look like this:

the: [brown, lazy]

brown: [fox]

fox: [jumped]

over: [the]

lazy: [dog]

This set of state transitions is called a Markov chain. With this in hand we can now choose a starting point (i.e. a word in the corpus) and "walk the chain" to create a new phrase. Markov chains built in this manner over large amounts of text can produce surprisingly realistic-sounding phrases.

In order to get started we need a corpus of text. Anything sufficiently large will do, but to really have some fun (and at the risk of bringing politics into the mix) we're going to make Markov chains great again by using this collection of text from Donald Trump's campaign speeches. Our first step is to import the text file and parse it into words.

```
import urllib2
text = urllib2.urlopen('https://raw.githubusercontent.com/ryanmcdermott/trump-speeches/master/speeches.txt')
words = []
for line in text:
line = line.decode('utf-8-sig', errors='ignore')
line = line.encode('ascii', errors='ignore')
line = line.replace('\r', ' ').replace('\n', ' ')
new_words = line.split(' ')
new_words = [word for word in new_words if word not in ['', ' ']]
words = words + new_words
print('Corpus size: {0} words.'.format(len(words)))
```

Corpus size: 166259 words.

I did some clean-up by converting it to ASCII and removing line breaks but that's about it, the rest of the text is just left as it appears in the source file. Our next step is to build the transition probabilities. We'll represent our transitions as a dictionary where the keys are the distinct words in the corpus and the value for a given key is a list of words that appear after that key. To build the chain we just need to iterate through the list of words, add it to the dictionary if it's not already there, and add the word proceeding it to the list of transition words.

```
chain = {}
n_words = len(words)
for i, key in enumerate(words):
if n_words > (i + 1):
word = words[i + 1]
if key not in chain:
chain[key] = [word]
else:
chain[key].append(word)
print('Chain size: {0} distinct words.'.format(len(chain)))
```

Chain size: 13292 distinct words.

It may come as a surprise that we're just naively inserting words into the transition list without caring if that word had appeared already or not. Won't we get duplicates, and isn't that a problem? Yes we will, and no it's not. Think of this as a simplistic way of representing the transition probability. If a word appears multiple times in the list, and we sample from the list randomly during a transition, there's a higher likelihood that we pick that word proportional to the number of times it appeared after the key relative to all the other words in the corpus that appeared after that key.

Now that we've built our Markov chain, we can get to the fun part - using it to generate phrases! To do this we only need two pieces of information - a starting word, and a phrase length. We're going to randomly select a starting word from the corpus and make our phrases tweet-length by sampling until our phrase hits 140 characters (assume we're part of the #never280 crowd). Let's give it a try.

```
import random
w1 = random.choice(words)
tweet = w1
while len(tweet) < 140:
w2 = random.choice(chain[w1])
tweet += ' ' + w2
w1 = w2
print(tweet)
```

Were not going to run by the 93 million people are, where were starting. New Hampshire." I PROMISE. I do so incredible, and be insulted, Chuck.

Not bad! The limitations of using only one word for context are readily apparent though. We can improve it by using a 2nd-order Markov chain instead. This time, instead of using simple word pairings, our "keys" will be the set of distinct tuples of words that appear in the text. Borrowing from the example phrase earlier, a 2nd-order Markov chain for "the brown fox jumped over the lazy dog" would look like:

(the, brown): [fox]

(brown, fox): [jumped]

(fox, jumped): [over]

(jumped, over): [the]

(over, the): [lazy]

(the, lazy): [dog]

In order to build a 2nd-order chain, we have to make a few modifications to the code.

```
chain = {}
n_words = len(words)
for i, key1 in enumerate(words):
if n_words > i + 2:
key2 = words[i + 1]
word = words[i + 2]
if (key1, key2) not in chain:
chain[(key1, key2)] = [word]
else:
chain[(key1, key2)].append(word)
print('Chain size: {0} distinct word pairs.'.format(len(chain)))
```

Chain size: 72373 distinct word pairs.

We can do a sanity check to make sure it's doing what we expect by choosing a word pair that appears somewhere in the text and then examining the transitions in the chain for that pair of words.

```
chain[("Its", "so")]
```

['great', 'great', 'easy.', 'preposterous.', 'important...', 'simple.', 'simple.', 'horrible.', 'out', 'terrible.', 'sad.', 'much', 'can', 'easy.', 'embarrassing', 'astronomical']

Looks about like what I'd expect. Next we need to modify the "tweet" code to handle the new design.

```
r = random.randint(0, len(words) - 1)
key = (words[r], words[r + 1])
tweet = key[0] + ' ' + key[1]
while len(tweet) < 140:
w = random.choice(chain[key])
tweet += ' ' + w
key = (key[1], w)
print(tweet)
```

there. They saw it. He talks about medical cards. He talks about fixing the VA health care. They want to talk to me from Georgia? "Dear So and

Better! Let's turn this into a function that we can call repeatedly to see a few more examples.

```
def markov_tweet(chain, words):
r = random.randint(0, len(words) - 1)
key = (words[r], words[r + 1])
tweet = key[0] + ' ' + key[1]
while len(tweet) < 140:
w = random.choice(chain[key])
tweet += ' ' + w
key = (key[1], w)
print(tweet + '\n')
```

```
markov_tweet(chain, words)
markov_tweet(chain, words)
markov_tweet(chain, words)
markov_tweet(chain, words)
markov_tweet(chain, words)
```

East. But we have a huge subject. Ive been with the Romney campaign. Guys made tens of thousands of people didnt care about the vets in one hour. somebody is going to put American-produced steel back into the sky. It will be the candidate. But I think 11 is a huge problem. And Im on the THAT WE CAN ONLY DREAM ABOUT. THEY HAVE A VERY BIG BEAUTIFUL GATE IN THAT WALL, BIG AND BEAUTIFUL, RIGHT. NO. NO, I DON'T KNOW WHERE THEY HAVE We need to get so sick of me. I didnt want the world my tenant. They buy condos for tens of millions of dollars overseas. And too many executive Wont be as good as you know, started going around and were going to win. Were going to happen. Thank you. SPEECH 8 This is serious rifle. This

That's all there is to it! Incredibly simple yet surprisingly effective. It's obviously not perfect but it's not complete gibberish either. If you run it enough times you'll find some combinations that actually sound pretty plausible. These results could probably be improved significantly with a much more powerful technique like a recurrent neural net, but relative to the effort involved it's hard to beat Markov chains.

]]>One of the most interesting themes throughout the book is the role that imagination has played in our species' ascendancy on the world stage. I think most people probably believe that humans have always been at the top of the food chain, but for most of our history we were essentially foragers (or as Harari put it, "an animal of no significance"). It wasn't until the "cognitive revolution" around 70,000 years ago that humans moved to the top of the food chain. The reason seems to be that our new cognitive skills allowed us to coordinate with other humans on a level that the world had not seen before. What specifically enabled this were two things - new language skills, and the ability to invent fiction. Our newfound ability to imagine things that aren't real allowed humans to organize around shared beliefs in myths that we, ourselves, created.

If you think about it, transitioning from a world in which all life operates via basic biological principles to one in which a species can imagine alternate realities is a big deal. This change arguably led to everything that has come since, to all of the accumulated knowledge that our civilization has amassed in the many generations that followed. Before this transition, an organism's operating system was programmed in its DNA. Some organisms could acquire limited amounts of knowledge from their environment, such as the best places to hunt or tricks to avoid being eaten, but everything else was hard-wired. With the ability to imagine, humans could rewrite their own operating system. We could create an idea - a conception of a thing that didn't exist, and make it reality.

What I find particularly fascinating is Harari's argument that basically every organizing principle in modern human society falls under the category of a shared belief in fiction. Religions, laws, nations, companies, human rights - these are all, in a sense, figments of our collective imaginations. Biology doesn't take a stance on how humans should treat each other. There is no physical necessity for the borders we draw around nations, no tangible thread holding together the assets of a corporation. All of this seems completely obvious upon introspection, but I discovered that until reading this book I had never really thought about it. I've found that the way I view things like political debate has changed as a result.

*Sapiens* delves into numerous topics that one might be surprised to find in a "history" book. For instance, there's a chapter that questions whether or not we're any happier or more fulfilled as a species for all the progress we've made. In another chapter, Harari looks at the trends of the past and begins to speculate about where we're headed next. He discusses the role that technologies like artificial intelligence and genetic engineering might play in this future, and whether or not we'll even still call ourselves *Homo sapiens*.

Philosophy aside, *Sapiens* does a great job of covering the major themes in our species' history. The agricultural revolution, the invention of written language and money, imperialism, the scientific revolution, the rise of capitalism and industry. The book covers a lot of ground. Harari's analysis is often counter-intuitive (and probably controversial), but always thoroughly researched and very well-reasoned. For example, he called the agricultural revolution "history's biggest fraud" because for most humans, the transition led to a much harsher and less rewarding life. He documents the "imagined order" of large societies that placed people into hierarchies, and concludes that the particular structures of these hierarchies are mostly accidents of history. He argues that scientific research flourished only in alliance with ideologies such as capitalism or imperialism, because these ideologies funded the cost of research. These conclusions are not immediately obvious but he makes a compelling case.

There was so much packed into this book that I'm really just scratching the surface. It's not a quick read but the writing style is very engaging so it doesn't feel laborious like you might expect for a book with this subject matter. I would highly recommend it to anyone interested in expanding their knowledge about humankind's social and cultural evolution.

]]>There are a few requirements you'll need to meet in order to use the library. It uses PyStan to do all of its inference, so PyStan has to be installed. PyStan has its own dependencies, including a C++ compiler. Python 3 also appears to be a requirement. Full installation instructions are here.

Let's take a quick tour through Prophet's capabilities. We can start by reading in some sample time series data. In this case we're using Wikipedia page hits for Peyton Manning, which is the data set that Facebook collected for the library's example code.

```
%matplotlib inline
import os
import pandas as pd
import numpy as np
from fbprophet import Prophet
path = os.path.dirname(os.path.dirname(os.getcwd())) + '/data/manning.csv'
data = pd.read_csv(path)
data['ds'] = pd.to_datetime(data['ds'])
data.head()
```

ds | y | |
---|---|---|

0 | 2007-12-10 | 14629 |

1 | 2007-12-11 | 5012 |

2 | 2007-12-12 | 3582 |

3 | 2007-12-13 | 3205 |

4 | 2007-12-14 | 2680 |

There are only two columns in the data, a date and a value. The naming convention of using 'ds' for the date and 'y' for the value is apparently a requirement to use Prophet; it's expecting those exact names and will not work otherwise!

Let's examine the data by plotting it using pandas' built-in plotting function.

```
data.set_index('ds').plot(figsize=(12, 9))
```

The data is highly volatile with order-of-magnitude differences between a typical day and a high-traffic day. This will be hard to model directly. Let's try applying a log transform to see if that helps.

```
data['y'] = np.log(data['y'])
data.set_index('ds').plot(figsize=(12, 9))
```

Much better! Not only is it stationary, but we've also revealed what looks like some cyclical patterns in the data. We can now instantiate a Prophet model and fit it to our data.

```
m = Prophet()
m.fit(data)
```

That was easy! This is one of the most attractive features of Prophet. It essentially does all of the model selection work for you and gives you a result that works well without much user input required. In this case we didn't have to specify anything at all, just give it some data and we get a model.

We'll explore below what the model looks like but it's worth spending a moment first to explain what's going on here. Unlike typical time-series methods like ARIMA (which are considered generative models), Prophet uses something called an additive regression model. This is essentially a sophisticated curve-fitting model. I haven't dug into any of the math, but based on the description in their introductory blog post, Prophet builds separate components for the trend, yearly seasonality, and weekly seasonality in the time series (with holidays as an optional fourth component). We can witness this directly by looking at one of the undocumented properties on the model object that shows the fitted parameters.

```
m.params
```

{u'beta': array([[ 0. , -0.03001147, 0.04819977, 0.00999481, -0.00228437, 0.01252909, 0.01559136, 0.00950633, 0.00075704, 0.00391209, -0.00586589, 0.0075454 , -0.00524287, 0.00208091, -0.00477578, -0.00410379, -0.0077744 , -0.00081338, 0.00125811, 0.00187115, 0.0069828 , -0.01233829, -0.01057246, 0.00938595, 0.00847051, 0.00088024, -0.00352237]]), u'delta': array([[ 1.62507395e-07, 1.29092081e-08, 3.48169254e-01, 4.57815903e-01, 1.61826714e-07, -5.66144938e-04, -2.34969389e-01, -2.46905754e-01, 9.96595883e-08, -1.82605683e-07, 6.12381739e-08, 2.78653912e-01, 2.30631082e-01, 2.83118248e-03, 1.55276178e-03, -8.61134360e-01, -3.14239669e-07, 5.54456073e-09, 4.91423429e-07, 4.71475093e-01, 7.93935609e-03, 1.36547372e-07, -3.38274613e-01, -3.20008088e-07, 1.16410210e-07]]), u'gamma': array([[ -5.37486490e-09, -8.40863029e-10, -3.59567303e-02, -6.19588853e-02, -2.69802216e-08, 1.12158987e-04, 5.44799089e-02, 6.53304459e-02, -2.95648930e-08, 6.03344459e-08, -2.21556944e-08, -1.09561865e-01, -9.78411305e-02, -1.28994139e-03, -7.57253043e-04, 4.47568989e-01, 1.73293155e-07, -3.23167613e-09, -3.01853068e-07, -3.04398195e-01, -5.37507537e-03, -9.67767399e-08, 2.50366597e-01, 2.46999155e-07, -9.35053320e-08]]), u'k': array([[-0.35578215]]), u'm': array([[ 0.62604285]]), u'sigma_obs': array([[ 0.03759107]])}

I think the beta, delta, and gamma arrays correspond to the distributions for the three different components. The way I think about this is we're saying we have three different regression models with some unknown set of parameters, and we want to find the combination of those models that best explains the data. We can attempt to do this using maximum a-priori (MAP) estimation, where our priors are the equations for the regression components (piecewise linear for the trend, Fourier series for the seasonal component, and so on). This appears to be what Prophet is doing. I can't say I've looked at it in any great detail so part of that explanation could be wrong, but I think it's broadly correct.

Now that we have a model, let's see what we can do with it. The obvious place to start is to forecast what we think our value will be for some future dates. Prophet makes this easy with a helper function.

```
future_data = m.make_future_dataframe(periods=365)
future_data.tail()
```

ds | |
---|---|

3265 | 2007-01-15 |

3266 | 2007-01-16 |

3267 | 2007-01-17 |

3268 | 2007-01-18 |

3269 | 2007-01-19 |

That gives us a data frame with dates going one year forward from where our data ends. We can then use the "predict" function to populate this data frame with forecast information.

```
forecast = m.predict(future_data)
forecast.columns
```

Index([u'ds', u't', u'trend', u'seasonal_lower', u'seasonal_upper', u'trend_lower', u'trend_upper', u'yhat_lower', u'yhat_upper', u'weekly', u'weekly_lower', u'weekly_upper', u'yearly', u'yearly_lower', u'yearly_upper', u'seasonal', u'yhat'], dtype='object')

The point estimate forecasts are in the "yhat" column, but note how many columns got added. In addition to the forecast itself we also have point estimates for each of the components, as well as upper and lower bounds for each of these projections. That's a lot of detail provided out-of-the-box just by calling a single function!

Let's see an example.

```
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
```

ds | yhat | yhat_lower | yhat_upper | |
---|---|---|---|---|

3265 | 2007-01-15 | 8.200620 | 7.493151 | 8.886727 |

3266 | 2007-01-16 | 8.525638 | 7.791967 | 9.266697 |

3267 | 2007-01-17 | 8.313019 | 7.620597 | 9.000529 |

3268 | 2007-01-18 | 8.145577 | 7.449701 | 8.870133 |

3269 | 2007-01-19 | 8.157476 | 7.467178 | 8.860933 |

Prophet also supplies several useful plotting functions. The first one is just called "plot", which displays the actual values along with the estimates. For the forecast period it only displays the projections since we don't have actual values for this period.

```
m.plot(forecast);
```

I found this to be a bit confusing because the data frame we passed in only contained the "forecast" date range, so where did the rest of it come from? I think the model object is storing the data it was trained on and using it as part of this function, so it looks like it will plot the whole date range regardless.

We can use another built-in plot to show each of the individual components. This is quite useful to visually inspect what the model is capturing from the data. In this case there are a few clear takeaways such as higher activity during football season or increased activity on Sunday & Monday.

```
m.plot_components(forecast);
```

In addition to the above components, Prophet can also incorporate possible effects from holidays. Holidays and dates for each holiday have to be manually specified over the entire range of the data set (including the forecast period). The way holidays get defined and incorporated into the model is fairly simple. Below are some holiday definitions for our current data set that include Peyton Manning's playoff and Superbowl appearances (taken from the example code).

```
playoffs = pd.DataFrame({
'holiday': 'playoff',
'ds': pd.to_datetime(['2008-01-13', '2009-01-03', '2010-01-16',
'2010-01-24', '2010-02-07', '2011-01-08',
'2013-01-12', '2014-01-12', '2014-01-19',
'2014-02-02', '2015-01-11', '2016-01-17',
'2016-01-24', '2016-02-07']),
'lower_window': 0,
'upper_window': 1,
})
superbowls = pd.DataFrame({
'holiday': 'superbowl',
'ds': pd.to_datetime(['2010-02-07', '2014-02-02', '2016-02-07']),
'lower_window': 0,
'upper_window': 1,
})
holidays = pd.concat((playoffs, superbowls))
```

Once we have holidays defined in a data frame, using them in the model is just a matter of passing in the data frame as a parameter when we define the model.

```
m = Prophet(holidays=holidays)
forecast = m.fit(data).predict(future_data)
m.plot_components(forecast);
```

Our component plot now includes a holidays component with spikes indicating the magnitude of influence those holidays have on the value.

While the Prophet library itself is very powerful, there are some useful features that we'd typically want when doing time series modeling that it currently doesn't provide. One very simple and obvious thing that's needed is a way to evaluate the forecasts. We can do this ourselves using scikit-learn's metrics (you could also calculate it yourself). Note that since we took the natural log of the series earlier we need to reverse that to get a meaningful number.

```
from sklearn.metrics import mean_absolute_error
data = m.predict(data)
mean_absolute_error(np.exp(data['y']), np.exp(data['yhat']))
```

2436.9620410194648

That works fine as a very simple example, but for real applications we'd probably want something more robust like cross-validation over sliding windows of the data set. Currently in order to accomplish this we'd have to implement it ourselves.

Another limitation is the lack of ability to incorporate additional information into the model. One can imagine variables that could be used along with the time series to further improve the forecast (for example, a variable indicating if Peyton Manning had just won a game, or had a particularly good performance, or appeared in some news articles). We can't do anything like this with Prophet directly. However, one idea I've experimented with in the past that may get around this limitation is building a two-stage model. The first stage is the Prophet model, and we use that to generate predictions. The second stage is a normal regression model that includes the additional signals as independent variables. The wrinkle is that instead of predicting the target directly, we predict the **error** from the time series model. When you put the two together, this may result in an even better overall forecast.

All things considered, Prophet is a great addition to the toolbox for time series problems. There are a number of knobs and dials that one can tweak that I didn't get into because I still haven't tried them out, but they provide options for advanced users to improve their forecasts even further. It's worth cautioning that this software is fairly immature so proceed carefully if using it for any serious tasks. That said, the authors claim Facebook uses it extensively so take that for what it's worth.

]]>In Peter's view this technology is most likely to come from startups (as he put it, a startup is the largest group of people you can convince of a plan to build a different future). With that perspective in mind, much of the book focuses on lessons about how to "build the future" through a startup. But I think the main points of the book are valuable whether you have any interest in startups or not. They offer new ways of thinking about technology, markets, and competition. It's bold, contrarian, and thought-provoking. Here are some of the key insights.

One of the most notable threads that gets touched on often is the nature of competition. Most economists view perfectly competitive markets as capitalism working as intended. Competition causes each participant to raise their game, lower prices, deliver greater value etc. in the pursuit of customers and profit. But Peter argues that competition and capitalism are actually opposites. Competitive markets erode profits until no one is making any money. When margins are very thin and profit is hard to come by, companies can't afford to do anything except fight for market share. They're trapped by short-term thinking.

The opposite state is when a company has a monopoly - complete dominance of a market protected by a huge moat. Most of us would say that monopolies are bad. Monopolies allow companies to charge high prices, slack off on customer service, cut R&D investment, and lots of other bad things. But Thiel argues this is only true in a world where nothing changes. In dynamic, technology-driven markets, creative monopolies can actually be good for society because they have both the incentive and the available cash flow to invest enormous resources into inventing new technologies. Rather than rent-seeking, this class of monopoly creates new categories of abundance.

Thiel's perspective on competition and monopoly is a core component of his advice for building a company. If competition is problematic then the best thing to do is start by owning a small market as a monopoly and expand to adjacent markets from there. This sounds easy but is very hard in practice, which is why few companies achieve it. Companies that do gain a monopoly are able to build some durable, lasting advantage in their market. This usually comes in the form of proprietary technology, network effects, economies of scale, and branding. Some combination of these advantages can generate significant long-term value.

My own perspective is that having a monopoly isn't binary. Nor are markets entirely static or dynamic. These things exist on a continuous spectrum. All companies are probably somewhere between perfect monopoly and perfect competition in any given market, no matter how that market is defined. And all monopolies probably cause some harm to consumers, even if they also lead to some new inventions. It's useful to speak about these things with such rigidity in the abstract to make a point, but the real world is messy. That doesn't mean he's wrong, just that there's probably more to the story in most cases.

Another theme from the book that I found really interesting was the trichotomy between easy, hard, and impossible. Thiel argues that most things are either easy or impossible to accomplish. Easy things have already been done, and impossible things can never be done no matter how hard we try. However, there are some things that are hard but possible. In some cases these hard things were previously impossible but have become possible thanks to new technology. Hard things that are possible to achieve are "secrets". They're truths about the world that most people do not know. This is because common knowledge about what is possible changes much slower than what is possible in reality. Secrets can come in many forms. They can be about the natural world or they can be about people. These hard things, these "secrets", are what great companies are built on. Companies doing hard things are a shared conspiracy to change the world, founded on a secret known only to those on the inside.

This is a pretty radical view of what it means to be part of a company, however I think it makes sense. Most of us aren't part of a conspiracy to change the world because we're not working for companies doing hard things. We're all probably aware of companies that seem to fit this description though. Tesla would be a good example. Tesla is trying to bring about the end of the fossil fuels era by making electric cars mainstream - hard, but possible. Tesla's employees generally appear to be fanatical about the company's mission. Many on the outside probably don't think Tesla will succeed in its goal, but I bet most of its employees do. Their "secret", though widely known, still may not be widely believed.

Most companies that try to do something hard end up failing. They can fail for a variety of reasons, many of which Peter talks about in the book (culture, team, incentives alignment, distribution, and so on). The distribution of companies that try to do hard things and succeed vs. those that fail probably looks like a power law. This explains why most secrets never see the light of day (and why the secret about hard things remains relatively unknown).

The last theme in the book that I found interesting has to do with salesmanship and persuasion. As an engineer I'm naturally distrustful of anything that involves "selling". It just feels messy, ambiguous, and somehow dishonest. I think Peter actually changed my mind though. He addressed this very point in the book about engineers underrating sales. It makes sense to talk about sales from the standpoint of building a company. Every successful product or service needs a distribution channel. What intrigued me was thinking about sales in a much more general capacity. Sales is just the art of persuasion. In some sense, we're all selling all the time because persuading people is a normal part of human life. When I really think about it, almost every conversation I have at work or at home involves some amount of persuasion, no matter how mundane. In hindsight, it seems obvious that this is a skill that one can improve on just like anything else.

One other point that Thiel makes is that great salesman are hidden from sight because it's not obvious that they're selling something. He points out the Elon Musk, widely thought of as the consummate engineer, is also a grandmaster salesman. If you think about it this actually makes a lot of sense. Elon is able to get some of the smartest, most driven people in the world to work at his companies. He got a massive loan from the government to build electric cars at a time when the world economy was collapsing. Most of the developed world thinks he's the real-life version of Iron Man. These are the marks of someone who knows how to persuade.

This was one of the most information-dense books I've ever read. Most books use a lot of words to say very little. "Zero to One" completely inverts this norm. Nassim Taleb recommended that everyone read this book three times. I've gone through it twice (it's a quick read) and I'm still picking up new things from it. I think he may be on to something.

]]>The central thesis is a dichotomy between two modes of thought - "System 1" and "System 2". "System 1" is fast, instinctive, and effortless. It happens automatically without us even realizing it. "System 2" is slow, deliberate, and logical. It involves effort and focus. In reality there is no physical distinction between the two modes of thought, it's all happening inside our brains at the same time, but it's a useful abstraction because it captures all the ways that "System 1" thinking can mislead us. The way I think about it is "System 1" is a filter on the raw sensory input we're taking in every second. If we think about it in stream processing terms, "System 1" is reading from the raw input stream and applying successively higher-order functions to the stream so we can quickly make sense of the input. It's what allows us to recognize faces, to read a book, to drive a car, or to catch a ball. These things don't require pupil-dilating mental effort on our part - they just happen. We might be aware that they're happening, but it's a passive awareness. We don't have to expend mental energy to make it so.

"System 1" thinking is absolutely essential for us to be able to function on a basic level. But it can also lead us astray. The filters being applied automatically to our input stream lead to thoughts and actions that frequently deviate from the rational agent model of human behavior. They lead to all sorts of cognitive biases such as anchoring, availability, substitution, framing, and overconfidence. These effects can be overcome with deliberate thought and effort - in other words, engaging "System 2" thinking to come up with an objective, logical conclusion. The problem is, these cognitive biases kick in so often and with such ease that it's hard to even recognize when such an error in judgement has occurred.

It's jarring to see just how often we're unconsciously influenced by cognitive biases, and how dramatic of an effect they can have. In one example, Kahneman was illustrating our tendency to accept a default option if one is available. He cited organ donor statistics in two different countries. In one country, the rate of civilians that opted to be an organ donor was something like 90%. In another, similar country the rate was 4%. So what accounted for the difference? In the first country you had to opt OUT of being a donor, and in the second country you had to opt IN. That was it. This is a startling conclusion. It's possible that there were confounding factors that could account for some of this variation, but it's just one example of an effect that's been observed in many different contexts with a high degree of statistical significance.

The reason I think it would be valuable for everyone to read this book is because understanding these biases likely leads to clearer thinking overall. It may be impossible, or even undesirable, to recognize every time our "System 1" thinking is making judgments that objectively do not make logical sense. But with some effort it's certainly possible to weed out the worst offenses and recognize when we're making egregious errors. In my own mental models, I've incorporated what I learned from Kahneman into my generally Bayesian approach to rational thought. I think this is a fairly sensible way to approach solving problems and making decisions.

]]>Below is a brief synopsis of the courses they offer. They are completely free to try out - just follow the link above, create an account, and register for the course you're interested in. In addition, I put all of the content for the courses I worked through (including labs with example code) in a github repo.

Note that due to the fast-paced rate of change that I alluded to earlier (and MapR's vested interest in staying current) the course catalog will likely evolve over time. It's possible that this post will become outdated fairly quickly, although I'll try to revisit it periodically to make sure the guidance is still relevant. I should also note that are there snippets of content throughout the training that are specific to the MapR platform, however more than 90% of it is platform-agnostic.

This list is not exhaustive. It only includes the courses that I spent time working on. Feel free to visit the landing page for a complete list of courses.

These are short, introductory courses that present a very high-level overview of the Hadoop ecosystem.

ESS 100 - Introduction to Big Data

ESS 101 - Apache Hadoop Essentials

ESS 102 - MapR Converged Data Platform Essentials

MapReduce is how it all got started, and is still used quite a bit. MapReduce is a programming model for distributing work over very large data sets across a cluster of machines. The name comes from the two principal steps involved in the process - map (filtering, sorting etc.) and reduce (summary operations).

DEV301 - Developing Hadoop Applications

HBase is an open-source, non-relational, distributed column-store database written in Java. HBase is very widely used as an alternative to relational databases for certain types of applications where scale is an issue.

DEV320 - HBase Data Model and Architecture

DEV325 - HBase Schema Design

DEV330 - Developing HBase Applications: Basics

DEV335 - Developing HBase Applications: Advanced

DEV340 - HBase Bulk Loading, Performance, and Security

Spark is an open-source cluster-computing framework that runs on Hadoop. I've written about Spark in the past. Suffice to say that it is a very exciting (and very popular) framework.

DEV360 - Spark Essentials

DEV361 - Build and Monitor Spark Applications

DEV362 - Create Data Pipeline Using Spark

Drill is an open-source framework for querying semi-structured and unstructured data at scale using SQL-like syntax. I haven't seen a lot of interest in this outside of the MapR distribution but it's a mature technology that has a lot of potential.

DA410 - Drill Essentials

DA415 - Drill Architecture

Hive is a data warehousing infrastructure built on top of Hadoop that provides the capability to query data in the Hadoop file system using SQL-like syntax. There's some conceptual overlap between Hive, HBase and Drill that requires some background and context to understand. The relevant courses do a good job of clarifying these relationships.

Pig is a high-level programming language and framework for doing ETL (extract, transform, and load) tasks with data. I'm not sure how much Pig is used anymore with newer technologies like Spark offering similar capabilities, but I think there is still a use case for it.

]]>Below is my curated list of podcasts to try out. This list is obviously biased toward my own fields of interest. If you happen to be an art enthusiast or love pop culture, this list will be next to useless for you. But I'm guessing if you're reading this blog then we have at least SOME shared interests. Even if it's not your thing, give one or two a try - you might be surprised.

Without further ado, here are 20 awesome podcasts that you should definitely check out.

Perfect for technology nuts and wannabe-entrepreneurs. The a16z staff hosts prominent authors, investors, CEOs etc. to talk emerging technology and how the business of technology is being disrupted. For instance, a few recent episodes covered VR for storytelling, genomics, and the next evolution of cities.

https://medium.com/conversations-with-tyler/all

Tyler Cowen hosts various guests to discuss a wide range of topics, usually with some angle relating to macro-economics. It's actually pretty hard to be more specific than that, they're kind of all over the place.

Ben Thompson (the "Stratechery" guy) and his co-host James Allworth discuss the business and strategy of technology, with the occasional diversion into bigger-picture topics in society and politics. Ben is really, really good at understanding how tech companies operate and where they're headed. It's kind of like getting a business school degree, but better (and for free).

http://www.npr.org/podcasts/510308/hidden-brain

Consider it a weekly lesson in psychology, sociology and human behavior, using real-world stories. I mean, everyone likes stories right?

http://investorfieldguide.com/

Ostensibly it's about investing, but a lot of the guests aren't investors at all so it's hard to pigeonhole. Consider it more an exercise in learning how to think and cultivate curiosity, while also learning about things like hedge funds, venture capital, value investing etc.

https://www.oreilly.com/topics/oreilly-data-show-podcast

For big data nerds. Actually most episodes lately are about AI, because fucking everyone in tech now has to talk about AI at every opportunity. But it's also about big data. Fairly in-the-weeds discussion, and a lot of emphasis on open-source projects. It's a great way to stay up-to-date on the big data landscape.

http://partiallyderivative.com/

For data science nerds. The crew talks data and machine learning while drinking obscure artisanal beer. Hilarty (and learning) ensues. They've also had a lot of good interviews with various data scientists lately (although they're slacking a bit on the beer).

http://www.radiolab.org/series/podcasts/

Radiolab is just awesome. Not sure how else to put it. They do episodes on all sorts of topics (the latest one on CRISPR was really good). Production quality is super high. The topics are also really accessible. If you're new to podcasting this is a great place to start.

http://rationallyspeakingpodcast.org/archive/

Their tagline is "exploring the borderlands between reason and nonsense". If you're skeptical of that claim, then you should probably be listening to this podcast!

http://www.recode.net/recode-decode-podcast-kara-swisher

Kara Swisher grills various silicon valley elites about various tech topics. Okay, it's not only that. But it's MOSTLY that.

http://revisionisthistory.com/

Malcolm Gladwell did a ten-part series where he goes in-depth on random stories from the past and shows how the popular narrative around those events was either wrong or at least incomplete. The show has been on hiatus for a while but worth spinning through the archive.

https://www.samharris.org/podcast/full_archive

Sam is a really smart dude. He invites lots of other really smart dudes on his show, and they spend a few hours going ludicrously in-depth on various philosophical, scientific, and political subjects. In takes a special type of person to be into this sort of thing, but if you're that type of person, you won't find anything else like this anywhere else.

https://www.startalkradio.net/show/

Neil deGrasse Tyson, also known as the millennials' Carl Sagan, is at his best in these entertaining, free-flowing discussions about various science or science-adjacent topics. All sorts of interesting guests and very wide-ranging interviews. Also, one of the comedians that frequents the show kind of sounds like the guy from Archer, which is pretty cool.

http://www.thetalkingmachines.com/blog/

For hardcore machine learning nerds. They do lots of interviews with researchers in various specializations. It's pretty technical but very useful if you're an ML practitioner. No new episodes in the last few months but I'm keeping an eye on it.

http://www.npr.org/programs/ted-radio-hour/

This is one of my favorites. Very high production quality, super-wide range of interesting topics. Each hour-long show stitches together excerpts from several TED talks that share some common theme. They also add some narration and frequently interview the TED speakers to add some color to the original talks. Awesome series.

http://www.vox.com/ezra-klein-show-podcast

Ezra is a political journalist but the podcast isn't focused on politics (although some of his guests are politicians). Just a lot of really good interviews with really smart people. Good podcast hosts are able to frame questions in ways that guide the discussion in interesting directions, and Ezra is especially good at this.

https://www.farnamstreetblog.com/the-knowledge-project/

Just recently discovered this one. The stated goal is to focus on "actionable strategies that help you make better decisions, avoid stupidity, and live a better life". A lot of the interviews are geared towards reading and knowledge acquisition.

Pretty much everyone knows who Tim is. In addition to his famous "4-hour" books, he's one of the guys that really launched podcasting as a medium into the mainstream. His tagline is "deconstructing world-class performers". A lot of the interviews are really good, although some could be edited down a bit.

I'm not gonna lie, these conversations are dry as hell. But if you're serious about investing then there's a LOT of good information to learn here.

For policy nerds (yes, that's a thing). Actually politics more broadly, but they spend a lot of time focused specifically on policy details (health care, taxes, education, and so on). It isn't called "The Weeds" for nothing. For some reason I find it surprisingly fascinating. Useful if you want to talk circles around baffled family members at the next politically-charged Thanksgiving dinner.

Happy podcasting!

]]>My latest target was a basket of different libraries in the Python ecosystem covering things like web development, caching, asynchronous messaging, and visualization. And since I’m a data scientist, I threw in a machine learning component just for fun. To explore these technologies, I created a semi-practical application that reads from the Twitter stream, parses tweets, and does some machine learning magic to score the tweet’s sentiment and project it into a two-dimensional grid, where tweets with similar content will appear closer to each other. It does all of this more or less in real time using asynchronous messaging.

The remainder of this blog post is devoted to showing how to build this from scratch. Just to be completely transparent about what it does and what technologies are involved, here are the components that I’ll demonstrate how to build:

- Basic web development in Python using Flask + standard front-end stuff (Bootstrap, JQuery etc.)
- Asynchronous, chainable task queues using Celery and Redis
- Real-time event-based communication between Flask and connected clients using Socket-IO
- Twitter stream filtering/parsing using Pattern
- Streaming real-time visualization using NVD3
- Sentiment analysis and word embeddings using Scikit-learn and Gensim (word2vec)

And here's a screenshot of what the finished app looks like.

Why are we doing this? Does this app have some tangible, real-world purpose? Probably not. But it’s fun and neat and you’ll hopefully learn a lot. If that sounds interesting then read on!

(NOTE: The completed application is located here. Feel free to reference this often as I may leave out some details throughout the post).

To get started, let’s talk about the setup required to run the application. In addition to a working Python interpreter, you’ll need to install a bunch of libraries that the application uses. You’ll also need to know how to perform a few tasks such as starting Redis, launching Celery, and running a Flask server. Fortunately these are all very easy to do. I wrote up detailed instructions in the project’s README file. Follow those steps and you should be up and running.

Now let’s build sentiment and word vector models to transform tweets. We’ll use this Twitter sentiment dataset as training data for both models. The first step is to read in the dataset and do some pre-processing using TF-IDF to convert each tweet to a bag-of-words representation.

```
print('Reading in data file...')
data = pd.read_csv(path + 'Sentiment Analysis Dataset.csv',
usecols=['Sentiment', 'SentimentText'], error_bad_lines=False)
print('Pre-processing tweet text...')
corpus = data['SentimentText']
vectorizer = TfidfVectorizer(decode_error='replace', strip_accents='unicode',
stop_words='english', tokenizer=tokenize)
X = vectorizer.fit_transform(corpus.values)
y = data['Sentiment'].values
```

Note that we’re using a custom tokenizer designed to handle patterns common in tweets. I borrowed this from a script Christopher Potts wrote and adapted it slightly (final version is in the "scripts" folder). Next, we can train the sentiment classifier and word2vec model.

```
print('Training sentiment classification model...')
classifier = MultinomialNB()
classifier.fit(X, y)
print('Training word2vec model...')
corpus = corpus.map(lambda x: tokenize(x))
word2vec = Word2Vec(corpus.tolist(), size=100, window=4, min_count=10, workers=4)
word2vec.init_sims(replace=True)
```

This should run pretty fast since the training data set is not that big. We now have a model that can read a tweet and classify its sentiment as either positive or negative, and another model that transforms the words in a tweet to 100-dimensional vectors. But we still need a way to use those 100-dimensional vectors to spatially plot a tweet on a 2-dimensional grid. To do that, we’re going to fit a PCA transform for the word vectors and keep only the first 2 principal components.

```
print('Fitting PCA transform...')
word_vectors = [word2vec[word] for word in word2vec.vocab]
pca = PCA(n_components=2)
pca.fit(word_vectors)
```

Finally, we’re going to save all of these artifacts to disk so we can call them later from the web application.

```
print('Saving artifacts to disk...')
joblib.dump(vectorizer, path + 'vectorizer.pkl')
joblib.dump(classifier, path + 'classifier.pkl')
joblib.dump(pca, path + 'pca.pkl')
word2vec.save(path + 'word2vec.pkl')
```

Now that we have all the models we need ready to go, we can get started on the meat of the application. First, some initialization. This code runs only once, when the Flask server is launched.

```
# Initialize and configure Flask
app = Flask(__name__)
app.config['SECRET_KEY'] = 'secret'
app.config['CELERY_BROKER_URL'] = 'redis://localhost:6379/0'
app.config['CELERY_RESULT_BACKEND'] = 'redis://localhost:6379/0'
app.config['SOCKETIO_REDIS_URL'] = 'redis://localhost:6379/0'
app.config['BROKER_TRANSPORT'] = 'redis'
app.config['CELERY_ACCEPT_CONTENT'] = ['pickle']
# Initialize SocketIO
socketio = SocketIO(app, message_queue=app.config['SOCKETIO_REDIS_URL'])
# Initialize and configure Celery
celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL'])
celery.conf.update(app.config)
```

There’s a bunch of stuff going on here so let’s break it down. We’ve created a variable called "app" that’s an instantiation of Flask, and set some configuration items to do things like tell it to use Redis as the broker (note that "config" is just a dictionary of key/value pairs which we can use for other settings not required by Flask). We also created a SocketIO instance, which is a class from the Flask-SocketIO integration library that basically wraps Flask with SocketIO support. Finally, we created our Celery app and updated its configuration settings to use the "config" dictionary we defined for Flask.

Next we need to load the models we created earlier into memory so they can be used by the application.

```
# Load transforms and models
vectorizer = joblib.load(path + 'vectorizer.pkl')
classifier = joblib.load(path + 'classifier.pkl')
pca = joblib.load(path + 'pca.pkl')
word2vec = Word2Vec.load(path + 'word2vec.pkl')
```

Finally, we’ll create some helper functions that use these models to classify the sentiment of a tweet and transform a tweet into 2D coordinates.

```
def classify_tweet(tweet):
"""
Classify a tweet with either a positive (1) or negative (0) sentiment.
"""
pred = classifier.predict(vectorizer.transform(np.array([tweet.text])))
return str(pred[0])
def vectorize_tweet(tweet):
"""
Convert a tweet to vector space using a pre-trained word2vec model, then transform
a sum of the vectorized words to 2-dimensional space using PCA to give a simple
2D coordinate representation of the original tweet.
"""
tweet_vector = np.zeros(100)
for word in tokenize(tweet.text):
if word in word2vec.vocab:
tweet_vector = tweet_vector + word2vec[word]
components = pca.transform(tweet_vector)
x = components[0, 0]
y = components[0, 1]
return str(x), str(y)
```

The server code handling web requests is very simple, in fact there are only two routes. The first one is the root path and just returns the main page ("index.html").

```
@app.route('/', methods=['GET'])
def index():
"""
Route that maps to the main index page.
"""
return render_template('index.html')
```

The second route accepts an input phrase and initiates a task chain that will kick off a stream of tweets from the Twitter firehose that match the input phrase.

```
@app.route('/twitter/<phrase>', methods=['POST'])
def twitter(phrase):
"""
Route that accepts a twitter search phrase and queues a task to initiate
a connection to twitter.
"""
queue = app.config['SOCKETIO_REDIS_URL']
# create_stream.apply_async(args=[phrase, queue])
chain(create_stream.s(phrase, queue), send_complete_message.s()).apply_async()
return 'Establishing connection...'
```

This is probably confusing at first glance so let’s examine a bit closer. The "queue" variable is just a string pointing to the URL where Redis is running ("redis://localhost:6379/0"), which we’ll use later. The "chain" line is a Celery function that composes a sequence of tasks to be queued up and run asynchronously. In this case, we’re queuing a call to the "create_stream" task following by a call to the "send_complete_message" task. These tasks may or may not begin executing immediately (depending on the state of the task queue) but in any event, execution of the function continues on to the last line where we return a message to the client saying that a connection (to Twitter) is being established.

In the last section we queued calls to two functions, "create_stream" and "send_complete_message". We said that these were defined as tasks in Celery, meaning that they execute asynchronously apart from the main program thread. Let’s take a look at those functions.

```
@celery.task
def create_stream(phrase, queue):
"""
Celery task that connects to the twitter stream and runs a loop, periodically
emitting tweet information to all connected clients.
"""
local = SocketIO(message_queue=queue)
stream = Twitter().stream(phrase, timeout=30)
for i in range(60):
stream.update()
for tweet in reversed(stream):
sentiment = classify_tweet(tweet)
x, y = vectorize_tweet(tweet)
local.emit('tweet', {'id': str(i),
'text': str(tweet.text.encode('ascii', 'ignore')),
'sentiment': sentiment,
'x': x,
'y': y})
stream.clear()
time.sleep(1)
return queue
```

There’s a lot to digest here so let’s step through it. First, "local" is a Socket-IO instance. Recall that we’re using Socket-IO (through the Flask-SocketIO wrapper library) to essentially "push" information from the Flask server to all connected clients, so this object is creating a channel that will allow us to do that. The reason we’re creating a local instance within the function and not using the global instance is because our Celery task runs in a separate thread (remember, this function gets called asynchronously).

In the next line, we create an object that we imported from the Pattern library that creates a persistent connection with Twitter. Notice we pass in the search phrase that we got via client input. This will restrict the Twitter stream to only capture tweets with that phrase in it. The next section enters a 60-second loop where each second we check for new tweets, and if any are found, we run our sentiment and vectorization functions on them. The last step is the "emit" line, where we tell Socket-IO to push out a message with all of our data for that tweet.

At the very end of the function’s execution, it returns the "queue" variable. You may be wondering – what is this for? Where does it go? Remember that we chained multiple Celery tasks together. When "create_stream" finishes executing the next task in the chain ("send_complete_message") is called, with the output of the first task passed to the next task. In this case we’re using the chain to propagate information (the URL of the Redis queue) through to each task.

The "send_complete_message" function is much simpler and shouldn’t require much explanation. Essentially what we’re doing is emitting a message to our clients that the loop in the previous function finished executing.

```
@celery.task
def send_complete_message(queue):
"""
Celery task that notifies the client that the twitter loop has completed executing.
"""
local = SocketIO(message_queue=queue)
local.emit('complete', {'data': 'Operation complete!'})
```

As a practical matter, it wasn’t necessary to set things up this way. We could have very easily just stuck this at the end of the "create_stream" function. But I wanted to demonstrate a simple example of how to compose workflows in Celery and pass information between tasks. The whole point is to learn about these libraries and what they can do!

All we have left is to create a UI for the user to interact with and wire up some javascript to communicate through Socket-IO and handle the events pushed from the server. The UI code is pretty generic so I won’t cover it in this post but feel free to inspect index.html in the repo to see it all put together. I’ll say a few things about the javascript code though. First we need a function that takes a user-provided phrase and posts it to the web server, which initiates the whole process. This is pretty standard JQuery stuff.

```
function twitter() {
phrase = $('#phrase').val();
url = encodeURI('/twitter/' + phrase);
$.ajax({
type: 'POST',
url: url,
success: function(data, status, request) {
$('#twitter-status').html(data);
},
error: function() {
alert('An error occurred submitting the request.');
}
});
}
```

We also need a function to create the chart what we’ll use to display tweets. I pulled some examples from the NVD3 website and came up with the following, which renders a scatter plot using the “data” variable as input (this gets populated later).

```
function loadGraph() {
var chart = nv.models.scatterChart()
.pointRange([2000, 2000])
.color(d3.scale.category10().range());
chart.xAxis.tickFormat(d3.format('.02f'));
chart.yAxis.tickFormat(d3.format('.02f'));
d3.select('#chart svg')
.datum(data)
.transition()
.duration(500)
.call(chart);
nv.utils.windowResize(chart.update);
return chart;
}
```

Additionally there’s a few things going on in the main script body, which runs a single time when the page is first loaded. First, we open a socket to establish a persistent connection with the server.

```
var socket = io.connect('http://' + document.domain + ':' + location.port);
```

Next we need to create an event handler that responds to a "tweet" message from the server. The event handler does a few things:

- Update some labels with a new status
- Add a new JSON record to the data structure used by the graph
- Redraw the graph to display the new data
- Append the raw text from the tweet to a display panel

```
socket.on('tweet', function(msg) {
$('#phrase').val('');
$('#twitter-status').html(
'Connection established. Streaming for 60 seconds (currently at ' + msg.id + ')...');
sentiment = parseInt(msg.sentiment);
x = parseFloat(msg.x);
y = parseFloat(msg.y);
data[sentiment].values.push({
id: msg.id,
x: x,
y: y,
size: 2000,
shape: "circle"});
loadGraph();
$('#twitter-results').append(
'<br>' + $('<div/>').text('(' + x.toFixed(2) + ', ' + y.toFixed(2) + ') ' + msg.text).html());
});
```

We also need an event handler that responds to the "complete" message.

```
socket.on('complete', function(msg) {
$('#twitter-status').html(msg.data);
});
```

The last step is just to wire up the beginning click event and initiate the graph.

```
$('#begin').click(twitter);
nv.addGraph(loadGraph);
```

That’s essentially it. If you want to run it yourself, the easiest way would be to download the source code here and follow the instructions in README.md. If all goes well, you should see something like this.

There are a fair amount of details that I glossed over in the interest of not making this post too long, but if you’d like to learn more about any of the libraries I used I’d highly encourage spinning through the documentation for that library. Pretty much all of them are well-documented with lots of example code, which is how I learned a lot of what I needed to know to make this app.

That wraps up the walk-through. As I said at the top, experimenting with new technologies is a great way to learn. By combining several new technologies at once, it forces even greater comprehension. This app is pretty bare-bones and not all that useful in its current state beyond serving as a learning tool, but there are a million ways it could be extended. If you find this stuff interesting, feel free to fork the repo and make your own version!

]]>But there’s more to it than that. The common link between each story is a theme that the seeds of innovation are not random, and there are some commonalities that increase the likelihood not just of having a breakthrough but making sure it takes off. Throughout the book, Isaacson frames each story in a similar narrative – that innovation is a collaborative process, and the legend of the lone hacker changing the world on his or her own is a cultural myth. As a telling of historical events, this could be viewed as problematic. There are probably lots of examples of important research, discoveries, inventions etc. that were left out of the book. It would be perfectly reasonable to argue that the sample set used in the book is biased toward examples that fit the narrative that the author wanted to tell. For this reason I would not characterize it as a history book, but recognizing this fact doesn’t diminish its value, because the questions it raises around innovation are arguably more interesting.

It’s worth pausing for a moment to consider a relatively foundational question – what is innovation? How do we define it? Why is it so desirable? As far as I can tell, there’s no single agreed-upon definition. According to Wikipedia, the most “complete” way to describe innovation is:

…production or adoption, assimilation, and exploitation of a value-added novelty in economic and social spheres; renewal and enlargement of products, services, and markets; development of new methods of production; and establishment of new management systems. It is both a process and an outcome.

Maybe a good way to summarize it is that innovation unlocks some sort of economic value by doing something new/different from what existed before. You may notice that this definition is highly market-centric. There's no mention of things like ground-breaking scientific discoveries. I think the key realization here is that new discoveries/inventions aren’t automatically innovative – you have to do something WITH them that enriches people’s lives.

Let’s do a hypothetical thought experiment. Imagine that I write a research paper that demonstrably shows how to create an effective vaccine for malaria, but it’s published in an academic journal where it goes largely unnoticed. Was it innovative? What if someone follows up on my work years later and eventually, after several iterations, it leads to the mass production of a vaccine that saves lives? Based on the framing of innovation in the book, it seems like the right conclusion is fairly straightforward - new science, no matter how important or significant, is only innovative if someone ultimately does something with it for the betterment of humanity.

I think what Isaacson was trying to convey is that to be truly innovative, it’s not enough to be a genius. Having a big impact goes beyond discovery. You also need to be able to communicate why it’s important, build a product/service around it, find an appropriate market, convince people it’s worthwhile etc. Isaacson uses this argument as rationale for his thesis, but here I think we diverge a bit. I think innovation isn’t by nature *collaborative*, but rather *multidisciplinary*. Innovation requires ability in lots of different areas. Isaacson himself references this idea when he points out that innovative people are often operating at the intersection of the sciences and the humanities. One reason it may seem like innovation requires collaboration is because people that excel at many disciplines are very rare.

The multidisciplinary nature of innovation reminds me of the teachings of someone I enjoy reading about immensely, Warren Buffet’s partner Charlie Munger. Charlie speaks often about the benefits of being well-versed in a wide range of subject areas and developing a latticework of mental models, a network of concepts and their relationships to one another that captures the really important ideas from every major discipline. I think having access to a robust set of mental models, or partnering with others whose mental models complement your own, is probably one of the keys to successful innovation.

]]>The original approach that Cesar used was to pick the most interesting or memorable moment each day and record that, because it would "nudge" you to do interesting things so you've got something worth recording. But there are lots of different ways to approach this. It could be random, otherwise mundane moments. It could be snapshots of your kids, or a selfie, or anything really. What I've found most fascinating about this experiment is how it doesn't really matter what you record. That one-second clip is enough to sort of "jog" your memory and surface details about that day that otherwise might have been hard to remember. I don't know if the effect is quite the same watching someone else's life play out as it is for your own. There's so much context around each clip that only you will be aware of. One might think that a single second of video isn't much, but it's interesting how much can be recalled just from that. It's like a trigger that loads bits and pieces from your brain's long-term storage for that day and puts them into working memory. I would probably never be able to remember what I did two Thursdays ago, but I can see the moment I recorded that day and go "oh yeah, we did x and then y and then z happened". And when you put it all together, the result is mesmerizing.

The hardest part is just remembering to record something every day, but this is also one of the benefits. By forcing yourself to think about finding a moment to record, you'll unconsciously spend more time in the present and less time worrying about things that don't really matter. I don't want to oversell this with some deep philosophical argument but anecdotally my experience has been that just thinking about finding a snapshot in time to "immortalize" each day puts a lot of things in perspective.

If I've successfully convinced you to give this a try, you might be thinking about how it would be a pain in the ass to record and stitch together one-second video clips every day. And you're right, it would be. Fortunately, there's an app for that. It's not free, but the cost is marginal and well worth the time savings (and to be clear, I have no involvement or personal stake in this app, just passing along a recommendation). I've only used the iOS version but I think it's available for Android as well.

I can't wait to see what my highlight reel looks like years from now. I think seeing my kids grow up day by day over a period of months or years is going to be a surreal experience. So much of what we do and experience each day is lost, but with this one simple idea, it's possible to reclaim a small piece of it.

]]>*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part*

*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part 4 - Multivariate Logistic Regression*

*Part 5 - Neural Networks*

*Part 6 - Support Vector Machines*

*Part 7 - K-Means Clustering & PCA*

*Part 8 - Anomaly Detection & Recommendation*

We've now reached the last post in this series! It's been an interesting journey. Andrew's class was really well-done and translating it all to python has been a fun experience. In this final installment we'll cover the last two topics in the course - anomaly detection and recommendation systems. We'll implement an anomaly detection algorithm using a Gaussian model and apply it to detect failing servers on a network. We'll also see how to build a recommendation system using collaborative filtering and apply it to a movie recommendations data set. As always, it helps to follow along using the exercise text for the course (posted here).

Our first task is to use a Gaussian model to detect if an unlabeled example from a data set should be considered an anomaly. We have a simple 2-dimensional data set to start off with so we can easily visualize what the algorithm is doing. Let's pull in and plot the data.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
%matplotlib inline
data = loadmat('data/ex8data1.mat')
X = data['X']
X.shape
```

(307L, 2L)

```
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X[:,0], X[:,1])
```

It appears that there's a pretty tight cluster in the center with several values further out away from the cluster. In this simple example, these could be considered anomalies. To find out, we're tasked with estimating a Gaussian distribution for each feature in the data. You may recall that to define a probability distribution we need two things - mean and variance. To accomplish this we'll create a simple function that calculates the mean and variance for each feature in our data set.

```
def estimate_gaussian(X):
mu = X.mean(axis=0)
sigma = X.var(axis=0)
return mu, sigma
mu, sigma = estimate_gaussian(X)
mu, sigma
```

(array([ 14.11222578, 14.99771051]), array([ 1.83263141, 1.70974533]))

Now that we have our model parameters, we need to determine a probability threshold which indicates that an example should be considered an anomaly. To do this, we need to use a set of labeled validation data (where the true anomalies have been marked for us) and test the model's performance at identifying those anomalies given different threshold values.

```
Xval = data['Xval']
yval = data['yval']
Xval.shape, yval.shape
```

((307L, 2L), (307L, 1L))

We also need a way to calculate the probability that a data point belongs to a normal distribution given some set of parameters. Fortunately SciPy has this built-in.

```
from scipy import stats
dist = stats.norm(mu[0], sigma[0])
dist.pdf(X[:,0])[0:50]
```

array([ 0.183842 , 0.20221694, 0.21746136, 0.19778763, 0.20858956, 0.21652359, 0.16991291, 0.15123542, 0.1163989 , 0.1594734 , 0.21716057, 0.21760472, 0.20141857, 0.20157497, 0.21711385, 0.21758775, 0.21695576, 0.2138258 , 0.21057069, 0.1173018 , 0.20765108, 0.21717452, 0.19510663, 0.21702152, 0.17429399, 0.15413455, 0.21000109, 0.20223586, 0.21031898, 0.21313426, 0.16158946, 0.2170794 , 0.17825767, 0.17414633, 0.1264951 , 0.19723662, 0.14538809, 0.21766361, 0.21191386, 0.21729442, 0.21238912, 0.18799417, 0.21259798, 0.21752767, 0.20616968, 0.21520366, 0.1280081 , 0.21768113, 0.21539967, 0.16913173])

In case it isn't clear, we just calculated the probability that each of the first 50 instances of our data set's first dimension belong to the distribution that we defined earlier by calculating the mean and variance for that dimension. Essentially it's computing how far each instance is from the mean and how that compares to the "typical" distance from the mean for this data.

Let's compute and save the probability density of each of the values in our data set given the Gaussian model parameters we calculated above.

```
p = np.zeros((X.shape[0], X.shape[1]))
p[:,0] = stats.norm(mu[0], sigma[0]).pdf(X[:,0])
p[:,1] = stats.norm(mu[1], sigma[1]).pdf(X[:,1])
p.shape
```

(307L, 2L)

We also need to do this for the validation set (using the same model parameters). We'll use these probabilities combined with the true label to determine the optimal probability threshold to assign data points as anomalies.

```
pval = np.zeros((Xval.shape[0], Xval.shape[1]))
pval[:,0] = stats.norm(mu[0], sigma[0]).pdf(Xval[:,0])
pval[:,1] = stats.norm(mu[1], sigma[1]).pdf(Xval[:,1])
```

Next, we need a function that finds the best threshold value given the probability density values and true labels. To do this we'll calculate the F1 score for varying values of epsilon. F1 is a function of the number of true positives, false positives, and false negatives.

```
def select_threshold(pval, yval):
best_epsilon = 0
best_f1 = 0
f1 = 0
step = (pval.max() - pval.min()) / 1000
for epsilon in np.arange(pval.min(), pval.max(), step):
preds = pval < epsilon
tp = np.sum(np.logical_and(preds == 1, yval == 1)).astype(float)
fp = np.sum(np.logical_and(preds == 1, yval == 0)).astype(float)
fn = np.sum(np.logical_and(preds == 0, yval == 1)).astype(float)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = (2 * precision * recall) / (precision + recall)
if f1 > best_f1:
best_f1 = f1
best_epsilon = epsilon
return best_epsilon, best_f1
epsilon, f1 = select_threshold(pval, yval)
epsilon, f1
```

(0.0095667060059568421, 0.7142857142857143)

Finally, we can apply the threshold to the data set and visualize the results.

```
# indexes of the values considered to be outliers
outliers = np.where(p < epsilon)
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X[:,0], X[:,1])
ax.scatter(X[outliers[0],0], X[outliers[0],1], s=50, color='r', marker='o')
```

Not bad! The points in red are the ones that were flagged as outliers. Visually these seem pretty reasonable. The top right point that has some separation (but was not flagged) may be an outlier too, but it's fairly close. There's another example in the text of applying this to a higher-dimensional data set, but since it's a trivial extension of the two-dimensional example we'll move on to the last section.

Recommendation engines use item and user-based similarity measures to examine a user's historical preferences to make recommendations for new "things" the user might be interested in. In this exercise we'll implement a particular recommendation algorithm called collaborative filtering and apply it to a data set of movie ratings. Let's first load and examine the data we'll be working with.

```
data = loadmat('data/ex8_movies.mat')
data
```

{'R': array([[1, 1, 0, ..., 1, 0, 0], [1, 0, 0, ..., 0, 0, 1], [1, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'Y': array([[5, 4, 0, ..., 5, 0, 0], [3, 0, 0, ..., 0, 0, 5], [4, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Thu Dec 1 17:19:26 2011', '__version__': '1.0'}

Y is a (number of movies x number of users) array containing ratings from 1 to 5. R is an "indicator" array containing binary values indicating if a user has rated a movie or not. Both should have the same shape.

```
Y = data['Y']
R = data['R']
Y.shape, R.shape
```

((1682L, 943L), (1682L, 943L))

We can look at the average rating for a movie by averaging over a row in Y for indexes where a rating is present.

```
Y[1,R[1,:]].mean()
```

2.5832449628844114

We can also try to "visualize" the data by rendering the matrix as if it were an image. We can't glean too much from this but it does give us an idea of a relative density of ratings across users and movies.

```
fig, ax = plt.subplots(figsize=(12,12))
ax.imshow(Y)
ax.set_xlabel('Users')
ax.set_ylabel('Movies')
fig.tight_layout()
```

Next we're going to implement a cost function for collaborative filtering. Intuitively, the "cost" is the degree to which a set of movie rating predictions deviate from the true predictions. The cost equation is given in the exercise text. It is based on two sets of parameter matrices called X and Theta in the text. These are "unrolled" into the "params" input so that we can use SciPy's optimization package later on. Note that I've included the array/matrix shapes in comments to help illustrate how the matrix interactions work.

```
def cost(params, Y, R, num_features):
Y = np.matrix(Y) # (1682, 943)
R = np.matrix(R) # (1682, 943)
num_movies = Y.shape[0]
num_users = Y.shape[1]
# reshape the parameter array into parameter matrices
X = np.matrix(np.reshape(params[:num_movies * num_features], (num_movies, num_features))) # (1682, 10)
Theta = np.matrix(np.reshape(params[num_movies * num_features:], (num_users, num_features))) # (943, 10)
# initializations
J = 0
# compute the cost
error = np.multiply((X * Theta.T) - Y, R) # (1682, 943)
squared_error = np.power(error, 2) # (1682, 943)
J = (1. / 2) * np.sum(squared_error)
return J
```

In order to test this, we're provided with a set of pre-trained parameters that we can evaluate. To keep the evaluation time down, we'll look at just a small sub-set of the data.

```
users = 4
movies = 5
features = 3
params_data = loadmat('data/ex8_movieParams.mat')
X = params_data['X']
Theta = params_data['Theta']
X_sub = X[:movies, :features]
Theta_sub = Theta[:users, :features]
Y_sub = Y[:movies, :users]
R_sub = R[:movies, :users]
params = np.concatenate((np.ravel(X_sub), np.ravel(Theta_sub)))
cost(params, Y_sub, R_sub, features)
```

22.224603725685675

This answer matches what the exercise text said we're supposed to get. Next we need to implement the gradient computations. Just like we did with the neural networks implementation in exercise 4, we'll extend the cost function to also compute the gradients.

```
def cost(params, Y, R, num_features):
Y = np.matrix(Y) # (1682, 943)
R = np.matrix(R) # (1682, 943)
num_movies = Y.shape[0]
num_users = Y.shape[1]
# reshape the parameter array into parameter matrices
X = np.matrix(np.reshape(params[:num_movies * num_features], (num_movies, num_features))) # (1682, 10)
Theta = np.matrix(np.reshape(params[num_movies * num_features:], (num_users, num_features))) # (943, 10)
# initializations
J = 0
X_grad = np.zeros(X.shape) # (1682, 10)
Theta_grad = np.zeros(Theta.shape) # (943, 10)
# compute the cost
error = np.multiply((X * Theta.T) - Y, R) # (1682, 943)
squared_error = np.power(error, 2) # (1682, 943)
J = (1. / 2) * np.sum(squared_error)
# calculate the gradients
X_grad = error * Theta
Theta_grad = error.T * X
# unravel the gradient matrices into a single array
grad = np.concatenate((np.ravel(X_grad), np.ravel(Theta_grad)))
return J, grad
J, grad = cost(params, Y_sub, R_sub, features)
J, grad
```

(22.224603725685675, array([ -2.52899165, 7.57570308, -1.89979026, -0.56819597, 3.35265031, -0.52339845, -0.83240713, 4.91163297, -0.76677878, -0.38358278, 2.26333698, -0.35334048, -0.80378006, 4.74271842, -0.74040871, -10.5680202 , 4.62776019, -7.16004443, -3.05099006, 1.16441367, -3.47410789, 0. , 0. , 0. , 0. , 0. , 0. ]))

Our next step is to add regularization to both the cost and gradient calculations. We'll create one final regularized version of the function (note that this version includes an additional learning rate parameter called "lambda").

```
def cost(params, Y, R, num_features, learning_rate):
Y = np.matrix(Y) # (1682, 943)
R = np.matrix(R) # (1682, 943)
num_movies = Y.shape[0]
num_users = Y.shape[1]
# reshape the parameter array into parameter matrices
X = np.matrix(np.reshape(params[:num_movies * num_features], (num_movies, num_features))) # (1682, 10)
Theta = np.matrix(np.reshape(params[num_movies * num_features:], (num_users, num_features))) # (943, 10)
# initializations
J = 0
X_grad = np.zeros(X.shape) # (1682, 10)
Theta_grad = np.zeros(Theta.shape) # (943, 10)
# compute the cost
error = np.multiply((X * Theta.T) - Y, R) # (1682, 943)
squared_error = np.power(error, 2) # (1682, 943)
J = (1. / 2) * np.sum(squared_error)
# add the cost regularization
J = J + ((learning_rate / 2) * np.sum(np.power(Theta, 2)))
J = J + ((learning_rate / 2) * np.sum(np.power(X, 2)))
# calculate the gradients with regularization
X_grad = (error * Theta) + (learning_rate * X)
Theta_grad = (error.T * X) + (learning_rate * Theta)
# unravel the gradient matrices into a single array
grad = np.concatenate((np.ravel(X_grad), np.ravel(Theta_grad)))
return J, grad
J, grad = cost(params, Y_sub, R_sub, features, 1.5)
J, grad
```

(31.344056244274221, array([ -0.95596339, 6.97535514, -0.10861109, 0.60308088, 2.77421145, 0.25839822, 0.12985616, 4.0898522 , -0.89247334, 0.29684395, 1.06300933, 0.66738144, 0.60252677, 4.90185327, -0.19747928, -10.13985478, 2.10136256, -6.76563628, -2.29347024, 0.48244098, -2.99791422, -0.64787484, -0.71820673, 1.27006666, 1.09289758, -0.40784086, 0.49026541]))

This result again matches up with the expected output from the exercise code, so it looks like the regularization is working. Before we train the model, we have one final step. We're tasked with creating our own movie ratings so we can use the model to generate personalized recommendations. A file is provided for us that links the movie index to its title. Let's load the file into a dictionary and use some sample ratings provided in the exercise.

```
movie_idx = {}
f = open('data/movie_ids.txt')
for line in f:
tokens = line.split(' ')
tokens[-1] = tokens[-1][:-1]
movie_idx[int(tokens[0]) - 1] = ' '.join(tokens[1:])
ratings = np.zeros((1682, 1))
ratings[0] = 4
ratings[6] = 3
ratings[11] = 5
ratings[53] = 4
ratings[63] = 5
ratings[65] = 3
ratings[68] = 5
ratings[97] = 2
ratings[182] = 4
ratings[225] = 5
ratings[354] = 5
print('Rated {0} with {1} stars.'.format(movie_idx[0], str(int(ratings[0]))))
print('Rated {0} with {1} stars.'.format(movie_idx[6], str(int(ratings[6]))))
print('Rated {0} with {1} stars.'.format(movie_idx[11], str(int(ratings[11]))))
print('Rated {0} with {1} stars.'.format(movie_idx[53], str(int(ratings[53]))))
print('Rated {0} with {1} stars.'.format(movie_idx[63], str(int(ratings[63]))))
print('Rated {0} with {1} stars.'.format(movie_idx[65], str(int(ratings[65]))))
print('Rated {0} with {1} stars.'.format(movie_idx[68], str(int(ratings[68]))))
print('Rated {0} with {1} stars.'.format(movie_idx[97], str(int(ratings[97]))))
print('Rated {0} with {1} stars.'.format(movie_idx[182], str(int(ratings[182]))))
print('Rated {0} with {1} stars.'.format(movie_idx[225], str(int(ratings[225]))))
print('Rated {0} with {1} stars.'.format(movie_idx[354], str(int(ratings[354]))))
```

Rated Toy Story (1995) with 4 stars. Rated Twelve Monkeys (1995) with 3 stars. Rated Usual Suspects, The (1995) with 5 stars. Rated Outbreak (1995) with 4 stars. Rated Shawshank Redemption, The (1994) with 5 stars. Rated While You Were Sleeping (1995) with 3 stars. Rated Forrest Gump (1994) with 5 stars. Rated Silence of the Lambs, The (1991) with 2 stars. Rated Alien (1979) with 4 stars. Rated Die Hard 2 (1990) with 5 stars. Rated Sphere (1998) with 5 stars.

We can add this custom ratings vector to the data set so it gets included in the model.

```
R = data['R']
Y = data['Y']
Y = np.append(Y, ratings, axis=1)
R = np.append(R, ratings != 0, axis=1)
```

We're now ready to train the collaborative filtering model. We're going to normalize the ratings and then run the optimization routine using our cost function, parameter vector, and data matrices at inputs.

```
from scipy.optimize import minimize
movies = Y.shape[0]
users = Y.shape[1]
features = 10
learning_rate = 10.
X = np.random.random(size=(movies, features))
Theta = np.random.random(size=(users, features))
params = np.concatenate((np.ravel(X), np.ravel(Theta)))
Ymean = np.zeros((movies, 1))
Ynorm = np.zeros((movies, users))
for i in range(movies):
idx = np.where(R[i,:] == 1)[0]
Ymean[i] = Y[i,idx].mean()
Ynorm[i,idx] = Y[i,idx] - Ymean[i]
fmin = minimize(fun=cost, x0=params, args=(Ynorm, R, features, learning_rate),
method='CG', jac=True, options={'maxiter': 100})
fmin
```

status: 1 success: False njev: 149 nfev: 149 fun: 38953.88249706676 x: array([-0.07177334, -0.08315075, 0.1081135 , ..., 0.1817828 , 0.16873062, 0.03383596]) message: 'Maximum number of iterations has been exceeded.' jac: array([ 0.01833555, 0.07377974, 0.03999323, ..., -0.00970181, 0.00758961, -0.01181811])

Since everything was "unrolled" for the optimization routine to work properly, we need to reshape our matrices back to their original dimensions.

```
X = np.matrix(np.reshape(fmin.x[:movies * features], (movies, features)))
Theta = np.matrix(np.reshape(fmin.x[movies * features:], (users, features)))
X.shape, Theta.shape
```

((1682L, 10L), (944L, 10L))

Our trained parameters are now in X and Theta. We can use these to create some recommendations for the user we added earlier.

```
predictions = X * Theta.T
my_preds = predictions[:, -1] + Ymean
sorted_preds = np.sort(my_preds, axis=0)[::-1]
sorted_preds[:10]
```

matrix([[ 5.00000264], [ 5.00000249], [ 4.99999831], [ 4.99999671], [ 4.99999659], [ 4.99999253], [ 4.99999238], [ 4.9999915 ], [ 4.99999019], [ 4.99998643]]

That gives us an ordered list of the top ratings, but we lost what index those ratings are for. We actually need to use argsort so we know what movie the predicted rating corresponds to.

```
idx = np.argsort(my_preds, axis=0)[::-1]
print("Top 10 movie predictions:")
for i in range(10):
j = int(idx[i])
print('Predicted rating of {0} for movie {1}.'.format(str(float(my_preds[j])), movie_idx[j]))
```

Top 10 movie predictions: Predicted rating of 5.00000264002 for movie Prefontaine (1997). Predicted rating of 5.00000249142 for movie Santa with Muscles (1996). Predicted rating of 4.99999831018 for movie Marlene Dietrich: Shadow and Light (1996) . Predicted rating of 4.9999967124 for movie Saint of Fort Washington, The (1993). Predicted rating of 4.99999658864 for movie They Made Me a Criminal (1939). Predicted rating of 4.999992533 for movie Someone Else's America (1995). Predicted rating of 4.99999238336 for movie Great Day in Harlem, A (1994). Predicted rating of 4.99999149604 for movie Star Kid (1997). Predicted rating of 4.99999018592 for movie Aiqing wansui (1994). Predicted rating of 4.99998642746 for movie Entertaining Angels: The Dorothy Day Story (1996).

The recommended movies don't actually line up that well with what's in the exercise text. The reason why isn't too clear and I haven't found anything to account for it, but it's possible there's a mistake in the code somewhere. Bonus points if someone spots an error and points it out! Still, even if there's some minor difference the bulk of the example is accurate.

That concludes the last exercise! When I started this series, my goal was to become more proficient in python as well as refine the machine learning knowledge I'd gained from taking Andrew's class. I feel confident that I accomplished that goal. My hope though is that it's just as valuable to read as it was for me to create.

]]>*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part*

*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part 4 - Multivariate Logistic Regression*

*Part 5 - Neural Networks*

*Part 6 - Support Vector Machines*

*Part 7 - K-Means Clustering & PCA*

*Part 8 - Anomaly Detection & Recommendation*

We're now down to the last two posts in this series! In this installment we'll cover two fascinating topics: K-means clustering and principal component analysis (PCA). K-means and PCA are both examples of unsupervised learning techniques. Unsupervised learning problems do not have any label or target for us to learn from to make predictions, so unsupervised algorithms instead attempt to learn some interesting structure in the data itself. We'll first implement K-means and see how it can be used it to compress an image. We'll also experiment with PCA to find a low-dimensional representation of images of faces. As always, it helps to follow along using the exercise text for the course (posted here).

To start out we're going to implement and apply K-means to a simple 2-dimensional data set to gain some intuition about how it works. K-means is an iterative, unsupervised clustering algorithm that groups similar instances together into clusters. The algorithm starts by guessing the initial centroids for each cluster, and then repeatedly assigns instances to the nearest cluster and re-computes the centroid of that cluster. The first piece that we're going to implement is a function that finds the closest centroid for each instance in the data.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
%matplotlib inline
def find_closest_centroids(X, centroids):
m = X.shape[0]
k = centroids.shape[0]
idx = np.zeros(m)
for i in range(m):
min_dist = 1000000
for j in range(k):
dist = np.sum((X[i,:] - centroids[j,:]) ** 2)
if dist < min_dist:
min_dist = dist
idx[i] = j
return idx
```

Let's test the function to make sure it's working as expected. We'll use the test case provided in the exercise.

```
data = loadmat('data/ex7data2.mat')
X = data['X']
initial_centroids = initial_centroids = np.array([[3, 3], [6, 2], [8, 5]])
idx = find_closest_centroids(X, initial_centroids)
idx[0:3]
```

array([ 0., 2., 1.])

The output matches the expected values in the text (remember our arrays are zero-indexed instead of one-indexed so the values are one lower than in the exercise). Next we need a function to compute the centroid of a cluster. The centroid is simply the mean of all of the examples currently assigned to the cluster.

```
def compute_centroids(X, idx, k):
m, n = X.shape
centroids = np.zeros((k, n))
for i in range(k):
indices = np.where(idx == i)
centroids[i,:] = (np.sum(X[indices,:], axis=1) / len(indices[0])).ravel()
return centroids
compute_centroids(X, idx, 3)
```

array([[ 2.42830111, 3.15792418], [ 5.81350331, 2.63365645], [ 7.11938687, 3.6166844 ]])

This output also matches the expected values from the exercise. So far so good. The next part involves actually running the algorithm for some number of iterations and visualizing the result. This step was implmented for us in the exercise, but since it's not that complicated I'll build it here from scratch. In order to run the algorithm we just need to alternate between assigning examples to the nearest cluster and re-computing the cluster centroids.

```
def run_k_means(X, initial_centroids, max_iters):
m, n = X.shape
k = initial_centroids.shape[0]
idx = np.zeros(m)
centroids = initial_centroids
for i in range(max_iters):
idx = find_closest_centroids(X, centroids)
centroids = compute_centroids(X, idx, k)
return idx, centroids
idx, centroids = run_k_means(X, initial_centroids, 10)
```

We can now plot the result using color coding to indicate cluster membership.

```
cluster1 = X[np.where(idx == 0)[0],:]
cluster2 = X[np.where(idx == 1)[0],:]
cluster3 = X[np.where(idx == 2)[0],:]
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(cluster1[:,0], cluster1[:,1], s=30, color='r', label='Cluster 1')
ax.scatter(cluster2[:,0], cluster2[:,1], s=30, color='g', label='Cluster 2')
ax.scatter(cluster3[:,0], cluster3[:,1], s=30, color='b', label='Cluster 3')
ax.legend()
```

One step we skipped over is a process for initializing the centroids. This can affect the convergence of the algorithm. We're tasked with creating a function that selects random examples and uses them as the initial centroids.

```
def init_centroids(X, k):
m, n = X.shape
centroids = np.zeros((k, n))
idx = np.random.randint(0, m, k)
for i in range(k):
centroids[i,:] = X[idx[i],:]
return centroids
init_centroids(X, 3)
```

array([[ 1.15354031, 4.67866717], [ 6.27376271, 2.24256036], [ 2.20960296, 4.91469264]])

Our next task is to apply K-means to image compression. The intuition here is that we can use clustering to find a small number of colors that are most representative of the image, and map the original 24-bit colors to a lower-dimensional color space using the cluster assignments. Here's the image we're going to compress.

The raw pixel data has been pre-loaded for us so let's pull it in.

```
image_data = loadmat('data/bird_small.mat')
image_data
```

{'A': array([[[219, 180, 103], [230, 185, 116], [226, 186, 110], ..., [ 14, 15, 13], [ 13, 15, 12], [ 12, 14, 12]], ..., [[ 15, 19, 19], [ 20, 20, 18], [ 18, 19, 17], ..., [ 65, 43, 39], [ 58, 37, 38], [ 52, 39, 34]]], dtype=uint8), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Tue Jun 5 04:06:24 2012', '__version__': '1.0'}

We can quickly look at the shape of the data to validate that it looks like what we'd expect for an image.

```
A = image_data['A']
A.shape
```

(128L, 128L, 3L)

Now we need to apply some pre-processing to the data and feed it into the K-means algorithm.

```
# normalize value ranges
A = A / 255.
# reshape the array
X = np.reshape(A, (A.shape[0] * A.shape[1], A.shape[2]))
# randomly initialize the centroids
initial_centroids = init_centroids(X, 16)
# run the algorithm
idx, centroids = run_k_means(X, initial_centroids, 10)
# get the closest centroids one last time
idx = find_closest_centroids(X, centroids)
# map each pixel to the centroid value
X_recovered = centroids[idx.astype(int),:]
# reshape to the original dimensions
X_recovered = np.reshape(X_recovered, (A.shape[0], A.shape[1], A.shape[2]))
plt.imshow(X_recovered)
```

Cool! You can see that we created some artifacts in the compression but the main features of the image are still there despite mapping the original image to only 16 colors. That's it for K-means. We'll now move on to principal component analysis.

PCA is a linear transformation that finds the "principal components", or directions of greatest variance, in a data set. It can be used for dimension reduction among other things. In this exercise we're first tasked with implementing PCA and applying it to a simple 2-dimensional data set to see how it works. Let's start off by loading and visualizing the data set.

```
data = loadmat('data/ex7data1.mat')
X = data['X']
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X[:, 0], X[:, 1])
```

The algorithm for PCA is fairly simple. After ensuring that the data is normalized, the output is simply the singular value decomposition of the covariance matrix of the original data. Since numpy already has built-in functions to calculate the covariance and SVD of a matrix, we'll use those rather than build from scratch.

```
def pca(X):
# normalize the features
X = (X - X.mean()) / X.std()
# compute the covariance matrix
X = np.matrix(X)
cov = (X.T * X) / X.shape[0]
# perform SVD
U, S, V = np.linalg.svd(cov)
return U, S, V
U, S, V = pca(X)
U, S, V
```

(matrix([[-0.79241747, -0.60997914], [-0.60997914, 0.79241747]]), array([ 1.43584536, 0.56415464]), matrix([[-0.79241747, -0.60997914], [-0.60997914, 0.79241747]]))

Now that we have the principal components (matrix U), we can use these to project the original data into a lower-dimensional space. For this task we'll implement a function that computes the projection and selects only the top K components, effectively reducing the number of dimensions.

```
def project_data(X, U, k):
U_reduced = U[:,:k]
return np.dot(X, U_reduced)
Z = project_data(X, U, 1)
Z
```

matrix([[-4.74689738], [-7.15889408], [-4.79563345], [-4.45754509], [-4.80263579], ..., [-6.44590096], [-2.69118076], [-4.61386195], [-5.88236227], [-7.76732508]])

We can also attempt to recover the original data by reversing the steps we took to project it.

```
def recover_data(Z, U, k):
U_reduced = U[:,:k]
return np.dot(Z, U_reduced.T)
X_recovered = recover_data(Z, U, 1)
X_recovered
```

matrix([[ 3.76152442, 2.89550838], [ 5.67283275, 4.36677606], [ 3.80014373, 2.92523637], [ 3.53223661, 2.71900952], [ 3.80569251, 2.92950765], ..., [ 5.10784454, 3.93186513], [ 2.13253865, 1.64156413], [ 3.65610482, 2.81435955], [ 4.66128664, 3.58811828], [ 6.1549641 , 4.73790627]])

If we then attempt to visualize the recovered data, the intuition behind how the algorithm works becomes really obvious.

```
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X_recovered[:, 0], X_recovered[:, 1])
```

Notice how the points all seem to be compressed down to an invisible line. That invisible line is essentially the first principal component. The second principal component, which we cut off when we reduced the data to one dimension, can be thought of as the variation orthogonal to that line. Since we lost that information, our reconstruction can only place the points relative to the first principal component.

Our last task in this exercise is to apply PCA to images of faces. By using the same dimension reduction techniques we can capture the "essence" of the images using much less data than the original images.

```
faces = loadmat('data/ex7faces.mat')
X = faces['X']
X.shape
```

(5000L, 1024L)

The exercise code includes a function that will render the first 100 faces in the data set in a grid. Rather than try to re-produce that here, you can look in the exercise text for an example of what they look like. We can at least render one image fairly easily though.

```
face = np.reshape(X[3,:], (32, 32))
plt.imshow(face)
```

Yikes, that looks awful! These are only 32 x 32 grayscale images though (it's also rendering sideways, but we can ignore that for now). Our next step is to run PCA on the faces data set and take the top 100 principal components.

```
U, S, V = pca(X)
Z = project_data(X, U, 100)
```

Now we can attempt to recover the original structure and render it again.

```
X_recovered = recover_data(Z, U, 100)
face = np.reshape(X_recovered[3,:], (32, 32))
plt.imshow(face)
```

Notice that we lost some detail, though not as much as you might expect for a 10x reduction in the number of dimensions.

That concludes exercise 7! In the final exercise we'll implement algorithms for anomaly detection and build a recommendation system using collaborative filtering.

]]>*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part*

*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part 4 - Multivariate Logistic Regression*

*Part 5 - Neural Networks*

*Part 6 - Support Vector Machines*

*Part 7 - K-Means Clustering & PCA*

*Part 8 - Anomaly Detection & Recommendation*

We're now hitting the home stretch of both the course content and this series of blog posts. In this exercise, we'll be using support vector machines (SVMs) to build a spam classifier. We'll start with SVMs on some simple 2D data sets to see how they work. Then we'll look at a set of email data and build a classifier on the processed emails using a SVM to determine if they are spam or not. As always, it helps to follow along using the exercise text for the course (posted here).

Before jumping into the code, let's quickly recap what an SVM is and why it's worth learning about. Broadly speaking, SVMs are a class of supervised learning algorithm that builds a representation of the examples in the training data as points in space, mapped so that the examples belonging to each class in the data are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a class based on which side of the gap they fall on. SVMs are a binary classifcation tool by default, although there are ways to use them in multi-class scenarios. SVMs can also handle non-linear classification using something called the kernel trick to project the data into a high-dimensional space before attempting to find a hyperplane. SVMs are a powerful class of algorithms and are used often in practical machine learning applications.

The first thing we're going to do is look at a simple 2-dimensional data set and see how a linear SVM works on the data set for varying values of C (similar to the regularization term in linear/logistic regression). Let's load the data.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
%matplotlib inline
raw_data = loadmat('data/ex6data1.mat')
raw_data
```

{'X': array([[ 1.9643 , 4.5957 ], [ 2.2753 , 3.8589 ], [ 2.9781 , 4.5651 ], ..., [ 0.9044 , 3.0198 ], [ 0.76615 , 2.5899 ], [ 0.086405, 4.1045 ]]), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Nov 13 14:28:43 2011', '__version__': '1.0', 'y': array([[1], [1], [1], ..., [0], [0], [1]], dtype=uint8)}

We'll visualize it as a scatter plot where the class label is denoted by a symbol ('+' for positive, 'o' for negative).

```
data = pd.DataFrame(raw_data['X'], columns=['X1', 'X2'])
data['y'] = raw_data['y']
positive = data[data['y'].isin([1])]
negative = data[data['y'].isin([0])]
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(positive['X1'], positive['X2'], s=50, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=50, marker='o', label='Negative')
ax.legend()
```

Notice that there is one outlier positive example that sits apart from the others. The classes are still linearly separable but it's a very tight fit. We're going to train a linear support vector machine to learn the class boundary. In this exercise we're not tasked with implementing an SVM from scratch, so I'm going to use the one built into scikit-learn.

```
from sklearn import svm
svc = svm.LinearSVC(C=1, loss='hinge', max_iter=1000)
svc
```

LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0)

For the first experiment we'll use C=1 and see how it performs.

```
svc.fit(data[['X1', 'X2']], data['y'])
svc.score(data[['X1', 'X2']], data['y'])
```

0.98039215686274506

It appears that it mis-classified the outlier. Let's see what happens with a larger value of C.

```
svc2 = svm.LinearSVC(C=100, loss='hinge', max_iter=1000)
svc2.fit(data[['X1', 'X2']], data['y'])
svc2.score(data[['X1', 'X2']], data['y'])
```

1.0

This time we got a perfect classification of the training data, however by increasing the value of C we've created a decision boundary that is no longer a natural fit for the data. We can visualize this by looking at the confidence level for each class prediction, which is a function of the point's distance from the hyperplane.

```
data['SVM 1 Confidence'] = svc.decision_function(data[['X1', 'X2']])
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=50, c=data['SVM 1 Confidence'], cmap='seismic')
ax.set_title('SVM (C=1) Decision Confidence')
```

```
data['SVM 2 Confidence'] = svc2.decision_function(data[['X1', 'X2']])
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=50, c=data['SVM 2 Confidence'], cmap='seismic')
ax.set_title('SVM (C=100) Decision Confidence')
```

The difference is a bit subtle but look at the color of the points near the boundary. In the first image the points near the boundary are a strong red or blue, indicating that they're a solid distance from the hyperplane. This is not the case in the second image, where a number of points are nearly white, indicating that they are directly adjacent to the hyperplane.

Now we're going to move from a linear SVM to one that's capable of non-linear classification using kernels. We're first tasked with implementing a gaussian kernel function. Although scikit-learn has a gaussian kernel built in, for transparency we'll implement one from scratch.

```
def gaussian_kernel(x1, x2, sigma):
return np.exp(-(np.sum((x1 - x2) ** 2) / (2 * (sigma ** 2))))
x1 = np.array([1.0, 2.0, 1.0])
x2 = np.array([0.0, 4.0, -1.0])
sigma = 2
gaussian_kernel(x1, x2, sigma)
```

0.32465246735834974

That result matches the expected value from the exercise. Next we're going to examine another data set, this time with a non-linear decision boundary.

```
raw_data = loadmat('data/ex6data2.mat')
data = pd.DataFrame(raw_data['X'], columns=['X1', 'X2'])
data['y'] = raw_data['y']
positive = data[data['y'].isin([1])]
negative = data[data['y'].isin([0])]
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(positive['X1'], positive['X2'], s=30, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=30, marker='o', label='Negative')
ax.legend()
```

For this data set we'll build a support vector machine classifier using the built-in RBF kernel and examine its accuracy on the training data. To visualize the decision boundary, this time we'll shade the points based on the predicted probability that the instance has a negative class label. We'll see from the result that it gets most of them right.

```
svc = svm.SVC(C=100, gamma=10, probability=True)
svc.fit(data[['X1', 'X2']], data['y'])
data['Probability'] = svc.predict_proba(data[['X1', 'X2']])[:,0]
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=30, c=data['Probability'], cmap='Reds')
```

For the third data set we're given both training and validation sets and tasked with finding optimal hyper-parameters for an SVM model based on validation set performance. Although we could use scikit-learn's built-in grid search to do this quite easily, in the spirit of following the exercise directions we'll implement a simple grid search from scratch.

```
raw_data = loadmat('data/ex6data3.mat')
X = raw_data['X']
Xval = raw_data['Xval']
y = raw_data['y'].ravel()
yval = raw_data['yval'].ravel()
C_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]
gamma_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]
best_score = 0
best_params = {'C': None, 'gamma': None}
for C in C_values:
for gamma in gamma_values:
svc = svm.SVC(C=C, gamma=gamma)
svc.fit(X, y)
score = svc.score(Xval, yval)
if score > best_score:
best_score = score
best_params['C'] = C
best_params['gamma'] = gamma
best_score, best_params
```

(0.96499999999999997, {'C': 0.3, 'gamma': 100})

Now we'll move on to the last part of the exercise. In this part our objective is to use SVMs to build a spam filter. In the exercise text, there's a task involving some text pre-processing to get our data in a format suitable for an SVM to handle. However, the task is pretty trivial (mapping words to an ID from a dictionary that's provided for the exercise) and the rest of the pre-processing steps such as HTML removal, stemming, normalization etc. are already done. Rather than reproduce these pre-processing steps, I'm going to skip ahead to the machine learning task which involves building a classifier from pre-processed train and test data sets consisting of spam and non-spam emails transformed to word occurance vectors.

```
spam_train = loadmat('data/spamTrain.mat')
spam_test = loadmat('data/spamTest.mat')
spam_train
```

{'X': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 1, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Nov 13 14:27:25 2011', '__version__': '1.0', 'y': array([[1], [1], [0], ..., [1], [0], [0]], dtype=uint8)}

```
X = spam_train['X']
Xtest = spam_test['Xtest']
y = spam_train['y'].ravel()
ytest = spam_test['ytest'].ravel()
X.shape, y.shape, Xtest.shape, ytest.shape
```

((4000L, 1899L), (4000L,), (1000L, 1899L), (1000L,))

Each document has been converted to a vector with 1,899 dimensions corresponding to the 1,899 words in the vocabulary. The values are binary, indicating the presence or absence of the word in the document. At this point, training and evaluation are just a matter of fitting the testing the classifer.

```
svc = svm.SVC()
svc.fit(X, y)
print('Test accuracy = {0}%'.format(np.round(svc.score(Xtest, ytest) * 100, 2)))
```

Test accuracy = 95.3%

This result is with the default parameters. We could probably get it a bit higher using some parameter tuning, but 95% accuracy still isn't bad.

That concludes exercise 6! In the next exercise we'll perform clustering and image compression with K-Means and principal component analysis.

]]>*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part*

*Part 1 - Simple Linear Regression*

*Part 2 - Multivariate Linear Regression*

*Part 3 - Logistic Regression*

*Part 4 - Multivariate Logistic Regression*

*Part 5 - Neural Networks*

*Part 6 - Support Vector Machines*

*Part 7 - K-Means Clustering & PCA*

*Part 8 - Anomaly Detection & Recommendation*

In part four we wrapped up our implementation of logistic regression by extending our solution to handle multi-class classification and testing it on the hand-written digits data set. Using just logistic regression we were able to hit a classification accuracy of about 97.5%, which is reasonably good but pretty much maxes out what we can achieve with a linear model. In this blog post we'll again tackle the hand-written digits data set, but this time using a feed-forward neural network with backpropagation. We'll implement un-regularized and regularized versions of the neural network cost function and compute gradients via the backpropagation algorithm. Finally, we'll run the algorithm through an optimizer and evaluate the performance of the network on the handwritten digits data set

I'll note up front that the math (and code) in this exercise gets a bit hairy. Implementing a neural network from scratch is not for the faint of heart. For those that venture on, be warned - here be dragons. I also strongly encourage the reader to skim through the corresponding exercise text from Andrew Ng's class. I uploaded a copy here. This text contains all of the equations that we'll be implementing. With that, let's dive in!

Since the data set is the same one we used in the last exercise, we'll re-use the code from last time to load the data.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
%matplotlib inline
data = loadmat('data/ex3data1.mat')
data
```

{'X': array([[ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], ..., [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.]]), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct 16 13:09:09 2011', '__version__': '1.0', 'y': array([[10], [10], [10], ..., [ 9], [ 9], [ 9]], dtype=uint8)}

Since we're going to need these later (and will use them often), let's create some useful variables up-front.

```
X = data['X']
y = data['y']
X.shape, y.shape
((5000L, 400L), (5000L, 1L))
```

We're also going to need to one-hot encode our labels. One-hot encoding turns a class label \(n\) (out of \(k\) classes) into a vector of length \(k\) where index \(n\) is "hot" (1) while the rest are zero. Scikit-learn has a built in utility we can use for this.

```
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
y_onehot = encoder.fit_transform(y)
y_onehot.shape
(5000L, 10L)
```

```
y[0], y_onehot[0,:]
(array([10], dtype=uint8),
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]))
```

The neural network we're going to build for this exercise has an input layer matching the size of our instance data (400 + the bias unit), a hidden layer with 25 units (26 with the bias unit), and an output layer with 10 units corresponding to our one-hot encoding for the class labels. The first piece we need to implement is a cost function to evaluate the loss for a given set of network parameters. The source mathematical function is in the exercise text, and looks pretty intimidating, but it helps to really break it down into pieces. Here are the functions required to compute the cost.

```
def sigmoid(z):
return 1 / (1 + np.exp(-z))
```

```
def forward_propagate(X, theta1, theta2):
m = X.shape[0]
a1 = np.insert(X, 0, values=np.ones(m), axis=1)
z2 = a1 * theta1.T
a2 = np.insert(sigmoid(z2), 0, values=np.ones(m), axis=1)
z3 = a2 * theta2.T
h = sigmoid(z3)
return a1, z2, a2, z3, h
```

```
def cost(params, input_size, hidden_size, num_labels, X, y, learning_rate):
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)
# reshape the parameter array into parameter matrices for each layer
theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
# run the feed-forward pass
a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
# compute the cost
J = 0
for i in range(m):
first_term = np.multiply(-y[i,:], np.log(h[i,:]))
second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))
J += np.sum(first_term - second_term)
J = J / m
return J
```

We've used the sigmoid function before so that's not new. The forward-propagate function computes the hypothesis for each training instance given the current parameters (in other words, given some current state of the network and a set of inputs, it calculates the outputs at each layer in the network). The shape of the hypothesis vector (denoted by \(h\)), which contains the prediction probabilities for each class, should match our one-hot encoding for y. Finally, the cost function runs the forward-propagation step and calculates the error of the hypothesis (predictions) vs. the true label for the instance.

We can test this real quick to convince ourselves that it's working as expected. Seeing the output from intermediate steps is also useful to understand what's going on.

```
# initial setup
input_size = 400
hidden_size = 25
num_labels = 10
learning_rate = 1
# randomly initialize a parameter array of the size of the full network's parameters
params = (np.random.random(size=hidden_size * (input_size + 1) + num_labels * (hidden_size + 1)) - 0.5) * 0.25
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)
# unravel the parameter array into parameter matrices for each layer
theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
theta1.shape, theta2.shape
((25L, 401L), (10L, 26L))
```

```
a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
a1.shape, z2.shape, a2.shape, z3.shape, h.shape
((5000L, 401L), (5000L, 25L), (5000L, 26L), (5000L, 10L), (5000L, 10L))
```

The cost function, after computing the hypothesis matrix \(h\), applies the cost equation to compute the total error between \(y\) and \(h\).

```
cost(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)
6.8228086634127862
```

Our next step is to add regularization to the cost function, which adds a penalty term to the cost that scales with the magnitude of the parameters. The equation for this looks pretty ugly, but it can be boiled down to just one line of code added to the original cost function. Just add the following right before the return statement.

```
J += (float(learning_rate) / (2 * m)) * (np.sum(np.power(theta1[:,1:], 2)) + np.sum(np.power(theta2[:,1:], 2)))
```

Next up is the backpropagation algorithm. Backpropagation computes the parameter updates that will reduce the error of the network on the training data. The first thing we need is a function that computes the gradient of the sigmoid function we created earlier.

```
def sigmoid_gradient(z):
return np.multiply(sigmoid(z), (1 - sigmoid(z)))
```

Now we're ready to implement backpropagation to compute the gradients. Since the computations required for backpropagation are a superset of those required in the cost function, we're actually going to extend the cost function to also perform backpropagation and return both the cost and the gradients. If you're wondering why I'm not just calling the existing cost function from within the backprop function to make the design more modular, it's because backprop uses a number of other variables calculated inside the cost function. Here's the full implementation. I skipped ahead and added gradient regularization rather than first create an un-regularized version.

```
def backprop(params, input_size, hidden_size, num_labels, X, y, learning_rate):
##### this section is identical to the cost function logic we already saw #####
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)
# reshape the parameter array into parameter matrices for each layer
theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
# run the feed-forward pass
a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
# initializations
J = 0
delta1 = np.zeros(theta1.shape) # (25, 401)
delta2 = np.zeros(theta2.shape) # (10, 26)
# compute the cost
for i in range(m):
first_term = np.multiply(-y[i,:], np.log(h[i,:]))
second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))
J += np.sum(first_term - second_term)
J = J / m
# add the cost regularization term
J += (float(learning_rate) / (2 * m)) * (np.sum(np.power(theta1[:,1:], 2)) + np.sum(np.power(theta2[:,1:], 2)))
##### end of cost function logic, below is the new part #####
# perform backpropagation
for t in range(m):
a1t = a1[t,:] # (1, 401)
z2t = z2[t,:] # (1, 25)
a2t = a2[t,:] # (1, 26)
ht = h[t,:] # (1, 10)
yt = y[t,:] # (1, 10)
d3t = ht - yt # (1, 10)
z2t = np.insert(z2t, 0, values=np.ones(1)) # (1, 26)
d2t = np.multiply((theta2.T * d3t.T).T, sigmoid_gradient(z2t)) # (1, 26)
delta1 = delta1 + (d2t[:,1:]).T * a1t
delta2 = delta2 + d3t.T * a2t
delta1 = delta1 / m
delta2 = delta2 / m
# add the gradient regularization term
delta1[:,1:] = delta1[:,1:] + (theta1[:,1:] * learning_rate) / m
delta2[:,1:] = delta2[:,1:] + (theta2[:,1:] * learning_rate) / m
# unravel the gradient matrices into a single array
grad = np.concatenate((np.ravel(delta1), np.ravel(delta2)))
return J, grad
```

There's a lot going on here so let's try to unpack it a bit. The first half of the function is calculating the error by running the data plus current parameters through the "network" (the forward-propagate function) and comparing the output to the true labels. The total error across the whole data set is represented as \(J\). This is the part we discussed earlier as the cost function.

The rest of the function is essentially answering the question "how can I adjust my parameters to reduce the error the next time I run through the network"? It does this by computing the contributions at each layer to the total error and adjusting appropriately by coming up with a "gradient" matrix (or, how much to change each parameter and in what direction).

The hardest part of the backprop computation (other than understanding WHY we're doing all these calculations) is getting the matrix dimensions right, which is why I annotated some of the calculations with comments showing the resulting shape of the computation. By the way, if you find it confusing when to use A * B vs. np.multiply(A, B), you're not alone. Basically the former is a matrix multiplication and the latter is an element-wise multiplication (unless A or B is a scalar value, in which case it doesn't matter). I wish there was a more concise syntax for this (maybe there is and I'm just not aware of it).

Anyway, let's test it out to make sure the function returns what we're expecting it to.

```
J, grad = backprop(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)
J, grad.shape
(6.8281541822949299, (10285L,))
```

We're finally ready to train our network and use it to make predictions. This is roughly similar to the previous exercise with multi-class logistic regression.

```
from scipy.optimize import minimize
# minimize the objective function
fmin = minimize(fun=backprop, x0=params, args=(input_size, hidden_size, num_labels, X, y_onehot, learning_rate),
method='TNC', jac=True, options={'maxiter': 250})
fmin
```

status: 3 success: False nfev: 250 fun: 0.33900736818312283 x: array([ -8.85740564e-01, 2.57420350e-04, -4.09396202e-04, ..., 1.44634791e+00, 1.68974302e+00, 7.10121593e-01]) message: 'Max. number of function evaluations reach' jac: array([ -5.11463703e-04, 5.14840700e-08, -8.18792403e-08, ..., -2.48297749e-04, -3.17870911e-04, -3.31404592e-04]) nit: 21

We put a bound on the number of iterations since the objective function is not likely to completely converge. Our total cost has dropped below 0.5 though so that's a good indicator that the algorithm is working. Let's use the parameters it found and forward-propagate them through the network to get some predictions. We have to reshape the output from the optimizer to match the parameter matrix shapes that our network is expecting, then run the forward propagation to generate a hypothesis for the input data.

```
X = np.matrix(X)
theta1 = np.matrix(np.reshape(fmin.x[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(fmin.x[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
y_pred = np.array(np.argmax(h, axis=1) + 1)
y_pred
```

array([[10], [10], [10], ..., [ 9], [ 9], [ 9]], dtype=int64)

Finally we can compute the accuracy to see how well our trained network is doing.

```
correct = [1 if a == b else 0 for (a, b) in zip(y_pred, y)]
accuracy = (sum(map(int, correct)) / float(len(correct)))
print 'accuracy = {0}%'.format(accuracy * 100)
accuracy = 99.22%
```

And we're done! We've successfully implemented a rudimentary feed-forward neural network with backpropagation and used it to classify images of handwritten digits. It might seem surprising at first that we managed to do this without implementing any classes or data structures resembling a network in any way. Isn't that what neural networks are all about? This was one of the biggest surprises to me when I took the class - how the whole thing basically boils down to a series of matrix multiplications. It turns out that this is by far the most efficient way to solve the problem. In fact if you look at any of the popular deep learning frameworks such as Tensorflow, they're essentially graphs of linear algebra computations. It's a very useful and practical way to think about machine learning algorithms.

That concludes the exercise on neural networks. In the next exercise we'll look at another popular supervised learning algorithm, support vector machines.

]]>