Curious Insight

Language Exploration Using Vector Space Models

John Wittenauer — Fri, 06 Nov 2015 02:20:17 GMT

Natural language processing is a huge sub-field of artificial intelligence that deals with models and representations for natural language. A very common way to represent words, phrases, and documents in modern NLP involves the use of sparse vectors. Here we'll explore a variety of Python libraries that implement various algorithoms related to natural language processing using vector space models.

The first thing we need is some text to work with. NLTK (Natural Language Toolkit) has many popular corpora available for download directly from the API, so let's use that. For this exercise we're going to use the Brown corpus. Start by initiating the download tool and selecting the corpus.

%matplotlib inline
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn  
import nltk  
nltk.download()

Check to make sure that the download was successful and we can access the data.

from nltk.corpus import reuters  
len(reuters.words())

1720901

If you've never heard of the Reuters corpus, it's a collection of more than 10,000 news documents published in 1987 categorized into 90 different topics. The content is obviously a bit dated but it's still interesting to work with because of the size of the collection and the fact that we have labeled topics for each document that we can use for certain types of supervised learning algorithms. The corpus contains over 1.3 million words in total.

NLTK's feature set is focused more on the linguistic aspect of natural language processing than the machine learning aspect, with functions for tasks such as tokenization, stemming, tagging, and parsing. Even so, there are some very useful functions we can take advantage of to get a sense of what's in this corpus. Let's start by tabulating the number of unique words in the corpus.

vocabulary = set(reuters.words())  
len(vocabulary)

There are 41,600 unique tokens in the corpus. This doesn't tell us anything about the distribution of these tokens though. NLTK has a built-in function to compute a frequency distribution for a text corpus.

fdist = nltk.FreqDist(reuters.words())  
fdist.most_common(30)

[(u'.', 94687),
 (u',', 72360),
 (u'the', 58251),
 (u'of', 35979),
 (u'to', 34035),
 (u'in', 26478),
 (u'said', 25224),
 (u'and', 25043),
 (u'a', 23492),
 (u'mln', 18037),
 (u'vs', 14120),
 (u'-', 13705),
 (u'for', 12785),
 (u'dlrs', 11730),
 (u"'", 11272),
 (u'The', 10968),
 (u'000', 10277),
 (u'1', 9977),
 (u's', 9298),
 (u'pct', 9093),
 (u'it', 8842),
 (u';', 8762),
 (u'&', 8698),
 (u'lt', 8694),
 (u'on', 8556),
 (u'from', 7986),
 (u'cts', 7953),
 (u'is', 7580),
 (u'>', 7449),
 (u'that', 7377)]

We can plot these cumulatively to get a sense for how much of the corpus they represent.

fig, ax = plt.subplots(figsize=(16,12))  
ax = fdist.plot(30, cumulative=True)

Just 30 tokens make up around 35% of the entire corpus! Moreover, most of these are things like punctuation and articles such as "and", "to", "of" and so on. This is useful to know as we may want to strip out tokens like these. You might also notice that the word 'the' appears on the list twice. That's because the corpus contains both upper-case and lower-case words, and they are each counted separately. Before we attempt to do anything with this data we'll need to correct these issues.

stopwords = nltk.corpus.stopwords.words()  
cleansed_words = [w.lower() for w in reuters.words() if w.isalnum() and w.lower() not in stopwords]  
vocabulary = set(cleansed_words)  
len(vocabulary)

30618

After converting everything to lowercase, removing punctuation, and removing "stop words" using a pre-defined list of words that do not add any semantic value, we've reduced the vocabulary from almost 42,000 to just over 30,000. Note that we still didn't address things like singular vs. plural being different words. To handle this we'd have to get into topics like stemming, but for now let's leave as-is. Let's look at the top 30 again.

fdist = nltk.FreqDist(cleansed_words)  
fdist.most_common(30)

[(u'said', 25383),
 (u'mln', 18623),
 (u'vs', 14341),
 (u'dlrs', 12417),
 (u'000', 10277),
 (u'1', 9977),
 (u'pct', 9810),
 (u'lt', 8696),
 (u'cts', 8361),
 (u'year', 7529),
 (u'net', 6989),
 (u'2', 6528),
 (u'billion', 5829),
 (u'loss', 5124),
 (u'3', 5091),
 (u'5', 4683),
 (u'would', 4673),
 (u'company', 4670),
 (u'1986', 4392),
 (u'4', 4363),
 (u'shr', 4182),
 (u'inc', 4121),
 (u'bank', 3654),
 (u'7', 3450),
 (u'corp', 3399),
 (u'6', 3376),
 (u'oil', 3272),
 (u'last', 3243),
 (u'8', 3218),
 (u'share', 3160)]

The list is arguably a bit more interesting now. There's a lot more that we could do with NLTK, but since we're interested in using this data to build statistical models, we need to find ways to "vectorize" this data. One common way to represent text data is called the "bag of words" representation. A bag of words represents each document in a corpus as a series of features that ask a question about the document. Most commonly, the features are the collection of all distinct words in the vocabulary of the entire corpus. The values are usually either binary (representing the presence or absence of that word in the document) or a count of the number of times that word appears in the document. A corpus is then represented as a matrix with one row per document and one column per unique word.

To build our initial bag of words matrix, we're going to use some of scikit-learn's built-in text processing capabilities. We'll start off using scikit-learn's CountVectorizer class to transform our corpus into a sparse bag of words representation. CountVectorizer expects as input a list of raw strings containing the documents in the corpus. It takes care of the tokenization, transformation to lowercase, filtering stop words, building the vocabulary etc. It also tabulates occurrance counts per document for each feature.

Since CountVectorizer expects the raw data as input rather than the pre-processed data we were working with in NLTK, we need to create a list of documents to pass to the vectorizer.

files = [f for f in reuters.fileids() if 'training' in f]  
corpus = [reuters.raw(fileids=[f]) for f in files]  
len(corpus)

7769

Let's look at an example from the corpus to get a sense of the what kind of raw text we're dealing with (part of the document has been omitted for brevity).

corpus[0]

u'BAHIA COCOA REVIEW\n  Showers continued throughout the week in\n  the Bahia cocoa zone, alleviating the drought since early\n  January and improving prospects for the coming temporao,\n  although normal humidity levels have not been restored,\n  Comissaria Smith said in its weekly review.\n      The dry period means the temporao will be late this year.\n      Arrivals for the week ended February 22 were 155,221 bags\n  of 60 kilos making a cumulative total for the season of 5.93\n  mln against 5.81 at the same stage last year. Again it seems\n  that cocoa delivered earlier on consignment was included in the\n  arrivals figures.\n      ...Final figures for the period to February 28 are expected to\n  be published by the Brazilian Cocoa Trade Commission after\n  carnival which ends midday on February 27.\n  \n\n'

Now we have the training corpus defined as a list of raw text documents. We can create our vectorizer and pass in the corpus to build our bag of words matrix.

from sklearn.feature_extraction.text import CountVectorizer  
vectorizer = CountVectorizer(stop_words='english')  
X = vectorizer.fit_transform(corpus)  
X

<7769x26001 sparse matrix of type ''
with 426016 stored elements in Compressed Sparse Row format>

The vectorizer stores the data as a sparse matrix since a dense matrix would use way too much space and most of the values would be zero anyway (because each document only contains a small number of the total words in the vocabulary). We can transform this to a numpy array if necessary though.

X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 3, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 2, 0, ..., 0, 0, 0]], dtype=int64)

The vectorizer stores the feature names (words) that map to the matrix column indexes. We can inspect those if desired. Note that I'm skipping to index 2000 because if you look at the beginning of the index, it's all numbers! The reuters corpus, being news articles, contains quite a high volume of numeric symbols. It's debatable wether or not we should really include these in the vocabulary, but for now they're there.

vectorizer.get_feature_names()[2000:2030]

[u'aero',
 u'aeroe',
 u'aerojet',
 u'aeronautics',
 u'aeroperu',
 u'aerosol',
 u'aerosols',
 u'aerospace',
 u'aerotech',
 u'aet',
 u'aetna',
 u'afbf',
 u'affadavit',
 u'affair',
 u'affairs',
 u'affandi',
 u'affect',
 u'affected',
 u'affecting',
 u'affects',
 u'affiliate',
 u'affiliated',
 u'affiliates',
 u'affiliation',
 u'affinerie',
 u'affirmation',
 u'affirmative',
 u'affirmed',
 u'afflicted',
 u'afford']

One potential issue with this representation is that it holds an in-memory mapping of the vocabulary to the document matrix that can get unwieldy on large datasets. This approach also doesn't work when training in an on-line fashion since it needs to build the entire vocabulary ahead of time. There's another vectorization algorithm implemented in scikit-learn that uses feature hashing to build the matrix in a stateless manner. This HashingVectorizer class solves both of the above problems, however is comes with some tradeoffs - it's not possible to "inverse transform" the vector back to the original words, and there's a possibility of collisions that could cause some information to be lost.

from sklearn.feature_extraction.text import HashingVectorizer  
hv = HashingVectorizer()  
X_hash = hv.transform(corpus)  
X_hash

<7769x1048576 sparse matrix of type ''
with 573305 stored elements in Compressed Sparse Row format>

For now we'll continue using the count vectorizer, but for very large corpora this approach would be faster and more efficient.

We now have a bag of words matrix, however there's another problem - some words appear much more frequently across the corpora as a whole than other words, so their presence in a document should carry less weight than a word that is very infrequent in general. To adjust for this, we'll use something called TF-IDF weighting. TF stands for term frequency, while IDF stands for inverse document frequency. The TF-IDF calculation lowers the relative weight of common words and increases the relative weight of uncommon words.

Scikit-learn implements TF-IDF as a separate transform that we can apply to the output of our vectorizer.

from sklearn.feature_extraction.text import TfidfTransformer  
tfidf = TfidfTransformer()  
X_weighted = tfidf.fit_transform(X)

Now that we have a weighted term-document matrix, let's do something with it. A common natural language processing task is to classify documents as belonging to a particular category. Since the reuters corpus is labeled, we can used a supervised learning algorithm to attempt to learn how to categorize similar news articles.

To do this we need a few additional pieces of information. We need a set of labels, and we need a test set to evaluate performance of the model. Forunately we have both available to us for the reuters dataset.

# build the term-document matrix for the test set using the existing transforms
test_files = [f for f in reuters.fileids() if 'test' in f]  
test_corpus = [reuters.raw(fileids=[f]) for f in test_files]  
X_test = vectorizer.transform(test_corpus)  
X_test_weighted = tfidf.transform(X_test)
# get the categories for each document in both the train and test sets
train_labels = [reuters.categories(fileids=[f]) for f in files]  
test_labels = [reuters.categories(fileids=[f]) for f in test_files]

Since there are 90 distinct categories, and each document can be assigned to more than one category, we probably don't have enough documents per category to build a really good document classifier. We're going to simplify the problem a bit and reduce the classification to a binary problem - wether or not the document belongs to the 'acq' category.

y = np.asarray([1 if 'acq' in label else 0 for label in train_labels])  
y_test = np.asarray([1 if 'acq' in label else 0 for label in test_labels])  
X_weighted.shape, y.shape, X_test_weighted.shape, y_test.shape

((7769, 26001), (7769L,), (3019, 26001), (3019L,))

Now we're ready to train a classifier. We'll use multinomial Naive Bayes in this example.

from sklearn.naive_bayes import MultinomialNB  
from sklearn.metrics import classification_report
# train the classifier
classifier = MultinomialNB()  
classifier.fit(X_weighted, y)
# predict labels for the test set
predictions = classifier.predict(X_test_weighted)
# output the classification report
label_names = ['not acq', 'acq']  
print(classification_report(y_test, predictions, target_names=label_names))

             precision    recall  f1-score   support
    not acq       0.87      1.00      0.93      2300
        acq       0.99      0.52      0.68       719
avg / total       0.90      0.88      0.87      3019

So that's a pretty reasonable outcome, although the recall is not as high as we would like it to be. There are a number of ways we could work to improve this result, such as experimenting with removing extraenous tokens such as numbers from our vocabulary or constructing additional high-level features about the documents. For a simple bag-of-words model though it's not too bad.

Supervised learning is nice when we have a labeled dataset, but the vast majority of text in the wild does not come with any sort of label so its usefulness in natural language processing is often limited. What about unsupervised techniques to categorize documents? Scikit-learn packages a decomposition technique called non-negative matrix factorization (NMF) that we can use for topic extraction.

from sklearn.decomposition import NMF  
nmf = NMF(n_components=10).fit(X_weighted)
feature_names = vectorizer.get_feature_names()
for topic_idx, topic in enumerate(nmf.components_):  
    print('Topic #%d:' % topic_idx)
    print(' '.join([feature_names[i] for i in topic.argsort()[:-20 - 1:-1]]))
    print('')

Topic #0:
loss profit vs 000 cts dlrs oper shr revs year qtr 4th includes net discontinued note operations 1986 quarter excludes
Topic #1:
pct february january rose year rise rate index december prices 1986 inflation said fell compared growth consumer statistics base production
Topic #2:
cts qtly div record april pay prior vs dividend sets march quarterly lt payout 15 10 payable regular 30 31
Topic #3:
said trade japan dollar bank yen japanese exchange foreign rates market rate dealers economic currency paris countries nations government told
Topic #4:
vs mln 000 net cts shr revs qtr avg shrs oper dlrs note profit 4th lt year mths 31 sales
Topic #5:
billion dlrs surplus deficit mln francs marks 1986 january year rose february reserves trade fell 1985 account foreign december exports
Topic #6:
fed customer repurchase says reserves federal agreements funds reserve temporary sets dlrs week repurchases securities economists supply billion add trading
Topic #7:
stg mln bank money market england bills band assistance shortage revised today forecast help given provided 16 estimate compares central
Topic #8:
tonnes 000 wheat sugar corn export said ec grain 87 traders 1986 maize production tonne exports year soviet usda china
Topic #9:
dlrs said mln company shares share stock lt corp offer split common quarter group earnings dividend oil board shareholders unit

The above output takes the components derived from the factorization (here assumed to model a "topic" from the corpus) and extracts the 20 words that most significantly contributed to that topic. Although it's not perfect, we can see some commonalites among the groups of words.

NMF gives some interesting results, but there are more advanced algorithms for topic modeling. Latent Dirichlet Allocation (LDA), for example, is a technique that models documents as though they are composed of some undefined number of topics. Each of the words in the document are then said to be attributed to some combination of those topics.

Scikit-learn does not implement LDA so if we want to try it out, we'll need to look elsewhere. Fortunately there's a library called gensim that's focused specifically on topic modeling. To start off we need our corpus in a format that gensim models can use as input. Gensim implements a lot of the same transforms that we just applied to the data, but rather that re-create the same transforms we can re-use what we've already done and convert our term-document matrix into gensim's expected format.

from gensim import corpora, models, similarities, matutils
# create the corpus using a conversion utility
gensim_corpus = matutils.Sparse2Corpus(X_weighted)
# build the LDA model
lda = models.LdaModel(gensim_corpus, num_topics=100)

With our LDA model we could now examine the words that most contribute to each topic (as we did with NMF), or we could compare new documents to the model to identify either the topics that make up that document or the existing documents that they are most similar to. However, I want to move on from NMF/LDA to another algorithm implemented in gensim called word2vec because it's really, really interesting. Word2vec is an unsupervised neural network model that runs on a corpus of text and learns vector representations for the individual words in the text. The word vectors are modeled in a way such that words that are semantically "close" to each other are also close in a mathematical sense, and this results in some interesting properties. Let's explore some of the implications on the reuters dataset.

Since word2vec expects a list of sentences as input, we'll need to go back to the pre-transformed sentence list provided by NLTK

model = models.Word2Vec(reuters.sents(), size=100, window=5, min_count=5, workers=4)

We now have a trained word2vec model. It's possible to look at the vector for a word directly, although it won't mean much to a person.

model['market']

array([ 0.36218089,  0.04660718, -0.03312639, -0.00589092,  0.08425491,
       -0.05584015,  0.39824656, -0.19913128,  0.21185778, -0.16018888,
       -0.30720046,  0.41359827,  0.05477867,  0.40744004, -0.15048127,
       -0.21775401, -0.0918686 ,  0.08254269, -0.36109206, -0.08484149,
       -0.37724456,  0.19134018,  0.18765855,  0.17301551, -0.13106611,
        0.10278706,  0.14409529,  0.09305458, -0.27449781, -0.16971849,
        0.20959041,  0.12159102,  0.07963905,  0.03050068,  0.31353745,
        0.06859812, -0.26051152,  0.1805039 ,  0.28199297, -0.19140336,
        0.13152425,  0.04389969,  0.06004116, -0.31306067, -0.12013798,
       -0.17255786, -0.05460097, -0.35606486,  0.31404966,  0.03737779,
       -0.11519474,  0.31271645, -0.31853175,  0.08142728,  0.09033886,
       -0.15671426, -0.07798025,  0.06073617,  0.2294289 ,  0.13113637,
       -0.04398542, -0.34159404,  0.06506728,  0.20032322, -0.11604583,
       -0.14258914, -0.06725569, -0.06181487,  0.13476266,  0.17378812,
        0.01733109, -0.0836978 , -0.24637276,  0.06484974, -0.02348729,
        0.27839953, -0.12627478,  0.50229609,  0.02701729, -0.11646958,
       -0.3040815 , -0.18003054,  0.01555716, -0.11430902, -0.40754062,
        0.05430043,  0.27255279,  0.12115923,  0.16014519, -0.03295279,
       -0.50409102,  0.38960707, -0.19293144, -0.19752754, -0.14633107,
        0.24427678, -0.13369191,  0.18097162, -0.26153758, -0.11974742], dtype=float32)

Every word in the vocabulary now has a vector representation that looks like this. Since we're dealing with vectors, it's possible to compare words using vector math such as the cosine similarity.

model.similarity('dollar', 'yen')

0.7455091131291105

model.similarity('dollar', 'potato')

0.28685558842610581

According to the model, 'dollar' and 'yen' are much more similar to each other (both being currencies) than 'dollar' and 'potato'. The relationship is deeper than just a similarity measure though. The word2vec model is capable of capturing abstract concepts as well. The ubiquitous example is "woman + king - man = queen". When properly trained on a large enough amount of text, the model is able to detect that the relationship between 'woman' and 'queen' is similar to the relationship between 'man' and 'king'. Let's see if the model we just trained can do something similar.

model.most_similar(positive=['Japan', 'dollar'], negative=['US'])

[(u'appreciation', 0.6466031670570374),
 (u'appreciated', 0.596366822719574),
 (u'sterling', 0.5693594813346863),
 (u'Taiwan', 0.5512674450874329),
 (u'slowly', 0.5457212924957275),
 (u'mark', 0.5433770418167114),
 (u'yen', 0.5293248891830444),
 (u'stems', 0.5171161890029907),
 (u'pleas', 0.5137792825698853),
 (u'brake', 0.5080464482307434)]

It didn't get it exactly right, although 'yen' is on the list, but there's a lot of other noise too. This is most likely due to the relatively small size of the dataset. Word2vec needs a huge amount of training data to work really well. Some parameter tuning might help too - for example, a size of 100 dimensions might be way too big for the amount of data in the reuters dataset.

Let's see if there's a way to visualize some of the information captured by the model. Since the vectors are high-dimensional we can't visualize them directly, but we can apply a dimension reduction technique like PCA and use the first two principal components as coordinates. We can try this with a group of words that should be somewhat similar, such as countries.

from sklearn.decomposition import PCA
words = ['US', 'China', 'Japan', 'England', 'France', 'Germany', 'Soviet']  
word_vectors = [model[word] for word in words]
# create and apply PCA transform
pca = PCA(n_components=2)  
principal_components = pca.fit_transform(word_vectors)
# slice the 2D array
x = principal_components[:, 0]  
y = principal_components[:, 1]
# plot with text annotation
fig, ax = plt.subplots(figsize=(16,12))  
ax.scatter(x, y, s=0)
for i, label in enumerate(words):  
    ax.annotate(label, (x[i], y[i]), size='x-large')

They're a bit more spread out than I thought they would be, but a few (such as U.S. and China) are very close. These probably appeared frequently in the text so there may have been a larger amount of training data for these terms. The results become more interesting when applied to very large datasets, and indeed Google and others have done just that. In fact, applying this methodology to words is only the beginning. It's already been extended or is being extended to phrases and even entire documents. It's a very promising research area and I'm excited to see what comes out of it in the next few years.

An Intro To Probablistic Programming

John Wittenauer — Wed, 23 Sep 2015 23:58:41 GMT

Probablistic programming is an expressive and flexible way to build Bayesian statistical models in code. It's an entirely different mode of programming that involves using stochastic variables defined using probability distributions instead of concrete, deterministic values. Stochastic variables can be composed together in expressions and functions, just like in normal programming. But unlike regular programming, we use these relationships to conduct inference on the data that the variables model. Probablistic programming is gaining significant traction as a paradigm for building next-generation intelligent systems. In fact, there's a major research initiative funded by DARPA aimed at advancing the state of the art in the field. In this blog post we'll apply bayesian methods to solve a few simple problems to illustrate the power probablistic programming. For this exercise we'll be using PyMC3.

(NOTE: This notebook is partially adapted from the PyMC3 "Getting Started" tutorial at https://pymc-devs.github.io/pymc3/getting_started. PyMC3 is currently considered beta software and should be treated as such. The source code for this post is available here.)

For our first exercise we're going to implement multiple linear regression using a very simple two-variable dataset that I'm borrowing from an exercise I did for a machine learning course. The data is related to home sales and contains independent variables for the size of and number of bedrooms in a house, and a dependent variable for the price of the house. Let's load up and visualize the data.

import os  
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
%matplotlib inline
path = os.getcwd() + '/data/ex1data2.txt'  
data = pd.read_csv(path, header=None, names=['Size', 'Bedrooms', 'Price'])  
data.head()

	Size	Bedrooms	Price
0	2104	3	399900
1	1600	3	329900
2	2400	3	369000
3	1416	2	232000
4	3000	4	539900

fig, ax = plt.subplots(1, 2, figsize=(15,6))  
ax[0].scatter(data.Size, data.Price)  
ax[1].scatter(data.Bedrooms, data.Price)  
ax[0].set_ylabel('Price')  
ax[0].set_xlabel('Size')  
ax[1].set_xlabel('Bedrooms')

We also need a baseline linear regression technique to compare to the bayesian approach so we have an idea of how well it's performing. For that we'll load up scikit-learn's linear regression module. We'll use squared error to evaluate the performance.

from sklearn import linear_model  
model = linear_model.LinearRegression()
# normalize features
data = (data - data.mean()) / data.std()
X = data[['Size', 'Bedrooms']].values  
y = data['Price'].values
model.fit(X, y)  
y_pred = model.predict(X)  
squared_error = ((y - y_pred) ** 2).sum()  
squared_error

12.284529170669948

For fun let's see what parameters it came up with too. We can also compare these later on.

model.intercept_, model.coef_

(-1.2033011196250568e-16, array([ 0.88476599, -0.05317882]))

Okay, now we're ready to proceed. In order to define a model in PyMC3, we need to frame the problem in bayesian terms. This is no small task for a beginner in bayesian statistics and takes some getting used to. In the case of linear regression, we're interested in predicting outcomes Y as normally-distributed observations with an expected value μ that is a linear function of two predictor variables, X1 and X2. Using this model, μ is our expected value, alpha is our intercept, beta is an array of coefficients, and sigma represents the observation error. Since these are all unknown, non-deterministic variables, we need to specify a prior to instantiate them. We'll use normal distributions with a mean of zero.

By the way, if any of the concepts I just mentioned sound completely foreign, I encourage the reader to brush up on the basics of Bayesian statistics. There are some great resources online. In particular, I've found this book to be very helpful.

from pymc3 import Model, Normal, HalfNormal
regression_model = Model()  
with regression_model:  
    # priors for unknown model parameters
    alpha = Normal('alpha', mu=0, sd=10)
    beta = Normal('beta', mu=0, sd=10, shape=2)
    sigma = HalfNormal('sigma', sd=1)
    # expected value of outcome
    mu = alpha + beta[0] * X[:,0] + beta[1] * X[:,1]
    # likelihood (sampling distribution) of observations
    y_obs = Normal('y_obs', mu=mu, sd=sigma, observed=y)

Note that each variable is either declared with a prior representing the distribution of that variable, or (in the case of μ) is a deterministic outcome of other stochastic variables. Also of note is the y_obs variable, which is a special type of "observed" variable that represents the data likelihood of the model. Finally, observe that we're mixing PyMC variables with the variables holding the data. PyMC's models are very expressive and we could use a variety of mathematical functions to define our variables.

Now that we've specified the model, we need to obtain posterior estimates for the unknown variables in the model. There are two techniques we can leverage for this and they can be used to complement each other. The first thing we're going to do is find the Maximum A Priori Estimate (MAP) for the model. The MAP is a point estimate for the model parameters obtained using numerical optimization methods.

from pymc3 import find_MAP  
from scipy import optimize
map_estimate = find_MAP(model=regression_model, fmin=optimize.fmin_powell)  
print(map_estimate)

{'alpha': array(3.396023412944341e-09), 'beta': array([ 0.8846891, -0.0531327]), 'sigma_log': array(-0.6630286280877248)}

In this case we're overriding the default optimization algorithm (BFGS) and specifying our own, but we could have left it as the default too. Any optimization algorithm that minimizes the loss on the objective function should work.

Finding the MAP is a good starting point but it's not necessarily the best answer we can find. To do that, we need to use a simulation-based approach such as Markov Chain Monte Carlo (MCMC). PyMC3 implements a variety of MCMC sampling algorithms including the No-U-Turn Sampler (NUTS), which is especially good for models that have many continuous variables because it uses gradient-based techniques to converge much faster than traditional sampling algorithms.

Let's use the MAP as a starting point and sample the posterior distribution 1000 times using MCMC.

from pymc3 import NUTS, sample
with regression_model:  
    # obtain starting values via MAP
    start = find_MAP(fmin=optimize.fmin_powell)
    # instantiate sampler
    step = NUTS(scaling=start)
    # draw posterior samples
    trace = sample(5000, step, start=start)

[-----------------100%-----------------] 5000 of 5000 complete in 11.4 sec

We can examine the trace object directly to see the sampled values for each of the variables in the model.

trace['alpha'][-5:]

array([ 0.07095636, -0.05955168, 0.09537584, 0.04383463, 0.10311347])

Although the flexibility to inspect the values directly is useful, PyMC3 provides plotting and summarization functions for inspecting the sampling output that tell us much more at a glance. A simple posterior plot can be created using traceplot.

from pymc3 import traceplot  
traceplot(trace)

There's also a text-based output available using the summary function.

from pymc3 import summary  
summary(trace)

alpha:
  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  0.002            0.080            0.001            [-0.154, 0.157]
  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  -0.152         -0.051         0.003          0.055          0.162
beta:
  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  0.884            0.096            0.002            [0.688, 1.063]
  -0.052           0.097            0.002            [-0.237, 0.139]
  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  0.697          0.821          0.884          0.946          1.076
  -0.237         -0.118         -0.052         0.010          0.139
sigma_log:
  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  -0.617           0.103            0.001            [-0.819, -0.425]
  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  -0.807         -0.688         -0.622         -0.549         -0.401
sigma:
  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  0.543            0.057            0.001            [0.440, 0.653]
  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  0.446          0.502          0.537          0.578          0.669

So we now have posterior distributions for each of our model parameters, but what do we do with them? Remember our original task was to find a model that minimizes the squared error of the training data. We've created a probablistic model of the parameters in the linear regression model, but we need to use point values to calculate the squared error. The most straightforward approach would be to use the mean, or expected value, of the parameters and plug those into the linear regression equation. Let's try that.

# so we can use vector math to compute the predictions at once
data.insert(0, 'Ones', 1)
X = data.iloc[:, 0:3].values  
params = np.array([0, 0.883, -0.052])  
y_pred = np.dot(X, params.T)
bayes_squared_error = ((y - y_pred) ** 2).sum()  
bayes_squared_error

12.284629306712745

Total squared error probably isn't the best way to evaluate the performance of a model. In a real scenario we'd be testing predictions on unseen data, testing the robustness of the model, etc. But it's still pretty interesting that the parameter means it found resulted in a model with basically the exact same squared error as the scikit-learn linear regression model. Fascinating stuff.

Since linear models are pretty common, PyMC3 also has a GLM module that makes specifying models in this format much easier. It uses patsy to define model equations as a string (similar to R-style formulas) and creates the necessary variables underneath.

from pymc3 import glm
with Model() as model:  
    glm.glm('Price ~ Size + Bedrooms', data)
    start = find_MAP()
    step = NUTS(scaling=start)
    trace = sample(5000, step, start=start)

[-----------------100%-----------------] 5000 of 5000 complete in 12.4 sec

The end result should look pretty similar to the trace that we obtained from the first example.

traceplot(trace)

So that's what bayesian inference looks like when applied to multiple linear regression. At this point it's worth asking - how would one use this in the broader picture, especially on non-trivial problems? For someone interested primarily in machine learning applications (vs. scientific analysis), where does this fit in the toolbox? These are questions that I'm still figuring out. I think the real power of bayesian modeling comes from incorporating domain knowledge into graphical models such as what Daphne Koller teaches in her Probablistic Graphical Models class. I hope to explore this further in a future notebook and attempt to apply it to a real machine learning problem.

Why Spark May Be Even Bigger Than The Hype

John Wittenauer — Sat, 15 Aug 2015 18:25:10 GMT

Spark is currently one of the hottest open source projects in the big data space, even eclipsing Hadoop in terms of excitement. Originally a Berkley AMPlab project, Spark became a top-level Apache project early last year and has been on a tear ever since. The company now backing Spark – Databricks – has been churning our major releases every few months with no signs of slowing down. But to claim that Spark is solely a Databricks-driven endeavor is a bit misleading. Spark is getting contributions from all over the place, including many of its biggest users, which is one of the characteristics of the project that make it so interesting. As of this writing there have been over 600 individual contributors to the project according to GitHub’s stats. That goes way beyond any single organization.

To give a better sense of just how quickly Spark as taken off, consider that 3 years ago it was an obscure academic experiment based off of distributed systems research taking place at AMPlab. By the end of 2013, the first Spark Summit was held with over 400 developers in attendance. And just last month, the 2015 event in San Francisco sold out with over 2,000 in attendance. Spark is currently the most active project under the stewardship of the Apache foundation and already boasts more than 500 production deployments, according to Patrick Wendell, one of the project’s founders and a co-founder of Databricks.

Early adopters have clearly taken notice of Spark’s rapid rise, and with good reason. Spark brings a lot of innovation to big data processing. Hadoop pushed the boundary of what is possible in terms of handling scale and variety of data, but MapReduce is fundamentally a batch-oriented approach. You write a MapReduce job, set it off, go get some coffee (or maybe lunch, depending on the size of the job) and hopefully get some results when you return. SQL abstractions such as Hive and Impala have reduced this friction somewhat by both alleviating the need to write MapReduce code and optimizing the execution plan of the code to improve performance. However, one would rarely hear the experience described as “real-time” except for relatively small jobs. Spark brings a different approach to the table. By intelligently using in-memory storage along with more sophisticated distributed processing algorithms, Spark is able to bring a more interactive feel to working with big data. Along with much more developer-friendly APIs in languages like Scala and Python, Spark opens the doors to a whole new audience that would never have even considered touching MapReduce if they didn’t have to.

Yet while Spark appears on the surface to be competing with Hadoop, it also runs on YARN and supports HDFS file storage. Spark is frequently installed on existing Hadoop clusters alongside Hive, HBase and the like, giving users another option for interacting with the data already in the cluster. In that sense it is more complementary to Hadoop than adversarial, and may even help to increase adoption of Hadoop. There’s a reason that all of the major Hadoop vendors have embraced Spark, and companies like IBM (a traditional enterprise stalwart) are purportedly going all in with initiatives to expand Spark’s footprint. The tidal wave is already occurring – why not embrace it?

But despite all of the optimism, there are definitely reasons to take pause. One major red flag is probably Databricks itself. The history of companies whose business model is based entirely on open source products is mixed to say the least. Although their model isn’t the traditional “license support from the experts” approach, they’re arguably taking an even more tenuous position. They’re essentially betting that they can entice enterprises to use their cloud platform (Spark-as-a-service?) over base-level Spark by providing some nice “extras” such as managed deployment/scheduling and an interactive workspace tool (which looks an awful lot like the Jypyter project, formerly IPython notebook). But what happens if the open source community or a major Hadoop vendor catches up to Databricks’ proprietary value-add code, thus negating any advantage to using their platform? Perhaps even more concerning – how does Databricks balance the need to keep Spark moving forward (which benefits everyone) vs. improving their own platform (which makes their business viable)? This is new territory and it may be years before we see how this plays out.

I think concerns over whether or not Spark can keep this momentum going are probably justified, but I still feel pretty optimistic about its future. One of the reasons I feel this way is because the vision that the project’s founders have for Spark going forward sounds amazing. If you watch some of the keynote speeches from the last summit, there’s a lot of excitement about where things are at today but even more excitement about where things are going. Essentially the goal is to turn Spark into an “operating system” for data processing. A variety of lightweight front-ends in varying languages will compile code down to a single logical model based on a common data frames API. From there, initiatives like Project Tungsten will take that logical plan and push the performance envelope with innovations like cache-aware computation and advanced code generation. Longer term, Spark may even be able to compile instructions to frameworks like LLVM or OpenCL and leverage alternative computation engines like GPUs (currently the de-facto standard for training “deep learning” neural nets). If the front-end APIs and library of distributed algorithms evolves to a point where it’s as easy to run analytics and machine learning on Spark as it is on a single computer today (or perhaps even easier), this could be a game-changer.

So although there’s a great deal of hype around Spark today, and it’s entirely possible that much of it ends up being overblown, I think it’s also possible that it ends up getting much, much bigger. IBM has called Spark “potentially the most significant open source project of the next decade”. Andrew Brust wrote earlier this year that “…when platforms get beyond a certain critical mass of support, they eventually become what the hype has made them out to be. In other words, belief in the quality of a platform tends to self-fulfill.” Will great vision, top-notch engineering talent, and a widespread belief that it will be successful end up being enough to overcome the challenges that lay ahead? Only time will tell, but one thing is certain – it’s going to be fun to watch.

Machine Learning Exercises In Python, Part 3

John Wittenauer — Wed, 15 Jul 2015 00:06:18 GMT

This post is part of a series covering the exercises from Andrew Ng's machine learning class on Coursera. The original code for this post is available here. The exercise text is available here.

Part 1 - Simple Linear Regression
Part 2 - Multivariate Linear Regression

In part 2 of the series we wrapped up our implementation of multivariate linear regression using gradient descent and applied it to a simple housing prices data set. In this post we’re going to switch our objective from predicting a continuous value (regression) to classifying a result into two or more discrete buckets (classification) and apply it to a student admissions problem. Suppose that you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set. For each training example, you have the applicant's scores on two exams and the admissions decision. To accomplish this, we're going to build a classification model that estimates the probability of admission based on the exam scores using a somewhat confusingly-named technique called logistic regression.

Logistic Regression

You may be wondering – why are we using a “regression” algorithm on a classification problem? Although the name seems to indicate otherwise, logistic regression is actually a classification algorithm. I suspect it’s named as such because it’s very similar to linear regression in its learning approach, but the cost and gradient functions are formulated differently. In particular, logistic regression uses a sigmoid or “logit” activation function instead of the continuous output in linear regression (hence the name). We’ll see more later on when we dive into the implementation.

To get started, let’s import and examine the data set we’ll be working with.

import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
%matplotlib inline
import os  
path = os.getcwd() + '\data\ex2data1.txt'  
data = pd.read_csv(path, header=None, names=['Exam 1', 'Exam 2', 'Admitted'])  
data.head()

	Exam 1	Exam 2	Admitted
0	34.623660	78.024693	0
1	30.286711	43.894998	0
2	35.847409	72.902198	0
3	60.182599	86.308552	1
4	79.032736	75.344376	1

There are two continuous independent variables in the data - “Exam 1” and “Exam 2”. Our prediction target is the “Admitted” label, which is binary-valued. A value of 1 means the student was admitted and a value of 0 means the student was not admitted. Let’s see this graphically with a scatter plot of the two scores and use color coding to visualize if the example is positive or negative.

positive = data[data['Admitted'].isin([1])]  
negative = data[data['Admitted'].isin([0])]
fig, ax = plt.subplots(figsize=(12,8))  
ax.scatter(positive['Exam 1'], positive['Exam 2'], s=50, c='b', marker='o', label='Admitted')  
ax.scatter(negative['Exam 1'], negative['Exam 2'], s=50, c='r', marker='x', label='Not Admitted')  
ax.legend()  
ax.set_xlabel('Exam 1 Score')  
ax.set_ylabel('Exam 2 Score')

From this plot we can see that there’s a nearly linear decision boundary. It curves a bit so we can’t classify all of the examples correctly using a straight line, but we should be able to get pretty close. Now we need to implement logistic regression so we can train a model to find the optimal decision boundary and make class predictions. The first step is to implement the sigmoid function.

def sigmoid(z):  
    return 1 / (1 + np.exp(-z))

This function is the “activation” function for the output of logistic regression. It converts a continuous input into a value between zero and one. This value can be interpreted as the class probability, or the likelihood that the input example should be classified positively. Using this probability along with a threshold value, we can obtain a discrete label prediction. It helps to visualize the function’s output to see what it’s really doing.

nums = np.arange(-10, 10, step=1)
fig, ax = plt.subplots(figsize=(12,8))  
ax.plot(nums, sigmoid(nums), 'r')

Our next step is to write the cost function. Remember that the cost function evaluates the performance of the model on the training data given a set of model parameters. Here’s the cost function for logistic regression.

def cost(theta, X, y):  
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
    second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
    return np.sum(first - second) / (len(X))

Note that we reduce the output down to a single scalar value, which is the sum of the “error” quantified as a function of the difference between the class probability assigned by the model and the true label of the example. The implementation is completely vectorized – it’s computing the model’s predictions for the whole data set in one statement (sigmoid(X * theta.T)). If the math here isn’t making sense, refer to the exercise text I linked to above for a more detailed explanation.

We can test the cost function to make sure it’s working, but first we need to do some setup.

# add a ones column - this makes the matrix multiplication work out easier
data.insert(0, 'Ones', 1)
# set X (training data) and y (target variable)
cols = data.shape[1]  
X = data.iloc[:,0:cols-1]  
y = data.iloc[:,cols-1:cols]
# convert to numpy arrays and initalize the parameter array theta
X = np.array(X.values)  
y = np.array(y.values)  
theta = np.zeros(3)

I like to check the shape of the data structures I’m working with fairly often to convince myself that their values are sensible. This technique if very useful when implementing matrix multiplication.

X.shape, theta.shape, y.shape

((100L, 3L), (3L,), (100L, 1L))

Now let’s compute the cost for our initial solution given zeros for the model parameters, here represented as “theta”.

cost(theta, X, y)

0.69314718055994529

Now that we have a working cost function, the next step is to write a function that computes the gradient of the model parameters to figure out how to change the parameters to improve the outcome of the model on the training data. Recall that with gradient descent we don’t just randomly jigger around the parameter values and see what works best. At each training iteration we update the parameters in a way that’s guaranteed to move them in a direction that reduces the training error (i.e. the “cost”). We can do this because the cost function is differentiable. The calculus involved in deriving the equation is well beyond the scope of this blog post, but the full equation is in the exercise text. Here’s the function.

def gradient(theta, X, y):  
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    parameters = int(theta.ravel().shape[1])
    grad = np.zeros(parameters)
    error = sigmoid(X * theta.T) - y
    for i in range(parameters):
        term = np.multiply(error, X[:,i])
        grad[i] = np.sum(term) / len(X)
    return grad

Note that we don't actually perform gradient descent in this function - we just compute a single gradient step. In the exercise, an Octave function called "fminunc" is used to optimize the parameters given functions to compute the cost and the gradients. Since we're using Python, we can use SciPy's optimization API to do the same thing.

import scipy.optimize as opt  
result = opt.fmin_tnc(func=cost, x0=theta, fprime=gradient, args=(X, y))  
cost(result[0], X, y)

0.20357134412164668

We now have the optimal model parameters for our data set. Next we need to write a function that will output predictions for a dataset X using our learned parameters theta. We can then use this function to score the training accuracy of our classifier.

def predict(theta, X):  
    probability = sigmoid(X * theta.T)
    return [1 if x >= 0.5 else 0 for x in probability]
theta_min = np.matrix(result[0])  
predictions = predict(theta_min, X)  
correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0)) else 0 for (a, b) in zip(predictions, y)]  
accuracy = (sum(map(int, correct)) % len(correct))  
print 'accuracy = {0}%'.format(accuracy)

accuracy = 89%

Our logistic regression classifer correctly predicted if a student was admitted or not 89% of the time. Not bad! Keep in mind that this is training set accuracy though. We didn't keep a hold-out set or use cross-validation to get a true approximation of the accuracy so this number is likely higher than its true performance (this topic is covered in a later exercise).

Regularized Logistic Regression

Now that we have a working implementation of logistic regression, we'll going to improve the algorithm by adding regularization. Regularization is a term in the cost function that causes the algorithm to prefer "simpler" models (in this case, models will smaller coefficients). The theory is that this helps to minimize overfitting and improve the model's ability to generalize. We’ll apply our regularized version of logistic regression to a slightly more challenging problem. Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a data set of test results on past microchips, from which you can build a logistic regression model.

Let's start by visualizing the data.

path = os.getcwd() + '\data\ex2data2.txt'  
data2 = pd.read_csv(path, header=None, names=['Test 1', 'Test 2', 'Accepted'])
positive = data2[data2['Accepted'].isin([1])]  
negative = data2[data2['Accepted'].isin([0])]
fig, ax = plt.subplots(figsize=(12,8))  
ax.scatter(positive['Test 1'], positive['Test 2'], s=50, c='b', marker='o', label='Accepted')  
ax.scatter(negative['Test 1'], negative['Test 2'], s=50, c='r', marker='x', label='Rejected')  
ax.legend()  
ax.set_xlabel('Test 1 Score')  
ax.set_ylabel('Test 2 Score')

This data looks a bit more complicated than the previous example. In particular, you'll notice that there is no linear decision boundary that will perform well on this data. One way to deal with this using a linear technique like logistic regression is to construct features that are derived from polynomials of the original features. We can try creating a bunch of polynomial features to feed into the classifier.

degree = 5  
x1 = data2['Test 1']  
x2 = data2['Test 2']
data2.insert(3, 'Ones', 1)
for i in range(1, degree):  
    for j in range(0, i):
        data2['F' + str(i) + str(j)] = np.power(x1, i-j) * np.power(x2, j)
data2.drop('Test 1', axis=1, inplace=True)  
data2.drop('Test 2', axis=1, inplace=True)
data2.head()

	Accepted	Ones	F10	F20	F21	F30	F31	F32
0	1	1	0.051267	0.002628	0.035864	0.000135	0.001839	0.025089
1	1	1	-0.092742	0.008601	-0.063523	-0.000798	0.005891	-0.043509
2	1	1	-0.213710	0.045672	-0.147941	-0.009761	0.031616	-0.102412
3	1	1	-0.375000	0.140625	-0.188321	-0.052734	0.070620	-0.094573
4	1	1	-0.513250	0.263426	-0.238990	-0.135203	0.122661	-0.111283

Now we need to modify the cost and gradient functions to include the regularization term. In each case, the regularizer is added on to the previous calculation. Here’s the updated cost function.

def costReg(theta, X, y, learningRate):  
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
    second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
    reg = (learningRate / 2 * len(X)) * np.sum(np.power(theta[:,1:theta.shape[1]], 2))
    return np.sum(first - second) / (len(X)) + reg

Notice that we’ve added a new variable called “reg” that is a function of the parameter values. As the parameters get larger, the penalization added to the cost function increases. Also note that we’ve added a new “learning rate” parameter to the function. This is also part of the regularization term in the equation. The learning rate gives us a new hyper-parameter that we can use to tune how much weight the regularization holds in the cost function.

Next we’ll add regularization to the gradient function.

def gradientReg(theta, X, y, learningRate):  
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    parameters = int(theta.ravel().shape[1])
    grad = np.zeros(parameters)
    error = sigmoid(X * theta.T) - y
    for i in range(parameters):
        term = np.multiply(error, X[:,i])
        if (i == 0):
            grad[i] = np.sum(term) / len(X)
        else:
            grad[i] = (np.sum(term) / len(X)) + ((learningRate / len(X)) * theta[:,i])
    return grad

Just as with the cost function, the regularization term is added on to the original calculation. However, unlike the cost function, we included logic to make sure that the first parameter is not regularized. The intuition behind this decision is that the first parameter is considered the “bias” or “intercept” of the model and shouldn’t be penalized.

We can test out the new functions just as we did before.

# set X and y (remember from above that we moved the label to column 0)
cols = data2.shape[1]  
X2 = data2.iloc[:,1:cols]  
y2 = data2.iloc[:,0:1]
# convert to numpy arrays and initalize the parameter array theta
X2 = np.array(X2.values)  
y2 = np.array(y2.values)  
theta2 = np.zeros(11)
learningRate = 1
costReg(theta2, X2, y2, learningRate)

0.6931471805599454

We can also re-use the optimization code from earlier to find the optimal model parameters.

result2 = opt.fmin_tnc(func=costReg, x0=theta2, fprime=gradientReg, args=(X2, y2, learningRate))  
result2

(array([ 0.35872309, -3.22200653, 18.97106363, -4.25297831, 18.23053189, 20.36386672, 8.94114455, -43.77439015, -17.93440473, -50.75071857, -2.84162964]), 110, 1)

Finally, we can use the same method we applied earlier to create label predictions for the training data and evaluate the performance of the model.

theta_min = np.matrix(result2[0])  
predictions = predict(theta_min, X2)  
correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0)) else 0 for (a, b) in zip(predictions, y2)]  
accuracy = (sum(map(int, correct)) % len(correct))  
print 'accuracy = {0}%'.format(accuracy)

accuracy = 91%

That's all for part 3! In the next post in the series we'll expand on our implementation of logistic regression to tackle multi-class image classification.

A Simple Time Series Analysis Of The S&P 500 Index

John Wittenauer — Wed, 17 Jun 2015 00:27:53 GMT

In this blog post we'll examine some common techniques used in time series analysis by applying them to a data set containing daily closing values for the S&P 500 stock market index from 1950 up to present day. The objective is to explore some of the basic ideas and concepts from time series analysis, and observe their effects when applied to a real world data set. Although it's not possible to actually predict changes in the index using these techniques, the ideas presented here could theoretically be used as part of a larger strategy involving many additional variables to conduct a regression or machine learning effort.

Time series analysis is a branch of statistics that involves reasoning about ordered sequences of related values in order to extract meaningful statistics and other characteristics of the data. It's used in a wide range of disciplines including econometrics, signal processing, weather forecasting, and basically any other field that involves time series data. These techniques are often used to develop models that can be used to attempt to forecast future values of a series, either on their own or in concert with other variables.

To get started, let's first download the data. I got the historial data set from Yahoo Finance, which includes a link to download the whole thing as a .csv file. Now we can load up the data set and take a look. I'll be using several popular Python libraries for the analysis, so all of the code is in Python.

%matplotlib inline
import os  
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
import statsmodels.api as sm  
import seaborn as sb  
sb.set_style('darkgrid')
path = os.getcwd() + '\data\stock_data.csv'  
stock_data = pd.read_csv(path)  
stock_data['Date'] = stock_data['Date'].convert_objects(convert_dates='coerce')  
stock_data = stock_data.sort_index(by='Date')  
stock_data = stock_data.set_index('Date')

The data is in reverse chronological order so I sorted it by date and then set the index of the data frame to the date column. If you look at the data there are several fields but we'll be focusing on the closing price only. Let's plot the data first and see what it looks like.

stock_data['Close'].plot(figsize=(16, 12))

The first obvious thing to note, aside from the two giant dips at the tail end corresponding to the market crashes in 2002 and 2008, is that the data is clearly non-stationary. This makes sense for market data as it tends to go up in the long run more than it goes down. This is a problem for time series analysis though as non-stationary data is hard to reason about. The first thing we can try is a first difference of the series. In other words, subtract the previous value t-1 from the current value t to get the difference d(t).

stock_data['First Difference'] = stock_data['Close'] - stock_data['Close'].shift()  
stock_data['First Difference'].plot(figsize=(16, 12))

The data no longer appears to be trending up over time and is instead centered around 0. There's another problem though. Look at the variance. It's very small early on and steadily increases over time. This is a sign that the data is not only non-stationary but also exponentially increasing. The magnitude of the day-to-day variations at present day completely dwarf the magnitude of the changes in 1950. To deal with this, we'll apply a log transform to the original series.

stock_data['Natural Log'] = stock_data['Close'].apply(lambda x: np.log(x))  
stock_data['Natural Log'].plot(figsize=(16, 12))

So that gives us the original closing price with a log transform applied to "flatten" the data from an exponential curve to a linear curve. One way to visually see the effect that the log transform had is to analyze the variance over time. We can use a rolling variance statistic and compare both the original series and the logged series.

stock_data['Original Variance'] = pd.rolling_var(stock_data['Close'], 30, min_periods=None, freq=None, center=True)  
stock_data['Log Variance'] = pd.rolling_var(stock_data['Natural Log'], 30, min_periods=None, freq=None, center=True)
fig, ax = plt.subplots(2, 1, figsize=(13, 12))  
stock_data['Original Variance'].plot(ax=ax[0], title='Original Variance')  
stock_data['Log Variance'].plot(ax=ax[1], title='Log Variance')  
fig.tight_layout()

Observe that in the top graph, we can't even see any of the variations until the late 80s. In the bottom graph however it's a different story, changes in the value are clearly visible throughout the entire data set. From this view, it's clear that our transformation has made the variance relatively constant.

Now we can see the earlier variations in the data set quite a bit better than before. We still need to take the first difference though so let's calculate that from the logged series.

stock_data['Logged First Difference'] = stock_data['Natural Log'] - stock_data['Natural Log'].shift()  
stock_data['Logged First Difference'].plot(figsize=(16, 12))

Much better! We now have a stationary time series model of daily changes to the S&P 500 index. Now let's create some lag variables y(t-1), y(t-2) etc. and examine their relationship to y(t). We'll look at 1 and 2-day lags along with weekly and monthly lags to look for "seasonal" effects.

stock_data['Lag 1'] = stock_data['Logged First Difference'].shift()  
stock_data['Lag 2'] = stock_data['Logged First Difference'].shift(2)  
stock_data['Lag 5'] = stock_data['Logged First Difference'].shift(5)  
stock_data['Lag 30'] = stock_data['Logged First Difference'].shift(30)

One interesting visual way to evaluate the relationship between lagged variables is to do a scatter plot of the original variable vs. the lagged variable and see where the distribution lies. We can do this with a joint plot using the seaborn package.

sb.jointplot('Logged First Difference', 'Lag 1', stock_data, kind='reg', size=13)

Notice how tightly packed the mass is around 0. It also appears to be pretty evenly distributed - the marginal distributions on both axes are roughly normal. This seems to indicate that knowing the index value one day doesn't tell us much about what it will do the next day.

It probably comes as no surprise that there's very little correlation between the change in value from one day to the next. Although I didn't plot them out here, the other lagged variables that we created above show similar results. There could be a relationship to other lag steps that we haven't tried, but it's impractical to test every possible lag value manually. Fortunately there is a class of functions that can systematically do this for us.

from statsmodels.tsa.stattools import acf  
from statsmodels.tsa.stattools import pacf
lag_correlations = acf(stock_data['Logged First Difference'].iloc[1:])  
lag_partial_correlations = pacf(stock_data['Logged First Difference'].iloc[1:])

The auto-correlation function computes the correlation between a variable and itself at each lag step up to some limit (in this case 40). The partial auto-correlation function computes the correlation at each lag step that is NOT already explained by previous, lower-order lag steps. We can plot the results to see if there are any significant correlations.

fig, ax = plt.subplots(figsize=(16,12))  
ax.plot(lag_correlations, marker='o', linestyle='--')

The auto-correlation and partial-autocorrelation results are very close to each other (I only plotted the auto-correlation results above). What this shows is that there is no significant (> 0.2) correlation between the value at time t and at any time prior to t up to 40 steps behind. In order words, the series is a random walk.

Another interesting technique we can try is a decomposition. This is a technique that attempts to break down a time series into trend, seasonal, and residual factors. Statsmodels comes with a decompose function out of the box.

from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(stock_data['Natural Log'], model='additive', freq=30)  
fig = plt.figure()  
fig = decomposition.plot()

Since we don't see any real cycle in the data, the visualization is not that effective in this case. For data where this is a strong seasonal pattern though it can be very useful. The folling instance, for example, is a sample from the statsmodels documentation showing CO2 emissions data over time.

co2_data = sm.datasets.co2.load_pandas().data  
co2_data.co2.interpolate(inplace=True)  
result = sm.tsa.seasonal_decompose(co2_data.co2)  
fig = plt.figure()  
fig = result.plot()

The decomposition is much more useful in this case. There are three clearly distinct components to the time series - a trend line, a seasonal adjustment, and residual values. Each of these would need to be accounted for and modeled appropriately.

Going back to our stock data, we're already observed that it's a random walk and that our lagged variables don't seem to have much impact, but we can still try fitting some ARIMA models and see what we get. Let's start with a simple moving average model.

model = sm.tsa.ARIMA(stock_data['Natural Log'].iloc[1:], order=(1, 0, 0))  
results = model.fit(disp=-1)  
stock_data['Forecast'] = results.fittedvalues  
stock_data[['Natural Log', 'Forecast']].plot(figsize=(16, 12))

So at first glance it seems like this model is doing pretty well. But although it appears like the forecasts are really close (the lines are almost indistinguishable after all), remember that we used the un-differenced series! The index only fluctuates a small percentage day-to-day relative to the total absolute value. What we really want is to predict the first difference, or the day-to-day moves. We can either re-run the model using the differenced series, or add an "I" term to the ARIMA model (resulting in a (1, 1, 0) model) which should accomplish the same thing. Let's try using the differenced series.

model = sm.tsa.ARIMA(stock_data['Logged First Difference'].iloc[1:], order=(1, 0, 0))  
results = model.fit(disp=-1)  
stock_data['Forecast'] = results.fittedvalues  
stock_data[['Logged First Difference', 'Forecast']].plot(figsize=(16, 12))

It's a little hard to tell, but it appears like our forecasted changes are generally much smaller than the actual changes. It might be worth taking a closer look at a subset of the data to see what's really going on.

stock_data[['Logged First Difference', 'Forecast']].iloc[1200:1600, :].plot(figsize=(16, 12))

So now it's pretty obvious that the forecast is way off. We're predicting tiny little variations relative to what is actually happening day-to-day. Again, this is more of less expected with a simple moving average model of a random walk time series. There's not enough information from the previous days to accurately forcast what's going to happen the next day.

A moving average model doesn't appear to do so well. What about an exponential smoothing model? Exponential smoothing spreads out the impact of previous values using an exponential weighting, so things that happened more recently are more impactful than things that happened a long time ago. Maybe this "smarter" form of averaging will be more accurate?

model = sm.tsa.ARIMA(stock_data['Logged First Difference'].iloc[1:], order=(0, 0, 1))  
results = model.fit(disp=-1)  
stock_data['Forecast'] = results.fittedvalues  
stock_data[['Logged First Difference', 'Forecast']].plot(figsize=(16, 12))

You can probably guess the answer...if predicting the stock market were this easy, everyone would be doing it! As I mentioned at the top, the point of this analysis wasn't to claim that you can predict the market with these techniques, but rather to demonstrate the types of the analysis one might use when breaking down time series data. Time series analysis is a very rich field with lots more technical analysis than what I went into here (much of which I'm still learning). If you're interested in doing a deeper dive, I recommend these notes from a professor at Duke. A lot of what I learned about the field I picked up from reading online resources like this one. Finally, the original source code from this post is hosted on GitHub here, along with a variety of other notebooks. Feel free to check it out!

Configuring Theano For High Performance Deep Learning

John Wittenauer — Sat, 23 May 2015 15:41:26 GMT

The topic of this post is perhaps a bit mundane, but after spending a considerable amount of time getting this right, I decided to put together a step-by-step guide so I would remember how to do it next time. And since I already went to the trouble, I might as well share it and hopefully save others a few headaches.

Theano, if you're not aware, is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It has a number of really interesting features, particularly its transparent use of a GPU for massively parallel computation and the ability to dynamically compile expressions into optimized C code. Theano provides developers with a general-purpose computing framework for very fast tensor operations and is widely used in the Python machine learning community, particularly for deep learning research. If you'd like to learn more, there's detailed documentation available here.

While Theano is very powerful, it's unfortunately not that easy to set up and configure properly. The website does provide instructions, but they're often inconsistent or out of date. There are also a number of blog posts and Stack Overflow questions floating around that discuss various elements of the process, but nothing that completely covered what I needed. So that's what this post will address.

To be clear, my goal was relatively simple - install and configure Theano with an efficient BLAS implementation for CPU operations, along with the ability to leverage a GPU when desired, on a clean install of Ubuntu 15.04 so I could run deep learning experiments. Although I'm running Ubuntu 15, this guide should also work on 14.xx. As a bonus, I'll also cover setting up a fresh install of Anaconda and downloading/installing a new but promising deep learning library called Keras.

First, open a terminal and run the following commands to make sure your OS is up-to-date.

sudo apt-get update  
sudo apt-get upgrade  
sudo apt-get install build-essential  
sudo apt-get autoremove

Next, install and set up Git. We need this to download and build OpenBLAS ourselves.

sudo apt-get install git  
git config --global user.name   
git config --global user.email

Next we need a fortran compiler (again required to build OpenBLAS).

sudo apt-get install gfortran

Now we're going to retrieve and build OpenBLAS. It's possible to skip this part and just download a pre-built BLAS package, but you'll get much better performance by compiling it yourself, and it requires relatively little effort. First create a Git folder (I added it to my home directory), then run these commands:

cd git  
git clone https://github.com/xianyi/OpenBLAS  
cd OpenBLAS  
make FC=gfortran  
sudo make PREFIX=/usr/local install

Next let's set up the required GPU tooling. If you're using a relatively modern NVIDIA card, your best bet is to use CUDA. It's possible to configure Theano using OpenCL as well, but I haven't tried this so I won't cover it here. There are a number of ways to install CUDA, and many guides advise you to download the binaries from NVIDIA's website and run some commands to install it, but actually there's a package available that installs everything you need. Run the following commands:

sudo apt-get install nvidia-current  
sudo apt-get install nvidia-cuda-toolkit

The first line installs NVIDIA's graphics card drivers, and the second line installs the CUDA tools. Note that the "nvidia-current" package may (despite what the name indicates) install an older version of the drivers than is actally available. Don't fall into the trap of thinking to have to use the legacy drivers to use CUDA. You don't! I ended up installing newer drivers from the system menu and it still worked fine.

You should restart your system after installing new drivers to make sure everything gets loaded properly. To verify that CUDA is installed, run this at the terminal (it should output some text that includes your graphics card model in it somewhere):

nvidia-smi

Next we'll get Anaconda set up, which will install Theano along with all of its dependencies. Download the binaries here and save them somewhere locally, then navigate to the path where you saved the file and run:

bash Anaconda-2.2.0-Linux-x86_64.sh

(note that the file name will change when a new version is published, so use the actual name of the file you downloaded)

To make sure your Anaconda install is up-to-date and all of Theano's dependencies are there, run a few statements at the terminal using the "conda" package manager:

conda update conda  
conda update anaconda  
conda install pydot  
conda update theano

We're now in the home stretch. All that's left are some configuration items to tell Theano where your BLAS/CUDA libraries are and how to use them. First, create a file called ".theanorc" in your home directory and add the following contents to it:

[global]
device = gpu  
floatX = float32
[blas]
ldflags = -L/usr/local/lib -lopenblas
[nvcc]
fastmath = True
[cuda]
root = /usr/lib/nvidia-cuda-toolkit

(switch the "device" flag to cpu when you want to use that instead)

Finally, we need to add the CUDA path to an environment variable that Theano also looks for. Run the following statement:

export LD_LIBRARY_PATH = "/usr/lib/nvidia-cuda-toolkit"

That's it! You're now set up to use Theano. In order to verify that everything is working, add the following code to "theano_test.py" in your home directory:

from theano import function, config, shared, sandbox  
import theano.tensor as T  
import numpy  
import time
vlen = 10 * 30 * 768  # 10 x #cores x # threads per core  
iters = 1000
rng = numpy.random.RandomState(22)  
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))  
f = function([], T.exp(x))  
print f.maker.fgraph.toposort()  
t0 = time.time()  
for i in xrange(iters):  
    r = f()
t1 = time.time()  
print 'Looping %d times took' % iters, t1 - t0, 'seconds'  
print 'Result is', r  
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):  
    print 'Used the cpu'
else:  
    print 'Used the gpu'

Then run it from the terminal and verify that it completes using the correct device (based on your configuration earlier):

python theano_test.py

NOTE: At this point I ran into an issue that had me banging my head against the wall for a few hours. When I tried to run the above test, Theano could not access my GPU even though CUDA was configured correctly. Although the exact underlying cause is still unclear to me, it turns out that running any CUDA process with root access first will "initialize" things such that any future processes will run successfully without root access. This needs to be done once after each restart and then you're good to go. One possible solution to this is to run the above script with root access:

sudo python theano_test.py

However, this presented two new problems for me:

1) My Anaconda python distro wasn't properly linked while running with root, so it was using the base Linux install and couldn't find any of the libraries
2) The process creates a temporary folder for Theano's compiled code that was then inaccessible without root access, causing future attempts to use Theano to fail

I resolved both of these issues by creating a simple bash script that I run once each time I reboot the machine. Create a file called "init.sh" in your home directory and add the following lines of code to it:

/home//anaconda/bin/python theano_test.py
cd .theano  
rm -rf compiledir_Linux-3.19--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.9-64/

(make sure you replace "user" with your actual path name)

The folder with the compiled code may have a different name, but it appears to be the same each time Theano runs, so just check to see what yours looks like and put it in the script. Then run this each time you reboot:

sudo bash init.sh

Now you should be good to go. The last step is adding a library to build deep learning nets with. I recommend trying Keras. It's still a bit immature but already has a great feature set, and the API is the best I've seen for this kind of stuff. To get Keras installed, run the following commands from your git folder:

git clone https://github.com/fchollet/keras  
cd keras  
python setup.py install

That's all there is to it. You're now ready to start building ultra-fast deep learning nets. Enjoy!

A Few Useful Things To Know About Machine Learning

John Wittenauer — Sat, 02 May 2015 13:53:24 GMT

One of the deep, dark secrets of machine learning is that beneath all of the math, statistics, and algorithms, there's sort of a "black art" to actually building useful models. I think the reason for this stems primarily from the counter-intuitiveness of working in high-dimensional spaces coupled with a tendancy to focus on things that can be explicitly defined and measured (such as a new algorithm) rather than more subjective nuances like developing features based on domain-specific knowledge. While the former is interesting and necessary to advance the state of the art, the latter is more likely to have a significant impact on most real-world projects. It is both very helpful and somewhat difficult to find practical lessons in machine learning that are well-articulated and based on sound principles.

One resource I've come across that is incredibly useful in this regard is Pedro Domingos' "A Few Useful Things to Know about Machine Learning" (link), from which the title of this blog post was unabashedly stolen. The paper is a survey of helpful reminders about the nature of the problems one deals with when designing predictive models, and practical guidelines for overcoming them. The remainder of this post is dedicated to distilling these reminders down to their essence for quick reference. However, I would encourage the interested reader to explore the full paper in depth.

Learning = Representation + Evaluation + Optimization

There's a huge variety of learning algorithms out there with more being published all the time. However it's important to realize that any supervised learning algorithm can be reduced to three components:

Representation - The hypothesis space of the learner. Equates to the way in which a model is defined. For example, k-nearest neighbor uses an instance-based representation while a decision tree is represented by the set of branches forming the tree.

Evaluation - The objective or scoring function used to distinguish good models from bad models. This might be accuracy, f1 score, squared error, information gain, or something entirely different.

Optimization - The method used to search among the set of possible models allowed by the representation (i.e. the hypothesis space) to find the "best" model (as defined by the evaluation function). Many algorithms with convex search spaces use some form of gradient descent, for example.

The selection of these three categories defines the set of possible models that can be learned, what the "best" model looks like, and how to search for that model.

It's Generalization That Counts

The fundamental goal of machine learning is to build a model that generalizes well. It's easy to build a model that learns the training data exceptionally well but performs miserably on new data. This is why cross-validation is so important - without evaluating a model on data not seen during training, there's no way to tell how good it actually is. Even when using cross-validation properly, it's still possible to screw this up. The most common way is overtuning hyper-parameters. This will appear to be improving the model's performance on test data but could actually harm generalization if taken too far.

Data Alone Is Not Enough

There's no free lunch in machine learning - any learner must have some form of domain knowledge provided to it in order to perform well (this can take many forms and depends on the choice of representation). Even very general assumptions, such as similar examples having similar classes, will often do very well. However, much more specific assumptions will often do much better. This is because induction is capable of turning a small amount of input knowledge into a large amount of output knowledge. One may think of learning in this context as analogous to farming. Just as farmers combine seeds with nutrients to grow crops, learners combine knowledge with data to grow models.

Overfitting Has Many Faces

Overfitting is a common problem in machine learning where a learner tends to model random sampling noise in the data and ends up not generalizing well. Overfitting is usually decomposed into bias and variance. Bias is the tendency to consistently learn the wrong thing. Variance is the tendency to learn random things that don't model the true function. The below graphic, taken from Domingos' paper cited earlier, is a great illustration of these principles.

Source: "A Few Useful Things to Know about Machine Learning", Pedro Domingos

On some problems it is possible to get both low bias and low variance, but not always. Because learning algorithms often involve tradeoffs between bias and variance, it is sometimes true that simpler and less expressive algorithms like linear models may perform better than more powerful algorithms like decision trees. In general, the more powerful a learner is, the more likely it is to exhibit low bias but high variance. In these situations, the tendency to overfit must be counter-balanced with good cross-validation and adding a regularization term to the scoring function.

Intuition Fails In High Dimensions

The second big challenge after overfitting is often called the curse of dimensionality. Roughly speaking, as the number of features in a data set increases it becomes exponentially more difficult to train a model that generalizes well because the data set represents a tiny fraction of the total input space. Even worse, similarity measures fail to effectively discriminate between examples in high dimensions because at very high dimensions everything looks alike. This particular insight highlights the general observation that our intuition fails very badly in high dimensions because our tendency is to draw comparisions to what we know (i.e. three-dimensional space) when such comparisions are not appropriate. The lesson here is to be very cautious about adding features recklessly. Additionally, employing good feature selection techniques is probably very important to building a model that generalizes well.

Feature Engineering Is The Key

The choice of features used for a project is easily the biggest factor in its success or failure. Raw data (even cleaned and scrubbed data) is often not in a form that's amenable to learning. Aside from gathering data, designing features may be the most time-consuming part of any machine learning project. Because feature engineering usually requires some amount of domain knowledge specific to the problem at hand, it is very difficult to automate. This is also why algorithms that are good at incorporating domain knowledge are often the most useful.

One may be tempted to try lots of different features and evaluate their impact by measuring their correlation to the target variable, however the assumption that correlation to the target is a proxy for usefulness is false. The classic example here is the XOR problem. Two booleans measured independently against a target that is the exclusive-or of the dependent variables will have no correlation to the target, however taken together they provide a perfect decision boundary. The lesson is that feature engineering and feature selection typically cannot be automated by software, but may have a disproportionate effect on the project's outcome, so don't neglect them.

More Data Beats A Cleverer Algorithm

When faced with the challenge of overcoming a model performance bottleneck there are two fundamental choices - use a more powerful learning algorithm or get more data (either number of examples or additional features). Although the former is tempting, it turns out that the latter is often more pragmatic. Generally speaking, a simple algorithm with lots of data will beat a much more powerful algorithm with much less data. In some cases it may be possible to use a powerful algorithm with lots and lots of data, but often this problem is still computationally intractable, although there is significant and rapid progress being made in this area.

Another potential benefit of simple algorithms is that the learned model may be interpretable by human beings whereas a complex model is usually very opaque. There is a trade-off, however, in that a simple algorithm may not have sufficient expressiveness or depth in its representation model to learn the true function. Some representations are exponentially more compact than others for certain functions, and a function that may be easy for complex algorithms to learn may be intractable or completely impossible for simple algorithms to learn.

Comparing The "Big Three" Hadoop Distros

John Wittenauer — Sat, 11 Apr 2015 17:06:29 GMT

Anyone that’s paying attention to the “big data” buzz these days is probably aware of just how hot Hadoop is right now. In fact the hype is so great, and the market so untapped, that adoption is expected to increase 25-fold over the next 5 years. Given that, it’s no surprise that large enterprises are looking at Hadoop and going “should we be getting in on this? What’s the deal here?”. Equally unsurprising is the observation - admittedly my own, likely biased observation, but still probably reasonably accurate - that most large enterprises have virtually no clue what Hadoop is or how to use it. Still, let’s say that you work for a company that has decided to “do Hadoop”, regardless of whether or not said company actually knows anything about or has any actual need for Hadoop. Let’s also suppose that said company is not likely to roll their own distro by compiling the various Apache projects from source on GitHub. You’re now at a point where you need to pick a vendor and use their distro (and likely support model, training, and so on). How do you choose?

Although there is no easy and completely obvious answer (if there was then there wouldn’t be any market competition in the first place), there are certainly some good heuristics to draw on. Fortunately, I recently had the opportunity to sit down with teams from each of the major vendors and discuss their offering, and I think some of the general observations that I took away from these meetings may be useful to others in a similar position.

There’s essentially a “big three” in the world of enterprise Hadoop distros – Hortonworks, Cloudera, and MapR. There are technically other distros in the market, but nothing really worth mentioning. If you’re starting an evaluation today, these are the main players. What’s really fascinating though is how different they all are, and how much their business strategies vary, despite all of them literally pushing the exact same core code base.

First let’s get some of the similarities out of the way. All three companies are more or less entirely focused on Hadoop. It’s not just a piece of their business, it's entire revenue stream. All three are mid-size companies (500-1000 employees) with paying customers numbering in the hundreds and an array of partnerships across various industries. They all provide free, downloadable versions of their Hadoop distributions (Cloudera and MapR also ship “premium” distributions for paying customers). They all rely on support contracts for at least part of their customer revenue. They also all employ engineers with committer status to at least some of the projects in the very broad Hadoop “ecosystem”. But that’s where the similarities end.

Hortonworks is both the youngest of the three and the only one that has already gone public. Right out of the gate their strategy is pretty clear – they’re all about open source. Hortonworks pretty much wants to be for Hadoop what Red Hat is for Linux – successful based largely on a robust support model for otherwise entirely free software. Of the three enterprise distros, theirs is the only one that doesn’t include any proprietary software at all. Rather, their approach is to focus on improving the open source projects (or creating new projects if necessary) to fill gaps and improve the overall story for Hadoop. Their business is then based around being the go-to experts to provide support for the platform, because they have the engineers that are writing the code in the first place (of course the other vendors are committing on most of the same projects as well). They’ve also formed close partnerships with major enterprise software vendors like Microsoft and SAP, which helps them get their foot in the door.

If Hortonworks is the pure open-source advocate of the bunch, Cloudera is something of a hybrid. They’re definitely heavily invested in open source and they’ve got quite a few Apache committers themselves, so it’s not like they’re riding on Hortonworks’ coattails. But they’re also championing non-Apache projects like the Impala SQL engine which, while still open source, is owned and controlled by Cloudera. In addition, Cloudera packages their own proprietary tools into their distribution for things like security and cluster management. They consider these tools part of their competitive advantage. Cloudera touts its size and first-mover status (they were the first to create an enterprise distribution by quite a wide margin) along with its support model as differentiators in the marketplace.

MapR seems to be taking an entirely different approach. Their strategy is 100% focused on the product and using it to solve problems. While they do use (most of) core Hadoop, they’re not as concerned with open source as the other two. MapR has by far the most proprietary software in their stack, most notably a custom file system (they do NOT use HDFS, although it sounded like you could if you really wanted to). Another key differentiator is that MapR doesn’t use a “NameNode” architecture, they’re developed their own way to track meta-data about the cluster from within the cluster nodes themselves. Their claim is that all of this results in significantly better performance and flexibility than the other distros. They’re also the only distro that supports Apache Drill, which is another flavor of Hadoop query language.

All three companies seem to have a lot going for them, and one would probably do well using any of their distros. But given that they’re all pushing the same core product, I found it surprising how different they are. Probably the most interesting (and entertaining) part of the presentations was how each company would throw in subtle (and often not-so-subtle) digs at their competitors. Hortonworks’ message was basically “we’re the only pure open-source option, and in the long run, open source wins”. Cloudera’s message was “we’ve been first to market on every front, we’re the biggest, we have the most resources, and we’ve got enterprise-grade IP that sets us apart”. MapR’s message, on the other hand, was essentially “HDFS sucks, NameNodes suck, our competitors suck, our distro is way better at everything” (the “Googleyness” is strong with those guys). MapR even went so far as to call out Doug Cutting, the original creator of Hadoop, for building a half-baked, reverse-engineered file system in HDFS that was a poor man’s version of the “real thing” that Google…I mean MapR built (I swear I’m not making this stuff up).

In the end, after six hours of marketing slides and competitor-shaming, we were left with lots of new information to digest but no clear winner in sight. That, I suppose, is to be expected – we were never going to decide based on a PowerPoint deck. There was, however a clear distinction in philosophy and strategy between the three. That element alone may be enough for some organizations to reach a conclusion. If open source is your thing, then Hortonworks is for you. If it’s all about product, then MapR may be the way to go. If you fit somewhere in the middle then Cloudera may be a good choice. On the other hand, if you just want to have some fun then invite them all in, grab some popcorn, and ask them what they think of each other.

GDELT: Watching The Entire World Unfold

John Wittenauer — Sun, 15 Mar 2015 19:25:59 GMT

I think it's a fairly well-accepted observation that we're in the middle of a cloud computing revolution. The cloud is making businesses more agile, lowering the barrier to entry for start-ups, driving seamless user experiences across devices, and empowering individuals with near-limitless computing resources at their fingertips, completely on-demand. The transformation often feels gradual though. Costs are coming down steadily but incrementally. New services are introduced, but are usually sublte improvements on existing services or simply moving workloads that are already done elsewhere to the cloud. Still, there are times when I'm reminded just how powerful these cloud computing platforms are, and the extent to which they enable grand ideas to be made reality. Such is the case with the GDELT Project, which is the subject of this post.

The Global Database of Events, Language, and Tone - more commonly known as The GDELT Project - is, according to its creators, "the largest, most comprehensive, and highest resolution open database of human society ever created". GDELT, funded by Google Ideas in collaboration with various universities, media entities, and non-profits, is a vast collection of global events dating all the way back to 1979 (with plans to go back even further). But the really fascinating part is the breadth and scale of the data it continues to collect each day. The project monitors news from all across the globe, in a variety of different formats and in over 100 different languages, and aggregates it all together into a single database. We're not just talking about copying news articles though, as that by itself would not be all that impressive. Instead, the platform takes all that raw data and uses sophisticated natural language and data mining algorithms to create structured data that researchers can work with (close to 400 million geo-tagged records to date). Best of all, the data and tools created by the project are completely free!

The output created by GDELT comes in two primary forms - the events database and the global knowledge graph. The events database records physical activities happening around the world. The data is organized into over 300 different categories. The categories are primarily socio-political in nature - things like riots, military actions, diplomatic exchanges, and so on. For example, I downloaded the events file from yesterday (of which there are over 120,000 events recorded) and skimmed through the top of the listing. Even in the first few dozen entries there was a fascinating variety of recordings - everything from news related to Tesla's new battery factory to coverage of Julian Asange's trial to the Pope's upcoming visit to Africa. When considering the mind-boggling scale of the data collected just for a single day, and the fact that it's all pre-structured and formatted for computers to process and draw inferences from, it's easy to see how incredibly useful this data could be.

But the events database is only one part of what GDELT does. There's another product called the global knowledge graph. The knowledge graph uses named entity recognition to turn all of the raw data that GDELT collects into a list of every person, organization, location etc. referenced in that data combined with a list of over 230 themes used to describe the context in which the named entity was mentioned. The end result is a graph of connections that describes not only what is happening around the world, but how it's all connected together.

The creation of these two sources of data - the events database and the global knowledge graph - would, by itself, be a pretty big deal. But the project doesn't stop there. Fueled by an acknowledgement that simply exposing the data in raw form still keeps the barrier to entry for a typical user pretty high (there's over 100 GB of data to host, after all), GDELT also provides a variety of options for consuming and analyzing the data. Aside from downloading the raw data in CSV form, it's also hosted as a Google BigQuery service in the cloud (the project IS sponsored by Google, after all). But perhaps more interesting is something called the GDELT Analysis Service, which is a collection of tools and services that lets you perform high-level visualization and analysis of the GDELT data. Examples include interactive heatmaps, word clouds, timelines, and network visualizations. Users provide inputs to describe what they want to analyze, and the system emails a link to the results when the analysis is complete. It's unfortunately not true real-time interaction yet, but I wouldn't be surprised to see these services evolve in that direction. For certain types of very common high-level analyses, pre-built reports have been created that can be subscribed to and emailed to interested parties on a daily basis. These include things like the Global Daily Trend Reports and the World Leaders Index.

Finally, there's a GDELT Blog which tracks news coverage and innovative usage examples for the data and services provided by GDELT around the world. There are examples of everything from real-time conflict and protest maps, to networks of "influence" between world leaders, to maps of global reactions to significant policy changes such as the passage of the Affordable Care Act. These particular applications represent a small sampling of what's possible with the data that GDELT provides.

If all of this sounds a bit wild, trust me - I'm right there with you. My initial reaction upon learning about this project went something like "how the hell am I just now finding out that something like this exists?". Considering the magnitude and potential impact, I think there's been a startling lack of media coverage about GDELT (which is somewhat ironic, considering what it does - I wonder if it has a named entity for itself?). Maybe GDELT is a largely academic exercise that doesn't mean a whole lot to anyone outside of the relatively small group of people in the world that might be equipped to really do something with it. Even if that's the case, it still feels like it has the potential to be a game-changer for those that CAN really use this type of data. I'm looking forward to seeing the creative new tools and applications that arise from this, and I may even try exploring some of those applications myself in a future blog post. But until then, it's simply a reminder that grand ideas that may have seemed far-fetched even a decade ago are entirely possible today, enabled in part by the seismic movements in fields such as cloud computing.

A Compendium Of Machine Learning Resources

John Wittenauer — Sun, 15 Feb 2015 18:52:30 GMT

If you've spent any amount of time studying machine learning, especially going out on your own and trying to learn independently with no formal training or guidance, there have probably been times where you've come away feeling daunted by the task in front of you. Where to start? What areas do you focus on? Should you work on practical real-world problems or develop a solid theoretical framework first? The list of possible questions goes on and on.

Unfortunately there's no easy answer, and even if there was, it certainly couldn't be boiled down into a blog post. Truth be told, I still stuggle with this myself at times. But along my journey I've complied a ton of useful resources that I'd like to share, with the hope that providing a list like this may save others some time tracking down a lot of this information. I can't really help with the "where to start" question, but I can provide a few options. Here are the most useful courses, books, websites, tools, blogs, articles etc. that I've come across during my study of machine learning.

Online Courses

I've discussed my affinity for MOOCs (massive open online courses) in previous posts, so it should come as no surprise that I've used them quite a bit while studying machine learning. These courses provide high-quality education by world-class experts in a structured and guided format. Most importantly, every one of them is completely free. There are lots of different sites that host these courses and they all vary a bit, but the basic idea is the same - watch video lectures and complete tests and programming assignments on the material covered by the lectures. You can access these at any time and work through the content at your own pace. I've listed a bunch of good ones below. They're organized in roughly the order that I would tackle them, although one could certainly argue for different ordering depending on your objective.

Linear Algebra - Video lectures from an MIT class on linear algebra taught by Gilbert Strang.

Introduction to Probability - The Science of Uncertainty - An introduction to probabilistic models, including random processes and the basic elements of statistical inference.

Intro to Descriptive Statistics - Descriptive statistics will teach you the basic concepts used to describe data. This is a great beginner course for those interested in Data Science, Economics, Psychology, Machine Learning, Sports analytics and just about any other field.

Intro to Inferential Statistics - Inferential statistics allows us to draw conclusions from data that might not be immediately obvious. This course focuses on enhancing your ability to develop hypotheses and use common tests such as t-tests, ANOVA tests, and regression to validate your claims.

Intro to Machine Learning - This is a class that will teach you the end-to-end process of investigating data through a machine learning lens. It will teach you how to extract and identify useful features that best represent your data, a few of the most important machine learning algorithms, and how to evaluate the performance of your machine learning algorithms.

Machine Learning - Learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself.

Artificial Intelligence - UC Berkeley's upper division course CS188: Introduction to Artificial Intelligence now available to everyone online.

Data Science Lectures - Video lectures and content from a data science course at Harvard.

Probablistic Graphical Models - In this class, you will learn the basics of the PGM representation and how to construct them, using both human knowledge and machine learning techniques.

Natural Language Processing - This class covers the fundamentals of mathematical and computational models of language, and the application of these models to key problems in natural language processing.

Convex Optimization - A series of Stanford video lectures on convex optimization, hosted on YouTube as a playlist.

Linear Dynamical Systems - An online Stanford class on optimization and linear dynamical systems.

Advanced Optimization and Randomized Methods - More advaned optimization lectures from a CMU class provided online.

Big Data, Large Scale Machine Learning - This course is for people interested in automatically extracting knowledge from large amounts of data. Students should have some prior knowledge or experience with basic machine learning methods.

Neural Networks for Machine Learning - Learn about artificial neural networks and how they're being used for machine learning, as applied to speech and object recognition, image segmentation, modeling language and human motion, etc. We'll emphasize both the basic algorithms and the practical tricks needed to get them to work well.

NYU Deep Learning Course - Deep learning as taught by Yann LeCunn, one of the leading researchers in the field.

Neural Networks - A comprehensive class on neural networks taught by Hugo Larochelle, posted entirely on YouTube as a playlist.

Books/Research Articles

Rather than compile my own list, I'll offer links to several other lists that may be useful. Note that pretty much all of the recommendations here are fairly advanced. I wouldn't suggest going out and buying these books or reading academic research papers as a starting point. Spin through some of the online classes first and jump to these to go really in-depth on a topic. It's also worth mentioning that I haven't read most of these myself yet (although I plan to eventually) so I can't comment on how good or useful they are, but these lists were mostly curated by experts who know what they're talking about.

One piece of advice I found interesting (this comes from Michael Jordan's Reddit AMA) is to read each text 3 times. The first time you can barely follow it, the second time you're starting to get it, and the third time it all seems obvious.

Reddit ML Book List - A list of machine learning books. There are lots of additional recommendations in the discussion below the main list.

Michael Jordan's Reading List - A compiled list of books recommended for research-level students interested in ML. Definitely not introductory material but if you want to go deep, these books will provide a solid foundation.

Deep Learning Reading List - Reading list for all new researchers working at the LISA lab in Montreal (one of the primary labs works on deep learning).

Websites

Below is a collection of various websites/resources that I've discovered that are extremely valuable to any ML practicioner. They're listed in no particular order.

Cross Validated - Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization. It's part of the Stack Exchange network.

Reddit (MachineLearning) - The machine learning sub-reddit.

DataTau - Reddit-like user forum where people link and discuss articles related to machine learning.

MetaOptimize Q+A - Another question & answer site focused on machine learning (similar to Cross Validated but specifically targeted at ML).

Kaggle - Website that hosts machine learning competitions. This is a great place to pick up some practical experience. Pay particular attention to the user forums following a competition as the winners will usually disclose tons of useful advice.

KDNuggets - A comprehensive data mining community.

Metacademy - A wiki-like resource for machine learning.

Machine Learning Google+ Community - The Google+ community for machine learning.

Visualizing.org - A community of creative people making sense of complex issues through data and design.

Visualizing Data - Tons of data visualization examples.

Deeplearning.net - A website dedicated to hosting a variety of resources related to deep learning.

Deep Learning Tutorials - Some introductory tutorials on deep learning.

Reddit AMAs - Geoffrey Hinton, Michael Jordan, Yann LeCun, Yoshua Bengio

Software

I've previously blogged about getting started with data science in Python. I would encourage the reader to start here for advice on tools and software if you're willing to go with Python. Aside from that, there are lots of other possibilities. There's no way to do a comprehensive list of relevant software, the number of options is huge and it's constantly changing. That said, here are a few promising directions to consider investigating.

R - Along with Python, R is the most commonly used tool for statistical analysis and machine learning. The CRAN package library has a vast collection of open-source software for a variety of ML tasks.

Hadoop - Hadoop is the "big data" platform used for most data sets at the terabyte level and beyond. It's a massively distributed data processing framework with a rich ecosystem of open source tools.

Spark - Spark is on the cutting edge of scalable in-memory computing and has libraries for distributed machine learning and graph processing built in.

H2O - Emerging open source project for scalable machine learning.

Deeplearning4j - A scalable deep learning library in Java.

Mloss.org - A vast collection of machine learning open source software.

Data

Data is pretty important for any machine learning task, right? Here are a few sites with lots of freely available public data to start playing around with.

Data.gov - Massive repository of publicly available data published by the U.S. government.

UN Data - More public data sets.

World Bank - More public data sets.

UCI Machine Learning Repository - A curated list of data sets designed for use in various machine learning tasks.

Wikidata - Online repository of structured data.

Conclusion

Feeling overwhelmed yet? It's easy to get lost in a sea of options with a topic so broad and so deep at the same time. I find that it helps to think of the learning process as a journey taking place over a very long time. Don't think of it in terms of having an end in sight, just try to make some progress every day (no matter how small) and it adds up over time.

Do you have any recommendations for other resources that I missed? Feel free to note in the comments!

Machine Learning Exercises In Python, Part 2

John Wittenauer — Sat, 24 Jan 2015 21:40:26 GMT

In part 1 of my series on machine learning in Python, we covered the first part of exercise 1 in Andrew Ng's Machine Learning class. In this post we'll wrap up exercise 1 by completing part 2 of the exercise. If you recall, in part 1 we implemented linear regression to predict the profits of a new food truck based on the population of the city that the truck would be placed in. For part 2 we've got a new task - predict the price that a house will sell for. The difference this time around is we have more than one dependent variable. We're given both the size of the house in square feet, and the number of bedrooms in the house. Can we easily extend our previous code to handle multiple linear regression? Let's find out!

First let's take a look at the data.

path = os.getcwd() + '\data\ex1data2.txt'  
data2 = pd.read_csv(path, header=None, names=['Size', 'Bedrooms', 'Price'])  
data2.head()

	Size	Bedrooms	Price
0	2104	3	399900
1	1600	3	329900
2	2400	3	369000
3	1416	2	232000
4	3000	4	539900

Notice that the scale of the values for each variable is vastly different. A house will typically have 2-5 bedrooms but may have anywhere from hundreds to thousands of square feet. If we were to run our regression algorithm on this data as-is, the "size" variable would be weighted too heavily and would end up dwarfing any contributions from the "number of bedrooms" feature. To fix this, we need to do something called "feature normalization". That is, we need to adjust the scale of the features to level the playing field. One way to do this is by subtracting from each value in a feature the mean of that feature, and then dividing by the standard deviation. Fortunately this is one line of code using pandas.

data2 = (data2 - data2.mean()) / data2.std()  
data2.head()

	Size	Bedrooms	Price
0	0.130010	-0.223675	0.475747
1	-0.504190	-0.223675	-0.084074
2	0.502476	-0.223675	0.228626
3	-0.735723	-1.537767	-0.867025
4	1.257476	1.090417	1.595389

Next, we need to modify our implementation of linear regression from part 1 to handle more than 1 dependent variable. Or do we? Let's take a look at the code for the gradient descent function again.

def gradientDescent(X, y, theta, alpha, iters):  
    temp = np.matrix(np.zeros(theta.shape))
    parameters = int(theta.ravel().shape[1])
    cost = np.zeros(iters)
    for i in range(iters):
        error = (X * theta.T) - y
        for j in range(parameters):
            term = np.multiply(error, X[:,j])
            temp[0,j] = theta[0,j] - ((alpha / len(X)) * np.sum(term))
        theta = temp
        cost[i] = computeCost(X, y, theta)
    return theta, cost

Look closely at the line of code calculating the error term: error = (X * theta.T) - y. It might not be obvious at first but we're using all matrix operations! This is the power of linear algebra at work. This code will work correctly no matter how many variables (columns) are in X, as long as the number of parameters in theta agree. Similarly, it will compute the error term for every row in X as long as the number of rows in y agree. On top of that, it's a very efficient calculation. This is a powerful way to apply ANY expression to a large number of instances at once.

Since both our gradient descent and cost function are using matrix operations, there is in fact no change to the code necessary to handle multiple linear regression. Let's test it out. We first need to perform a few initializations to create the appropriate matrices to pass to our functions.

# add ones column
data2.insert(0, 'Ones', 1)
# set X (training data) and y (target variable)
cols = data2.shape[1]  
X2 = data2.iloc[:,0:cols-1]  
y2 = data2.iloc[:,cols-1:cols]
# convert to matrices and initialize theta
X2 = np.matrix(X2.values)  
y2 = np.matrix(y2.values)  
theta2 = np.matrix(np.array([0,0,0]))

Now we're ready to give it a try. Let's see what happens.

# perform linear regression on the data set
g2, cost2 = gradientDescent(X2, y2, theta2, alpha, iters)
# get the cost (error) of the model
computeCost(X2, y2, g2)

0.13070336960771897

Looks promising! We can also plot the training progress to confirm that the error was in fact decreasing with each iteration of gradient descent.

fig, ax = plt.subplots(figsize=(12,8))  
ax.plot(np.arange(iters), cost2, 'r')  
ax.set_xlabel('Iterations')  
ax.set_ylabel('Cost')  
ax.set_title('Error vs. Training Epoch')

The cost, or error of the solution, dropped with each sucessive iteration until it bottomed out. This is exactly what we would expect to happen. It looks like our algorithm worked.

It's worth noting that we don't HAVE to implement any algorithms from scratch to solve this problem. The great thing about Python is its huge developer community and abundance of open-source software. In the machine learning realm, the top Python library is scikit-learn. Let's see how we could have handled our simple linear regression task from part 1 using scikit-learn's linear regression class.

from sklearn import linear_model  
model = linear_model.LinearRegression()  
model.fit(X, y)

It doesn't get much easier than that. There are lots of parameters to the "fit" method that we could have tweaked depending on how we want the algorithm to function, but the defaults are sensible enough for our problem that I left them alone. Let's try plotting the fitted parameters to see how it compares to our earlier results.

x = np.array(X[:, 1].A1)  
f = model.predict(X).flatten()
fig, ax = plt.subplots(figsize=(12,8))  
ax.plot(x, f, 'r', label='Prediction')  
ax.scatter(data.Population, data.Profit, label='Traning Data')  
ax.legend(loc=2)  
ax.set_xlabel('Population')  
ax.set_ylabel('Profit')  
ax.set_title('Predicted Profit vs. Population Size')

Notice I'm using the "predict" function to get the predicted y values in order to draw the line. This is much easier than trying to do it manually. Scikit-learn has a great API with lots of convenience functions for typical machine learning workflows. We'll explore some of these in more detail in future posts.

That's it for today. In part 3, we'll take a look at exercise 2 and dive into some classification tasks using logistic regression.

Going Reactive: A Primer On Reactive Programming

John Wittenauer — Thu, 01 Jan 2015 16:34:21 GMT

If you've never come across the term "reactive programming" before, or have never heard the word "reactive" mentioned in a software development context, you may be forgiven for being in the dark on the subject. I first learned of the reactive paradigm after coming across something called the Reactive Manifesto. The manifesto, written by several Typesafe engineers in collaboration with a number of prominent figures such as Martin Thompson, Erik Meijer, and Martin Odersky (the creator of scala), describes the design pillars of reactive systems. The first version was published in July 2013 but was updated in September 2014 to reflect feedback from the community and a "simplification" of the scope and message.

So what is the reactive manifesto, and why should you care? Essentially, it describes a system design approach that lends itself well to large-scale, distributed, high-performance applications. To borrow a passage from the document itself:

Only a few years ago a large application had tens of servers, seconds of response time, hours of offline maintenance and gigabytes of data. Today applications are deployed on everything from mobile devices to cloud-based clusters running thousands of multi-core processors. Users expect millisecond response times and 100% uptime. Data is measured in Petabytes. Today's demands are simply not met by yesterday’s software architectures.

According to the manifesto, there are four key attributes that describe systems ready to meet the challenges listed above:

Responsive - The system responds in a timely manner if at all possible.

Resilient - The system stays responsive in the face of failure.

Elastic - The system stays responsive under varying workload.

Message Driven - The system relies on asynchronous message-passing to establish a boundary between components that ensures loose coupling, isolation, location transparency, and provides the means to delegate errors as messages.

I won't go into too much detail on these characteristics beyond the relatively vauge descriptions above, but the interested reader should check out the manifesto itself as well as this blog post for some background on why several attributes were changed from the original version. What I'd like to do instead is discuss how these attributes intersect with other concepts you may already be familiar with, and how these concepts allow us to implement reactive principles.

In a sense, reactive programming is simply an extension of functional programming. After all, what are the core principles of functional programming? Aside from higher-order functions, I would argue it's immutability (data structures are not modified once created), statelessness (functions have no "history" and always give the same output for a given input), and composition (functions are broken down into the smallest possible units and complex operations are built out of simpler functions in a composition chain). These same principles are the key to building distributed systems, which is what reactive programming is all about. In fact, in the Reactive Programming course offered on Coursera, the first few weeks of the course focus exclusively on functional programming. Clearly there is some conceptual overlap here.

(If you're totally unfamiliar with functional programming, this article has a nice introduction with some practical examples in Javascript)

So we've covered how functional and reactive programming are similar, but how do they differ? Why create this new "paradigm" in the first place if it's really just a re-hash of functional programming? I think there are three concepts in reactive programming that make it unique - concurrency, asychrony, and data streams. Reactive is all about managing distributed state by coordinating and orchestrating distributed data streams, and this requires both a functional approach as well as an awareness and understanding of concurrency, asychrony, and data streams. Let's break these down a bit and see how they fit into the big picture.

Concurrency is exactly what it sounds like - running operations in parallel across some system boundary. It could be across several threads running on the same processor, or it could be across several cores on the same processor, or it could be across physical computers over a network. Conceptually they're all related because they share some of the same challenges, namely resource contention and the deadlocks that can arise from an improper design.

Asynchronus computing should be familar to anyone that's done web development before. Ever registered an on-click event in jQuery? That's an asynchronous function. More generally, asynchronous operations are event-driven and non-blocking. Unlike a "traditional" computer program where each instruction is executed sequentially, an asynchronous operation is not guaranteed to be complete before the next operation begins executing. Instead, it fires an event upon completion signaling that it is done executing and its result is available for use.

Closely tied to the concept of asynchronous computing is the notion of a "future" or "promise". Going back to the previous example, if a program is executing sequentially and calls an asynchronous function, what does it get back from the function? It's not the return value because we already said the operation is non-blocking - in other words, the program doesn't wait around for the async function to finish but instead keeps going to the next operation. So if the async function doesn't yield it's final return value, what is it returning? The answer is a "future". This is a really general concept that could be implemented a lot of different ways, but essentially it's a data structure that at some point in the future will contain the return value of the async function.

Once we've got the concept of a future down, one fairly obvious question might be - what happens if there's more than one return value? This leads to the last concept - a data stream. You might see this referred to as an "observable" in some contexts. To grok the idea of a data stream, think about a stream of clicks being recorded on a website. Clicks can occur at any time and there is no "end point" after which there will be no more clicks (i.e. it goes on indefinitely). The collection of click events can be thought of as a data stream. Put another way, data streams are async collections. They are to futures as lists are to unit values in the synchronous world.

Now that we've defined these concepts, let's circle back to reactive programming. Reactive, in my view, is the amalgation of the functional concepts I described earlier with the concepts of concurrency, asychrony, and data streams. Reactive applications use all of this together to provide responsive, resilient, elastic services built on compositions of distributed micro-services interacting through immutable data streams with no shared internal state.

The observant reader may notice that I didn't mention the "message-driven" attribute. In the original version of the manifesto this was "event-driven" instead but was changed in the latest revision. I think the ideas are somewhat interchangable, the difference being that messages are more general in the sense that they may communicate things other than events. Conceptually, messages provide an asynchronous boundary that necessitates a decoupling from implementation details of the message's sender and reciever, which may be why "message-driven" was chosen instead. If we relax the definition of "message" to be understood in the general rather than literal sense, then I believe this attribute can be used to describe the conduit through which events are communicated. Under this assumption, the examples we discussed above may also be considered "message-driven".

I'm obviously just scratching the surface of this topic, and to be perfectly honest I'm still learning it myself! I'll conclude this post with some helpful resources to further pursue learning about reactive programming. I mentioned the Coursera course already, which is great if you really want to get into the nuts and bolts of reactive programming (I personally found it hard to follow but your mileage may vary). The other resources I want to mention are the ReactiveX website and the Akka library. ReactiveX is an implementation of the "observer" (data stream) pattern on most of the major computing platforms. The website contains tons of useful documentation and practical code examples. Akka is a software library written on the JVM that implements the actor model. Akka is used in a lot of large-scale distrubuted computing architectures and is worth investigating if you're serious about understanding and building reactive systems.

That concludes my primer on reactive programming! This is a fascinating subject that is almost certainly going to become more mainstream as time goes on. Hopefully this information gave you a decent introduction to the concepts behind reactive systems and spurred some interest to dive deeper and learn more about reactive programming.

How To Think Like An Entrepreneur

John Wittenauer — Thu, 18 Dec 2014 02:23:42 GMT

Back in early September, I came across an article on hacker news introducing a brand new class at Stanford called "How to Start a Startup". The class, which was hosted by Y Combinator founder Sam Altman, eschewed the traditional lecture model and instead lined up a series of A-list founders and venture capitalists to come and give talks about various aspects of startups and entreprenership. The list of speakers is pretty much a who's who of Silicon Valley - Paul Graham, Peter Thiel, Marc Andreessen, Ben Horowitz, Dustin Moskovitz, Brian Chesky, Aaron Levie. The list goes on. By all accounts it appeared that the class would provide a wealth of information from some of the most successful entrepreneurs in tech to those lucky enough to be in attendance.

But it turned out that the reality is even better, because the whole thing is available online for free.

I'm a huge fan of the online education movement championed by Coursera, edX and others. In fact, I've participated in a number of so-called "MOOCs" (massive open online courses) myself this year and continue to leverage them for my own personal growth and development. But in a way this startup class goes even further. The content in these videos is the type of thing that previously would only be accessible to a select few founders entering Y Combinator's relatively "elite" class. Whereas I could probably find a book or tutorial that covers much of the content avaiable in a typical MOOC if I tried hard enough, it's unlikely that I could simply go out and find first-hand, highly targeted advice from world-class entrepreneurs presented in a cohesive story with almost no effort required on my part. Sam and the various presenters in the class deserve a huge amount of credit for putting this together and having the guts to throw the whole thing out on the web for anyone to use.

Since discovering the class a few months ago, I've been able to keep up as new videos are released and recently reached the conclusion of the lectures. Having now digested around 1,000 minutes of content from a lot of incredibly smart people, my advice would be the following - go watch these videos. Seriously, start watching them right now. Even if you have no interest in doing a startup. Even if you have no interest in ever becoming an entrepreneur. The information is so valuable that even though it's targeted at tech startups that intend to become large companies, there are many elements that can be applied to virtually any business, and at any level. If nothing else, hearing the stories of how these founders got started and became successful should motivate and drive you to pursue your own goals, whatever they may be.

Below are a few of the most interesting takeaways that I got from the class. Honestly there's so much good information that I'm just scratching the surface, but these are the ideas that have stuck with me at the moment.

Startups are hard

Okay, this one is pretty obvious and you probably don't need take this class to figure that out. It's worth mentioning though because startups have been glamorized a bit by the media and movies like "The Social Network", and the reality is much harsher. Many of the speakers reinforced the notion that building a company is the hardest thing they ever did, and these are the successful ones. Sam has a great quote early on about the chances of success with a new venture:

You may still fail. The outcome is something like idea x product x execution x team x luck, where luck is a random number between zero and ten thousand. Literally that much. But if you do really well in the four areas you can control, you have a good chance at at least some amount of success.

Dustin Moskovitz talked about the motivation behind doing a startup and when you should pursue it: "...basically you can't not do it. You're super passionate about this idea, you're the right person to do it, you've gotta make it happen." Basically you have to be really passionate about your idea, and it should be something that you personally want and wish existed but doesn't, so you need to create it.

The best ideas usually sound crazy

One of the more interesting themes in the lectures had to do with ideas for a startup. It seems intuitive that great ideas might come as sort of a "Eureka!" moment where the soon-to-be founder suddenly has a bout of inspiration and rushes off to start building on his/her idea before someone else has the very same idea and acts on it themselves. It turns out that it's often not like this in practice. In fact, some of the best ideas seem downright crazy before they actually work. As Sam says in one of the lectures:

The hardest part about coming up with great ideas, is that the best ideas often look terrible at the beginning. ...That's why it's also not dangerous to tell people your idea. The truly good ideas don't sound like they're worth stealing. You want an idea where you can say, "I know it sounds like a bad idea, but here's specifically why it's actually a great one." You want to sound crazy, but you want to actually be right. And you want an idea that not many other people are working on. And it's okay if it doesn't sound big at first.

A related thread of disucssion is the notion that ideas don't really need to be big in the beginning. There's a general mantra that was repeated often that Sam summed up early on: "Something that we say at YC a lot is that its better to build something that a small number of users love, then a large number of users like". The theory here is that it's easier to develop from a product that a few people love to a product that a lot of people love, than it is to go from something that a lot of people like to something a lot of people love. In other words, the degree of passion that early users have for your product matters more than having a lot of users.

How does this translate to the world outside of startups? Basically, quality and user experience trumps almost everything else, so when building software of any kind, really pay attention to your users and have an almost manical obsession with quality. This isn't just advice for startups - some of the most successful companies in the world live by this principle. As Sam states in one of the lectures:

Apple, Google, and Facebook have each done this extremely well. It's not about the product, it's about everything they do. They move fast and they break things, they're frugal in the right places, but they care about quality everywhere.

Moving fast while still caring a great deal about quality is extremely hard to do. They seem like mutually exclusive things where increasing one naturally leads to a decrease in the other, and maybe in a purely practical sense that still holds true. I think this point is more about the mentality or the culture of the people doing the work. Looked at from that perspective, it's probably very useful to adopt this idealized viewpoint in a lot of different aspects of work.

Competition is for losers

Peter Thiel's lecture on competition and monopolies was one of the more enlightening moments in the series. The basic premise goes like this: small markets that no one is really going after are easier to capture than large markets that already have lots of competition, and sometimes those small markets can expand into large markets with the right strategy. In other words, one should always strive to be a monopoly in a small market and then work on growing the market. As Peter puts it: "You want to be a one of a kind company. You want to be the only player in a small ecosystem".

The basis for this theory stems from his views on the different types of companies that exist, which even he admits is a bit extreme but makes a lot of sense if you really think about it:

I do think the extreme binary view of the world I always articulate is that there are exactly two kinds of businesses in this world, there are businesses that are perfectly competitive and there are businesses that are monopolies. There is shockingly little that is in between. And this dichotomy is not understood very well because people are constantly lying about the nature of the businesses they are in. And in my mind this is not necessarily the most important thing in business, but I think it's the most important business idea that people don't understand, that there are just these two kinds of businesses.

Peter gives a lot of examples to back this theory up, many of which are fairly compelling. He also discusses the notion of value creation and how businesses succeed (or fail) at capturing some of the value they're creating for the world. Very fascinating stuff.

Work on things that matter

My favorite lecture of the whole series was Paul Graham's talk on the counter-intuitiveness of startups and how to have a mindset that generates novel ideas for startups. He brings up a lot of notable ways that startups tend to contradict our basic intuition, which would be useful to know if you're actually doing a startup, but more fascinating to me was his succint advice on coming up with ideas for a startup:

How do you turn your mind into the kind that has startup ideas unconsciously? One, learn about a lot of things that matter. Two, work on problems that interest you. Three, with people you like and or respect.

The advice is simple and straightforward - be interested in interesting problems and pursue solutions to those problems with people you can actually stand to work with. It seems obvious when you write it out but I suspect that many of us cannot honestly claim to live by this mantra in our professional careers all the time.

I'll conclude this post with perhaps my favorite quote from the whole class, also by Paul, when discussing how he came to work on the things he did throughout his career:

I find it very hard to make myself work on boring things even if they're supposed to be important. My life is full of case after case where I worked on things just because I was interested and they turned out to be useful later in some worldly way. Y Combinator itself is something I only did because it seemed interesting. I seem to have some internal compass that helps me out. This is for you not me and I don't know what you have in your heads. Maybe if I think more about it I can come up some heuristics for recognizing genuinely interesting ideas. For now all I can give you is the hopelessly question begging advice. Incidentally this is the actual meaning of the phrase begging the question. The hopelessly question begging advice that if you’re interested in generally interesting problems, gratifying your interest energetically is the best way to prepare yourself for a startup and probably the best way to live.

Machine Learning Exercises In Python, Part 1

John Wittenauer — Fri, 05 Dec 2014 03:12:26 GMT

One of the pivotal moments in my professional development this year came when I discovered Coursera. I'd heard of the "MOOC" phenomenon but had not had the time to dive in and take a class. Earlier this year I finally pulled the trigger and signed up for Andrew Ng's Machine Learning class. I completed the whole thing from start to finish, including all of the programming exercises. The experience opened my eyes to the power of this type of education platform, and I've been hooked ever since.

This blog post will be the first in a series covering the programming exercises from Andrew's class. One aspect of the course that I didn't particularly care for was the use of Octave for assignments. Although Octave/Matlab is a fine platform, most real-world "data science" is done in either R or Python (certainly there are other languages and tools being used, but these two are unquestionably at the top of the list). Since I'm trying to develop my Python skills, I decided to start working through the exercises from scratch in Python. The full source code is available at my IPython repo on Github. You'll also find the data used in these exercises and the original exercise PDFs in sub-folders off the root directory if you're interested.

While I can explain some of the concepts involved in this exercise along the way, it's impossible for me to convey all the information you might need to fully comprehend it. If you're really interested in machine learning but haven't been exposed to it yet, I encourage you to check out the class (it's completely free and there's no commitment whatsoever). With that, let's get started!

Examining The Data

In the first part of exercise 1, we're tasked with implementing simple linear regression to predict profits for a food truck. Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from the cities. You'd like to figure out what the expected profit of a new food truck might be given only the population of the city that it would be placed in.

Let's start by examining the data which is in a file called "ex1data1.txt" in the "data" directory of my repository above. First we need to import a few libraries.

import os  
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
%matplotlib inline

Now let's get things rolling. We can use pandas to load the data into a data frame and display the first few rows using the "head" function.

path = os.getcwd() + '\data\ex1data1.txt'  
data = pd.read_csv(path, header=None, names=['Population', 'Profit'])  
data.head()

	Population	Profit
0	6.1101	17.5920
1	5.5277	9.1302
2	8.5186	13.6620
3	7.0032	11.8540
4	5.8598	6.8233

Another useful function that pandas provides out-of-the-box is the "describe" function, which calculates some basic statistics on a data set. This is helpful to get a "feel" for the data during the exploratory analysis stage of a project.

data.describe()

	Population	Profit
count	97.000000	97.000000
mean	8.159800	5.839135
std	3.869884	5.510262
min	5.026900	-2.680700
25%	5.707700	1.986900
50%	6.589400	4.562300
75%	8.578100	7.046700
max	22.203000	24.147000

Examining stats about your data can be helpful, but sometimes you need to find ways to visualize it too. Fortunately this data set only has one dependent variable, so we can toss it in a scatter plot to get a better idea of what it looks like. We can use the "plot" function provided by pandas for this, which is really just a wrapper for matplotlib.

data.plot(kind='scatter', x='Population', y='Profit', figsize=(12,8))

It really helps to actually look at what's going on, doesn't it? We can clearly see that there's a cluster of values around cities with smaller populations, and a somewhat linear trend of increasing profit as the size of the city increases. Now let's get to the fun part - implementing a linear regression algorithm in python from scratch!

Implementing Simple Linear Regression

If you're not familiar with linear regression, it's an approach to modeling the relationship between a dependent variable and one or more independent variables (if there's one independent variable then it's called simple linear regression, and if there's more than one independent variable then it's called multiple linear regression). There are lots of different types and variances of linear regression that are outside the scope of this discussion so I won't go into that here, but to put it simply - we're trying to create a linear model of the data X, using some number of parameters theta, that describes the variance of the data such that given a new data point that's not in X, we could accurately predict what the outcome y would be without actually knowing what y is.

In this implementation we're going to use an optimization technique called gradient descent to find the parameters theta. If you're familiar with linear algebra, you may be aware that there's another way to find the optimal parameters for a linear model called the "normal equation" which basically solves the problem at once using a series of matrix calculations. However, the issue with this approach is that it doesn't scale very well for large data sets. In contrast, we can use variants of gradient descent and other optimization methods to scale to data sets of unlimited size, so for machine learning problems this approach is more practical.

Okay, that's enough theory. Let's write some code. The first thing we need is a cost function. The cost function evaluates the quality of our model by calculating the error between our model's prediction for a data point, using the model parameters, and the actual data point. For example, if the population for a given city is 4 and we predicted that it was 7, our error is (7-4)^2 = 3^2 = 9 (assuming an L2 or "least squares" loss function). We do this for each data point in X and sum the result to get the cost. Here's the function:

def computeCost(X, y, theta):  
    inner = np.power(((X * theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))

Notice that there are no loops. We're taking advantage of numpy's linear algrebra capabilities to compute the result as a series of matrix operations. This is far more computationally efficient than an unoptimizted "for" loop.

In order to make this cost function work seamlessly with the pandas data frame we created above, we need to do some manipulating. First, we need to insert a column of 1s at the beginning of the data frame in order to make the matrix operations work correctly (I won't go into detail on why this is needed, but it's in the exercise text if you're interested - basically it accounts for the intercept term in the linear equation). Second, we need to separate our data into independent variables X and our dependent variable y.

# append a ones column to the front of the data set
data.insert(0, 'Ones', 1)
# set X (training data) and y (target variable)
cols = data.shape[1]  
X = data.iloc[:,0:cols-1]  
y = data.iloc[:,cols-1:cols]

Finally, we're going to convert our data frames to numpy matrices and instantiate a parameter matirx.

# convert from data frames to numpy matrices
X = np.matrix(X.values)  
y = np.matrix(y.values)  
theta = np.matrix(np.array([0,0]))

One useful trick to remember when debugging matrix operations is to look at the shape of the matrices you're dealing with. It's also helpful to remember when walking through the steps in your head that matrix multiplications look like (i x j) * (j x k) = (i x k), where i, j, and k are the shapes of the relative dimensions of the matrix.

X.shape, theta.shape, y.shape

((97L, 2L), (1L, 2L), (97L, 1L))

Okay, so now we can try out our cost function. Remember the parameters were initialized to 0 so the solution isn't optimal yet, but we can see if it works.

computeCost(X, y, theta)

32.072733877455676

So far so good. Now we need to define a function to perform gradient descent on the parameters theta using the update rules defined in the exercise text. Here's the function for gradient descent:

def gradientDescent(X, y, theta, alpha, iters):  
    temp = np.matrix(np.zeros(theta.shape))
    parameters = int(theta.ravel().shape[1])
    cost = np.zeros(iters)
    for i in range(iters):
        error = (X * theta.T) - y
        for j in range(parameters):
            term = np.multiply(error, X[:,j])
            temp[0,j] = theta[0,j] - ((alpha / len(X)) * np.sum(term))
        theta = temp
        cost[i] = computeCost(X, y, theta)
    return theta, cost

The idea with gradient descent is that for each iteration, we compute the gradient of the error term in order to figure out the appropriate direction to move our parameter vector. In other words, we're calculating the changes to make to our parameters in order to reduce the error, thus bringing our solution closer to the optimal solution (i.e best fit).

This is a fairly complex topic and I could easily devote a whole blog post just to discussing gradient descent. If you're interested in learning more, I would recommend starting with this article and branching out from there.

Once again we're relying on numpy and linear algebra for our solution. You may notice that my implementation is not 100% optimal. In particular, there's a way to get rid of that inner loop and update all of the parameters at once. I'll leave it up to the reader to figure it out for now (I'll cover it in a later post).

Now that we've got a way to evaluate solutions, and a way to find a good solution, it's time to apply this to our data set.

# initialize variables for learning rate and iterations
alpha = 0.01  
iters = 1000
# perform gradient descent to "fit" the model parameters
g, cost = gradientDescent(X, y, theta, alpha, iters)  
g

matrix([[-3.24140214, 1.1272942 ]])

Note that we've initialized a few new variables here. If you look closely at the gradient descent function, it has parameters called alpha and iters. Alpha is the learning rate - it's a factor in the update rule for the parameters that helps determine how quickly the algorithm will converge to the optimal solution. Iters is just the number of iterations. There is no hard and fast rule for how to initialize these parameters and typically some trial-and-error is involved.

We now have a parameter vector descibing what we believe is the optimal linear model for our data set. One quick way to evaluate just how good our regression model is might be to look at the total error of our new solution on the data set:

computeCost(X, y, g)

4.5159555030789118

That's certainly a lot better than 32, but it's not a very intuitive way to look at it. Fortunately we have some other techniques at our disposal.

Viewing The Results

We're now going to use matplotlib to visualize our solution. Remember the scatter plot from before? Let's overlay a line representing our model on top of a scatter plot of the data to see how well it fits. We can use numpy's "linspace" function to create an evenly-spaced series of points within the range of our data, and then "evaluate" those points using our model to see what the expected profit would be. We can then turn it into a line graph and plot it.

x = np.linspace(data.Population.min(), data.Population.max(), 100)  
f = g[0, 0] + (g[0, 1] * x)
fig, ax = plt.subplots(figsize=(12,8))  
ax.plot(x, f, 'r', label='Prediction')  
ax.scatter(data.Population, data.Profit, label='Traning Data')  
ax.legend(loc=2)  
ax.set_xlabel('Population')  
ax.set_ylabel('Profit')  
ax.set_title('Predicted Profit vs. Population Size')

Not bad! Our solution looks like and optimal linear model of the data set. Since the gradient decent function also outputs a vector with the cost at each training iteration, we can plot that as well.

fig, ax = plt.subplots(figsize=(12,8))  
ax.plot(np.arange(iters), cost, 'r')  
ax.set_xlabel('Iterations')  
ax.set_ylabel('Cost')  
ax.set_title('Error vs. Training Epoch')

Notice that the cost always decreases - this is an example of what's called a convex optimization problem. If you were to plot the entire solution space for the problem (i.e. plot the cost as a function of the model parameters for every possible value of the parameters) you would see that it looks like a "bowl" shape with a "basin" representing the optimal solution.

That's all for now! In part 2 we'll finish off the first exercise by extending this example to more than 1 variable. I'll also show how the above solution can be reached by using a popular machine learning library called scikit-learn.

Getting Started With Data Science In Python

John Wittenauer — Mon, 24 Nov 2014 01:52:12 GMT

I've spent a lot of time this year learning about the Python ecosystem. It's not something that I've ever used formally on the job, so I had to sort of go out of my way to get exposed to Python. The reason I was motivated to learn Python stems from my interest in machine learning, and after doing a lot of research on the tools that people use most often for both cutting-edge research and practical applications, I came to the conclusion that R and Python have the largest and most active communities by far. Among those two choices, there seems to be an interesting dicotomy where most people coming into data science from a statistics background are using R and most people coming from a computer science background use Python. Since I fit in the latter camp, and there were some other aspects of Python that appealed to me more than R, I focused my energy on Python. I have to say that while I've since dabbled in R, and what I've seen of it is impressive, I have not been disappointed with Python. The open source community is simply amazing, and the tools you have at your disposal are both cutting-edge and extremely high quality.

In this post, I want to introduce a number of Python libraries that constitute the core of the "data science toolstack" for Python. These libraries are very commonly used for data analysis, data visualization, machine learning, and a wide variety of other tasks. Additionally, I'll talk about where to find these tools and how to get a Python environment set up with these tools available so you can start coding. Hopefully this will give you a basic understanding of some of the most popular Python libraries used for data science, where to go to learn more than the basics, and how to ultimately start using these tools.

NumPy

NumPy is probably the most fundamental library in the Python arsenal for scientific computing and data science. Almost everything else you'll use depends on NumPy either directly or indirectly. At it's core, NumPy is a linear algebra library. But it's also much more than that. Obviously I'm not going to talk too much here about library details as it could take a whole book to cover everything you might need to know. What I'll do instead is point to some useful resources to get started, like this NumPy tutorial. J.R. Johansson has a great IPython notebook series on scientific computing, including a notebook focused specifically on NumPy. Finally, if you're feeling really ambitious, you can check out the source code for the whole thing here (yay for open source).

Matplotlib

Matplotlib is the de facto plotting library for Python. The API takes some getting used to but it's incredibly powerful. Matplotlib can generate almost any type of plot you can think of, everything from simple line charts to animated 3D contour graphs. There are even packages like Seaborn that use matplotlib's rich functionality to generate complex statistical visualizations with almost no effort. To get started, try browsing the matplotlib gallery to see if anything catches your attention. There's also a great tutorial on matplotlib by J.R. Johansson on GitHub. As with NumPy, the full source code for the package is available on GitHub as well.

Pandas

Pandas is an advanced data analysis library for Python. The defining characteristic of pandas is the implementation of a "data frame" object that is very much like R's data frame. In that sense, pandas provides a very R-like approach to manipulating and analyzing data that may otherwise take a lot more effort to accomplish in Python. The pandas documentation is really thorough but may feel overwhelming to newcomers. If you're looking for a good place to start, I would recommend the 10 minutes to pandas tutorial. The pandas source is also available on Github.

Scikit-learn

Scikit-learn is easily the go-to machine learning library in Python. Aside from implementing the largest variety of features, scikit-learn also has a huge community and is heavily battle-tested as it is used in production deployments all over the world. Scikit-learn basically has everything you need to build a full machine learning pipeline - data pre-processing and transforms, clustering, classification, regression, cross-validation, grid search, and so on. The scikit-learn user guide has extensive documentation on the various algorithms provided by the library. As with everything else on this list, scikit-learn is fully open-source and available on GitHub.

IPython

IPython, or "Interactive Python", is an interesting package. Unlike the other software on this list, IPython is not a library per-se as it doesn't have a callable API that implements some particular set of functionality. Instead, IPython provides a rich, interactive shell that supports advanced capabilites not found in the default shell such as in-line data visualization. Perhaps the coolest feature of IPython is the concept of a notebook, which allows one to build a document that mixes code, text, plots, images etc. from a live interactive computing environment. J.R. Johansson's series has a nice introduction that covers IPython, and in fact the entire series itself are IPython notebooks displayed over the web! You can also find the IPython source code available here.

How Do I Get Started?

It's actually really easy to get a Python environment set up with these and many other popular Python libraries thanks to packaged distributions provided by companies in the business of offering Python-related services. The one that I use, and the one I would recommend, is the Anaconda Python distribution. Anaconda is a free package provided by a company called Continuum Analytics. It's available for all of the major platforms and installs Python along with a huge (100+) collection of open source libraries aimed at large-scale data processing, predictive analytics, and scientific computing.

New versions of Anaconda are released periodically that update the various libraries making up the package in a way that ensures everything remains compatible. Continuum even wrote several libraries of its own that are highly useful, including a package management system called "conda" that seems to be significantly better than anything else available. If you'd like to get started using Python for data science, check out the link and give it a try!