Getting Started With Data Science In Python
24th November 2014I've spent a lot of time this year learning about the Python ecosystem. It's not something that I've ever used formally on the job, so I had to sort of go out of my way to get exposed to Python. The reason I was motivated to learn Python stems from my interest in machine learning, and after doing a lot of research on the tools that people use most often for both cutting-edge research and practical applications, I came to the conclusion that R and Python have the largest and most active communities by far. Among those two choices, there seems to be an interesting dicotomy where most people coming into data science from a statistics background are using R and most people coming from a computer science background use Python. Since I fit in the latter camp, and there were some other aspects of Python that appealed to me more than R, I focused my energy on Python. I have to say that while I've since dabbled in R, and what I've seen of it is impressive, I have not been disappointed with Python. The open source community is simply amazing, and the tools you have at your disposal are both cutting-edge and extremely high quality.
In this post, I want to introduce a number of Python libraries that constitute the core of the "data science toolstack" for Python. These libraries are very commonly used for data analysis, data visualization, machine learning, and a wide variety of other tasks. Additionally, I'll talk about where to find these tools and how to get a Python environment set up with these tools available so you can start coding. Hopefully this will give you a basic understanding of some of the most popular Python libraries used for data science, where to go to learn more than the basics, and how to ultimately start using these tools.
NumPy
NumPy is probably the most fundamental library in the Python arsenal for scientific computing and data science. Almost everything else you'll use depends on NumPy either directly or indirectly. At it's core, NumPy is a linear algebra library. But it's also much more than that. Obviously I'm not going to talk too much here about library details as it could take a whole book to cover everything you might need to know. What I'll do instead is point to some useful resources to get started, like this NumPy tutorial. J.R. Johansson has a great IPython notebook series on scientific computing, including a notebook focused specifically on NumPy. Finally, if you're feeling really ambitious, you can check out the source code for the whole thing here (yay for open source).
Matplotlib
Matplotlib is the de facto plotting library for Python. The API takes some getting used to but it's incredibly powerful. Matplotlib can generate almost any type of plot you can think of, everything from simple line charts to animated 3D contour graphs. There are even packages like Seaborn that use matplotlib's rich functionality to generate complex statistical visualizations with almost no effort. To get started, try browsing the matplotlib gallery to see if anything catches your attention. There's also a great tutorial on matplotlib by J.R. Johansson on GitHub. As with NumPy, the full source code for the package is available on GitHub as well.
Pandas
Pandas is an advanced data analysis library for Python. The defining characteristic of pandas is the implementation of a "data frame" object that is very much like R's data frame. In that sense, pandas provides a very R-like approach to manipulating and analyzing data that may otherwise take a lot more effort to accomplish in Python. The pandas documentation is really thorough but may feel overwhelming to newcomers. If you're looking for a good place to start, I would recommend the 10 minutes to pandas tutorial. The pandas source is also available on Github.
Scikit-learn
Scikit-learn is easily the go-to machine learning library in Python. Aside from implementing the largest variety of features, scikit-learn also has a huge community and is heavily battle-tested as it is used in production deployments all over the world. Scikit-learn basically has everything you need to build a full machine learning pipeline - data pre-processing and transforms, clustering, classification, regression, cross-validation, grid search, and so on. The scikit-learn user guide has extensive documentation on the various algorithms provided by the library. As with everything else on this list, scikit-learn is fully open-source and available on GitHub.
IPython
IPython, or "Interactive Python", is an interesting package. Unlike the other software on this list, IPython is not a library per-se as it doesn't have a callable API that implements some particular set of functionality. Instead, IPython provides a rich, interactive shell that supports advanced capabilites not found in the default shell such as in-line data visualization. Perhaps the coolest feature of IPython is the concept of a notebook, which allows one to build a document that mixes code, text, plots, images etc. from a live interactive computing environment. J.R. Johansson's series has a nice introduction that covers IPython, and in fact the entire series itself are IPython notebooks displayed over the web! You can also find the IPython source code available here.
How Do I Get Started?
It's actually really easy to get a Python environment set up with these and many other popular Python libraries thanks to packaged distributions provided by companies in the business of offering Python-related services. The one that I use, and the one I would recommend, is the Anaconda Python distribution. Anaconda is a free package provided by a company called Continuum Analytics. It's available for all of the major platforms and installs Python along with a huge (100+) collection of open source libraries aimed at large-scale data processing, predictive analytics, and scientific computing.
New versions of Anaconda are released periodically that update the various libraries making up the package in a way that ensures everything remains compatible. Continuum even wrote several libraries of its own that are highly useful, including a package management system called "conda" that seems to be significantly better than anything else available. If you'd like to get started using Python for data science, check out the link and give it a try!