A Compendium Of Machine Learning Resources
15th February 2015If you've spent any amount of time studying machine learning, especially going out on your own and trying to learn independently with no formal training or guidance, there have probably been times where you've come away feeling daunted by the task in front of you. Where to start? What areas do you focus on? Should you work on practical real-world problems or develop a solid theoretical framework first? The list of possible questions goes on and on.
Unfortunately there's no easy answer, and even if there was, it certainly couldn't be boiled down into a blog post. Truth be told, I still stuggle with this myself at times. But along my journey I've complied a ton of useful resources that I'd like to share, with the hope that providing a list like this may save others some time tracking down a lot of this information. I can't really help with the "where to start" question, but I can provide a few options. Here are the most useful courses, books, websites, tools, blogs, articles etc. that I've come across during my study of machine learning.
Online Courses
I've discussed my affinity for MOOCs (massive open online courses) in previous posts, so it should come as no surprise that I've used them quite a bit while studying machine learning. These courses provide high-quality education by world-class experts in a structured and guided format. Most importantly, every one of them is completely free. There are lots of different sites that host these courses and they all vary a bit, but the basic idea is the same - watch video lectures and complete tests and programming assignments on the material covered by the lectures. You can access these at any time and work through the content at your own pace. I've listed a bunch of good ones below. They're organized in roughly the order that I would tackle them, although one could certainly argue for different ordering depending on your objective.
Linear Algebra - Video lectures from an MIT class on linear algebra taught by Gilbert Strang.
Introduction to Probability - The Science of Uncertainty - An introduction to probabilistic models, including random processes and the basic elements of statistical inference.
Intro to Descriptive Statistics - Descriptive statistics will teach you the basic concepts used to describe data. This is a great beginner course for those interested in Data Science, Economics, Psychology, Machine Learning, Sports analytics and just about any other field.
Intro to Inferential Statistics - Inferential statistics allows us to draw conclusions from data that might not be immediately obvious. This course focuses on enhancing your ability to develop hypotheses and use common tests such as t-tests, ANOVA tests, and regression to validate your claims.
Intro to Machine Learning - This is a class that will teach you the end-to-end process of investigating data through a machine learning lens. It will teach you how to extract and identify useful features that best represent your data, a few of the most important machine learning algorithms, and how to evaluate the performance of your machine learning algorithms.
Machine Learning - Learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself.
Artificial Intelligence - UC Berkeley's upper division course CS188: Introduction to Artificial Intelligence now available to everyone online.
Data Science Lectures - Video lectures and content from a data science course at Harvard.
Probablistic Graphical Models - In this class, you will learn the basics of the PGM representation and how to construct them, using both human knowledge and machine learning techniques.
Natural Language Processing - This class covers the fundamentals of mathematical and computational models of language, and the application of these models to key problems in natural language processing.
Convex Optimization - A series of Stanford video lectures on convex optimization, hosted on YouTube as a playlist.
Linear Dynamical Systems - An online Stanford class on optimization and linear dynamical systems.
Advanced Optimization and Randomized Methods - More advaned optimization lectures from a CMU class provided online.
Big Data, Large Scale Machine Learning - This course is for people interested in automatically extracting knowledge from large amounts of data. Students should have some prior knowledge or experience with basic machine learning methods.
Neural Networks for Machine Learning - Learn about artificial neural networks and how they're being used for machine learning, as applied to speech and object recognition, image segmentation, modeling language and human motion, etc. We'll emphasize both the basic algorithms and the practical tricks needed to get them to work well.
NYU Deep Learning Course - Deep learning as taught by Yann LeCunn, one of the leading researchers in the field.
Neural Networks - A comprehensive class on neural networks taught by Hugo Larochelle, posted entirely on YouTube as a playlist.
Books/Research Articles
Rather than compile my own list, I'll offer links to several other lists that may be useful. Note that pretty much all of the recommendations here are fairly advanced. I wouldn't suggest going out and buying these books or reading academic research papers as a starting point. Spin through some of the online classes first and jump to these to go really in-depth on a topic. It's also worth mentioning that I haven't read most of these myself yet (although I plan to eventually) so I can't comment on how good or useful they are, but these lists were mostly curated by experts who know what they're talking about.
One piece of advice I found interesting (this comes from Michael Jordan's Reddit AMA) is to read each text 3 times. The first time you can barely follow it, the second time you're starting to get it, and the third time it all seems obvious.
Reddit ML Book List - A list of machine learning books. There are lots of additional recommendations in the discussion below the main list.
Michael Jordan's Reading List - A compiled list of books recommended for research-level students interested in ML. Definitely not introductory material but if you want to go deep, these books will provide a solid foundation.
Deep Learning Reading List - Reading list for all new researchers working at the LISA lab in Montreal (one of the primary labs works on deep learning).
Websites
Below is a collection of various websites/resources that I've discovered that are extremely valuable to any ML practicioner. They're listed in no particular order.
Cross Validated - Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization. It's part of the Stack Exchange network.
Reddit (MachineLearning) - The machine learning sub-reddit.
DataTau - Reddit-like user forum where people link and discuss articles related to machine learning.
MetaOptimize Q+A - Another question & answer site focused on machine learning (similar to Cross Validated but specifically targeted at ML).
Kaggle - Website that hosts machine learning competitions. This is a great place to pick up some practical experience. Pay particular attention to the user forums following a competition as the winners will usually disclose tons of useful advice.
KDNuggets - A comprehensive data mining community.
Metacademy - A wiki-like resource for machine learning.
Machine Learning Google+ Community - The Google+ community for machine learning.
Visualizing.org - A community of creative people making sense of complex issues through data and design.
Visualizing Data - Tons of data visualization examples.
Deeplearning.net - A website dedicated to hosting a variety of resources related to deep learning.
Deep Learning Tutorials - Some introductory tutorials on deep learning.
Reddit AMAs - Geoffrey Hinton, Michael Jordan, Yann LeCun, Yoshua Bengio
Software
I've previously blogged about getting started with data science in Python. I would encourage the reader to start here for advice on tools and software if you're willing to go with Python. Aside from that, there are lots of other possibilities. There's no way to do a comprehensive list of relevant software, the number of options is huge and it's constantly changing. That said, here are a few promising directions to consider investigating.
R - Along with Python, R is the most commonly used tool for statistical analysis and machine learning. The CRAN package library has a vast collection of open-source software for a variety of ML tasks.
Hadoop - Hadoop is the "big data" platform used for most data sets at the terabyte level and beyond. It's a massively distributed data processing framework with a rich ecosystem of open source tools.
Spark - Spark is on the cutting edge of scalable in-memory computing and has libraries for distributed machine learning and graph processing built in.
H2O - Emerging open source project for scalable machine learning.
Deeplearning4j - A scalable deep learning library in Java.
Mloss.org - A vast collection of machine learning open source software.
Data
Data is pretty important for any machine learning task, right? Here are a few sites with lots of freely available public data to start playing around with.
Data.gov - Massive repository of publicly available data published by the U.S. government.
UN Data - More public data sets.
World Bank - More public data sets.
UCI Machine Learning Repository - A curated list of data sets designed for use in various machine learning tasks.
Wikidata - Online repository of structured data.
Conclusion
Feeling overwhelmed yet? It's easy to get lost in a sea of options with a topic so broad and so deep at the same time. I find that it helps to think of the learning process as a journey taking place over a very long time. Don't think of it in terms of having an end in sight, just try to make some progress every day (no matter how small) and it adds up over time.
Do you have any recommendations for other resources that I missed? Feel free to note in the comments!