Changelog
July 10th, 2016 - I've received a few emails over the last year with suggested additions to this page. Of particular note - thanks to:
Also, over the last year or so, I worked on an internal review panel for Manning Publishing's "Introducing Data Science". I've added my public review below.
Finally, proceed with this in mind: this page is somewhat outdated and in serious need of a much more thorough update (and possibly, pruning). Feel free to reach out with suggestions on what should be added or removed!
April 9th, 2014 - Added "Building Machine Learning Systems with Python", "A Programmer's Guide to Data Mining", "Neural Networks and Deep Learning" and Hugo Larochelle's neural networks class.
March 27th, 2014 - Misc updates and edits. Added my thoughts / reviews for "Introduction to the Math of Neural Networks", "Machine Learning in Action", Project Polymath's "Introduction to Higher Math" and Professor Ng's coursera course. Added a few new courses and resources along with the Udacity and Courseara data science tracks.
July 15th, 2013 - Someone on reddit found this and linked to it in a discussion in /r/machinelearning. I added a few resources that some helpful redditors shared in that thread. Thanks everyone!
This list was initially a small set of resources put together during a session I led on machine learning self-study at Barcamp Chiang Mai 6.
Since then, it's become a place for me to both track my progress and aggregate resources for future study / exploration. Please leave a comment if you see something I've missed or want to share your experience with one of the courses / books / resources listed below.
Note that this is a list of resources targeting beginners without a strong math background who are interested in Machine Learning self-study at the undergraduate level (as opposed to exploring the bleeding edge of ML theory as a PhD student). If you are looking for more concise reference materials - you might check out this list of recommended academic ML / math texts.
The list does assume, however, that you have at least college algebra and pre-calc behind you. If not - you'll definitely need to go sort that out first. Udacity and Coursera both have a number courses available. Khan Academy and PatrickJMT can be helpful if you get stuck and need to see some concrete examples.
Mathematics
Better Explained
Before you do anything else - consider having a look at the topics covered at better explained.
Concise symbolic representations and terse, formal English are an efficient way for mathematicians to communicate with each other. These tools in isolation, however, offer a poor vehicle for teaching mathematical concepts to beginners whose primary math education failed them.
Freshman and sophomore year of undergrad, I often found myself stumbling upon the intuitive elegance of some concept after spending an entire day banging my head against pages and pages of Greek letters and complex notation. I remember asking myself: "Seriously? Is that all there is to it? Why didn't they just SAY THAT".
If that's where you are right now, you might enjoy the "intuition over infallible formal representation" approach taken by better explained. Apart from being an absolute lifesaver when you're just getting started - these intuitions and insights will help you see why the formal representations are necessary (and how you can apply them) once you arrive at a concept that's complex enough to merit a more structured, precise method for organizing your thoughts.
Basics
If you need to bootstrap yourself with a full undergraduate calculus series to reach a minimum viable level of mathematical maturity - check out "Calculus Revisited", a home study course created by professor Herbert Gross at MIT in the late 60s and early 70s. Don't let the age of this course and the non-HD black and white videos fool you. I've taken bits and pieces from several videos here as needed to fill in holes in my understanding and I can't recommend this series enough.
Better explained also has a guide to calculus that's worth checking out.
Mathematical Reasoning / Logic / Proofs / Terminology / Notation, etc.
Linear Algebra
Probability Theory / Statistics
Math for AI / ML / Neural Networks
- Heaton Research - One of our discussion participants, @mccarthy, pinged me about Jeff Heaton yesterday - and I think much of his work is worth mentioning here given Heaton's efforts to teach math for AI / ML with college algebra as the only (strict) prerequisite.
- Introduction to the Math of Neural Networks
- Review (03/27/14) - I worked through the first five chapters. If given the chance, I would gladly buy it again to support Heaton's efforts (I'll have access to future versions), but I can only give it a lukewarm recommendation in its current state. There are a number of very confusing errors. Not just typos (which are to be expected in a rough draft) but cases where he says one thing but upon watching his youtube series (to clear up the confusion) it became obvious that he'd done something backwards or mixed up the order in the book. I recommend using the book together with the youtube series for the best result.
Also, If you have any formal background, you may find his notation confusing. He bucks notational conventions pretty hard - possibly in an effort to "simply" things. I found, however, that the ambiguity this creates just makes things more confusing. The notation he uses for computing node deltas was especially difficult to follow. Note that his target audience is people whose background includes only algebra, so this isn't a really a complaint - just an observation. You'll need to "unlearn" some of the notation you know if you want to follow his.
- Heaton Research Youtube Channel
- Artificial Intelligence for Humans
Other ML-Specific Math
- Mathematical Monk - Youtube channel with 200+ old-school Khan Academy style mini-lessons on probability theory and machine learning.
Machine Learning Courses / Books
Enough Theory to Know Which Way is Up...
Applied Machine Learning
- Statistical Learning (Stanford Online) - According to the site: "This is an introductory-level course in supervised learning, with a focus on regression and classification methods... This is not a math-heavy class, so we try and describe the methods without heavy reliance on formulas and complex mathematics."
- Machine Learning in Action (Python based book)
- Review(03/27/14) - I keep coming back to this book whenever I find that I want or need a deeper understanding of an algorithm's implementation. So far, I've worked through the chapters on K-nearest neighbors, naive Bayesian classification, linear regression and support vector machines. Very satisfied with how all of the content I've covered so far has been presented. Explanations are clear and concise as the author covers formatting / loading data from a file and building each piece of the process from scratch using NumPy and Matplotlib.
- Building Machine Learning Systems with Python - a python based book that covers a wide range of algorithms and applications.
- Probabilistic Programming and Bayesian Methods for Hackers (Python based book - free on github)
- A Programmer's Guide to Data Mining - A free, online python based book that covers basic recommendation systems and classifiers
- Analyzing Big Data with Twitter - A special UC Berkeley iSchool course. Thanks to reader Tim Osterbuhr for the recommendation.
Deeper Down the Rabbit Hole
Competitions
Languages, Libraries and APIs
- Scikit-learn
- Python library containing ML algorithms, visualizations, model selection and more.
- Build on top of numpy, scipy, and matplotlib
- Source
- Docs
- Algorithm Cheet Sheet
- GNU Octave
- HLL primarily intended for numerical computation. Extensive visualization / gui features.
- Used in Ng's Machine learning class
- Download
- Docs
- WEKA
- Set of Java libraries, visualization tools and ML algorithms
- Source
- Docs
- JavaScript Libraries - Note - These are stand alone libraries of algorithms only, not intended to compete with the full-featured suites of tools above. JavaScript historically has had pretty awful native math support and isn't typically a go-to language for computationally intensive applications. The author of these libraries has worked around most of these weaknesses where possible though, and these libraries are pretty awesome. In any case, they are by far the most promising options at the moment for things like browser plugins, HTML 5 mobile apps using local storage as a persistence layer, etc.
NLP Specific
- NLTK
- Python library for natural language processing
- Source
- Docs
- GATE
- Large collection of applications, Java libraries and architectures
- Overview
- Natural
- Sentimental
Computer Vision Specific
- OpenCV
- Computer vision libraries available in C++, C, Python and Java
- Builds available for Linux, Mac, Windows iOS and Android
Other Resources
Most of these people tweet blog posts, tutorials, links to papers, etc. relevant to machine learning:
@AndrewYNg, @NandoDF, @YhatHQ, @karpathy, @PhilemonBrakel, @vnfrombucharest, @zaxtax, @revodavid, @seanjtaylor, @jakehofman, @drewconway, @medriscoll, @bigdata, @mrogati, @ogrisel, @johnmyleswhite, @dpatil, @zybler, @ChrisDiehl, @peteskomoroch, @DataJunkie, @dwf, @siah, @hmason
Last but not least, don't forget to follow @bigdataborat - lest we begin to take ourselves too seriously. :)
Data Science Resources
I debated about whether or not to add these resources, as they go beyond what is (strictly) needed for a pure machine learning self-study curriculum.
To teach yourself anything beyond the theory and mathematics and do some applied machine learning, however, you will need to experiment using real-world data sets captured in the wild. To do this you'll need to be able to collect, store and refine data, design experiments and interpret the results.
These resources attempt to teach the foundational knowledge needed to develop this kind of skill set.
Data Science Specialization - Coursera (Johns Hopkins University)
I will be starting this track on April 7th. I chose it over the Udacity track because it looks a bit more rigorous and well thought out. I also did the first course in the Udacity track and found it to be a bit too focused on the sponsoring company's stack / toolkit. For now I'd like to focus more on the general concepts. From there, specializing in any particular stack or framework will be easier.
The data science specialization consists of 9 courses and a capstone project. The courses are listed below along with a description taken from the course website. I will be adding reviews and comments as I finish the courses:
- The Data Scientist's Toolbox
- "Upon completion of this course you will be able to identify and classify data science problems. You will also have created your Github account, created your first repository, and pushed your first markdown file to your account."
- R Programming
- "In this course you will learn how to program in R and how to use R for effective data analysis. You will learn how to install and configure software necessary for a statistical programming environment and describe generic programming language concepts as they are implemented in a high-level statistical language."
- Getting and Cleaning Data
- "Upon completion of this course you will be able to obtain data from a variety of sources. You will know the principles of tidy data and data sharing. Finally, you will understand and be able to apply the basic tools for data cleaning and manipulation."
- Exploratory Data Analyisis
- "After successfully completing this course you will be able to make visual representations of data using the base, lattice, and ggplot2 plotting systems in R, apply basic principles of data graphics to create rich analytic graphics from different types of datasets, construct exploratory summaries of data in support of a specific question, and create visualizations of multidimensional data using exploratory multivariate statistical techniques."
- Reproducible Research
- "In this course you will learn to write a document using R markdown, integrate live R code into a literate statistical program, compile R markdown documents using knitr and related tools, and organize a data analysis so that it is reproducible and accessible to others."
- Statistical Inference
- "In this class students will learn the fundamentals of statistical inference. Students will receive a broad overview of the goals, assumptions and modes of performing statistical inference. Students will be able to perform inferential tasks in highly targeted settings and will be able to use the skills developed as a roadmap for more complex inferential challenges."
- Regression Models
- "In this course students will learn how to fit regression models, how to interpret coefficients, how to investigate residuals and variability. Students will further learn special cases of regression models including use of dummy variables and multi-variable adjustment. Extensions to generalized linear models, especially considering Poisson and logistic regression will be reviewed."
- Practical Machine Learning
- "Upon completion of this course you will understand the components of a machine learning algorithm. You will also know how to apply multiple basic machine learning tools. You will also learn to apply these tools to build and evaluate predictors on real data."
- Developing Data Products
- "Students will learn how communicate using statistics and statistical products. Emphasis will be paid to communicating uncertainty in statistical results. Students will learn how to create simple Shiny web applications and R packages for their data products."
- Capstone Project
- "The capstone project class will allow students to create a usable/public data product that can be used to show your skills to potential employers. Projects will be drawn from real-world problems and will be conducted with industry, government, and academic partners. The capstone project will be four weeks long, offered in conjunction with the series. The capstone class will be offered thrice yearly."
Data Science Track - Udacity
The first course udacity offered in this track was the "Introduction to Hadoop and MapReduce" course. To me, this felt too focused on a particular set of tools and not general enough for a beginner like myself. Other courses in the track look better. I may come back to this after the coursera specialization. Notably, the final three courses listed below are offered via Georgia Tech.
The courses available in the data science track are listed below, along with descriptions taken from the course website.
- Intro to Computer Science
- "Learn key concepts in computer science including how to write your own computer programs. This course teaches Python in the context of building a search engine."
- Intro to Statistics
- "Statistics is about extracting meaning from data. In this class, we will introduce techniques for visualizing relationships in data and systematic techniques for understanding the relationships using mathematics."
- Intro to Data Science
- "What does a data scientist do? In this course, we will survey the main topics in data science so you can understand the skills that are needed to become a data scientist!"
- Data Wrangling with MongoDB
- "Data Scientists spend most of their time cleaning data. In this course, you will learn to convert and manipulate messy data to extract what you need."
- Exploratory Data Analysis
- "Data is everywhere and so much of it is unexplored. Learn how to investigate and summarize data sets using R and eventually create your own analysis."
- Intro to Hadoop and MapReduce
- "In this short course, learn the fundamentals of MapReduce and Apache Hadoop to start making sense of Big Data in the real world!"
- Machine Learning 1 - Supervised Learning
- "In this course, you'll learn how to apply Supervised Learning techniques important for solving a range of data science problems. And for surviving the robot uprising."
- Machine Learning 2 - Unsupervised Learning
- "Ever wonder how Netflix can predict what movies you'll like? Or how Amazon knows what you want to buy before you do? The answer can be found in Unsupervised Learning!"
- Machine Learning 3 - Reinforcement Learning
- "Can we program machines to learn like humans? This course will teach you the algorithms for designing self-learning agents like us!"
Data Analysis Course Via Springboard
Course link
Thanks to reader Tim Osterbuhr for the suggestion. The course looks to be free, with around 310 hours of instruction. The full curriculum is available here. I suspect many readers will be able to skip the "learn to program" and "git" bits, but if you are absolutely new to programming (perhaps with a strong mathematics / statistics background) this course could be especially helpful?
A blurb from the about page:
"The Data Analysis learning path provides a short but intensive introduction to the field of data analysis. The path is divided into three parts. In part 1, we learn general programming practices (software design, version control) and tools (python, sql, unix, and Git). In part 2, we learn R and focus more narrowly on data analysis, studying statistical techniques, machine learning, and presentation of findings. Part 3 includes a choice of elective topics: visualization, social network analysis, and big data (Hadoop and MapReduce)."
Introducing Data Science
Review - 07/10/16 - Over the course of the last year, I had an opportunity to work with the authors of this book as part of its internal review panel. In general, I think it's a fantastic resource with a few caveats:
- This is an industrial, non-academic resource. Expect a few places to be conceptually accurate but technically imprecise
- You will need substantial experience setting up and working with a reasonably complex development environment
- Since this article is primarily about machine learning self-study - I'll add that the chapter covering Machine Learning, especially, was inappropriate (in my opinion) for a book with "introducing" in the title. Both Professor Abu-Mostafa and Professor Ng (courses linked above) do a much better job of introducing this topic, with examples that are approachable by anyone with a basic undergraduate mathematics background. Without the conceptual and technical bootstrapping I received from these courses, it's likely that I would not have been able to follow (at a deep, meaningful level) the more complex examples in this book.
- I discovered several code samples that would not run, or were otherwise erroneous. I believe many of these are fixed - but have not had an opportunity to revisit them with the material that went to press
All in all, my only real gripe with the book is its title. I think "Case Studies in Applied Data Science" or similar would be more appropriate.
In that context, if you'd enjoy being guided through some real-world problems by hackers / startup founders informed by experience in industry - again, this is a fantastic resource. Just don't expect it to be the hand-holding introduction the title implies. You'll need to work through a few bumps in the road, and will need a foundational background in machine learning and/or statistics to get through the more technical parts. If that background is something you lack, you're much better served by taking advantage of one of the academic resources linked above.
If you enjoyed this post, consider following me on twitter.
Connect