7 Steps to Mastering Machine Learning With Python
There are many Python machine learning resources freely available online. Where to begin? How to proceed? Go from zero to Python machine learning hero in 7 steps!
Getting started. Two of the most de-motivational words in the English language. The first step is often the hardest to take, and when given too much choice in terms of direction it can often be debilitating.
Where to begin?
This post aims to take a newcomer from minimal knowledge of machine learning in Python all the way to knowledgeable practitioner in 7 steps, all while using freely available materials and resources along the way. The prime objective of this outline is to help you wade through the numerous free options that are available; there are many, to be sure, but which are the best? Which complement one another? What is the best order in which to use selected resources?
Moving forward, I make the assumption that you are not an expert in:
- Machine learning
- Python
- Any of Python's machine learning, scientific computing, or data analysis libraries
It would probably be helpful to have some basic understanding of one or both of the first 2 topics, but even that won't be necessary; some extra time spent on the earlier steps should help compensate.
Step 1: Basic Python Skills
If we intend to leverage Python in order to perform machine learning, having some base understanding of Python is crucial. Fortunately, due to its widespread popularity as a general purpose programming language, as well as its adoption in both scientific computing and machine learning, coming across beginner's tutorials is not very difficult. Your level of experience in both Python and programming in general are crucial to choosing a starting point.
First, you need Python installed. Since we will be using scientific computing and machine learning packages at some point, I suggest that you install Anaconda. It is an industrial-strength Python implementation for Linux, OSX, and Windows, complete with the required packages for machine learning, including numpy, scikit-learn, and matplotlib. It also includes iPython Notebook, an interactive environment for many of our tutorials. I would suggest Python 2.7, for no other reason than it is still the dominant installed version.
If you have no knowledge of programming, my suggestion is to start with the following free online book, then move on to the subsequent materials:
- Python The Hard Way, by Zed A. Shaw
If you have experience in programming but not with Python in particular, or if your Python is elementary, I would suggest one or both of the following:
- Google Developers Python Course (highly recommended for visual learners)
- An Introduction to Python for Scientific Computing (from UCSB Engineering), by M. Scott Shell (a great scientific Python intro ~60 pages)
And for those looking for a 30 minute crash course in Python, here you go:
Of course, if you are an experienced Python programmer you will be able to skip this step. Even if so, I suggest keeping the very readable Python documentation handy.
Step 2: Foundational Machine Learning Skills
KDnuggets' own Zachary Lipton has pointed out that there is a lot of variation in what people consider a "data scientist." This actually is a reflection of the field of machine learning, since much of what data scientists do involves using machine learning algorithms to varying degrees. Is it necessary to intimately understand kernel methods in order to efficiently create and gain insight from a support vector machine model? Of course not. Like almost anything in life, required depth of theoretical understanding is relative to practical application. Gaining an intimate understanding of machine learning algorithms is beyond the scope of this article, and generally requires substantial amounts of time investment in a more academic setting, or via intense self-study at the very least.
The good news is that you don't need to possess a PhD-level understanding of the theoretical aspects of machine learning in order to practice, in the same manner that not all programmers require a theoretical computer science education in order to be effective coders.
Andrew Ng's Coursera course often gets rave reviews for its content; my suggestion, however, is to browse the course notes compiled by a former student of the online course's previous incarnation. Skip over the Octave-specific notes (a Matlab-like language unrelated to our Python pursuits). Be warned that these are not "official" notes, but do seem to capture the relevant content from Andrew's course material. Of course, if you have the time and interest, now would be the time to take Andrew Ng's Machine Learning course on Coursera.
There all sorts of video lectures out there if you prefer, alongside Ng's course mentioned above. I'm a fan of Tom Mitchell, so here's a link to his recent lecture videos (along with Maria-Florina Balcan), which I find particularly approachable:
You don't need all of the notes and videos at this point. A valid strategy involves moving forward to particular exercises below, and referencing applicable sections of the above notes and videos when appropriate. For example, when you come across an exercise implementing a regression model below, read the appropriate regression section of Ng's notes and/or view Mitchell's regression videos at that time.
Step 3: Scientific Python Packages Overview
Alright. We have a handle on Python programming and understand a bit about machine learning. Beyond Python there are a number of open source libraries generally used to facilitate practical machine learning. In general, these are the main so-called scientific Python libraries we put to use when performing elementary machine learning tasks (there is clearly subjectivity in this):
- numpy - mainly useful for its N-dimensional array objects
- pandas - Python data analysis library, including structures such as dataframes
- matplotlib - 2D plotting library producing publication quality figures
- scikit-learn - the machine learning algorithms used for data analysis and data mining tasks
A good approach to learning these is to cover this material:
- Scipy Lecture Notes, by Gaël Varoquaux, Emmanuelle Gouillart, and Olav Vahtras
This pandas tutorial is good, and to the point:
You will see some other packages in the tutorials below, including, for example, Seaborn, which is a data visualization library based on matplotlib. The aforementioned packages are (again, subjectively) the core of a wide array of machine learning tasks in Python; however, understanding them should let you adapt to additional and related packages without confusion when they are referenced in the following tutorials.
Now, on to the good stuff...