7 Steps to Mastering Basic Machine Learning with Python — 2019 Edition
With a new year upon us, I thought it would be a good time to revisit the concept and put together a new learning path for mastering machine learning with Python. With these 7 steps you can master basic machine learning with Python!
There is an awful lot of freely-available material out there for folks who are interested in a crash course in machine learning with Python.
Some time ago I wrote 7 Steps to Mastering Machine Learning With Python and 7 More Steps to Mastering Machine Learning With Python, a pair of posts which attempted to aggregate and organize some of this available quality material into just such a crash course. However, these posts are getting stale, having been around for a few years at this point. With a new year upon us, I thought it would be a good time to revisit the concept and put together a new learning path for mastering machine learning with Python.
This time around, we will split the path up into 3 posts, one each for basic, intermediate and advanced topics. Let's make sure we view these terms in a relative sense, however, and not expect to be research-caliber machine learning engineers after getting through the (eventual) advanced post. The learning path is aimed at those with some understanding of programming, computer science concepts, and/or machine learning in an abstract sense, who are wanting to be able to use the implementations of machine learning algorithms of the prevalent Python libraries to build their own machine learning models.
This first post will start from zero, and get readers to the point of having set up an environment, gained an understanding of Python, and tried out a variety of algorithms for different scenarios. We will leverage the existing tutorials, videos, and works of a variety of folks, so the thanks for anything included herein should be directed solely at them.
Instead of having a high number of resources for each topic step (say, clustering), I have tried to select a quality tutorial or two, along with an accessible video preliminarily describing the underlying theory, math, or intuition of the given topic.
Fear not if the steps seem mostly aimed at machine learning algorithms, as along the way you will also come across additional important concepts, such as data preprocessing, loss metrics, data visualization, and much more.
So grab a cup of your favorite beverage and settle in for the first of three in the series, and start mastering basic machine learning with Python in these 7 steps.
Step 1. Mastering Python Basics
I looked for some updated materials for this section, beyond those I pointed out in previous iterations, both for the sake of change and for the sake of keeping up with recent versions of Python.
This GitHub repository by Jerry Pussinen, containing "Jupyter notebooks for teaching/learning Python 3," seems to have a decent overview of Python, which those who have an understanding of programming, or who are motivated, could get through within a few hours. You will need Python 3.5+ installed, as well as Jupyter notebook.
As you will be needing a number of Python's more popular scientific libraries as we progress, I recommend using the Anaconda distribution, which you can download here (choose whichever Python 3.X version is the latest, not Python 2.X), instead of installing components separately. Just launch the installer, and when it's done you will have Python, Jupyter notebook, and everything else you will need moving forward.
Step 2. Understanding the Python Scientific Computing Environment
So, you have Python and the scientific computing stack installed and ready to go. But why?
Before going too much further, it's a good idea to understand what the scientific computing stack is, what its most prominent and important components are, and how they will be used in a machine learning environment.
This article from Dataquest, aptly titled Jupyter Notebook for Beginners: A Tutorial, dives into why we are using Jupyter notebooks at all, and introduces some of the most important Python libraries you will encounter along this path, namely Pandas, Numpy, and Matplotlib.
The tutorials does not cover Scikit-learn, one of the main engines of the actual machine learning process in the Python ecosystem, which contains implementations of dozens of algorithms for you to implement in your own projects. However, the introductory article An introduction to machine learning with Scikit-learn, directly from the maintainers of Scikit-learn will give you an overview of its basics in 5 minutes.
As an exercise left to the reader, I would suggest locating and becoming familiar the contents of the documentation for Pandas, Numpy, Matplotlib, and Scikit-learn, and would keep the links handy as references moving forward. At any rate, make sure you are comfortable with the basics of these 4 tools specifically, as they are well-used in basic Python machine learning.
Step 3. Classification
Classification is one of the main methods of supervised learning, and the manner in which prediction is carried out as relates to data with class labels. Classification involves finding a model which describes data classes, which can then be used to classify instances of unknown data. The concept of training data versus testing data is of integral importance to classification. Popular classification algorithms for model building, and manners of presenting classifier models, include (but are not limited to) decision trees, logistic regression, support vector machines, and neural networks.
First, watch MIT professor John Guttag's lecture on classification.
Then have a look at the following tutorials, each of which covers an elementary machine learning classification algorithm (how exciting, your first machine learning algorithm!).
Susan Li provides a detailed overview of implementing the most basic classifier, logistic regression, in Building A Logistic Regression in Python, Step by Step.
Once you have completed Susan's tutorial, follow Russell Brown's concise Creating and Visualizing Decision Trees with Python.
As a bonus, and since we are also learning how to use Jupyter notebooks at the same time, have a look at Interactive Visualization of Decision Trees with Jupyter Widgets by Dafni Sidiropoulou Velidou to gain exposure to some of the benefits of using Jupyter notebooks for visualizing and tweaking your models.
Step 4. Regression
Regression is similar to classification, in that it is another dominant form of supervised learning and is useful for predictive analysis. They differ in that classification is used for predictions of data with distinct finite classes, while regression is used for predicting continuous numeric data. As a form of supervised learning, training/testing data is an important concept in regression as well.
First, watch CMU professor Tom Mitchell's lecture on regression.
Then have a look at, and follow along with, Adi Bronshtein's tutorial titled Simple and Multiple Linear Regression in Python.
Step 5. Clustering
Clustering is used for analyzing data which does not include pre-labeled classes. Data instances are grouped together using the concept of maximizing intraclass similarity and minimizing the similarity between differing classes. This translates to the clustering algorithm identifying and grouping instances which are very similar, as opposed to ungrouped instances which are much less-similar to one another. As clustering does not require the pre-labeling of classes, it is a form of unsupervised learning.
First, watch this lecture by MIT's John Guttag.
k-means clustering is perhaps the most well-known example of a clustering algorithm, but is not the only one. Different clustering schemes exist, including hierarchical clustering, fuzzy clustering, and density clustering, as do different takes on centroid-style clustering (the family to which k-means belongs). For more on this, read Jake Huneycutt's An Introduction to Clustering Algorithms in Python. Then read Michael J. Garbade's Understanding K-means Clustering in Machine Learning and implement k-means for yourself.
Then take a look at Gabriel Pierobon's DBSCAN clustering for data shapes k-means can’t handle well (in Python) to implement a density-based clustering model.
Step 6. More Classification
Now that we have sampled around, let's switch gears back to classification and check out a more complex algorithm.
Watch CMU's Maria Florina Balcan's discuss support vector machines (SVMs) in this lecture video.
Then read Aakash Tandel's Support Vector Machines — A Brief Overview, a high-level treatment of SVMs. Follow this up with Support Vector Machine vs Logistic Regression by Georgios Drakos.
Finally, round out your SVM understanding by reading In-Depth: Support Vector Machines by Jake VanderPlas, an excerpt from his highly recommended Python Data Science Handbook.
Step 7. Ensemble Methods
Lastly, let's learn a bit about ensemble methods.
Start by watching this video lecture by Peter Bloem of Vrije University.
Then read these 2 mostly explanatory articles:
- Gradient Boosting from scratch, by Prince Grover
- Random Forest Simple Explanation, by Will Koehrsen
Finally, follow these tutorials to try your hand at ensemble methods.
- Introduction to Python Ensembles, by Sebastian Flennerhag
- CatBoost vs. Light GBM vs. XGBoost, by Alvira Swalin
- Using XGBoost in Python, by Manish Pathak
Hopefully you have benefited from these 7 steps to mastering basic machine learning with Python. Join us next time when we will move on to some more intermediate topics.
Related:
- 7 Steps to Mastering Machine Learning With Python
- 7 More Steps to Mastering Machine Learning With Python
- 7 Steps to Mastering Data Preparation with Python