Getting Started with Scikit-learn in 5 Steps
This tutorial offers a comprehensive hands-on walkthrough of machine learning with Scikit-learn. Readers will learn key concepts and techniques including data preprocessing, model training and evaluation, hyperparameter tuning, and compiling ensemble models for enhanced performance.
Introduction to Scikit-learn
When learning about how to use Scikit-learn, we must obviously have an existing understanding of the underlying concepts of machine learning, as Scikit-learn is nothing more than a practical tool for implementing machine learning principles and related tasks. Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. The algorithms use training data to make predictions or decisions by uncovering patterns and insights. There are three main types of machine learning:
- Supervised learning - Models are trained on labeled data, learning to map inputs to outputs
- Unsupervised learning - Models work to uncover hidden patterns and groupings within unlabeled data
- Reinforcement learning - Models learn by interacting with an environment, receiving rewards and punishments to encourage optimal behavior
As you are undoubtedly aware, machine learning powers many aspects of modern society, generating enormous amounts of data. As data availability continues to grow, so does the importance of machine learning.
Scikit-learn is a popular open source Python library for machine learning. Some key reasons for its widespread use include:
- Simple and efficient tools for data analysis and modeling
- Accessible to Python programmers, with focus on clarity
- Built on NumPy, SciPy and matplotlib for easier integration
- Wide range of algorithms for tasks like classification, regression, clustering, dimensionality reduction
This tutorial aims to offer a step-by-step walkthrough of using Scikit-learn (mainly for common supervised learning tasks), focusing on getting started with extensive hands-on examples.
Step 1: Getting Started with Scikit-learn
Installation and Setup
In order to install and use Scikit-learn, your system must have a functioning Python installation. We won't be covering that here, but will assume that you have a functioning installation at this point.
Scikit-learn can be installed using pip, Python's package manager:
pip install scikit-learn
This will also install any required dependencies like NumPy and SciPy. Once installed, Scikit-learn can be imported in your Python scripts as follows:
import sklearn
Testing Your Installation
Once installed, you can start a Python interpreter and run the import command above.
Python 3.10.11 (main, May 2 2023, 00:28:57) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sklearn
So long as you do not see any error messages, you are now ready to start using Scikit-learn!
Loading Sample Datasets
Scikit-learn provides a variety of sample datasets that we can use for testing and experimentation:
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()
The digits dataset contains images of handwritten digits along with their labels. We can start familiarizing ourselves with Scikit-learn using these sample datasets before moving on to real-world data.
Step 2: Data Preprocessing
Importance of Data Preprocessing
Real-world data is often incomplete, inconsistent, and contains errors. Data preprocessing transforms raw data into a usable format for machine learning, and is an essential step that can impact the performance of downstream models.
Many novice practitioners often overlook proper data preprocessing, instead jumping right into model training. However, low quality data inputs will lead to low quality models outputs, regardless of the sophistication of the algorithms used. Steps like properly handling missing data, detecting and removing outliers, feature encoding, and feature scaling help boost model accuracy.
Data preprocessing accounts for a major portion of the time and effort spent on machine learning projects. The old computer science adage "garbage in, garbage out" very much applies here. High quality data inputs are a prerequisite for high performance machine learning. The data preprocessing steps transform the raw data into a refined training set that allows the machine learning algorithms to effectively uncover predictive patterns and insights.
So in summary, properly preprocessing the data is an indispensable step in any machine learning workflow, and should receive substantial focus and diligent effort.
Loading and Understanding Data
Let's load a sample dataset using Scikit-learn for demonstration:
from sklearn.datasets import load_iris
iris_data = load_iris()
We can explore the features and target values:
print(iris_data.data[0]) # Feature values for first sample
print(iris_data.target[0]) # Target value for first sample
We should understand the meaning of the features and target before proceeding.
Data Cleaning
Real data often contains missing, corrupt or outlier values. Scikit-learn provides tools to handle these issues:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(iris_data.data)
The imputer replaces missing values with the mean, which is a common — but not the only — strategy. This is just one approach for data cleaning.
Feature Scaling
Algorithms like Support Vector Machines (SVMs) and neural networks are sensitive to the scale of input features. Inconsistent feature scales can result in these algorithms giving undue importance to features with larger scales, thereby affecting the model's performance. Therefore, it's essential to normalize or standardize the features to bring them onto a similar scale before training these algorithms.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(iris_data.data)
StandardScaler standardizes features to have mean 0 and variance 1. Other scalers are also available.
Visualizing the Data
We can also visualize the data using matplotlib to gain further insights:
import matplotlib.pyplot as plt
plt.scatter(iris_data.data[:, 0], iris_data.data[:, 1], c=iris_data.target)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.show()
Data visualization serves multiple critical functions in the machine learning workflow. It allows you to spot underlying patterns and trends in the data, identify outliers that may skew model performance, and gain a deeper understanding of the relationships between variables. By visualizing the data beforehand, you can make more informed decisions during the feature selection and model training phases.
Step 3: Model Selection and Training
Overview of Scikit-learn Algorithms
Scikit-learn provides a variety of supervised and unsupervised algorithms:
- Classification: Logistic Regression, SVM, Naive Bayes, Decision Trees, Random Forest
- Regression: Linear Regression, SVR, Decision Trees, Random Forest
- Clustering: k-Means, DBSCAN, Agglomerative Clustering
Along with many others.
Choosing an Algorithm
Choosing the most appropriate machine learning algorithm is vital for building high quality models. The best algorithm depends on a number of key factors:
- The size and type of data available for training. Is it a small or large dataset? What kinds of features does it contain - images, text, numerical?
- The available computing resources. Algorithms differ in their computational complexity. Simple linear models train faster than deep neural networks.
- The specific problem we want to solve. Are we doing classification, regression, clustering, or something more complex?
- Any special requirements like the need for interpretability. Linear models are more interpretable than black-box methods.
- The desired accuracy/performance. Some algorithms simply perform better than others on certain tasks.
For our particular sample problem of categorizing iris flowers, a classification algorithm like Logistic Regression or Support Vector Machine would be most suitable. These can efficiently categorize the flowers based on the provided feature measurements. Other simpler algorithms may not provide sufficient accuracy. At the same time, very complex methods like deep neural networks would be overkill for this relatively simple dataset.
As we train models going forward, it is crucial to always select the most appropriate algorithms for our specific problems at hand, based on considerations such as those outlined above. Reliably choosing suitable algorithms will ensure we develop high quality machine learning systems.
Training a Simple Model
Let's train a Logistic Regression model:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(scaled_data, iris_data.target)
That's it! The model is trained and ready for evaluation and use.
Training a More Complex Model
While simple linear models like logistic regression can often provide decent performance, for more complex datasets we may need to leverage more sophisticated algorithms. For example, ensemble methods combine multiple models together, using techniques like bagging and boosting, to improve overall predictive accuracy. As an illustration, we can train a random forest classifier, which aggregates many decision trees:
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(scaled_data, iris_data.target)
The random forest can capture non-linear relationships and complex interactions among the features, allowing it to produce more accurate predictions than any single decision tree. We can also employ algorithms like SVM, gradient boosted trees, and neural networks for further performance gains on challenging datasets. The key is to experiment with different algorithms beyond simple linear models to harness their strengths.
Note, however, that whether using a simple or more complex algorithm for model training, the Scikit-learn syntax allows for the same approach, reducing the learning curve dramatically. In fact, almost every task using the library can be expressed with the fit/transform/predict paradigm.
Step 4: Model Evaluation
Importance of Evaluation
Evaluating a machine learning model's performance is an absolutely crucial step before final deployment into production. Comprehensively evaluating models builds essential trust that the system will operate reliably once deployed. It also identifies potential areas needing improvement to enhance the model's predictive accuracy and generalization ability. A model may appear highly accurate on the training data it was fit on, but still fail miserably on real-world data. This highlights the critical need to test models on held-out test sets and new data, not just the training data.
We must simulate how the model will perform once deployed. Rigorously evaluating models also provides insights into possible overfitting, where a model memorizes patterns in the training data but fails to learn generalizable relationships useful for out-of-sample prediction. Detecting overfitting prompts appropriate countermeasures like regularization and cross-validation. Evaluation further allows comparing multiple candidate models to select the best performing option. Models that do not provide sufficient lift over a simple benchmark model should potentially be re-engineered or replaced entirely.
In summary, comprehensively evaluating machine learning models is indispensable for ensuring they are dependable and adding value. It is not merely an optional analytic exercise, but an integral part of the model development workflow that enables deploying truly effective systems. So machine learning practitioners should devote substantial effort towards properly evaluating their models across relevant performance metrics on representative test sets before even considering deployment.
Train/Test Split
We split the data to evaluate model performance on new data:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_data, iris_data.target)
By convention, X refers to features and y refers to target variable. Please note that y_test
and iris_data.target
are different ways to refer to the same data.
Evaluation Metrics
For classification, key metrics include:
- Accuracy: Overall proportion of correct predictions
- Precision: Proportion of positive predictions that are actual positives
- Recall: Proportion of actual positives predicted positively
These can be computed via Scikit-learn's classification report:
from sklearn.metrics import classification_report
print(classification_report(y_test, model.predict(X_test)))
This gives us insight into model performance.
Step 5: Improving Performance
Hyperparameter Tuning
Hyperparameters are model configuration settings. Tuning them can improve performance:
from sklearn.model_selection import GridSearchCV
params = {'C': [0.1, 1, 10]}
grid_search = GridSearchCV(model, params, cv=5)
grid_search.fit(scaled_data, iris_data.target)
This grids over different regularization strengths to optimize model accuracy.
Cross-Validation
Cross-validation provides more reliable evaluation of hyperparameters:
from sklearn.model_selection import cross_val_score
cross_val_scores = cross_val_score(model, scaled_data, iris_data.target, cv=5)
It splits the data into 5 folds and evaluates performance on each.
Ensemble Methods
Combining multiple models can enhance performance. To demonstrate this, let's first train a random forest model:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(scaled_data, iris_data.target)
Now we can proceed to create an ensemble model using both our logistic regression and random forest models:
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(estimators=[('lr', model), ('rf', random_forest)])
voting_clf.fit(scaled_data, iris_data.target)
This ensemble model combines our previously trained logistic regression model, referred to as lr
, with the newly defined random forest model, referred to as rf
.
Model Stacking and Blending
More advanced ensemble techniques like stacking and blending build a meta-model to combine multiple base models. After training base models separately, a meta-model learns how best to combine them for optimal performance. This provides more flexibility than simple averaging or voting ensembles. The meta-learner can learn which models work best on different data segments. Stacking and blending ensembles with diverse base models often achieve state-of-the-art results across many machine learning tasks.
# Train base models
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
rf = RandomForestClassifier()
svc = SVC()
rf.fit(X_train, y_train)
svc.fit(X_train, y_train)
# Make predictions to train meta-model
rf_predictions = rf.predict(X_test)
svc_predictions = svc.predict(X_test)
# Create dataset for meta-model
blender = np.vstack((rf_predictions, svc_predictions)).T
blender_target = y_test
# Fit meta-model on predictions
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(blender, blender_target)
# Make final predictions
final_predictions = gb.predict(blender)
This trains a random forest and SVM model separately, then trains a gradient boosted tree on their predictions to produce the final output. The key steps are generating predictions from base models on the test set, then using those predictions as input features to train the meta-model.
Moving Forward
Scikit-learn provides an extensive toolkit for machine learning with Python. In this tutorial, we covered the complete machine learning workflow using Scikit-learn — from installing the library and understanding its capabilities, to loading data, training models, evaluating model performance, tuning hyperparameters, and compiling ensembles. The library has become hugely popular due to its well-designed API, breadth of algorithms, and integration with the PyData stack. Sklearn empowers users to quickly and efficiently build models and generate predictions without getting bogged down in implementation details. With this solid foundation, you can now practically apply machine learning to real-world problems using Scikit-learn. The next step entails identifying issues that are amenable to ML techniques, and leveraging the skills from this tutorial to extract value.
Of course, there is always more to learn about Scikit-learn specifically and machine learning in general. The library implements cutting-edge algorithms like neural networks, manifold learning, and deep learning using its estimator API. You can always extend your competency by studying the theoretical workings of these methods. Scikit-learn also integrates with other Python libraries like Pandas for added data manipulation capabilities. Furthermore, a product like SageMaker provides a production platform for operationalizing Scikit-learn models at scale.
This tutorial is just the starting point — Scikit-learn is a versatile toolkit that will continue to serve your modeling needs as you take on more advanced challenges. The key is to continue practicing and honing your skills through hands-on projects. Practical experience with the full modeling lifecycle is the best teacher. With diligence and creativity, Scikit-learn provides the tools to unlock deep insights from all kinds of data.
Matthew Mayo (@mattmayo13) holds a Master's degree in computer science and a graduate diploma in data mining. As Editor-in-Chief of KDnuggets, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.