Getting Started with Scikit-learn for Classification in Machine Learning
The tutorial will introduce you to the scikit-learn module and its various features. It will also give you a brief overview of the multiclass classification problem through various algorithms.
Image by Editor
Scikit-learn is one of the most commonly used machine-learning libraries built in python. Its popularity can be attributed to its easy and consistent code structure which is friendly for beginner developers. Also, there is a high level of support available along with flexibility to integrate third-party functionalities which makes the library robust and suitable for production. The library contains multiple machine learning models for classification, regression, and clustering. In this tutorial, we will explore the problem of multiclass classification through various algorithms. Let’s dive right into it and build our scikit-learn models.
Install the Latest Version
pip install scikit-learn
Loading the Dataset
We will use the “Wine” dataset available in the datasets module of scikit-learn. This dataset consists of 178 samples and 3 classes in total. The dataset is already pre-processed and converted to feature vectors hence, we can directly use it to train our models.
from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True)
Creating Training and Testing Data
We will keep 67% of the data for training and the rest 33% for testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)
Now, we will experiment with 5 different models of differing complexities and evaluate their results on our dataset.
Logistic Regression
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)
y_pred_lr = model_lr.predict(X_test)
print("Accuracy Score: ", accuracy_score(y_pred_lr, y_test))
print(classification_report(y_pred_lr, y_test))
Output
Accuracy Score: 0.9830508474576272
precision recall f1-score support
0 1.00 0.95 0.98 21
1 0.96 1.00 0.98 23
2 1.00 1.00 1.00 15
accuracy 0.98 59
macro avg 0.99 0.98 0.98 59
weighted avg 0.98 0.98 0.98 59
K-Nearest Neighbors
model_knn = KNeighborsClassifier(n_neighbors=1)
model_knn.fit(X_train, y_train)
y_pred_knn = model_knn.predict(X_test)
print("Accuracy Score:", accuracy_score(y_pred_knn, y_test))
print(classification_report(y_pred_knn, y_test))
Output
Accuracy Score: 0.7796610169491526
precision recall f1-score support
0 0.90 0.78 0.84 23
1 0.75 0.82 0.78 22
2 0.67 0.71 0.69 14
accuracy 0.78 59
macro avg 0.77 0.77 0.77 59
weighted avg 0.79 0.78 0.78 59
Upon changing the parameter ‘n_neighbors=2’ we observe a decrease in the value of accuracy. Hence, it shows that the data is simple enough and achieves better learning with a single neighbor to consider.
Output
Accuracy Score: 0.6949152542372882
precision recall f1-score support
0 0.90 0.72 0.80 25
1 0.75 0.69 0.72 26
2 0.33 0.62 0.43 8
accuracy 0.69 59
macro avg 0.66 0.68 0.65 59
weighted avg 0.76 0.69 0.72 59
Naive Bayes
from sklearn.naive_bayes import GaussianNB
model_nb = GaussianNB()
model_nb.fit(X_train, y_train)
y_pred_nb = model_nb.predict(X_test)
print("Accuracy Score:", accuracy_score(y_pred_nb, y_test))
print(classification_report(y_pred_nb, y_test))
Output
Accuracy Score: 1.0
precision recall f1-score support
0 1.00 1.00 1.00 20
1 1.00 1.00 1.00 24
2 1.00 1.00 1.00 15
accuracy 1.00 59
macro avg 1.00 1.00 1.00 59
weighted avg 1.00 1.00 1.00 59
Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
model_dtclassifier = DecisionTreeClassifier()
model_dtclassifier.fit(X_train, y_train)
y_pred_dtclassifier = model_dtclassifier.predict(X_test)
print("Accuracy Score:", accuracy_score(y_pred_dtclassifier, y_test))
print(classification_report(y_pred_dtclassifier, y_test))
Output
Accuracy Score: 0.9661016949152542
precision recall f1-score support
0 0.95 0.95 0.95 20
1 1.00 0.96 0.98 25
2 0.93 1.00 0.97 14
accuracy 0.97 59
macro avg 0.96 0.97 0.97 59
weighted avg 0.97 0.97 0.97 59
Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
def get_best_parameters():
params = {
"n_estimators": [10, 50, 100],
"max_features": ["auto", "sqrt", "log2"],
"max_depth": [5, 10, 20, 50],
"min_samples_split": [2, 4, 6],
"min_samples_leaf": [2, 4, 6],
"bootstrap": [True, False],
}
model_rfclassifier = RandomForestClassifier(random_state=42)
rf_randomsearch = RandomizedSearchCV(
estimator=model_rfclassifier,
param_distributions=params,
n_iter=5,
cv=3,
verbose=2,
random_state=42,
)
rf_randomsearch.fit(X_train, y_train)
best_parameters = rf_randomsearch.best_params_
print("Best Parameters:", best_parameters)
return best_parameters
parameters_rfclassifier = get_best_parameters()
model_rfclassifier = RandomForestClassifier(
**parameters_rfclassifier, random_state=42
)
model_rfclassifier.fit(X_train, y_train)
y_pred_rfclassifier = model_rfclassifier.predict(X_test)
print("Accuracy Score:", accuracy_score(y_pred_rfclassifier, y_test))
print(classification_report(y_pred_rfclassifier, y_test))
Output
Best Parameters: {'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 5, 'bootstrap': True}
Accuracy Score: 0.9830508474576272
precision recall f1-score support
0 1.00 0.95 0.98 21
1 0.96 1.00 0.98 23
2 1.00 1.00 1.00 15
accuracy 0.98 59
macro avg 0.99 0.98 0.98 59
weighted avg 0.98 0.98 0.98 59
In this algorithm, we performed some hyperparameter tuning to achieve the best accuracy. We defined a parameter grid consisting of multiple values to choose from for each parameter. Further, we used the Randomized Search CV algorithm to search the best parameter space for the model. Finally we feed the obtained parameters to the classifier and train the model.
Comparison of Models
Models | Accuracy | Observations |
Logistic Regression | 98.30% | Achieves great accuracy. Model is able to generalize well on the test dataset. |
K-Nearest Neighbors | 77.96% | The algorithm is not able to learn the data representation well. |
Naive Bayes | 100% | The model is less complex hence it overfits the data to obtain absolute accuracy. |
Decision Tree Classifier | 96.61% | Achieves decent accuracy. |
Random Forest Classifier | 98.30% | Being an ensemble-based approach it performs better than Decision Tree. Performing hyperparameter tuning makes it achieve similar accuracy to logistic regression. |
Conclusion
In this tutorial, we learned how to get started to build and train machine learning models in scikit-learn. We implemented and evaluated a few algorithms to get a basic idea about their performance. One can always adopt advanced strategies for feature engineering, hyperparameter tuning or training to improve performance. To read more about the functionalities that scikit-learn offers, head over to the official documentation - Introduction to machine learning with scikit-learn, Machine Learning in Python with scikit-learn.
Yesha Shastri is a passionate AI developer and writer pursuing Master’s in Machine Learning from Université de Montréal. Yesha is intrigued to explore responsible AI techniques to solve challenges that benefit society and share her learnings with the community.