Encoding Categorical Features with MultiLabelBinarizer
Transform multi-label format into a binary matrix for multi-label classification.
Image by Author
In the past, you might have converted categorical features into numerical ones using One Hot, Label, and Ordinal encoder. You were working with data that have only one label per sample. But how do you deal with samples with multiple labels?
In this mini tutorial, you will learn the difference between multi-class and multi-label. Furthermore, we will apply Scikit-Learn’s MultiLabelBinarizer function to convert iterable of iterables and multilabel targets.Â
Multi-Class vs. Multi-Label
In machine learning, multi-class classification data consists of more than two classes, and each sample is assigned one label. Whereas in multi-label classification, each sample is assigned multiple labels.
Image from Thamme GowdaÂ
We will review the examples to understand both types of classification tasks.Â
Multi-Class
In Multi-Class, every record of the student has only one label (Major), and there are more than 2 classes. The students can only have either Math, Science, or English as a major. Â
Image by Author
Multi-Label
In the multi-label, a student can have more than one Major. For example, Nisaha has selected English, Law, and History as her majors.Â
As we can also see, the length of the array varies, some of the students have two majors, and some of them have 3.Â
The students have 0 to N number of majors.Â
Image by Author
Scikit-Learn MultiLabelBinarizer Examples
We will now use the Scikit-learn MultiLabelBinarizer to convert iterable of iterables and multilabel targets into binary encoding.Â
Example 1
In the first example, we have transformed the List of Lists to binary encoding using the MultiLabelBinarizer function. The fit_transform
understands the data and applies the transformation.
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
print(mlb.fit_transform([["Abid", "Matt"], ["Nisha"]]))
Output:
We got an array of 1s and 0s.Â
array([[1, 1, 0],
[0, 0, 1]])
Example 2
We can also convert a list of dictionaries to a binary matrix indicating the presence of a class label.
After transformation, you can view the class labels by using .classes_
y = mlb.fit_transform(
[
{"Abid", "Matt"},
{"Nisha", "Abid", "Matt"},
{"Nisha", "Abid", "Sara", "Matt"},
{"Matt", "Sara"},
]
)
print(list(mlb.classes_))
Output:
['Abid', 'Matt', 'Nisha', 'Sara']
To understand binary matrices, we will convert the output into a Pandas DataFrame with column names as classes.Â
res = pd.DataFrame(y, columns=mlb.classes_)
res
Just like one-hot encoding, it has represented labels as 1’s and 0s.Â
The MultiLabelBinarizer is commonly used in Image and News classification. After the transformation, you can train the simple Random Forest or Neural Networks in no time.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.