Pandas: How to One-Hot Encode Data

In this article, we will explore how to utilize the Pandas for One-Hot encoding categorical data.

By Muhammad Arham, Machine Learning Engineer at Vyro on July 24, 2023 in Data Science

Image from Pexels

What is One-Hot Encoding

One-hot encoding is a data preprocessing step to convert categorical values into compatible numerical representations.

categorical_column	bool_col	col_1	col_2	label
value_A	True	9	4	0
value_B	False	7	2	0
value_D	True	9	5	0
value_D	False	8	3	1
value_D	False	9	0	1
value_D	False	5	4	1
value_B	True	8	1	1
value_D	True	6	6	1
value_C	True	0	5	0

For example for this dummy dataset, the categorical column has multiple string values. Many machine learning algorithms require the input data to be in numerical form. Therefore, we need some way to convert this data attribute to a form compatible with such algorithms. Thus, we break down the categorical column into multiple binary-valued columns.

How to use Pandas Library for One-Hot Encoding

Firstly, read the .csv file or any other associated file into a Pandas data frame.

df = pd.read_csv("data.csv")

To check unique values and better understand our data, we can use the following Panda functions.

df['categorical_column'].nunique()
df['categorical_column'].unique()

For this dummy data, the functions return the following output:

>>> 4
>>> array(['value_A', 'value_C', 'value_D', 'value_B'], dtype=object)

For the categorical column, we can break it down into multiple columns. For this, we use pandas.get_dummies() method. It takes the following arguments:

Argument
data: array-like, Series, or DataFrame	The original panda's data frame object
columns: list-like, default None	List of categorical columns to hot-encode
drop_first: bool, default False	Removes the first level of categorical labels

To better understand the function, let us work on one-hot encoding the dummy dataset.

Hot-Encoding the Categorical Columns

We use the get_dummies method and pass the original data frame as data input. In columns, we pass a list containing only the categorical_column header.

df_encoded = pd.get_dummies(df, columns=['categorical_column', ])

The following commands drops the categorical_column and creates a new column for each unique value. Therefore, the single categorical column is converted into 4 new columns where only one of the 4 columns will have a 1 value, and all of the other 3 are encoded 0. This is why it is called One-Hot Encoding.

categorical_column_value_A	categorical_column_value_B	categorical_column_value_C	categorical_column_value_D
1	0	0	0
0	1	0	0
0	0	0	1
0	0	0	1
0	0	0	1
0	0	0	1
0	1	0	0
0	0	0	1
0	0	1	0
0	0	0	1

The problem occurs when we want to one-hot encode the boolean column. It creates two new columns as well.

Hot Encoding Binary Columns

df_encoded = pd.get_dummies(df, columns=[bool_col, ])

bool_col_False	bool_col_True
0	1
1	0
0	1
1	0

We unnecessarily increase a column when we can have only one column where True is encoded to 1 and False is encoded to 0. To solve this, we use the drop_first argument.

df_encoded = pd.get_dummies(df, columns=['bool_col'], drop_first=True)

bool_col_True

Conclusion

The dummy dataset is one-hot encoded where the final result looks like

col_1	col_2	bool	A	B	C	D	label
9	4	1	1	0	0	0	0
7	2	0	0	1	0	0	0
9	5	1	0	0	0	1	0
8	3	0	0	0	0	1	1
9	0	0	0	0	0	1	1
5	4	0	0	0	0	1	1
8	1	1	0	1	0	0	1
6	6	1	0	0	0	1	1
0	5	1	0	0	1	0	0
1	8	1	0	0	0	1	0

The categorical values and boolean values have been converted to numerical values that can be used as input to machine learning algorithms.

Muhammad Arham is a Deep Learning Engineer working in Computer Vision and Natural Language Processing. He has worked on the deployment and optimizations of several generative AI applications that reached the global top charts at Vyro.AI. He is interested in building and optimizing machine learning models for intelligent systems and believes in continual improvement.

Pandas: How to One-Hot Encode Data

What is One-Hot Encoding

How to use Pandas Library for One-Hot Encoding

Hot-Encoding the Categorical Columns

Hot Encoding Binary Columns

Conclusion

More On This Topic

Latest Posts

Top Posts