Pandas: How to One-Hot Encode Data
In this article, we will explore how to utilize the Pandas for One-Hot encoding categorical data.
Image from Pexels
What is One-Hot Encoding
One-hot encoding is a data preprocessing step to convert categorical values into compatible numerical representations.Â
categorical_column | bool_col | col_1 | col_2 | label |
value_A | True | 9 | 4 | 0 |
value_B | False | 7 | 2 | 0 |
value_D | True | 9 | 5 | 0 |
value_D | False | 8 | 3 | 1 |
value_D | False | 9 | 0 | 1 |
value_D | False | 5 | 4 | 1 |
value_B | True | 8 | 1 | 1 |
value_D | True | 6 | 6 | 1 |
value_C | True | 0 | 5 | 0 |
For example for this dummy dataset, the categorical column has multiple string values. Many machine learning algorithms require the input data to be in numerical form. Therefore, we need some way to convert this data attribute to a form compatible with such algorithms. Thus, we break down the categorical column into multiple binary-valued columns.
How to use Pandas Library for One-Hot Encoding
Firstly, read the .csv file or any other associated file into a Pandas data frame.
df = pd.read_csv("data.csv")
To check unique values and better understand our data, we can use the following Panda functions.
df['categorical_column'].nunique()
df['categorical_column'].unique()
For this dummy data, the functions return the following output:
>>> 4
>>> array(['value_A', 'value_C', 'value_D', 'value_B'], dtype=object)
For the categorical column, we can break it down into multiple columns. For this, we use pandas.get_dummies() method. It takes the following arguments:
Argument | |
data: array-like, Series, or DataFrame | The original panda's data frame object |
columns: list-like, default None | List of categorical columns to hot-encode |
drop_first: bool, default False | Removes the first level of categorical labels |
To better understand the function, let us work on one-hot encoding the dummy dataset.
Hot-Encoding the Categorical Columns
We use the get_dummies method and pass the original data frame as data input. In columns, we pass a list containing only the categorical_column header.Â
df_encoded = pd.get_dummies(df, columns=['categorical_column', ])
The following commands drops the categorical_column and creates a new column for each unique value. Therefore, the single categorical column is converted into 4 new columns where only one of the 4 columns will have a 1 value, and all of the other 3 are encoded 0. This is why it is called One-Hot Encoding.
categorical_column_value_A | categorical_column_value_B | categorical_column_value_C | categorical_column_value_D |
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 |
0 | 0 | 0 | 1 |
0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 |
The problem occurs when we want to one-hot encode the boolean column. It creates two new columns as well.
Hot Encoding Binary Columns
df_encoded = pd.get_dummies(df, columns=[bool_col, ])
bool_col_False | bool_col_True |
0 | 1 |
1 | 0 |
0 | 1 |
1 | 0 |
We unnecessarily increase a column when we can have only one column where True is encoded to 1 and False is encoded to 0. To solve this, we use the drop_first argument.
df_encoded = pd.get_dummies(df, columns=['bool_col'], drop_first=True)
bool_col_True |
1 |
0 |
1 |
0 |
Conclusion
The dummy dataset is one-hot encoded where the final result looks like
col_1 | col_2 | bool | A | B | C | D | label |
9 | 4 | 1 | 1 | 0 | 0 | 0 | 0 |
7 | 2 | 0 | 0 | 1 | 0 | 0 | 0 |
9 | 5 | 1 | 0 | 0 | 0 | 1 | 0 |
8 | 3 | 0 | 0 | 0 | 0 | 1 | 1 |
9 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
5 | 4 | 0 | 0 | 0 | 0 | 1 | 1 |
8 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
6 | 6 | 1 | 0 | 0 | 0 | 1 | 1 |
0 | 5 | 1 | 0 | 0 | 1 | 0 | 0 |
1 | 8 | 1 | 0 | 0 | 0 | 1 | 0 |
The categorical values and boolean values have been converted to numerical values that can be used as input to machine learning algorithms.Â
Muhammad Arham is a Deep Learning Engineer working in Computer Vision and Natural Language Processing. He has worked on the deployment and optimizations of several generative AI applications that reached the global top charts at Vyro.AI. He is interested in building and optimizing machine learning models for intelligent systems and believes in continual improvement.