How to Merge Pandas DataFrames
Data merge is a common data processing activity. Learn how Pandas provide various ways to merge our data.
Image by catalyststuff on Freepik
Data merging between two datasets or more is typical during data processing. In this blog, we will learn how data merging with Pandas is done and various tips to improve our data merging skills. Let’s explore the data merge technique.
Merge Pandas DataFrame
First; we need to import the Pandas Python package.
import pandas as pd
Merging two Pandas DataFrames would require the merge method from the Pandas package. This function would merge two DataFrame by the variable or columns we intended to join. Let’s try the Pandas merging method with an example DataFrame.
# Create Population DataFrame
df1 = pd.DataFrame({
'Country': ['America', 'Indonesia', 'France'],
'Location': ['New York', 'Jakarta', 'Paris'],
'Population': [731800, 575030, 183305]
})
# Create Income DataFrame
df2 = pd.DataFrame({
'Country': ['America', 'America', 'Indonesia', 'India', 'France', 'Greece'],
'Location': ['New York', 'Chicago', 'Jakarta', 'Mumbai', 'Paris', 'Yunani'],
'Income': [1000, 1500, 1400, 1100, 900, 1200]
})
# Merge Dataframe
merged_df = pd.merge(df1, df2, on='Country')
merged_df
Image by Author
In the example above, we create two different DataFrame with slightly different columns, and we merge them on the ‘Country’ column. The result is the rows from both DataFrame with similar values were merged. With one line, we manage to merge two different DataFrame.Â
Applying Optional ParametersÂ
Pandas .merge method has various optional parameters we could take advantage of. Let’s take a look at some of the useful ones.
Renaming merged columns with the same name
In our example above, we can see a column called ‘Location’ which contains suffixes _x and _y after the merge. If we want to change the column name when the merge happens, we can do that with the following code.
merged_df = pd.merge(df1, df2, on='Country', suffixes = ('_Population', '_Income'))
merged_df
Image by Author
In this code, we pass the suffixes parameter with tuple contain two values; the first and second DataFrame name. In my example, we named the first DataFrame Population and the second Income.
Merge based on different column names
What if we have two DataFrame with two different column names that refer to the same definition? We can still merge them, but we need to specify which DataFrame and column we want to merge.Â
df2 = pd.DataFrame({
'Index': ['America', 'America', 'Indonesia', 'India', 'France', 'Greece'],
'Location': ['New York', 'Chicago', 'Jakarta', 'Mumbai', 'Paris', 'Yunani'],
'Income': [1000, 1500, 1400, 1100, 900, 1200]
})
merged_df = pd.merge(df1, df2, left_on='Country', right_on = 'Index')
merged_df
Image by Author
In the example above, we change the second DataFrame ‘Country’ column as ‘Index’ then we merge the dataset by specify the column name on each DataFrame. Left_on parameter is for the first DataFrame and the right_on for the second DataFrame.
Change the merge type
There are five different types merged in the Pandas merge method. By default, the merge is an Inner merge which only includes rows with matching values in both columns. However, we can change the merge type by passing the values on the how parameter:
- Left
Left merge uses only the values from the first DataFrame.
merged_df = pd.merge(df1, df2, on='Country', how = 'left')
merged_df
Image by Author
- Right
Right merge uses only the values from the second DataFrame.
merged_df = pd.merge(df1, df2, on='Country', how = 'right')
merged_df
Image by Author
- Outer
Include all rows from both DataFrame by using the union of both DataFrame keys.
Image by Author
- Cross
Create a cartesian product from both DataFrame
merged_df = pd.merge(df1, df2, how = 'cross')
merged_df
Image by Author
Merge by two or more different columns
It is possible with Pandas merge to join two DataFrame with two or more different columns. We need to specify within the ‘on’ parameter by passing the list of columns we want to merge.
merged_df = pd.merge(df1, df2, on = ['Country', 'Location'])
merged_df
Image by Author
Conclusion
During data processing, it’s a common activity to merge two different DataFrame. To do that, we can use the Pandas method called merge. There are various optional parameters we can access within the Pandas merge to perform specific tasks, including changing the merged column name, merging DataFrame based on the different column name, changing the merge type, and merging by two other columns or more.
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and Data tips via social media and writing media.