Pandas vs SQL: When Data Scientists Should Use Each Tool
Exploring data sets and understanding its structure, content, and relationships is a routine and core process for any Data Scientist. Multiple tools exist for performing such analysis, and we take a deep dive into the benefits and different approaches of two important tools, SQL and Pandas.
Photo by rigel on Unsplash.
Both of these tools are important to not only data scientists but also to those in similar positions like data analytics and business intelligence. With that being said, when should data scientists specifically use pandas over SQL and vice versa? In some situations, you can get away with just using SQL, and some other times, pandas is much easier to use, especially for data scientists who focus on research in a Jupyter Notebook setting. Below, I will discuss when you should use SQL and when you should use pandas. Keep in mind that both of these tools have specific use cases, but there are many times where their functionality overlap, and that is what I will be comparing below as well.
Pandas
Photo by Kalen Kemp on Unsplash.
Pandas is an open-source data analysis tool in the Python programing language. The benefit of pandas starts when you already have your main dataset, usually from a SQL query. This main difference can mean that the two tools are separate. However, you can also perform several of the same functions in each respective tool. For example, you can create new features from existing columns in pandas, perhaps easier and faster than in SQL.
It is important to note that I am not comparing what pandas does that SQL cannot do and vice versa. I will be picking the tool that can do the function more efficiently or preferable for data science work — in my opinion, from personal experience.
Here are times where using pandas is more beneficial than SQL — while also having the same functionality as SQL:
- creating calculated fields from existing features
When incorporating a more complex SQL query, you often are incorporating subqueries as well in order to divide values from different columns. In pandas, you can simply divide features much easier like the following:
df["new_column"] = df["first_column"]/df["second_column"]
The code above is showing how you can divide two separate columns and assign those values to a new column. In this case, you are performing the feature creation on the whole entire dataset or dataframe. You can use this function in both feature exploration and feature engineering in the process of data science.
- grouping by
Also referring to subqueries, grouping by in SQL can become quite complex and require lines and lines of code that can be visually overwhelming. In pandas, you can simply group by one line of code. I am not referring to the ‘group by’ at the end of a simple select from table query, but one where there are multiple subqueries involved.
df.groupby(by="first_column").mean()
This result would be returning the mean of the first_column for every column in the dataframe. There are many other ways to use this grouping function, which are outlined nicely in the pandas documentation.
- checking data types
In SQL, you will often have to cast types, but sometimes it can be a little clearer to see the way pandas lays out data types in a vertical format rather than scrolling through a horizontal output in SQL. You can expect some examples of data types returned to be int64, float64, datetime64[ns], and object.
df.dtypes
While these are all fairly simple functions of pandas and SQL, in SQL, they are particularly tricky and sometimes just much easier to implement in a pandas dataframe. Now, let’s look at what SQL is better at performing.
SQL
Photo by Caspar Camille Rubin on Unsplash.
SQL is probably the language that is used most by the most amount of different positions. For example, a data engineer could use SQL, a Tableau developer, or a product manager. With that being said, data scientists tend to use SQL frequently. It is important to note that there are several different versions of SQL, usually all having a similar function, just slightly formatted differently.
Here are times where using SQL is more beneficial than pandas — while also having the same functionality as pandas:
- WHERE clause
This clause in SQL is used frequently and can also be performed in pandas. In pandas, however, it is slightly more difficult or less intuitive. For example, you have to write out redundant code, whereas, in SQL, you simply need the WHERE.
SELECT id FROM table WHERE id > 100
In pandas, it would be something like:
df[df["id"] > 100]["id"]
Yes, both are simple, but SQL is just a little more intuitive.
- JOINS
Pandas has a few ways to join, which can be a little overwhelming, whereas in SQL, you can perform simple joins like the following:Â INNER, LEFT, RIGHT.
SELECT one.column_A, two.column_B FROM first_table one INNER JOIN second_table two ON two.id = one.id
In this code, joining is slightly easier to read than in pandas, where you have to merge dataframes, and especially as you merge more than two dataframes, it can be quite complex in pandas. SQL can perform multiple joins, whether it be INNER, etc., all in the same query.
All of these examples, whether it be SQL or pandas, can be used in at least the exploratory data analysis portion of the data science process, as well as in feature engineering, and querying model results once they are stored in a database.
Summary
This comparison of pandas versus SQL is more of a personal preference. With that being said, you may feel the opposite of my opinion. However, I hope it still sheds light on the differences between pandas and SQL, as well as what you can perform the same in both tools, using slightly different coding techniques and a different language altogether.
To summarize, we have compared the benefits of using pandas over SQL and vice versa for a few of their shared functions:
- Â creating calculated fields from existing features
- grouping by
- checking data types
- WHERE clause
- JOINS
Original. Reposted with permission.
Related: