Five Ways to do Conditional Filtering in Pandas
Learn five ways to perform conditional filtering with Pandas to help slice and dice your data.
Image by Editor
When I started my journey into data science, I learned R as my first programming language. I became very familiar using the dplyr package to filter data based on certain conditions. Fast forward a few years to when I started using Python and I initially had an aversion to Pandas because of how different it was from dplyr.
As time went on, I became more comfortable using Pandas methods to filter data and it became less intimidating. This blog post is a walkthrough of five ways to conditionally filter data using Pandas, using a single condition filter and then a multi-condition filter.
The filtering techniques used in this post are:
- Pandas filtering with selection brackets
- Pandas series methods: isin(), between(), contains()
- Defining separate filters outside of selection bracket filtering
- query()
- loc[]
- Bonus: Using the pandas filter() method
The Data
The dataset for this post comes from “The Vehicle Dataset” by Nehla Birla hosted on Kaggle.com and contains information about used cars for sale.
To begin, I import Pandas and read in the dataset.
To get a feeling for the syntax and readability of each method, this post looks at two examples to filter the vehicle dataset: one simple and one with multiple conditions.
- Simple filter: find all cars from the year 2013 and newer
- Multi-condition filter: Find all Hondas from the years 2013, 2014, and 2015 between the prices 300000 and 450000 (inclusive).
Filtering Method 1: Selection Brackets
Finding all the vehicles that have a year of 2013 or newer is a fairly standard Pandas filtering task: select the column of the dataset to filter on, tell it what value to filter against, and plug that condition into brackets for the entire dataframe.
If we want to make our multi-condition search, we can put each individual filters inside parentheses () separated by our Boolean search criteria (& for and, | for or, and ~ for not).
These multiple conditions technically work, but the readability of this code is not great. There are brackets and parentheses all over the place. To clean up the code and use fewer conditions, pandas has various methods that we can apply for the same results, one of which we just used in the code chunk above, called str.contains().
Filtering Method 2: Selection Brackets with Series Functions
There are numerous pandas.Series methods we can apply to our columns. They are listed in the Pandas documentation. The reason we look at series methods as we filter is because each column of our Pandas.DataFrame individually is a Pandas.Series element, so we can apply Pandas.Series methods and functionality to it.
There are numerous methods we could use with the vehicles dataset, but to filter the data with our multiple condition example, we will use:
- isin() – check to see if the series values are in a given list
- str.contains() – check to see if a string is in the series
- between() – find series value that are between two values
We will use isin() to check which vehicles meet our years of interest, str.contains() to find which vehicles have Honda in the name, and between() to find vehicles in our price range.
This cleans up the code somewhat, and takes advantage of a few Pandas.Series methods, but the code still isn’t exactly readable. To make this look better, we can drop our code across multiple lines, one line per filtering action. The way to do that is by putting regular parentheses just inside our initial dataframe selection brackets, then inserting all conditions inside these parentheses.
Filtering Method 3: Selection Brackets with External Filters and Series Methods
A blend of the two methods above, we can define filters outside of our selection brackets as variables and then call each variable inside the selection brackets. This is a clean way to write each filter on its own individual line and then call all filters in one line of code. It means less overall parentheses and line breaks throughout the code.
Filtering Method 4: query()
I first heard of pandas.Series.query a year or two ago on a podcast, and I wasn’t a fan at first. Over time, it has really grown on me. A query expression is a great way to subset data: they can be basic and easy or complex and powerful. The query expression to subset vehicles with years 2013 and newer is simple. You feed your filtering parameter(s) in as a string.
As you move on to multi-condition filters, you can make your query string more complex. Instead of typing & or | between your filter parameters, you simply type and or or, respectively. Below is the code to write a query expression for our multi-condition filter. *Note: to call variables that are inside the environment but outside of the DataFrame/ Series you are querying, you need to use an @ before calling the variable. See the use of @ immediately before calling the list “years.”
This is a really neat way to subset your data! Yet, the more query parameters you add, the less readable it becomes. To overcome this problem, using query, we can simply add \ at the place where we want a line break and continue the query expression on the next line. If we want, we can maintain the notation of putting one filter condition per line.
Filtering Method 5: loc[]
I really enjoy the power that comes with using python lambda functions. How can we translate lambda into filtering the vehicles dataset with our conditions? With the simple, single condition filter we have been applying, we call loc off of our dataframe, and with lambda, we can insert our condition.
If we want to add multiple conditions, we can just chain another loc off the results of the previous one. However, if left on one line, there are brackets and periods everywhere! It becomes very difficult to read. To make this more readable, we can wrap the entire right side of our expression in parentheses and then can put each loc filter on its own line.
Filtering Bonus: Use Pandas.DataFrame.filter
As an interesting aside, the Pandas.DataFrame method filter()does not allow you to filter datasets based on data inside the dataset, like the name implied to me originally. Rather, the method filter() allows you to filter based on row/ index names and/ or columns names as a way to subset data.
Filtering the vehicle dataset that only has a few columns and the index names as integers does not exactly demonstrate the power of filter(). We could make the values inside the “name” column the index, which consists of car make and model. However, there are then various rows that have the same make and model and would have the same index, which is not a best practice. Therefore, to use make and model but also make each index unique, the current integer index is concatenated to the row’s make and model, and then that becomes the new index. You may or may not want to do that in practice, but I wanted to do it just to demonstrate the method.
With this new index, we can use the Pandas.DataFrame method filter() to filter based on index name. With filter(), we can search for particular indices using the input like (which is similar to str.contains() like we used before) or we can search for indices by using regular expressions. To filter the data based on the multi-condition filter we have used throughout this post, we can return the same results shown in the other five techniques. The only difference now is that we are using index values to help filter results.
Summary
That’s it! We have seen five techniques and one bonus technique to help slice and dice your data given one or multiple filtering criteria. I’m not sure which one I like the best; I suppose it depends on the use case. I can see myself using query() for the ease and readability of query string expressions, but I can also see myself using loc[] with lambda functions!
I hope this post gave you a few ideas of how you can apply some powerful Pandas functionality to your data!
Bryan Kolano is an active duty U.S. Army officer and data scientist living outside of Washington, D.C. When he is not trying to improve his Python skills, he finds himself spending time with his wife and four kids or studying Spanish and French.