
Pandas – filtering records in 20 ways
Filtering records is a quite common operation when you process or analyze data with pandas,a lot of times you will have to apply filters so that you can concentrate to the data you want. Pandas is so powerful and flexible that it provides plenty of ways you can filter records, whether you want to filtering by columns to focus on a subset of the data or base on certain conditions. In this article, we will be discussing the various ways of filtering records in pandas.
Prerequisite:
You will need to install pandas package in order to follow the below examples. Below is the command to install pandas with pip:
pip install pandas
And I will be using the sample data from here, so you may also want to download a copy into your local machine to try out the later examples.
With the below codes, we can get a quick view of how the sample data looks like:
import pandas as pd df = pd.read_excel(r"C:\Sample-Sales-Data.xlsx") df.head(5)
Below is the output of the first 5 rows of data:
Let’s get started with our examples.
Filtering records by label or index
Filtering by column name/index is the most straightforward way to get a subset of the data frame in case you are only interested in a few columns of the data rather than the full data frame. The syntax is to use df[[column1,..columnN]] to filter only the specified columns. For instance, the below will get a subset of data with only 2 columns – “Salesman” and “Item Desc”:
new_df = df[["Salesman","Item Desc"]] new_df.head(5)
Output from the above would be:
If you are pretty sure which are the rows you are looking for, you can use the df.loc function which allows you to specify both the row and column labels to filter the records. You can pass in a list of row labels and column labels like below:
df.loc[[0,4], ["Salesman", "Item Desc"]]
And you would see the row index 0 and 4, column label “Salesman” and “Item Desc” are selected as per below output:
Or you can specify the label range with : to filter the records by a range:
df.loc[0:4, ["Salesman", "Item Desc"]]
You would see 5 rows (row index 0 to 4) selected as per below output:
Note that currently we are using the default row index which is a integer starting from 0, so it happens to be same as the position of the rows. Let’s say you have Salesman as your index, then you will need to do filtering based on the index label (value of the Salesman), e.g.:
df.set_index("Salesman", inplace=True)
df.loc["Sara", ["Item Desc", "Order Quantity"]]
With the above code, you will be able to select all the records with Salesman as “Sara”:
Filtering records by row/column position
Similarly, you can use iloc function to achieve the same as what can be done with loc function. But the difference is that, for iloc, you shall pass in the integer position for both row and columns. E.g.:
df.iloc[[0,4,5,10],0:2]
The integers are the position of the row/column from 0 to length-1 for the axis. So the below output will be generated when you run the above code:
Filtering records by single condition
If you would like to filter the records based on a certain condition, for instance, the value of a particular column, you may have a few options to do the filtering based on what type of data you are dealing with.
The eq and == work the same when you want to compare if the value matches:
flt_wine = df["Item Desc"].eq("White Wine")
df[flt_wine]
Or:
flt_wine = (df["Item Desc"] == "White Wine")
df[flt_wine]
Both will generate the below output:
If you run the flt_wine alone, you will see the output is a list of True/False with their index. This is how the filter works as pandas data frame would filter out the index with False value.
To get the data with the negation of certain condition, you can use ~ before your condition statement as per below:
df[~flt_wine]
#or
df[~(df["Item Desc"] == "White Wine")]
#or
df[(df["Item Desc"] != "White Wine")]
This will return the data with “Item Desc” other than “White Wine”.
And for string data type, you can also use the str.contains to match if the column has a particular sub string.
df[df["Item Desc"].str.contains("Wine")]
If you want to filter by matching multiple values, you can use isin with a list of values:
flt_wine = df["Item Desc"].isin(["White Wine", "Red Wine"])
df[flt_wine].head(5)
And you can also use data frame query function to achieve the same. But the column label with spaces in-between would cause errors when using this function, so you will need to reformat a bit of your column header, such as replacing spaces with underscore (refer to this article for more details ).
With this change in the column header, you shall be able to run the below code with the same result as above isin method.
df1 = df.query("Item_Desc in ('White Wine','Red Wine')")
df1.head(5)
There are other Series functions you can use to filter your records, such as isnull, isna, notna, notnull, find etc. You may want to check pandas Series documentation.
Filtering records by multiple conditions
When you need to filter by multiple conditions where multiple columns are involved, you can also do similar as what we have discussed in above with the & or | to join the conditions.
For filtering records when both conditions are true:
flt_whisky_bulk_order = (df["Item Desc"] == "Whisky") & (df["Order Quantity"] >= 10) df[flt_whisky_bulk_order]
The output would be :
For filtering the records when either condition is true:
flt_high_value_order = (df["Item Desc"] == "Whisky") | (df["Price Per Unit"] >= 50)
df[flt_high_value_order]
The output would be :
Similarly, the above can be done with data frame query function. Below is the example of AND condition:
df1 = df.query("Item_Desc == 'Whisky' and Order_Quantity >= 10") df1.head(5)
Below is the example of OR condition:
df1 = df.query("Item_Desc_ == 'Whisky' or Price_Per_Unit >= 10")
df1.head(5)
Filtering records by dataframe.filter
There is also another filter method which can be used to filter by the row or column label.
Below is an example that can be used to get all the columns with the name starting with “Order” keyword:
df.filter(regex="Order*", axis=1)
you shall see the below output:
Similarly, when applying to row labels, you can axis=0
df.set_index("Order Date", inplace=True) df.filter(like="2020-06-21", axis=0)
Take note that data frame query function only works on the row or column label not any specific data series.
Conclusion
Filtering records is a so frequently used operation whenever you need to deal with the data in pandas, and in this article we have discussed a lot of methods you can use under different scenarios. It may not cover everything you need but hopefully it can solve 80% of your problems. There are other Series functions you may employ to filter your data, but probably you would see the syntax still falls under what we have summarized in this article.
If you are interested in other topics about pandas, you may refer to here.