Pandas

pandas filtering records

Pandas – filtering records in 20 ways

Filtering records is a quite common operation when you process or analyze data with pandas,a lot of times you will have to apply filters so that you can concentrate to the data you want. Pandas is so powerful and flexible that it provides plenty of ways you can filter records, whether you want to filtering by columns to focus on a subset of the data or base on certain conditions. In this article, we will be discussing the various ways of filtering records in pandas.

Prerequisite:

You will need to install pandas package in order to follow the below examples. Below is the command to install pandas with pip:

pip install pandas

And I will be using the sample data from here, so you may also want to download a copy into your local machine to try out the later examples.

With the below codes, we can get a quick view of how the sample data looks like:

import pandas as pd
df = pd.read_excel(r"C:\Sample-Sales-Data.xlsx")
df.head(5)

Below is the output of the first 5 rows of data:

pandas filtering data

Let’s get started with our examples.

Filtering records by label or index

Filtering by column name/index is the most straightforward way to get a subset of the data frame in case you are only interested in a few columns of the data rather than the full data frame. The syntax is to use df[[column1,..columnN]] to filter only the specified columns. For instance, the below will get a subset of data with only 2 columns –  “Salesman” and “Item Desc”:

new_df = df[["Salesman","Item Desc"]]
new_df.head(5)

Output from the above would be:

pandas filtering data subset

If you are pretty sure which are the rows you are looking for, you can use the df.loc function which allows you to specify both the row and column labels to filter the records. You can pass in a list of row labels and column labels like below:

df.loc[[0,4], ["Salesman", "Item Desc"]]

And you would see the row index 0 and 4, column label “Salesman” and “Item Desc” are selected as per below output:

pandas filtering loc

Or you can specify the label range with : to filter the records by a range:

df.loc[0:4, ["Salesman", "Item Desc"]]

You would see 5 rows (row index 0 to 4) selected as per below output:

pandas filtering loc with label range

Note that currently we are using the default row index which is a integer starting from 0, so it happens to be same as the position of the rows. Let’s say you have Salesman as your index, then you will need to do filtering based on the index label (value of the Salesman), e.g.:

df.set_index("Salesman", inplace=True)
df.loc["Sara", ["Item Desc", "Order Quantity"]]

With the above code, you will be able to select all the records with Salesman as “Sara”:

pandas filtering loc with row label

Filtering records by row/column position

Similarly, you can use iloc function to achieve the same as what can be done with loc function. But the difference is that, for iloc, you shall pass in the integer position for both row and columns. E.g.:

df.iloc[[0,4,5,10],0:2]

The integers are the position of the row/column from 0 to length-1 for the axis. So the below output will be generated when you run the above code:

pandas filtering iloc function

Filtering records by single condition

If you would like to filter the records based on a certain condition, for instance, the value of a particular column, you may have a few options to do the filtering based on what type of data you are dealing with. 

The eq and == work the same when you want to compare if the value matches:

flt_wine = df["Item Desc"].eq("White Wine")
df[flt_wine]

Or:

flt_wine = (df["Item Desc"] == "White Wine")
df[flt_wine]

Both will generate the below output:

pandas filtering equals condition

If you run the flt_wine alone, you will see the output is a list of True/False with their index. This is how the filter works as pandas data frame would filter out the index with False value.

To get the data with the negation of certain condition, you can use ~ before your condition statement as per below:

df[~flt_wine]
#or
df[~(df["Item Desc"] == "White Wine")]
#or
df[(df["Item Desc"] != "White Wine")]

This will return the data with “Item Desc” other than “White Wine”.

And for string data type, you can also use the str.contains to match if the column has a particular sub string.

df[df["Item Desc"].str.contains("Wine")]

If you want to filter by matching multiple values, you can use isin with a list of values:

flt_wine = df["Item Desc"].isin(["White Wine", "Red Wine"])
df[flt_wine].head(5)

pandas filtering isin function

And you can also use data frame query function to achieve the same. But the column label with spaces in-between would cause errors when using this function, so you will need to reformat a bit of your column header, such as replacing spaces with underscore (refer to this article for more details ).

With this change in the column header, you shall be able to run the below code with the same result as above isin method.

df1 = df.query("Item_Desc in ('White Wine','Red Wine')")
df1.head(5)

There are other Series functions you can use to filter your records, such as isnull, isna, notna, notnull, find etc. You may want to check pandas Series documentation.

Filtering records by multiple conditions

When you need to filter by multiple conditions where multiple columns are involved, you can also do similar as what we have discussed in above with the & or | to join the conditions.

For filtering records when both conditions are true:

flt_whisky_bulk_order = (df["Item Desc"] == "Whisky") & (df["Order Quantity"] >= 10)
df[flt_whisky_bulk_order]

The output would be :

pandas filtering and condition

For filtering the records when either condition is true:

flt_high_value_order = (df["Item Desc"] == "Whisky") | (df["Price Per Unit"] >= 50) 
df[flt_high_value_order]

The output would be :

pandas filtering or condition

Similarly, the above can be done with data frame query function. Below is the example of AND condition:

df1 = df.query("Item_Desc == 'Whisky' and Order_Quantity >= 10") 
df1.head(5)

Below is the example of OR condition:

df1 = df.query("Item_Desc_ == 'Whisky' or Price_Per_Unit >= 10")
df1.head(5)

Filtering records by dataframe.filter

There is also another filter method which can be used to filter by the row or column label.

Below is an example that can be used to get all the columns with the name starting with “Order” keyword:

df.filter(regex="Order*", axis=1)

you shall see the below output:

pandas filtering dataframe filter

Similarly, when applying to row labels, you can axis=0

df.set_index("Order Date", inplace=True)
df.filter(like="2020-06-21", axis=0)

pandas filtering dataframe filter 2

Take note that data frame query function only works on the row or column label not any specific data series.

Conclusion

Filtering records is a so frequently used operation whenever you need to deal with the data in pandas, and in this article we have discussed a lot of methods you can use under different scenarios. It may not cover everything you need but hopefully it can solve 80% of your problems. There are other Series functions you may employ to filter your data, but probably you would see the syntax still falls under what we have summarized in this article.

If you are interested in other topics about pandas, you may refer to here.

pandas format column headers

Pandas format column headers

When using Pandas to deal with data from various sources, you may usually see the data headers in various formats, for instance, some people prefers to use upper case, some uses lowercase or camel case. And there are also different ways to join the words when using as column label, such as space, hyphen or underscore are commonly seen. This potentially causes some problem when you want to reference a particular column since pandas column label is case sensitive, and you may get confused what the correct spelling. In this case, you would need to format column headers into a standard format before processing the data. This article will be explaining the different ways to format column headers.

Prerequisite:

You will need to install pandas package in order to follow the below examples. Below is the command to install pandas with pip:

pip install pandas

With the package installed, let’s create a sample data set for our later use:

import pandas as pd
df = pd.DataFrame({"Salesman" : ["Patrick", "Sara", "Randy"],
                  "order date" : pd.date_range(start='2020-08-01', periods=3),
                  "Item Desc " : ["White Wine", "Whisky", "Red Wine"],
                  "Price Per-Unit": [10, 20, 30], 
                  "Order Quantity" : [50, 10, 40],
                  99: ["remak1", "remark2", "remark3"]})

You can preview your data set from Jupyter Notebook, it would be similar to below:

pandas format column headers

You probably wonder why someone would use number as column header, but it does happen in the real-world for various reasons.

If you use df[99] or df.loc[0,99], you are able to see the correct data, which means it does not confuse pandas whether your column label is string or numeric.

pandas format column headers

But it sometimes causes readability issue to human and introduce errors, especially if you always assume column labels are string and perform some string operation on them.

Convert column header to string

So the first thing we probably want to do it to convert column header into string. You can use the astype method to convert it:

df.columns = df.columns.astype("str")

A lot of pandas methods have “inplace” parameter to apply the changes without creating new objects, but astype does not support “inplace”, so we will need to re-assign the formatted object back to df.columns.

Format column header with cases

If you would like to reference all columns in uppercase, you can have at least three choices of doing it:

  • Use the str method from pandas Index object
  • Use the map method from pandas Index object
  • Use Python built-in map method

Below is the sample code for above 3 options:

#Index.str method
df.columns = df.columns.str.upper()

#Index.map method
df.columns = df.columns.map(str.upper)

#Python built-in map method
df.columns = map(str.upper, df.columns)

The column headers would be all converted to uppercase:

Index(['SALESMAN', 'ORDER DATE', 'ITEM DESC ', 'PRICE PER-UNIT',
       'ORDER QUANTITY', '99'],
      dtype='object')

Option 1 seems to be most straightforward way as long as the operations are supported by str, such as ljust, rjust, split etc.

Similarly, you can convert column headers to lowercase with str.lower():

df.columns = df.columns.str.lower()

or camel case with str.title if this is the format you wish to standardize across all data sources:

df.columns = df.columns.str.title()

Replace characters in column header

If you noticed there is a space accidentally added in my column header – “Item Desc “, this will cause index error if I use df[“Item Desc”] to access the column. To fix this, we can use the str.strip to remove all the leading or trailing spaces:

df.columns = df.columns.str.strip()

But those spaces in-between cannot be removed, if want to you use df.Item Desc , it will give you error. The best way is to replace all the spaces with hyphen or underscore, so that you can use both df[“Item_Desc”] and df.Item_Desc format to reference the column. Below is how you can use a simple lambda function to replace the space and hyphen with underscore:

df.columns = df.columns.map(lambda x : x.replace("-", "_").replace(" ", "_"))
# Or
df.columns = map(lambda x : x.replace("-", "_").replace(" ", "_"), df.columns)

If you check again, the column header would be updated as per below:

Index(['Salesman', 'Order_Date', 'Item_Desc', 'Price_Per_Unit',
       'Order_Quantity', '99'],
      dtype='object')

Note that, if you use df.columns.str.replace, you cannot just chain multiple replace function together, as the first replace function just return an Index object not a string.

Often you would see there are new line characters in the column header, you can remove them with the replace method as per below:

df.columns = df.columns.str.replace("\n", "")

Add prefix or suffix to column header

With the map and lambda, you can also easily add prefix or suffix to the column header, e.g.:

#adding prefix with "Label_"
df.columns = df.columns.map(lambda x : "Label_" + x)

#adding suffix with "_Col"
df.columns = df.columns.map(lambda x : x + "_Col")

Use of rename method

If you find the entire column header is not meaningful to you, you can manually rename multiple column names at one time with the data frame rename method as per below:

df.rename(columns={"Salesman" : "Sales Person", "Item Desc " : "Order Desc"}, inplace=True)

The rename method support inplace parameter, so you can immediately apply the changes in the original data frame.

pandas rename column name

flatten multi index column

After you aggregated your data with groupby and agg function, you may sometimes get a multi index column header, for instance:

df_sum = df.groupby("Salesman").agg({"Order Quantity": ["mean", "sum"]})

When you calculate both mean and sum of the “Order Quantity” column at the same time, you will get the result similar to below:

python pandas format column header multi index column

The column header become a multi index header, so if you want to flatten this column header by join the two levels into one, you can make use of the list comprehension as per below :

df_sum.columns = [' '.join(col) for col in df_sum.columns]

With the above, you would see column header changed from hierarchical to flattened as per the below:

python pandas format column header flatten multi index column

 

Conclusion

In this article, we have discussed a few options you can use to format column headers such as using str and map method of pandas Index object, and if you want something more than just some string operation, you can also pass in a lambda function. All these methods are not just limited to column header or row label (Index object), you can also use them to format your data series.

If you are interested in other topics about pandas, you may refer to here.

 

pandas tricks pass multiple columns to lambda

Pandas tricks – pass multiple columns to lambda

Pandas is one of the most powerful tool for analyzing and manipulating data. In this article, I will be sharing with you the solutions for a very common issues you might have been facing with pandas when dealing with your data – how to pass multiple columns to lambda or self-defined functions.

Prerequisite

You will have to install pandas on your working environment:

pip install pandas

When dealing with data, you will always have the scenario that you want to calculate something based on the value of a few columns, and you may need to use lambda or self-defined function to write the calculation logic, but how to pass multiple columns to lambda function as parameters?

Let me use some real world example, so that easier for you to understand the issue that I am talking about. Below table shows partial of the e-com delivery charges offered by some company, so the delivery charges are determined by package size (H+L+W), package weight and the delivery mode you are choosing.

Size (cm/kg) 3 hours express Next Day Delivery Same Day Delivery
<60 CM (H+L+W) & MAX 1KG 12 8 10
<80 CM (H+L+W) & MAX 5KG 15 9 11
<100 CM (H+L+W) & MAX 8KG 17 11 13
<120 CM (H+L+W) & MAX 10KG 19 14 16

And assuming we have the below order data, and we want to simulate the delivery charges. Let’s create the data in a pandas dataframe.

import pandas as pd

df = pd.DataFrame({
    "Order#" : ["1", "2", "3", "4"], 
    "Weight" : [5.0, 2.1, 8.1, 7.5], 
    "Package Size" : [80, 45, 110, 90],
    "Delivery Mode": ["Same Day", "Next Day", "Express", "Next Day"]})

If you view dataframe from Jupyter Notebook (you can sign up here to use it for free), you shall be able to see the data as per below.

Pandas pass multiple columns to lambda same data

Let’s also implement a calculate_rate function where we need to pass in the weight, package size, and delivery mode in order to calculate the delivery charges:

def calculate_rate(weight, package_size, delivery_mode):
    #set the charges as $20 since we do not have the complete rate card
    charges = 20
    if weight <=1 and package_size <60:
        if delivery_mode == "Express":
            charges = 13
        elif delivery_mode == "Next Day":
            charges = 8
        else:
            charges = 10
    elif weight <=5 and package_size <80:
        if delivery_mode == "Express":
            charges = 15
        elif delivery_mode == "Next Day":
            charges = 9
        else:
            charges = 11
    elif weight <=8 and package_size <100:
        if delivery_mode == "Express":
            charges = 17
        elif delivery_mode == "Next Day":
            charges = 11
        else:
            charges = 13
    return charges

Pass multiple columns to lambda

Here comes to the most important part. You probably already know data frame has the apply function where you can apply the lambda function to the selected dataframe. We will also use the apply function, and we have a few ways to pass the columns to our calculate_rate function.

 Option 1

We can select the columns that involved in our calculation as a subset of the original data frame, and use the apply function to it.

And in the apply function, we have the parameter axis=1 to indicate that the x in the lambda represents a row, so we can unpack the x with *x and pass it to calculate_rate.

df["Delivery Charges"] = df[["Weight", "Package Size", "Delivery Mode"]].apply(lambda x : calculate_rate(*x), axis=1)

If we check the df again in Jupyter Notebook, you should see the new column “Delivery Charges” with the figures calculated based on the logic we defined in calculate_rate function.

Pandas pass multiple columns to lambda

Option 2:

If you do not want to get a subset of the data frame and then apply the lambda, you can also directly use the apply function to the original data frame. In this case, you will need to select the columns before passing to the calculate_rate function. Same as above, we will need to specify the axis=1 to indicate it’s applying to each row.

df["Delivery Charges"] = df.apply(lambda x : calculate_rate(x["Weight"], x["Package Size"], x["Delivery Mode"]), axis=1)

This will produce the same result as option 1. And you can also use x.Weight instead of x[“Weight”] when passing in the parameter.

 

Conclusion

The two options we discussed to pass multiple columns to lambda are basically the same, and it’s either applying to the subset or the original data frame. I have not yet tested with a large set of data, so there might be some differences in terms of the performance, you may need to take a note if you are dealing with a lot of data.

You may also interested to read some other articles related to pandas.

 

pandas tricks calculate percentage within group

Pandas Tricks – Calculate Percentage Within Group

Pandas groupby probably is the most frequently used function whenever you need to analyse your data, as it is so powerful for summarizing and aggregating data. Often you still need to do some calculation on your summarized data, e.g. calculating the % of vs total within certain category. In this article, I will be sharing with you some tricks to calculate percentage within groups of your data.

Prerequisite

You will need to install pandas if you have not yet installed:

pip install pandas
#or conda install pandas

I am going to use some real world example to demonstrate what kind of problems we are trying to solve. The sample data I am using is from this link , and you can also download it and try by yourself.

Let’s first read the data from this sample file:

import pandas as pd

# You can also replace the below file path to the URL of the file
df = pd.read_excel(r"C:\Sample Sales Data.xlsx", sheet_name="Sheet")

The data will be loaded into pandas dataframe, you will be able to see something as per below:

pandas tricks - calculate percentage within group

Let’s first calculate the sales amount for each transaction by multiplying the quantity and unit price columns.

df["Total Amount"] = df["Quantity"] * df["Price Per Unit"]

You can see the calculated result like below:

pandas tricks - calculate percentage within group

Calculate percentage within group

With the above details, you may want to group the data by sales person and the items they sold, so that you have a overall view of their performance for each person. You can do with the below :

#df.groupby(["Salesman","Item Desc"])["Total Amount"].sum()
df.groupby(["Salesman", "Item Desc"]).agg({"Total Amount" : "sum"})

And you will be able to see the total amount per each sales person:

pandas tricks - calculate percentage within group

This is good as you can see the total of the sales for each person and products within the given period.

Calculate the best performer

Now let’s see how we can get the % of the contribution to total revenue for each of the sales person, so that we can immediately see who is the best performer.

To achieve that, firstly we will need to group and sum up the “Total Amount” by “Salemans”, which we have already done previously.

df.groupby(["Salesman"]).agg({"Total Amount" : "sum"})

And then we calculate the sales amount against the total of the entire group. Here we can get the “Total Amount” as the subset of the original dataframe, and then use the apply function to calculate the current value vs the total. Take note, here the default value of axis is 0 for apply function.

[["Total Amount"]].apply(lambda x: 100*x/x.sum())

With the above, we should be able get the % of contribution to total sales for each sales person. And let’s also sort the % from largest to smallest:

sort_values(by="Total Amount", ascending=False)

Let’s put all together and run the below in Jupyter Notebook:

df.groupby(["Salesman"])\
.agg({"Total Amount" : "sum"})[["Total Amount"]]\
.apply(lambda x: 100*x/x.sum())\
.sort_values(by="Total Amount", ascending=False)

You shall be able to see the below result with the sales contribution in descending order. (Do not confuse with the column name “Total Amount”, pandas uses the original column name for the aggregated data. You can rename it to whatever name you want later)

pandas tricks - calculate percentage within group for salesman

 

Calculate the most popular products

Similarly, we can follow the same logic to calculate what is the most popular products. This time we want to summarize the sales amount by product, and calculate the % vs total for both “Quantity” and “Total Amount”. And also we want to sort the data in descending order for both fields. e.g.:

df.groupby(["Item Desc"])\
.agg({"Quantity": "sum", "Total Amount" : "sum"})[["Quantity", "Total Amount"]]\
.apply(lambda x: 100*x/x.sum())\
.sort_values(by=["Quantity","Total Amount"], ascending=[False,False])

This will produce the below result, which shows “Whisky” is the most popular product in terms of number of quantity sold. But “Red Wine” contributes the most in terms of the total revenue probably because of the higher unit price.

pandas tricks - calculate percentage within group for products

 

Calculate best sales by product for each sales person

What if we still wants to understand within each sales person, what is the % of sales for each product vs his/her total sales amount?

In this case, we shall first group the “Salesman” and “Item Desc” to get the total sales amount for each group. And on top of it, we calculate the % within each “Salesman” group which is achieved with groupby(level=0).apply(lambda x: 100*x/x.sum()).

Note: After grouping, the original datafram becomes multiple index dataframe, hence the level = 0 here refers to the top level index which is “Salesman” in our case.

df.groupby(["Salesman", "Item Desc"])\
.agg({"Total Amount" : "sum"})\
.groupby(level=0).apply(lambda x: 100*x/x.sum())\
.sort_values(by=["Salesman", "Item Desc","Total Amount"], ascending=[True, True, False])

You will be able see the below result which already sorted by % of sales contribution for each sales person.

pandas tricks - calculate percentage within group - for salesman and product

 

Conclusion

This is just some simple use cases where we want to calculate percentage within group with the pandas apply function, you may also be interested to see what else the apply function can do from here.