
Pandas Tricks – Calculate Percentage Within Group
Pandas groupby probably is the most frequently used function whenever you need to analyse your data, as it is so powerful for summarizing and aggregating data. Often you still need to do some calculation on your summarized data, e.g. calculating the % of vs total within certain category. In this article, I will be sharing with you some tricks to calculate percentage within groups of your data.
Prerequisite
You will need to install pandas if you have not yet installed:
pip install pandas #or conda install pandas
I am going to use some real world example to demonstrate what kind of problems we are trying to solve. The sample data I am using is from this link , and you can also download it and try by yourself.
Let’s first read the data from this sample file:
import pandas as pd # You can also replace the below file path to the URL of the file df = pd.read_excel(r"C:\Sample Sales Data.xlsx", sheet_name="Sheet")
The data will be loaded into pandas dataframe, you will be able to see something as per below:
Let’s first calculate the sales amount for each transaction by multiplying the quantity and unit price columns.
df["Total Amount"] = df["Quantity"] * df["Price Per Unit"]
You can see the calculated result like below:
Calculate percentage within group
With the above details, you may want to group the data by sales person and the items they sold, so that you have a overall view of their performance for each person. You can do with the below :
#df.groupby(["Salesman","Item Desc"])["Total Amount"].sum() df.groupby(["Salesman", "Item Desc"]).agg({"Total Amount" : "sum"})
And you will be able to see the total amount per each sales person:
This is good as you can see the total of the sales for each person and products within the given period.
Calculate the best performer
Now let’s see how we can get the % of the contribution to total revenue for each of the sales person, so that we can immediately see who is the best performer.
To achieve that, firstly we will need to group and sum up the “Total Amount” by “Salemans”, which we have already done previously.
df.groupby(["Salesman"]).agg({"Total Amount" : "sum"})
And then we calculate the sales amount against the total of the entire group. Here we can get the “Total Amount” as the subset of the original dataframe, and then use the apply function to calculate the current value vs the total. Take note, here the default value of axis is 0 for apply function.
[["Total Amount"]].apply(lambda x: 100*x/x.sum())
With the above, we should be able get the % of contribution to total sales for each sales person. And let’s also sort the % from largest to smallest:
sort_values(by="Total Amount", ascending=False)
Let’s put all together and run the below in Jupyter Notebook:
df.groupby(["Salesman"])\ .agg({"Total Amount" : "sum"})[["Total Amount"]]\ .apply(lambda x: 100*x/x.sum())\ .sort_values(by="Total Amount", ascending=False)
You shall be able to see the below result with the sales contribution in descending order. (Do not confuse with the column name “Total Amount”, pandas uses the original column name for the aggregated data. You can rename it to whatever name you want later)
Calculate the most popular products
Similarly, we can follow the same logic to calculate what is the most popular products. This time we want to summarize the sales amount by product, and calculate the % vs total for both “Quantity” and “Total Amount”. And also we want to sort the data in descending order for both fields. e.g.:
df.groupby(["Item Desc"])\ .agg({"Quantity": "sum", "Total Amount" : "sum"})[["Quantity", "Total Amount"]]\ .apply(lambda x: 100*x/x.sum())\ .sort_values(by=["Quantity","Total Amount"], ascending=[False,False])
This will produce the below result, which shows “Whisky” is the most popular product in terms of number of quantity sold. But “Red Wine” contributes the most in terms of the total revenue probably because of the higher unit price.
Calculate best sales by product for each sales person
What if we still wants to understand within each sales person, what is the % of sales for each product vs his/her total sales amount?
In this case, we shall first group the “Salesman” and “Item Desc” to get the total sales amount for each group. And on top of it, we calculate the % within each “Salesman” group which is achieved with groupby(level=0).apply(lambda x: 100*x/x.sum()).
Note: After grouping, the original datafram becomes multiple index dataframe, hence the level = 0 here refers to the top level index which is “Salesman” in our case.
df.groupby(["Salesman", "Item Desc"])\ .agg({"Total Amount" : "sum"})\ .groupby(level=0).apply(lambda x: 100*x/x.sum())\ .sort_values(by=["Salesman", "Item Desc","Total Amount"], ascending=[True, True, False])
You will be able see the below result which already sorted by % of sales contribution for each sales person.
Conclusion
This is just some simple use cases where we want to calculate percentage within group with the pandas apply function, you may also be interested to see what else the apply function can do from here.