pandas split data into buckets with cut and qcut

Pandas – split data into buckets with cut and qcut

If you do a lot of data analysis on your daily job, you may have encountered problems that you would want to split data into buckets or groups based on certain criteria and then analyse your data within each group. For instance, you would like to check the popularity of your products or website within each age groups, or understand how many percent of the students fall under each score range. The most straightforward way might be to categorize your data based on the conditions and then summarize the information, but this usually requires some additional effort to massage the data. In this article, I will be sharing with you a simple way to bin your data with pandas cut and qcut function.

Prerequisite

You will need to install pandas package if you do not have it yet in your working environment. Below is the command to install pandas with pip:

pip install pandas

And let’s import the necessary packages and create some sample sales data for our later examples.

import pandas as pd
import numpy as np
df = pd.DataFrame({"Consignee" : ["Patrick", "Sara", "Randy", "John", "Patrick", "Joe"],
                   "Age" : [44, 51, 23, 30, 44, 39],
                  "Order Date" : pd.date_range(start='2020-08-01', end="2020-08-05", periods=6),
                  "Item Desc" : ["White Wine", "Whisky", "Red Wine", "Whisky", "Red Wine", "Champagne"],
                  "Price Per Unit": [10, 20, 30, 20, 30, 30], 
                  "Order Quantity" : [50, 60, 40, 20, 10, 50],
                  "Order Dimensions" : [0.52, 0.805, 0.48, 0.235,0.12, 0.58]})

With the above codes, we can do a quick view of how the data looks like:

pandas split data into segments

And let’s also calculate the total sales amount by multiplying the price per unit and the order quantity:

df["Total Amount"] = df["Price Per Unit"] * df["Order Quantity"]

Once this data is ready, let’s dive into the problems we are going to solve today.

split data into buckets by cut

If we would like to classify our customers into a few age groups and have a overall view of how much money each age group has spent on our product, how shall we do it ? As I mentioned earlier, we are not going to apply some lambda function with conditions like : if the age is less than 30 then classify the customer as young, because this can easily drive you crazy when you have hundreds or thousands of groups to be defined. Instead, we will be using a powerful data frame cut function to achieve this.

The cut function has two mandatory arguments:

  • x – an array of values to be binned
  • bins – indicate how you want to bin your values

For instance, if you supply the df[“Age”] as the first argument, and indicate bins as 2, you are telling pandas to split your age data into 2 equal groups. In our case, the minimum age value is 23, and maximum age value is 51, so the first group will be from 23 to 23 + (51-23)/2, and second group from 23 + (51-23)/2 to 51. When you run the below code:

pd.cut(df["Age"],2)

You shall see the output similar to below:

pandas split data segment category

Pandas already classified our age data into these two groups and the output shows that data type is a pandas category object. This is very useful as you can actually assign this category column back to the original data frame, and do further analysis based on the categories from there.

Since we don’t want the decimal points for age data, we can set precision = 0, and we also want to label our age data into 3 groups as Yong, Mid-Aged and Old.

Below is the code that we assign our binned age data into “Age Group” column:

df["Age Group"] = pd.cut(df["Age"],3, precision=0, labels=["Young","Mid-Aged","Old"])

If you examine the data again, you would see:

pandas split data into buckets - age group

Pandas mapped out our age data into 3 groups evenly based on the min and max of the age values. But you may have noticed that age 44 has been classified as “Old” which does not sound that true. In this case, we would want to give our own definition of young, mid-aged and old in the bins argument. Let’s delete the “Age Group” column and redo it with below:

df["Age Group"] = pd.cut(df["Age"],[20, 30, 50, 60], precision=0, labels=["Young","Mid-Aged","Old"])

With this list of integer intervals, we are telling pandas to split our data into 3 groups (20, 30], (30, 50] and (50, 60], and label them as Young, Mid-Aged and Old respectively. (here “(” means exclusive, and “]” means inclusive)

If we check the data again:

df[["Age", "Age Group"]]

You shall see the correct result as per we expected:

pandas split data into buckets- age groups with custom intervals

Now with this additional column, you can easily find out how much each age group contributed to the total sales amount. For example:

df.groupby("Age Group").agg({"Total Amount": "sum"})[["Total Amount"]].apply(lambda x: 100*x/x.sum())

This would calculate the contribution % to the total sales amount within each group (more details from here):

pandas split data into buckets - cut age groups

If you do not wish to have any intermediate data column (for our case, the “Age Group”) added to you data frame, you can directly pass the output of the cut into the groupby function:

df.groupby(pd.cut(df["Age"],[20, 30, 50, 55], precision=0, labels=["Young","Mid-Aged","Old"])).agg({"Total Amount": "sum"})[["Total Amount"]].apply(lambda x: 100*x/x.sum())

The above code will produce the same result as previously.

There are times you may want to define your bins with a start point & end point at a fixed interval, for instance, to understand for order dimensions at each 0.1, how much is the total sales amount.

For such case, we can make use of the arange function from numpy package, e.g.:

np.arange(0, 1, 0.1)

This would give us an array of values between 0 and 1 with interval of 0.1, and we can supply it as the bins to cut function:

df.groupby(pd.cut(df["Order Dimensions"],np.arange(0, 1, 0.1))).agg({"Total Amount": "sum"})

With the above code, we can see pandas split the order dimensions into small chunks of every 0.1 range, and then summarized the sales amount for each of these ranges:

pandas split data into buckets - order dimensions

Note that arange does not include the stop number 1, so if you wish to include 1, you may want to add an extra step into the stop number, e.g.: np.arange(0, 1 + 0.1, 0.1). And cut function also has two arguments – right and include_lowest to control how you want to include the left and right edge. E.g.:

df.groupby(pd.cut(df["Order Dimensions"],np.arange(0, 1 + 0.1, 0.1), right=False, include_lowest=True)).agg({"Total Amount": "sum"})

This will make the left edge inclusive and right edge exclusive, the output will be similar to below:

pandas split data into buckets - order dimensions left inclusive

cut vs qcut

Pandas also provides another function qcut, which helps to split your data based on quantiles (the cut points based on the distribution of the data). For instance, if you use qcut for the “Age” column:

pd.qcut(df["Age"],2, duplicates="drop")

You would see the age data has been split into two groups : (22.999, 41.5] and (41.5, 51.0]. 

pandas split data into buckets - age groups qcut

If you examine the data inside each group:

pd.qcut(df["Age"],2, duplicates="drop").value_counts()

You would see qcut has split the total of 6 rows of age data equally into 2 groups, and the cut point is at 41.5:

pandas split data into buckets - age groups qcut - value_counts1

So if you would like to understand what are the 4 age groups spent similar amount of money on your product, you can do as below:

df.groupby(pd.qcut(df["Age"],4, duplicates="drop")).agg({"Total Amount" : "sum"})

And you would see if we split our data into these 4 groups, the total sale amount are relatively the same:

pandas split data into buckets - age groups qcut - sales amount

Conclusion

In this article, we have reviewed through the pandas cut and qcut function where we can make use of them to split our data into buckets either by self defined intervals or based on cut points of the data distribution.

Hope this gives you some hints when you are solving the problems similar to what we have discussed here.

 

You may also like

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x