pandas

combine data in pandas with merge vs join

Pandas Tricks – Combine Data in Different Ways

Introduction

If you have used pandas for your data analysis work, you may already get some idea on how powerful and flexible it is in terms of data processing. Many times there are more than one way to solve your problem, and choosing the best approach become another tough decision. For instance, in one of my previous article, I tried to summarize the 20 ways to filter records in pandas which definitely is not a complete list for all the possible solutions. In this article, I will be discussing about the different ways to merge/combine data in pandas and when you shall use them since combining data probably is one of the necessary step you shall perform before starting your data analysis.

Prerequisites

If you have not yet installed pandas, you may use the below command to install it from PyPI:

pip install pandas

And import the module at the beginning of your code:

import pandas as pd

Let’s dive into the code examples.

Combine Data with Append vs Concat

Imagine you have below two data frames from different sources, now you would like to merge them into one data frame.

df1 = pd.DataFrame({"ID" : [1, 2, 3, 4, 5], 
"Name" : ["Aaron", "Jimmy", "Zoe", "Jill", "Jenny"]})
df2 = pd.DataFrame({"ID": [6], "Name" : ["Kelly"]})

The most straightforward way would be using the append method from the pandas DataFrame object:

df1.append(df2, ignore_index=True)

The append method allows to add rows to the end of the current data frame, and with the ignore_index parameter as True, the resulting axis will be relabeled starting from 0.

You would see the output as per below:

combine data in pandas with merge vs join

Alternatively, you can use the pandas concat method which is self-explanatory based on its name. It provides a few more parameters to manipulate the resulting data frame such as specifying the axis for the concatenation to be done as well as the join logic for either union or intersection operation.

You can use the below to generate the same output as previously:

pd.concat([df1, df2], ignore_index=True)

And if you would like to retain a reference to the sources in your result, you can use the keys as per below:

pd.concat([df1, df2], keys=["src_1", "src_2"])

This would return a multi-index data frame where you can easily refer back to the data by source (e.g. df.loc[“src_1”]).

combine data in pandas with merge vs join

Adding new data frame as columns can be also done with axis = 1, for instance:

df3 = pd.DataFrame({"Age" : [12, 13, 13, 12, 13]})
pd.concat([df1, df3], axis=1)

The data frame has been added as one column to the caller:

combine data in pandas with merge vs join

As concat method accepts a list of data frames, you can combine multiple data frames at one time, which would be much faster than using append to do one by one.

Merge Data with Join vs Merge

Beside appending rows or columns based on axis, sometimes you may need more sophisticated operations similar to the left/right join in a rational database. For such scenarios, you shall make use of the pandas merge or join method.

For the previous example to append df2 to df1, you can achieve it with merge as well:

df1.merge(df2, how="outer")

Output as following:

combine data in pandas with merge vs join

It would be more tedious if you want to achieve the same via join since it can only join the data frame based on index, so you will have to set the index to the correct columns you would like to use as key. Below is how you can do it via join:

df1.join(df2.set_index(["ID", "Name"]), 
        on=["ID", "Name"], how="outer").reset_index(drop=True)

Assuming you have the below student’s score for each subject, and you want to merge the student information (df1) and the below based on the “Name” column:

df4 = pd.DataFrame({"ID" : [1001, 1002, 1003, 1002, 1001],
                    "Subject": ["Science", "Math", "English", "Math", "Science"], 
                    "Name": ["Aaron", "Jimmy", "Jimmy", "Zoe", "Jenny"], 
                    "Score" : ["A", "B", "C", "B", "B"]})

With merge function, you can specify the joining logic as left join on “Name” column as per below:

df1.merge(df4, on="Name", how="left")

Pandas will automatically add suffix whenever there are columns with duplicate names (e.g. “ID” in df1 and df4) from the two data frames, below is the output you may see:

combine data in pandas with merge vs join

To generate the same output via join, you can use below code which you need to pre-set the index for df4 and specify the suffix for left and right data frame:

df1.join(df4.set_index("Name"), on="Name", lsuffix="_x", rsuffix="_y")

Of course, if you would like to perform the right join for the above two data frames, you can do as per below:

df1.merge(df4, on="Name", how="right")
# or
df1.join(df4.set_index("Name"), on="Name", how="right", lsuffix="_x", rsuffix="_y")

Output as per below:

combine data in pandas with merge vs join

Merge DataFrame with Duplicate Keys

When merging multiple DataFrame objects, you may occasionally encounter the scenario that there are duplicate values for the columns you want to use as keys for joining. For instance, you may have below records if one subject has more than one lecturers:

df5 = pd.DataFrame({"Subject": ["Science", "Science", "Math", "Math", "English"], 
                    "Lecturer": ["Michael", "John", "Tim", "Robert", "Alex"]})

When you merge this information with student score based on the subject with merge or join method:

df4.merge(df5, on="Subject", how="left")
#or 
df4.join(df5.set_index("Subject"), on="Subject", how="left")

You would see the below output with M x N records due to the duplicate key in the df5:

combine data in pandas with merge vs join

If your objective is to perform something similar to excel vlookup to return the first matched value, then you can use the drop_duplicates method to remove the duplicate records before joining. E.g.:

df4.merge(df5.drop_duplicates("Subject"), on="Subject", how="left")

This would allow you to combine the two data frames with the first matched record from df5:

combine data in pandas with merge vs join

And in case you do not want to lose the information from the lecturer data frame, you will need to perform some sort of data aggregation before joining, e.g.:

df4.merge(df5.groupby("Subject").agg({"Lecturer" : lambda x: ','.join(x)}),
 on="Subject", how="left")

With this aggregation on the lecturer values, you would be able to see the below output:

combine data in pandas with merge vs join

Based on the above examples, you may find that merge and join are interchangeable in most of the cases, and you may have to type a bit more when using join method due to the different default arguments used. Since it always works on the index, you will have to preset the index on the key columns before joining.

Conclusion

In this article, we have reviewed through a few methods pandas offered for combining data frames with some sample code. To wrap up, the append and concat are usually used for merging two or more data frames based on the row or column index, and concat has better performance over append when you have multiple data frames to be worked on. If you need some high performance in-memory join operations like SQL joining for rational database, you will need to use merge or join method which can be interchangeable in most of the scenario. In addition, if the data frame you worked on does not have a index on the joining row/column, using merge over join would probably save your some typing.

 

group consecutive rows of same values in pandas

How to group consecutive rows of same values in pandas

Problem Statement

You have a data set which you would like to group consecutive rows if one of the columns has the same values. If there is a different value in between the rows, these records shall be split into separate groups.

To better elaborate the issue, let’s use an example.

Assuming you have the connection log data for some devices such as Bluetooth. It triggers a event when the connection is established as well as when it’s disconnected from the paired device. In between, there may be additional events triggered out for connectivity test. Let’s load the data and visualize it:

import pandas as pd 

df = pd.read_excel("connection log.xlsx")

df.head(10)

You can see the below output from Jupyter Lab:

connection log data

If you would like to check the duration for each device per every connection, you probably want to group these records if the events are triggered during the same connection. To determine whether the records are within the same connection, you shall sort the event date in ascending order and if the device Id is not the same in the consecutive rows, then they must be a events for different connections. So this can be done in pandas?

Solution to group the consecutive rows

Let’s do some sorting to our data first to make sure the records are in chronological order based on the event date:

df.sort_values(["Event Time", "Device ID"], ascending=[True, True], inplace=True)

To compare the value of current row and subsequent row for a particular column, we can use the data series shift method. For instance, we can shift the “Device ID” values to next row and store the result into new column named “Device ID X”:

df["Device ID X"] = df["Device ID"].shift()

After the shifting, you shall see the updated data frame as per below:

shifted rows

If you try to compare the values of both columns:

df["Device ID"] != df["Device ID X"]

You can see the return value of the True/False in a data series form:

compare device id column

Now it is coming to the most critical step. Since we know that there is only one True value when device ID switched to a new ID, we can use the pandas cumsum method to sum up these True values accumulatively as per below:

df["cumsum"] = (df["Device ID"] != df["Device ID X"]).cumsum()

When doing the accumulative summary, the True values will be counted as 1 and False values will be counted as 0. So you would see the below output:

comparison result

You can see that the same values calculated for the rows we would like to group together, and you can make use of this value to re-group the records for further analysis.

You can even simply combine the above steps into one liner to get the earliest and latest event time for each group as per below:

df.groupby((df["Device ID"] != df["Device ID"].shift()).cumsum()).agg({"Event Time" : ["min", "max"]})

Output as per below:

one liner result

You may be also interested in some other similar topic from here.

 

Photo by Aron Visuals on Unsplash

How to calculate date difference between rows in pandas

Problem:

You have some data with date (or numeric) data columns, you already knew you can directly use – operator to calculate the difference between columns, but now you would like to calculate the date difference between the two consecutive rows.

For instance, You have some sample data for GPS tracking, and it has the start and end time at each location (latitude and longitude). You would like to calculate the time gap within each location or between two locations.

import pandas as pd
import numpy as np

df = pd.read_excel("GPS Data.xlsx")
df.head(10)

For a quick view, you can see the sample data output as per below:

pandas calculate date difference between two consecutive rows

Solutions:

  Option 1: Using Series or Data Frame diff

Data frame diff function is the most straightforward way to compare the values between the current row and the previous rows. By default, it compare the current and previous row, and you can also specify the period argument in order to compare the current row and current – period row. To calculate the time gap of the start time between two consecutive rows:

df["Start Time"].diff()

You shall see the below output:

pandas calculate date difference between two consecutive rows

If you check the date type of the result, you may see it is showing as dtype(‘<m8[ns]’), you can convert the result into integer minutes for easier reading.

In this case, you can use the below timedelta from numpy to convert the date difference into minutes:

df["Start Time"].diff().apply(lambda x: x/np.timedelta64(1, 'm')).fillna(0).astype('int64')

You shall see the below output:

pandas calculate date difference between two consecutive rows

You can also select multiple date columns as a data frame to apply the diff function.

  Option 2: Using Series or Data Frame shift with – operator

Shift function allows us to move the values up/down or left /right to the given periods depends on what axis you have specified. You can imagine it is the same as Excel shift cells function.

To calculate the difference between the current and next row, you will need to shift the subtrahend column up 1 cell, below is how to calculate the difference between current End Time and the Start Time from the following row:

df["End Time"] - df["Start Time"].shift(1)

Yous shall see the below result:

pandas calculate date difference between two consecutive rows

If you want to calculate the difference for multiple date columns, you can use the data frame shift operation as per below:

df[["End Time", "Start Time"]] - df[["Start Time", "End Time"]].shift(1)

pandas calculate date difference between two consecutive rows

  Option 3: Using data frame sub

The data frame sub function is self-explanatory by it’s name. You can either apply the subtraction at row level or column level by specifying the aixs argument. For our case, to calculate the date difference between two rows, you can use the original data frame to subtract another data frame which starts from the second row of the original data frame. Below is the code:

(df.loc[:,["Start Time", "End Time"]].sub(df.loc[0,["Start Time", "End Time"]], axis='columns')/np.timedelta64(1, "m")).astype("int64")

You can see the below output:

pandas calculate date difference between two consecutive rows

If you would like to calculate the gap between current End Time and next Start Time, you can use the below:

df["End Time"].sub(df["Start Time"].shift(1))

It should produce the same result as previously when we use – with shift.

Conclusion:

Among the 3 options we discussed above, using diff is the most straightforward approach, but you may notice that it can only apply the calculation on the same columns, if you would like to calculate the difference between the End Time of the current row and the Start Time of the next row, you will have to use sub or – with shift operation. One more difference between diff and sub is that sub has the fill_value argument which supports to substitute the missing values with a default value, so that you do not need another line of code to handle the NA values.

pandas split one row of data into multiple rows

Pandas tricks – split one row of data into multiple rows

As a data scientist or analyst, you will need to spend a lot of time wrangling the data from various sources so that you can have a standard data structure for your further analysis. There are cases that you get the raw data in some sort of summary view and you would need to split one row of data into multiple rows based on certain conditions in order to do grouping and matching from different perspectives. In this article, we will be discussing a solution to solve this particular issue.

Prerequisites:

You will need to get pandas installed if you have not yet. Below is the pip command to install pandas:

pip install pandas

And let’s import the necessary modules and use this sample data for our demonstration, you can download it into your local folder, or just supply this URL link to pandas read_excel method:

import pandas as pd
import numpy as np

df = pd.read_excel("eShop-Delivery-Record.xlsx", sheet_name=0)

So if we do a quick view of the first 5 rows of the data with df.head(5), you would see the below output:

pandas split one row of data into multiple rows

Assume this is the data extracted from a eCommerce system where someone is running a online shop for footwear and apparel products, and the shop provides free 7 days return for the items that it is selling. You can see that each of the rows has the order information, when and who performed the delivery service, and if customer requested return, when the item was returned and by which courier service provider. The data is more from the shop owner’s view, and you may find some difficulty when you want to analyse from courier service providers’ perspective with the current data format. So probably we shall do some transformation to make the format simpler for analysis.

Split one row of data into multiple rows

Now let’s say we would like to split this one row of data into 2 rows if there is a return happening, so that each row has the order info as well as the courier service info and we can easily do some analysis such as calculating the return rate for each product, courier service cost for each month by courier companies, etc.

The output format we would like to have is more like a transaction based, so let’s try to format our date columns and rename the delivery related columns, so that it won’t confuse us later when splitting the data.

df["Delivery Date"] = pd.to_datetime(df["Delivery Date"]).dt.date
df["Return Date"] = pd.to_datetime(df["Return Date"]).dt.date

df.rename(columns={"Delivery Date" : "Transaction Date",
"Delivery Courier" : "Courier",
"Delivery Charges" : "Charges"}, inplace=True)

And we add one more column as transaction type to indicate whether the record is for delivery or return. For now, we just assign it as “DELIVERY” for all records:

df["Transaction Type"] = "DELIVERY"

The rows we need to split are the ones with return info, so let’s create a filter by checking if return date is empty:

flt_returned = ~df["Return Date"].isna()

If you verify the filter with df[flt_returned], you shall see all rows with return info are selected as per below:

pandas split one row of data into multiple rows

To split out the delivery and return info for these rows, we will need to perform the below steps:

  • Duplicate the current 1 row into 2 rows
  • Change the transaction type to “RETURN” for the second duplicated row
  • Copy values of the Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively

To duplicate these records, we use data frame index.repeat() to repeat these index twice, and then use loc function to get the data for these repeated indexes. Below is the code to create the duplicate records for the rows with return info:

d = df[flt_returned].loc[df[flt_returned].index.repeat(2),:].reset_index(drop=True)

Next, let’s save the duplicated row indexes into a variable, so that we can refer to it multiple times even when some data in the duplicated row changed. We use the data frame duplicated function to return the index of the duplicated rows. For this function, the keep=”first” argument will mark 1st row as non-duplicate and the subsequent rows as duplicate, while keep=”last” will mark the 1st row as duplicate.

idx_duplicate = d.duplicated(keep="first")
#the default value for keep argument is "first", so you can just use d.duplicated()

With this idx_duplicate variable, we can directly update the transaction type for these rows to RETURN:

d.loc[idx_duplicate,"Transaction Type"] = "RETURN"

And next, we shall copy the return info into Transaction Date, Courier, Charges fields for these return records. You can either base on the transaction type value to select rows, or continue to use the idx_duplicate to identify the return records.

Below will copy values from Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively:

d.loc[idx_duplicate, ["Transaction Date", "Courier", "Charges"]] = d.loc[idx_duplicate, 
                                                     ["Return Date", "Return Courier","Return Charges"]].to_numpy()

If you check the data now, you shall see for the return rows, the return info is already copied over:

pandas split one row of data into multiple rows

(Note: you may want to check here to understand why to_numpy() is needed for swapping columns)

Finally, we need to combine the original rows which only has delivery info with the above processed data. Let’s also sort the values by order number and reset the index:

new_df = pd.concat([d, df[~flt_returned]]).sort_values("Order#").reset_index(drop=True)

Since the return related columns are redundant now, we shall drop these columns to avoid the confusion, so let’s use the below code to drop them by the “Return” keywords in the column labels:

new_df.drop(new_df.filter(regex="Return*", axis=1), axis=1, inplace=True)

(To understand how df.filter works, check my this article)

Once we deleted the redundant columns, you shall see the below final result in the new_df as per below:

pandas split one row of data into multiple rows

So we have successfully transformed our data from a shop owner’s view to courier companies’ view, each of the delivery and return records are now an individual row.

Conclusion

Data wrangling sometimes can be tough depends on what kind of source data you get. In this article, we have gone through a solution to split one row of data into multiple rows by using the pandas index.repeat to duplicate the rows and loc function to swapping the values. There are other possible ways to handle this, please do share your comments in case you have any better idea.

pandas split data into buckets with cut and qcut

Pandas – split data into buckets with cut and qcut

If you do a lot of data analysis on your daily job, you may have encountered problems that you would want to split data into buckets or groups based on certain criteria and then analyse your data within each group. For instance, you would like to check the popularity of your products or website within each age groups, or understand how many percent of the students fall under each score range. The most straightforward way might be to categorize your data based on the conditions and then summarize the information, but this usually requires some additional effort to massage the data. In this article, I will be sharing with you a simple way to bin your data with pandas cut and qcut function.

Prerequisite

You will need to install pandas package if you do not have it yet in your working environment. Below is the command to install pandas with pip:

pip install pandas

And let’s import the necessary packages and create some sample sales data for our later examples.

import pandas as pd
import numpy as np
df = pd.DataFrame({"Consignee" : ["Patrick", "Sara", "Randy", "John", "Patrick", "Joe"],
                   "Age" : [44, 51, 23, 30, 44, 39],
                  "Order Date" : pd.date_range(start='2020-08-01', end="2020-08-05", periods=6),
                  "Item Desc" : ["White Wine", "Whisky", "Red Wine", "Whisky", "Red Wine", "Champagne"],
                  "Price Per Unit": [10, 20, 30, 20, 30, 30], 
                  "Order Quantity" : [50, 60, 40, 20, 10, 50],
                  "Order Dimensions" : [0.52, 0.805, 0.48, 0.235,0.12, 0.58]})

With the above codes, we can do a quick view of how the data looks like:

pandas split data into segments

And let’s also calculate the total sales amount by multiplying the price per unit and the order quantity:

df["Total Amount"] = df["Price Per Unit"] * df["Order Quantity"]

Once this data is ready, let’s dive into the problems we are going to solve today.

split data into buckets by cut

If we would like to classify our customers into a few age groups and have a overall view of how much money each age group has spent on our product, how shall we do it ? As I mentioned earlier, we are not going to apply some lambda function with conditions like : if the age is less than 30 then classify the customer as young, because this can easily drive you crazy when you have hundreds or thousands of groups to be defined. Instead, we will be using a powerful data frame cut function to achieve this.

The cut function has two mandatory arguments:

  • x – an array of values to be binned
  • bins – indicate how you want to bin your values

For instance, if you supply the df[“Age”] as the first argument, and indicate bins as 2, you are telling pandas to split your age data into 2 equal groups. In our case, the minimum age value is 23, and maximum age value is 51, so the first group will be from 23 to 23 + (51-23)/2, and second group from 23 + (51-23)/2 to 51. When you run the below code:

pd.cut(df["Age"],2)

You shall see the output similar to below:

pandas split data segment category

Pandas already classified our age data into these two groups and the output shows that data type is a pandas category object. This is very useful as you can actually assign this category column back to the original data frame, and do further analysis based on the categories from there.

Since we don’t want the decimal points for age data, we can set precision = 0, and we also want to label our age data into 3 groups as Yong, Mid-Aged and Old.

Below is the code that we assign our binned age data into “Age Group” column:

df["Age Group"] = pd.cut(df["Age"],3, precision=0, labels=["Young","Mid-Aged","Old"])

If you examine the data again, you would see:

pandas split data into buckets - age group

Pandas mapped out our age data into 3 groups evenly based on the min and max of the age values. But you may have noticed that age 44 has been classified as “Old” which does not sound that true. In this case, we would want to give our own definition of young, mid-aged and old in the bins argument. Let’s delete the “Age Group” column and redo it with below:

df["Age Group"] = pd.cut(df["Age"],[20, 30, 50, 60], precision=0, labels=["Young","Mid-Aged","Old"])

With this list of integer intervals, we are telling pandas to split our data into 3 groups (20, 30], (30, 50] and (50, 60], and label them as Young, Mid-Aged and Old respectively. (here “(” means exclusive, and “]” means inclusive)

If we check the data again:

df[["Age", "Age Group"]]

You shall see the correct result as per we expected:

pandas split data into buckets- age groups with custom intervals

Now with this additional column, you can easily find out how much each age group contributed to the total sales amount. For example:

df.groupby("Age Group").agg({"Total Amount": "sum"})[["Total Amount"]].apply(lambda x: 100*x/x.sum())

This would calculate the contribution % to the total sales amount within each group (more details from here):

pandas split data into buckets - cut age groups

If you do not wish to have any intermediate data column (for our case, the “Age Group”) added to you data frame, you can directly pass the output of the cut into the groupby function:

df.groupby(pd.cut(df["Age"],[20, 30, 50, 55], precision=0, labels=["Young","Mid-Aged","Old"])).agg({"Total Amount": "sum"})[["Total Amount"]].apply(lambda x: 100*x/x.sum())

The above code will produce the same result as previously.

There are times you may want to define your bins with a start point & end point at a fixed interval, for instance, to understand for order dimensions at each 0.1, how much is the total sales amount.

For such case, we can make use of the arange function from numpy package, e.g.:

np.arange(0, 1, 0.1)

This would give us an array of values between 0 and 1 with interval of 0.1, and we can supply it as the bins to cut function:

df.groupby(pd.cut(df["Order Dimensions"],np.arange(0, 1, 0.1))).agg({"Total Amount": "sum"})

With the above code, we can see pandas split the order dimensions into small chunks of every 0.1 range, and then summarized the sales amount for each of these ranges:

pandas split data into buckets - order dimensions

Note that arange does not include the stop number 1, so if you wish to include 1, you may want to add an extra step into the stop number, e.g.: np.arange(0, 1 + 0.1, 0.1). And cut function also has two arguments – right and include_lowest to control how you want to include the left and right edge. E.g.:

df.groupby(pd.cut(df["Order Dimensions"],np.arange(0, 1 + 0.1, 0.1), right=False, include_lowest=True)).agg({"Total Amount": "sum"})

This will make the left edge inclusive and right edge exclusive, the output will be similar to below:

pandas split data into buckets - order dimensions left inclusive

cut vs qcut

Pandas also provides another function qcut, which helps to split your data based on quantiles (the cut points based on the distribution of the data). For instance, if you use qcut for the “Age” column:

pd.qcut(df["Age"],2, duplicates="drop")

You would see the age data has been split into two groups : (22.999, 41.5] and (41.5, 51.0]. 

pandas split data into buckets - age groups qcut

If you examine the data inside each group:

pd.qcut(df["Age"],2, duplicates="drop").value_counts()

You would see qcut has split the total of 6 rows of age data equally into 2 groups, and the cut point is at 41.5:

pandas split data into buckets - age groups qcut - value_counts1

So if you would like to understand what are the 4 age groups spent similar amount of money on your product, you can do as below:

df.groupby(pd.qcut(df["Age"],4, duplicates="drop")).agg({"Total Amount" : "sum"})

And you would see if we split our data into these 4 groups, the total sale amount are relatively the same:

pandas split data into buckets - age groups qcut - sales amount

Conclusion

In this article, we have reviewed through the pandas cut and qcut function where we can make use of them to split our data into buckets either by self defined intervals or based on cut points of the data distribution.

Hope this gives you some hints when you are solving the problems similar to what we have discussed here.