
How to group consecutive rows of same values in pandas
Problem Statement
You have a data set which you would like to group consecutive rows if one of the columns has the same values. If there is a different value in between the rows, these records shall be split into separate groups.
To better elaborate the issue, let’s use an example.
Assuming you have the connection log data for some devices such as Bluetooth. It triggers a event when the connection is established as well as when it’s disconnected from the paired device. In between, there may be additional events triggered out for connectivity test. Let’s load the data and visualize it:
import pandas as pd df = pd.read_excel("connection log.xlsx") df.head(10)
You can see the below output from Jupyter Lab:
If you would like to check the duration for each device per every connection, you probably want to group these records if the events are triggered during the same connection. To determine whether the records are within the same connection, you shall sort the event date in ascending order and if the device Id is not the same in the consecutive rows, then they must be some events for different connections. So can this be done in pandas?
Solution to group the consecutive rows
Let’s do some sorting to our data first to make sure the records are in chronological order based on the event date:
df.sort_values(["Event Time", "Device ID"], ascending=[True, True], inplace=True)
To compare the value of current row and subsequent row for a particular column, we can use the data series shift method. For instance, we can shift the “Device ID” values to next row and store the result into new column named “Device ID X”:
df["Device ID X"] = df["Device ID"].shift()
After the shifting, you shall see the updated data frame as per below:
If you try to compare the values of both columns:
df["Device ID"] != df["Device ID X"]
You can see the return value of the True/False in a data series form:
Now it is coming to the most critical step. Since we know that there is only one True value when device ID switched to a new ID, we can use the pandas cumsum method to sum up these True values accumulatively as per below:
df["cumsum"] = (df["Device ID"] != df["Device ID X"]).cumsum()
When doing the accumulative summary, the True values will be counted as 1 and False values will be counted as 0. So you would see the below output:
You can see that the same values calculated for the rows we would like to group together, and you can make use of this value to re-group the records for further analysis.
You can even simply combine the above steps into one liner to get the earliest and latest event time for each group as per below:
df.groupby((df["Device ID"] != df["Device ID"].shift()).cumsum()).agg({"Event Time" : ["min", "max"]})
Output as per below:
You may be also interested in some other similar topic from here.
I read many pandas tutorials but the your articles gave me most satisfaction after reading. While others use toy examples that are either too simplistic or bear no relevance to daily work, the examples given by you are real life and I can easily relate them to what I am doing. So thank you very much again for a good blog!
Thank you very much for your good words, Jianhong 🙂
This is a fantastic breakdown of a common issue when dealing with time series data in pandas!
Thankyou!
Thanks Johnny for reading -:)
Hi, thank you for the tutorial, it helped me quite a lot. However, I am trying to understand if there is a way of keeping the Device ID values (1001, 1002, …) instead of returning the 1,2,3 values as presented in the last output’s 1st row. Could you help me with that? Cheers!
Hi Maurizio,
The simplest way I can think of without much change to the existing code would be using unique to aggregate the Device ID column as per below:
df.groupby((df[“Device ID”] != df[“Device ID”].shift()).cumsum()).agg({“Device ID”: lambda x: x.unique(),
“Event Time” : [“min”, “max”]})
Thank you so much for the prompt answer, Ken! You are a legend!
Thank for your reply, this is exactly what I was looking for 😀