
How to group consecutive rows of same values in pandas
Problem Statement
You have a data set which you would like to group consecutive rows if one of the columns has the same values. If there is a different value in between the rows, these records shall be split into separate groups.
To better elaborate the issue, let’s use an example.
Assuming you have the connection log data for some devices such as Bluetooth. It triggers a event when the connection is established as well as when it’s disconnected from the paired device. In between, there may be additional events triggered out for connectivity test. Let’s load the data and visualize it:
import pandas as pd df = pd.read_excel("connection log.xlsx") df.head(10)
You can see the below output from Jupyter Lab:
If you would like to check the duration for each device per every connection, you probably want to group these records if the events are triggered during the same connection. To determine whether the records are within the same connection, you shall sort the event date in ascending order and if the device Id is not the same in the consecutive rows, then they must be a events for different connections. So this can be done in pandas?
Solution to group the consecutive rows
Let’s do some sorting to our data first to make sure the records are in chronological order based on the event date:
df.sort_values(["Event Time", "Device ID"], ascending=[True, True], inplace=True)
To compare the value of current row and subsequent row for a particular column, we can use the data series shift method. For instance, we can shift the “Device ID” values to next row and store the result into new column named “Device ID X”:
df["Device ID X"] = df["Device ID"].shift()
After the shifting, you shall see the updated data frame as per below:
If you try to compare the values of both columns:
df["Device ID"] != df["Device ID X"]
You can see the return value of the True/False in a data series form:
Now it is coming to the most critical step. Since we know that there is only one True value when device ID switched to a new ID, we can use the pandas cumsum method to sum up these True values accumulatively as per below:
df["cumsum"] = (df["Device ID"] != df["Device ID X"]).cumsum()
When doing the accumulative summary, the True values will be counted as 1 and False values will be counted as 0. So you would see the below output:
You can see that the same values calculated for the rows we would like to group together, and you can make use of this value to re-group the records for further analysis.
You can even simply combine the above steps into one liner to get the earliest and latest event time for each group as per below:
df.groupby((df["Device ID"] != df["Device ID"].shift()).cumsum()).agg({"Event Time" : ["min", "max"]})
Output as per below:
You may be also interested in some other similar topic from here.