Problem:
You have some data with date (or numeric) data columns, you already knew you can directly use – operator to calculate the difference between columns, but now you would like to calculate the date difference between the two consecutive rows.
For instance, You have some sample data for GPS tracking, and it has the start and end time at each location (latitude and longitude). You would like to calculate the time gap within each location or between two locations.
import pandas as pd import numpy as np df = pd.read_excel("GPS Data.xlsx") df.head(10)
For a quick view, you can see the sample data output as per below:
Solutions:
Option 1: Using Series or Data Frame diff
Data frame diff function is the most straightforward way to compare the values between the current row and the previous rows. By default, it compare the current and previous row, and you can also specify the period argument in order to compare the current row and current – period row. To calculate the time gap of the start time between two consecutive rows:
df["Start Time"].diff()
You shall see the below output:
If you check the date type of the result, you may see it is showing as dtype(‘<m8[ns]’), you can convert the result into integer minutes for easier reading.
In this case, you can use the below timedelta from numpy to convert the date difference into minutes:
df["Start Time"].diff().apply(lambda x: x/np.timedelta64(1, 'm')).fillna(0).astype('int64')
You shall see the below output:
You can also select multiple date columns as a data frame to apply the diff function.
Option 2: Using Series or Data Frame shift with – operator
Shift function allows us to move the values up/down or left /right to the given periods depends on what axis you have specified. You can imagine it is the same as Excel shift cells function.
To calculate the difference between the current and next row, you will need to shift the subtrahend column up 1 cell, below is how to calculate the difference between current End Time and the Start Time from the following row:
df["End Time"] - df["Start Time"].shift(1)
Yous shall see the below result:
If you want to calculate the difference for multiple date columns, you can use the data frame shift operation as per below:
df[["End Time", "Start Time"]] - df[["Start Time", "End Time"]].shift(1)
Option 3: Using data frame sub
The data frame sub function is self-explanatory by it’s name. You can either apply the subtraction at row level or column level by specifying the aixs argument. For our case, to calculate the date difference between two rows, you can use the original data frame to subtract another data frame which starts from the second row of the original data frame. Below is the code:
(df.loc[:,["Start Time", "End Time"]].sub(df.loc[0,["Start Time", "End Time"]], axis='columns')/np.timedelta64(1, "m")).astype("int64")
You can see the below output:
If you would like to calculate the gap between current End Time and next Start Time, you can use the below:
df["End Time"].sub(df["Start Time"].shift(1))
It should produce the same result as previously when we use – with shift.
Conclusion:
Among the 3 options we discussed above, using diff is the most straightforward approach, but you may notice that it can only apply the calculation on the same columns, if you would like to calculate the difference between the End Time of the current row and the Start Time of the next row, you will have to use sub or – with shift operation. One more difference between diff and sub is that sub has the fill_value argument which supports to substitute the missing values with a default value, so that you do not need another line of code to handle the NA values.