
Pandas tricks – split one row of data into multiple rows
As a data scientist or analyst, you will need to spend a lot of time wrangling the data from various sources so that you can have a standard data structure for your further analysis. There are cases that you get the raw data in some sort of summary view and you would need to split one row of data into multiple rows based on certain conditions in order to do grouping and matching from different perspectives. In this article, we will be discussing a solution to solve this particular issue.
Prerequisites:
You will need to get pandas installed if you have not yet. Below is the pip command to install pandas:
pip install pandas
And let’s import the necessary modules and use this sample data for our demonstration, you can download it into your local folder, or just supply this URL link to pandas read_excel method:
import pandas as pd import numpy as np df = pd.read_excel("eShop-Delivery-Record.xlsx", sheet_name=0)
So if we do a quick view of the first 5 rows of the data with df.head(5), you would see the below output:
Assume this is the data extracted from a eCommerce system where someone is running a online shop for footwear and apparel products, and the shop provides free 7 days return for the items that it is selling. You can see that each of the rows has the order information, when and who performed the delivery service, and if customer requested return, when the item was returned and by which courier service provider. The data is more from the shop owner’s view, and you may find some difficulty when you want to analyse from courier service providers’ perspective with the current data format. So probably we shall do some transformation to make the format simpler for analysis.
Split one row of data into multiple rows
Now let’s say we would like to split this one row of data into 2 rows if there is a return happening, so that each row has the order info as well as the courier service info and we can easily do some analysis such as calculating the return rate for each product, courier service cost for each month by courier companies, etc.
The output format we would like to have is more like a transaction based, so let’s try to format our date columns and rename the delivery related columns, so that it won’t confuse us later when splitting the data.
df["Delivery Date"] = pd.to_datetime(df["Delivery Date"]).dt.date df["Return Date"] = pd.to_datetime(df["Return Date"]).dt.date df.rename(columns={"Delivery Date" : "Transaction Date", "Delivery Courier" : "Courier", "Delivery Charges" : "Charges"}, inplace=True)
And we add one more column as transaction type to indicate whether the record is for delivery or return. For now, we just assign it as “DELIVERY” for all records:
df["Transaction Type"] = "DELIVERY"
The rows we need to split are the ones with return info, so let’s create a filter by checking if return date is empty:
flt_returned = ~df["Return Date"].isna()
If you verify the filter with df[flt_returned], you shall see all rows with return info are selected as per below:
To split out the delivery and return info for these rows, we will need to perform the below steps:
- Duplicate the current 1 row into 2 rows
- Change the transaction type to “RETURN” for the second duplicated row
- Copy values of the Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively
To duplicate these records, we use data frame index.repeat() to repeat these index twice, and then use loc function to get the data for these repeated indexes. Below is the code to create the duplicate records for the rows with return info:
d = df[flt_returned].loc[df[flt_returned].index.repeat(2),:].reset_index(drop=True)
Next, let’s save the duplicated row indexes into a variable, so that we can refer to it multiple times even when some data in the duplicated row changed. We use the data frame duplicated function to return the index of the duplicated rows. For this function, the keep=”first” argument will mark 1st row as non-duplicate and the subsequent rows as duplicate, while keep=”last” will mark the 1st row as duplicate.
idx_duplicate = d.duplicated(keep="first") #the default value for keep argument is "first", so you can just use d.duplicated()
With this idx_duplicate variable, we can directly update the transaction type for these rows to RETURN:
d.loc[idx_duplicate,"Transaction Type"] = "RETURN"
And next, we shall copy the return info into Transaction Date, Courier, Charges fields for these return records. You can either base on the transaction type value to select rows, or continue to use the idx_duplicate to identify the return records.
Below will copy values from Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively:
d.loc[idx_duplicate, ["Transaction Date", "Courier", "Charges"]] = d.loc[idx_duplicate, ["Return Date", "Return Courier","Return Charges"]].to_numpy()
If you check the data now, you shall see for the return rows, the return info is already copied over:
(Note: you may want to check here to understand why to_numpy() is needed for swapping columns)
Finally, we need to combine the original rows which only has delivery info with the above processed data. Let’s also sort the values by order number and reset the index:
new_df = pd.concat([d, df[~flt_returned]]).sort_values("Order#").reset_index(drop=True)
Since the return related columns are redundant now, we shall drop these columns to avoid the confusion, so let’s use the below code to drop them by the “Return” keywords in the column labels:
new_df.drop(new_df.filter(regex="Return*", axis=1), axis=1, inplace=True)
(To understand how df.filter works, check my this article)
Once we deleted the redundant columns, you shall see the below final result in the new_df as per below:
So we have successfully transformed our data from a shop owner’s view to courier companies’ view, each of the delivery and return records are now an individual row.
Conclusion
Data wrangling sometimes can be tough depends on what kind of source data you get. In this article, we have gone through a solution to split one row of data into multiple rows by using the pandas index.repeat to duplicate the rows and loc function to swapping the values. There are other possible ways to handle this, please do share your comments in case you have any better idea.