pandas

pandas convert columns to rows, convert wide to long, pandas melt

Pandas Tips – Convert Columns To Rows

  Introduction

In one of my previous posts – Pandas tricks to split one row of data into multiple rows, we have discussed a solution to split the summary data from one row into multiple rows in order to standardize the data for further analysis. Similarly, there are many scenarios that we have the aggregated data like a Excel pivot table, and we need to unpivot it from wide to long format for better analysis. In this article, I will be sharing with you a few tips to convert columns to rows with pandas DataFrame.

Prerequisites

To run the later code examples, you shall get pandas installed in your working environment. Below is the pip command to install pandas:

pip install pandas

And we will be using the data from this file for the later demonstration, so you may download and examine how the data looks like with below code:

import pandas as pd
import os
data_dir = "c:\\your_download_dir"
df = pd.read_excel(os.path.join(data_dir, "Sample-Data.xlsx"))

You shall see the sample sales data as per below:

pandas convert columns to rows, wide to long format, pandas melt

The sales amount has been summarized by each product in the last 4 columns. With this wide data format, it would be difficult for us to do some analysis, for instance, the top salesman by month by products or the best seller products by month etc.

A better data format should be transforming the product columns into rows so that each single row only represents 1 product and its sales amount. Now let’s start to explore what are the different ways to convert columns to rows with pandas.

Using Pandas Stack Method

The most immediate solution you may think of would be using the stack method as it allows you to stack the columns vertically onto each other and make it into multiple rows.  For our case, we will need to specify the DataFrame index as “Salesman” and “Order Date“, so that the product columns will stack based on this index. For instance:

df.set_index(["Salesman", "Order Date"]).stack()

If you check the result now, you shall see the below output:

pandas convert columns to rows, wide to long format, pandas melt

This is an MultiIndex Series with index name – [‘Salesman’, ‘Order Date’, None], so you can reset the index and  rename the Series name as “Amount”, meanwhile give the name of the “None” index as “Product Desc” to make it more meaningful. E.g.:

df.set_index(["Salesman", "Order Date"])\
    .stack()\
    .reset_index(name='Amount')\
    .rename(columns={'level_2':'Product Desc'})

With the above code, you can see the output similar to below:

pandas convert columns to rows, wide to long format, pandas melt

 

If you do not want to have the 0 sales amount records, you can easily apply a filter to the DataFrame to have cleaner data.

Using Pandas Melt method

The melt method is a very powerful function to unpivot data from wide to long format. It is like the opposite operation to the pivot_table function, so if you are familiar with pivot_table function or the Excel pivot table, you shall be able to understand the parameters easily.

To achieve the same result as per the stack function, we can use the below code with melt method:

df.melt(id_vars=['Salesman', 'Order Date'], 
        value_vars=['Beer', 'Red Wine', 'Whisky', 'White Wine'],
        var_name="Product Desc",
        value_name='Amount')

The id_vars specifies the columns for grouping rows. The value_vars and var_name specify the columns to unpivot and the new column name, and the value_name indicates the name of the value column. To help you better understand this parameters, you can imagine how the data is generated via pivot table in Excel, now it’s the reversing process.

pandas convert columns to rows, wide to long format, pandas melt

 

Using Pandas wide_to_long Method

The wide_to_long method is quite self-explanatory by its name. The method uses pandas.melt under the hood, and it is designed to solve some particular problems. For instance, if your columns names follows certain patterns such as including a year or number or date, you can specify the pattern and extract the info when converting those columns to rows.

Below is the code that generates the same output as our previous examples:

pd.wide_to_long(
    df, 
    stubnames="Amount", 
    i=["Salesman", "Order Date"], 
    j="Product Desc", 
    suffix=r"|Red Wine|White Wine|Whisky|Beer").reset_index()

The stubnames parameter specifies the columns for the values converted from the wide format. And i specifies the columns for grouping the rows, and j is the new column name those stacked columns. Since our product column names does not follow any pattern, in the suffix parameter, we just list out all the product names.

As the wide_to_long returns a MultiIndex DataFrame, we need to reset index to make it flat data structure.

You may not see the power of this function from the above example, but if you look at the below example from its official document, you would understand how wonderful this function is when solving this type of problems.

pandas convert columns to rows, wide to long format, pandas melt, pandas wide_to_long

Performance Consideration

When testing the code performance for the above 3 methods, the wide_to_long method would take significant longer time than the other two methods, and melt seems to be the fastest. But the result may vary for large set of data, so you will need to evaluate again based on your data set.

#timeit for stack method
4.52 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#timeit for melt method
3.5 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#timeit for wide_to_long method
17.8 ms ± 709 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conclusion

In this article, we have reviewed through 3 pandas methods to convert columns to rows when you need to unpivot your data or transform it from wide to long format for further analysis. A simple testing shows that melt method performs the best and the wide_to_long takes the longest time, but bear in mind that wide_to_long method has its specific use cases which the other functions may not be able to achieve.

 

combine data in pandas with merge vs join

Pandas Tricks – Combine Data in Different Ways

Introduction

If you have used pandas for your data analysis work, you may already get some idea on how powerful and flexible it is in terms of data processing. Many times there are more than one way to solve your problem, and choosing the best approach become another tough decision. For instance, in one of my previous article, I tried to summarize the 20 ways to filter records in pandas which definitely is not a complete list for all the possible solutions. In this article, I will be discussing about the different ways to merge/combine data in pandas and when you shall use them since combining data probably is one of the necessary step you shall perform before starting your data analysis.

Prerequisites

If you have not yet installed pandas, you may use the below command to install it from PyPI:

pip install pandas

And import the module at the beginning of your code:

import pandas as pd

Let’s dive into the code examples.

Combine Data with Append vs Concat

Imagine you have below two data frames from different sources, now you would like to merge them into one data frame.

df1 = pd.DataFrame({"ID" : [1, 2, 3, 4, 5], 
"Name" : ["Aaron", "Jimmy", "Zoe", "Jill", "Jenny"]})
df2 = pd.DataFrame({"ID": [6], "Name" : ["Kelly"]})

The most straightforward way would be using the append method from the pandas DataFrame object:

df1.append(df2, ignore_index=True)

The append method allows to add rows to the end of the current data frame, and with the ignore_index parameter as True, the resulting axis will be relabeled starting from 0.

You would see the output as per below:

combine data in pandas with merge vs join

Alternatively, you can use the pandas concat method which is self-explanatory based on its name. It provides a few more parameters to manipulate the resulting data frame such as specifying the axis for the concatenation to be done as well as the join logic for either union or intersection operation.

You can use the below to generate the same output as previously:

pd.concat([df1, df2], ignore_index=True)

And if you would like to retain a reference to the sources in your result, you can use the keys as per below:

pd.concat([df1, df2], keys=["src_1", "src_2"])

This would return a multi-index data frame where you can easily refer back to the data by source (e.g. df.loc[“src_1”]).

combine data in pandas with merge vs join

Adding new data frame as columns can be also done with axis = 1, for instance:

df3 = pd.DataFrame({"Age" : [12, 13, 13, 12, 13]})
pd.concat([df1, df3], axis=1)

The data frame has been added as one column to the caller:

combine data in pandas with merge vs join

As concat method accepts a list of data frames, you can combine multiple data frames at one time, which would be much faster than using append to do one by one.

Merge Data with Join vs Merge

Beside appending rows or columns based on axis, sometimes you may need more sophisticated operations similar to the left/right join in a rational database. For such scenarios, you shall make use of the pandas merge or join method.

For the previous example to append df2 to df1, you can achieve it with merge as well:

df1.merge(df2, how="outer")

Output as following:

combine data in pandas with merge vs join

It would be more tedious if you want to achieve the same via join since it can only join the data frame based on index, so you will have to set the index to the correct columns you would like to use as key. Below is how you can do it via join:

df1.join(df2.set_index(["ID", "Name"]), 
        on=["ID", "Name"], how="outer").reset_index(drop=True)

Assuming you have the below student’s score for each subject, and you want to merge the student information (df1) and the below based on the “Name” column:

df4 = pd.DataFrame({"ID" : [1001, 1002, 1003, 1002, 1001],
                    "Subject": ["Science", "Math", "English", "Math", "Science"], 
                    "Name": ["Aaron", "Jimmy", "Jimmy", "Zoe", "Jenny"], 
                    "Score" : ["A", "B", "C", "B", "B"]})

With merge function, you can specify the joining logic as left join on “Name” column as per below:

df1.merge(df4, on="Name", how="left")

Pandas will automatically add suffix whenever there are columns with duplicate names (e.g. “ID” in df1 and df4) from the two data frames, below is the output you may see:

combine data in pandas with merge vs join

To generate the same output via join, you can use below code which you need to pre-set the index for df4 and specify the suffix for left and right data frame:

df1.join(df4.set_index("Name"), on="Name", lsuffix="_x", rsuffix="_y")

Of course, if you would like to perform the right join for the above two data frames, you can do as per below:

df1.merge(df4, on="Name", how="right")
# or
df1.join(df4.set_index("Name"), on="Name", how="right", lsuffix="_x", rsuffix="_y")

Output as per below:

combine data in pandas with merge vs join

Merge DataFrame with Duplicate Keys

When merging multiple DataFrame objects, you may occasionally encounter the scenario that there are duplicate values for the columns you want to use as keys for joining. For instance, you may have below records if one subject has more than one lecturers:

df5 = pd.DataFrame({"Subject": ["Science", "Science", "Math", "Math", "English"], 
                    "Lecturer": ["Michael", "John", "Tim", "Robert", "Alex"]})

When you merge this information with student score based on the subject with merge or join method:

df4.merge(df5, on="Subject", how="left")
#or 
df4.join(df5.set_index("Subject"), on="Subject", how="left")

You would see the below output with M x N records due to the duplicate key in the df5:

combine data in pandas with merge vs join

If your objective is to perform something similar to excel vlookup to return the first matched value, then you can use the drop_duplicates method to remove the duplicate records before joining. E.g.:

df4.merge(df5.drop_duplicates("Subject"), on="Subject", how="left")

This would allow you to combine the two data frames with the first matched record from df5:

combine data in pandas with merge vs join

And in case you do not want to lose the information from the lecturer data frame, you will need to perform some sort of data aggregation before joining, e.g.:

df4.merge(df5.groupby("Subject").agg({"Lecturer" : lambda x: ','.join(x)}),
 on="Subject", how="left")

With this aggregation on the lecturer values, you would be able to see the below output:

combine data in pandas with merge vs join

Based on the above examples, you may find that merge and join are interchangeable in most of the cases, and you may have to type a bit more when using join method due to the different default arguments used. Since it always works on the index, you will have to preset the index on the key columns before joining.

Conclusion

In this article, we have reviewed through a few methods pandas offered for combining data frames with some sample code. To wrap up, the append and concat are usually used for merging two or more data frames based on the row or column index, and concat has better performance over append when you have multiple data frames to be worked on. If you need some high performance in-memory join operations like SQL joining for rational database, you will need to use merge or join method which can be interchangeable in most of the scenario. In addition, if the data frame you worked on does not have a index on the joining row/column, using merge over join would probably save your some typing.

 

group consecutive rows of same values in pandas

How to group consecutive rows of same values in pandas

Problem Statement

You have a data set which you would like to group consecutive rows if one of the columns has the same values. If there is a different value in between the rows, these records shall be split into separate groups.

To better elaborate the issue, let’s use an example.

Assuming you have the connection log data for some devices such as Bluetooth. It triggers a event when the connection is established as well as when it’s disconnected from the paired device. In between, there may be additional events triggered out for connectivity test. Let’s load the data and visualize it:

import pandas as pd 

df = pd.read_excel("connection log.xlsx")

df.head(10)

You can see the below output from Jupyter Lab:

connection log data

If you would like to check the duration for each device per every connection, you probably want to group these records if the events are triggered during the same connection. To determine whether the records are within the same connection, you shall sort the event date in ascending order and if the device Id is not the same in the consecutive rows, then they must be some events for different connections. So can this be done in pandas?

Solution to group the consecutive rows

Let’s do some sorting to our data first to make sure the records are in chronological order based on the event date:

df.sort_values(["Event Time", "Device ID"], ascending=[True, True], inplace=True)

To compare the value of current row and subsequent row for a particular column, we can use the data series shift method. For instance, we can shift the “Device ID” values to next row and store the result into new column named “Device ID X”:

df["Device ID X"] = df["Device ID"].shift()

After the shifting, you shall see the updated data frame as per below:

shifted rows

If you try to compare the values of both columns:

df["Device ID"] != df["Device ID X"]

You can see the return value of the True/False in a data series form:

compare device id column

Now it is coming to the most critical step. Since we know that there is only one True value when device ID switched to a new ID, we can use the pandas cumsum method to sum up these True values accumulatively as per below:

df["cumsum"] = (df["Device ID"] != df["Device ID X"]).cumsum()

When doing the accumulative summary, the True values will be counted as 1 and False values will be counted as 0. So you would see the below output:

comparison result

You can see that the same values calculated for the rows we would like to group together, and you can make use of this value to re-group the records for further analysis.

You can even simply combine the above steps into one liner to get the earliest and latest event time for each group as per below:

df.groupby((df["Device ID"] != df["Device ID"].shift()).cumsum()).agg({"Event Time" : ["min", "max"]})

Output as per below:

one liner result

You may be also interested in some other similar topic from here.

 

Photo by Aron Visuals on Unsplash

How to calculate date difference between rows in pandas

Problem:

You have some data with date (or numeric) data columns, you already knew you can directly use – operator to calculate the difference between columns, but now you would like to calculate the date difference between the two consecutive rows.

For instance, You have some sample data for GPS tracking, and it has the start and end time at each location (latitude and longitude). You would like to calculate the time gap within each location or between two locations.

import pandas as pd
import numpy as np

df = pd.read_excel("GPS Data.xlsx")
df.head(10)

For a quick view, you can see the sample data output as per below:

pandas calculate date difference between two consecutive rows

Solutions:

  Option 1: Using Series or Data Frame diff

Data frame diff function is the most straightforward way to compare the values between the current row and the previous rows. By default, it compare the current and previous row, and you can also specify the period argument in order to compare the current row and current – period row. To calculate the time gap of the start time between two consecutive rows:

df["Start Time"].diff()

You shall see the below output:

pandas calculate date difference between two consecutive rows

If you check the date type of the result, you may see it is showing as dtype(‘<m8[ns]’), you can convert the result into integer minutes for easier reading.

In this case, you can use the below timedelta from numpy to convert the date difference into minutes:

df["Start Time"].diff().apply(lambda x: x/np.timedelta64(1, 'm')).fillna(0).astype('int64')

You shall see the below output:

pandas calculate date difference between two consecutive rows

You can also select multiple date columns as a data frame to apply the diff function.

  Option 2: Using Series or Data Frame shift with – operator

Shift function allows us to move the values up/down or left /right to the given periods depends on what axis you have specified. You can imagine it is the same as Excel shift cells function.

To calculate the difference between the current and next row, you will need to shift the subtrahend column up 1 cell, below is how to calculate the difference between current End Time and the Start Time from the following row:

df["End Time"] - df["Start Time"].shift(1)

Yous shall see the below result:

pandas calculate date difference between two consecutive rows

If you want to calculate the difference for multiple date columns, you can use the data frame shift operation as per below:

df[["End Time", "Start Time"]] - df[["Start Time", "End Time"]].shift(1)

pandas calculate date difference between two consecutive rows

  Option 3: Using data frame sub

The data frame sub function is self-explanatory by it’s name. You can either apply the subtraction at row level or column level by specifying the aixs argument. For our case, to calculate the date difference between two rows, you can use the original data frame to subtract another data frame which starts from the second row of the original data frame. Below is the code:

(df.loc[:,["Start Time", "End Time"]].sub(df.loc[0,["Start Time", "End Time"]], axis='columns')/np.timedelta64(1, "m")).astype("int64")

You can see the below output:

pandas calculate date difference between two consecutive rows

If you would like to calculate the gap between current End Time and next Start Time, you can use the below:

df["End Time"].sub(df["Start Time"].shift(1))

It should produce the same result as previously when we use – with shift.

Conclusion:

Among the 3 options we discussed above, using diff is the most straightforward approach, but you may notice that it can only apply the calculation on the same columns, if you would like to calculate the difference between the End Time of the current row and the Start Time of the next row, you will have to use sub or – with shift operation. One more difference between diff and sub is that sub has the fill_value argument which supports to substitute the missing values with a default value, so that you do not need another line of code to handle the NA values.

pandas split one row of data into multiple rows

Pandas tricks – split one row of data into multiple rows

As a data scientist or analyst, you will need to spend a lot of time wrangling the data from various sources so that you can have a standard data structure for your further analysis. There are cases that you get the raw data in some sort of summary view and you would need to split one row of data into multiple rows based on certain conditions in order to do grouping and matching from different perspectives. In this article, we will be discussing a solution to solve this particular issue.

Prerequisites:

You will need to get pandas installed if you have not yet. Below is the pip command to install pandas:

pip install pandas

And let’s import the necessary modules and use this sample data for our demonstration, you can download it into your local folder, or just supply this URL link to pandas read_excel method:

import pandas as pd
import numpy as np

df = pd.read_excel("eShop-Delivery-Record.xlsx", sheet_name=0)

So if we do a quick view of the first 5 rows of the data with df.head(5), you would see the below output:

pandas split one row of data into multiple rows

Assume this is the data extracted from a eCommerce system where someone is running a online shop for footwear and apparel products, and the shop provides free 7 days return for the items that it is selling. You can see that each of the rows has the order information, when and who performed the delivery service, and if customer requested return, when the item was returned and by which courier service provider. The data is more from the shop owner’s view, and you may find some difficulty when you want to analyse from courier service providers’ perspective with the current data format. So probably we shall do some transformation to make the format simpler for analysis.

Split one row of data into multiple rows

Now let’s say we would like to split this one row of data into 2 rows if there is a return happening, so that each row has the order info as well as the courier service info and we can easily do some analysis such as calculating the return rate for each product, courier service cost for each month by courier companies, etc.

The output format we would like to have is more like a transaction based, so let’s try to format our date columns and rename the delivery related columns, so that it won’t confuse us later when splitting the data.

df["Delivery Date"] = pd.to_datetime(df["Delivery Date"]).dt.date
df["Return Date"] = pd.to_datetime(df["Return Date"]).dt.date

df.rename(columns={"Delivery Date" : "Transaction Date",
"Delivery Courier" : "Courier",
"Delivery Charges" : "Charges"}, inplace=True)

And we add one more column as transaction type to indicate whether the record is for delivery or return. For now, we just assign it as “DELIVERY” for all records:

df["Transaction Type"] = "DELIVERY"

The rows we need to split are the ones with return info, so let’s create a filter by checking if return date is empty:

flt_returned = ~df["Return Date"].isna()

If you verify the filter with df[flt_returned], you shall see all rows with return info are selected as per below:

pandas split one row of data into multiple rows

To split out the delivery and return info for these rows, we will need to perform the below steps:

  • Duplicate the current 1 row into 2 rows
  • Change the transaction type to “RETURN” for the second duplicated row
  • Copy values of the Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively

To duplicate these records, we use data frame index.repeat() to repeat these index twice, and then use loc function to get the data for these repeated indexes. Below is the code to create the duplicate records for the rows with return info:

d = df[flt_returned].loc[df[flt_returned].index.repeat(2),:].reset_index(drop=True)

Next, let’s save the duplicated row indexes into a variable, so that we can refer to it multiple times even when some data in the duplicated row changed. We use the data frame duplicated function to return the index of the duplicated rows. For this function, the keep=”first” argument will mark 1st row as non-duplicate and the subsequent rows as duplicate, while keep=”last” will mark the 1st row as duplicate.

idx_duplicate = d.duplicated(keep="first")
#the default value for keep argument is "first", so you can just use d.duplicated()

With this idx_duplicate variable, we can directly update the transaction type for these rows to RETURN:

d.loc[idx_duplicate,"Transaction Type"] = "RETURN"

And next, we shall copy the return info into Transaction Date, Courier, Charges fields for these return records. You can either base on the transaction type value to select rows, or continue to use the idx_duplicate to identify the return records.

Below will copy values from Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively:

d.loc[idx_duplicate, ["Transaction Date", "Courier", "Charges"]] = d.loc[idx_duplicate, 
                                                     ["Return Date", "Return Courier","Return Charges"]].to_numpy()

If you check the data now, you shall see for the return rows, the return info is already copied over:

pandas split one row of data into multiple rows

(Note: you may want to check here to understand why to_numpy() is needed for swapping columns)

Finally, we need to combine the original rows which only has delivery info with the above processed data. Let’s also sort the values by order number and reset the index:

new_df = pd.concat([d, df[~flt_returned]]).sort_values("Order#").reset_index(drop=True)

Since the return related columns are redundant now, we shall drop these columns to avoid the confusion, so let’s use the below code to drop them by the “Return” keywords in the column labels:

new_df.drop(new_df.filter(regex="Return*", axis=1), axis=1, inplace=True)

(To understand how df.filter works, check my this article)

Once we deleted the redundant columns, you shall see the below final result in the new_df as per below:

pandas split one row of data into multiple rows

So we have successfully transformed our data from a shop owner’s view to courier companies’ view, each of the delivery and return records are now an individual row.

Conclusion

Data wrangling sometimes can be tough depends on what kind of source data you get. In this article, we have gone through a solution to split one row of data into multiple rows by using the pandas index.repeat to duplicate the rows and loc function to swapping the values. There are other possible ways to handle this, please do share your comments in case you have any better idea.