Introduction
In one of my previous posts – Pandas tricks to split one row of data into multiple rows, we have discussed a solution to split the summary data from one row into multiple rows in order to standardize the data for further analysis. Similarly, there are many scenarios that we have the aggregated data like a Excel pivot table, and we need to unpivot it from wide to long format for better analysis. In this article, I will be sharing with you a few tips to convert columns to rows with pandas DataFrame.
Prerequisites
To run the later code examples, you shall get pandas installed in your working environment. Below is the pip command to install pandas:
pip install pandas
And we will be using the data from this file for the later demonstration, so you may download and examine how the data looks like with below code:
import pandas as pd import os data_dir = "c:\\your_download_dir" df = pd.read_excel(os.path.join(data_dir, "Sample-Data.xlsx"))
You shall see the sample sales data as per below:
The sales amount has been summarized by each product in the last 4 columns. With this wide data format, it would be difficult for us to do some analysis, for instance, the top salesman by month by products or the best seller products by month etc.
A better data format should be transforming the product columns into rows so that each single row only represents 1 product and its sales amount. Now let’s start to explore what are the different ways to convert columns to rows with pandas.
Using Pandas Stack Method
The most immediate solution you may think of would be using the stack method as it allows you to stack the columns vertically onto each other and make it into multiple rows. For our case, we will need to specify the DataFrame index as “Salesman” and “Order Date“, so that the product columns will stack based on this index. For instance:
df.set_index(["Salesman", "Order Date"]).stack()
If you check the result now, you shall see the below output:
This is an MultiIndex Series with index name – [‘Salesman’, ‘Order Date’, None], so you can reset the index and rename the Series name as “Amount”, meanwhile give the name of the “None” index as “Product Desc” to make it more meaningful. E.g.:
df.set_index(["Salesman", "Order Date"])\ .stack()\ .reset_index(name='Amount')\ .rename(columns={'level_2':'Product Desc'})
With the above code, you can see the output similar to below:
If you do not want to have the 0 sales amount records, you can easily apply a filter to the DataFrame to have cleaner data.
Using Pandas Melt method
The melt method is a very powerful function to unpivot data from wide to long format. It is like the opposite operation to the pivot_table function, so if you are familiar with pivot_table function or the Excel pivot table, you shall be able to understand the parameters easily.
To achieve the same result as per the stack function, we can use the below code with melt method:
df.melt(id_vars=['Salesman', 'Order Date'], value_vars=['Beer', 'Red Wine', 'Whisky', 'White Wine'], var_name="Product Desc", value_name='Amount')
The id_vars specifies the columns for grouping rows. The value_vars and var_name specify the columns to unpivot and the new column name, and the value_name indicates the name of the value column. To help you better understand this parameters, you can imagine how the data is generated via pivot table in Excel, now it’s the reversing process.
Using Pandas wide_to_long Method
The wide_to_long method is quite self-explanatory by its name. The method uses pandas.melt under the hood, and it is designed to solve some particular problems. For instance, if your columns names follows certain patterns such as including a year or number or date, you can specify the pattern and extract the info when converting those columns to rows.
Below is the code that generates the same output as our previous examples:
pd.wide_to_long( df, stubnames="Amount", i=["Salesman", "Order Date"], j="Product Desc", suffix=r"|Red Wine|White Wine|Whisky|Beer").reset_index()
The stubnames parameter specifies the columns for the values converted from the wide format. And i specifies the columns for grouping the rows, and j is the new column name those stacked columns. Since our product column names does not follow any pattern, in the suffix parameter, we just list out all the product names.
As the wide_to_long returns a MultiIndex DataFrame, we need to reset index to make it flat data structure.
You may not see the power of this function from the above example, but if you look at the below example from its official document, you would understand how wonderful this function is when solving this type of problems.
Performance Consideration
When testing the code performance for the above 3 methods, the wide_to_long method would take significant longer time than the other two methods, and melt seems to be the fastest. But the result may vary for large set of data, so you will need to evaluate again based on your data set.
#timeit for stack method 4.52 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) #timeit for melt method 3.5 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) #timeit for wide_to_long method 17.8 ms ± 709 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Conclusion
In this article, we have reviewed through 3 pandas methods to convert columns to rows when you need to unpivot your data or transform it from wide to long format for further analysis. A simple testing shows that melt method performs the best and the wide_to_long takes the longest time, but bear in mind that wide_to_long method has its specific use cases which the other functions may not be able to achieve.