Pandas

pandas convert columns to rows, convert wide to long, pandas melt

Pandas Tips – Convert Columns To Rows

  Introduction

In one of my previous posts – Pandas tricks to split one row of data into multiple rows, we have discussed a solution to split the summary data from one row into multiple rows in order to standardize the data for further analysis. Similarly, there are many scenarios that we have the aggregated data like a Excel pivot table, and we need to unpivot it from wide to long format for better analysis. In this article, I will be sharing with you a few tips to convert columns to rows with pandas DataFrame.

Prerequisites

To run the later code examples, you shall get pandas installed in your working environment. Below is the pip command to install pandas:

pip install pandas

And we will be using the data from this file for the later demonstration, so you may download and examine how the data looks like with below code:

import pandas as pd
import os
data_dir = "c:\\your_download_dir"
df = pd.read_excel(os.path.join(data_dir, "Sample-Data.xlsx"))

You shall see the sample sales data as per below:

pandas convert columns to rows, wide to long format, pandas melt

The sales amount has been summarized by each product in the last 4 columns. With this wide data format, it would be difficult for us to do some analysis, for instance, the top salesman by month by products or the best seller products by month etc.

A better data format should be transforming the product columns into rows so that each single row only represents 1 product and its sales amount. Now let’s start to explore what are the different ways to convert columns to rows with pandas.

Using Pandas Stack Method

The most immediate solution you may think of would be using the stack method as it allows you to stack the columns vertically onto each other and make it into multiple rows.  For our case, we will need to specify the DataFrame index as “Salesman” and “Order Date“, so that the product columns will stack based on this index. For instance:

df.set_index(["Salesman", "Order Date"]).stack()

If you check the result now, you shall see the below output:

pandas convert columns to rows, wide to long format, pandas melt

This is an MultiIndex Series with index name – [‘Salesman’, ‘Order Date’, None], so you can reset the index and  rename the Series name as “Amount”, meanwhile give the name of the “None” index as “Product Desc” to make it more meaningful. E.g.:

df.set_index(["Salesman", "Order Date"])\
    .stack()\
    .reset_index(name='Amount')\
    .rename(columns={'level_2':'Product Desc'})

With the above code, you can see the output similar to below:

pandas convert columns to rows, wide to long format, pandas melt

 

If you do not want to have the 0 sales amount records, you can easily apply a filter to the DataFrame to have cleaner data.

Using Pandas Melt method

The melt method is a very powerful function to unpivot data from wide to long format. It is like the opposite operation to the pivot_table function, so if you are familiar with pivot_table function or the Excel pivot table, you shall be able to understand the parameters easily.

To achieve the same result as per the stack function, we can use the below code with melt method:

df.melt(id_vars=['Salesman', 'Order Date'], 
        value_vars=['Beer', 'Red Wine', 'Whisky', 'White Wine'],
        var_name="Product Desc",
        value_name='Amount')

The id_vars specifies the columns for grouping rows. The value_vars and var_name specify the columns to unpivot and the new column name, and the value_name indicates the name of the value column. To help you better understand this parameters, you can imagine how the data is generated via pivot table in Excel, now it’s the reversing process.

pandas convert columns to rows, wide to long format, pandas melt

 

Using Pandas wide_to_long Method

The wide_to_long method is quite self-explanatory by its name. The method uses pandas.melt under the hood, and it is designed to solve some particular problems. For instance, if your columns names follows certain patterns such as including a year or number or date, you can specify the pattern and extract the info when converting those columns to rows.

Below is the code that generates the same output as our previous examples:

pd.wide_to_long(
    df, 
    stubnames="Amount", 
    i=["Salesman", "Order Date"], 
    j="Product Desc", 
    suffix=r"|Red Wine|White Wine|Whisky|Beer").reset_index()

The stubnames parameter specifies the columns for the values converted from the wide format. And i specifies the columns for grouping the rows, and j is the new column name those stacked columns. Since our product column names does not follow any pattern, in the suffix parameter, we just list out all the product names.

As the wide_to_long returns a MultiIndex DataFrame, we need to reset index to make it flat data structure.

You may not see the power of this function from the above example, but if you look at the below example from its official document, you would understand how wonderful this function is when solving this type of problems.

pandas convert columns to rows, wide to long format, pandas melt, pandas wide_to_long

Performance Consideration

When testing the code performance for the above 3 methods, the wide_to_long method would take significant longer time than the other two methods, and melt seems to be the fastest. But the result may vary for large set of data, so you will need to evaluate again based on your data set.

#timeit for stack method
4.52 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#timeit for melt method
3.5 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#timeit for wide_to_long method
17.8 ms ± 709 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conclusion

In this article, we have reviewed through 3 pandas methods to convert columns to rows when you need to unpivot your data or transform it from wide to long format for further analysis. A simple testing shows that melt method performs the best and the wide_to_long takes the longest time, but bear in mind that wide_to_long method has its specific use cases which the other functions may not be able to achieve.

 

group consecutive rows of same values in pandas

How to group consecutive rows of same values in pandas

Problem Statement

You have a data set which you would like to group consecutive rows if one of the columns has the same values. If there is a different value in between the rows, these records shall be split into separate groups.

To better elaborate the issue, let’s use an example.

Assuming you have the connection log data for some devices such as Bluetooth. It triggers a event when the connection is established as well as when it’s disconnected from the paired device. In between, there may be additional events triggered out for connectivity test. Let’s load the data and visualize it:

import pandas as pd 

df = pd.read_excel("connection log.xlsx")

df.head(10)

You can see the below output from Jupyter Lab:

connection log data

If you would like to check the duration for each device per every connection, you probably want to group these records if the events are triggered during the same connection. To determine whether the records are within the same connection, you shall sort the event date in ascending order and if the device Id is not the same in the consecutive rows, then they must be some events for different connections. So can this be done in pandas?

Solution to group the consecutive rows

Let’s do some sorting to our data first to make sure the records are in chronological order based on the event date:

df.sort_values(["Event Time", "Device ID"], ascending=[True, True], inplace=True)

To compare the value of current row and subsequent row for a particular column, we can use the data series shift method. For instance, we can shift the “Device ID” values to next row and store the result into new column named “Device ID X”:

df["Device ID X"] = df["Device ID"].shift()

After the shifting, you shall see the updated data frame as per below:

shifted rows

If you try to compare the values of both columns:

df["Device ID"] != df["Device ID X"]

You can see the return value of the True/False in a data series form:

compare device id column

Now it is coming to the most critical step. Since we know that there is only one True value when device ID switched to a new ID, we can use the pandas cumsum method to sum up these True values accumulatively as per below:

df["cumsum"] = (df["Device ID"] != df["Device ID X"]).cumsum()

When doing the accumulative summary, the True values will be counted as 1 and False values will be counted as 0. So you would see the below output:

comparison result

You can see that the same values calculated for the rows we would like to group together, and you can make use of this value to re-group the records for further analysis.

You can even simply combine the above steps into one liner to get the earliest and latest event time for each group as per below:

df.groupby((df["Device ID"] != df["Device ID"].shift()).cumsum()).agg({"Event Time" : ["min", "max"]})

Output as per below:

one liner result

You may be also interested in some other similar topic from here.

 

create animated charts and gif in python with pandas-alive

Create Animated Charts In Python

Introduction

If you are working as a data analyst or data scientist for some time, you may have already known how to use matplotlib to visualize and present data in various charts. The matplotlib library provides an animation module to generate dynamic charts to make your data more engaging, however it still takes you a few steps to format your data, and initialize and update the data into the charts. In this article, I will demonstrate you another Python library – pandas-alive which allows you to generate animated charts directly from pandas data without any format conversion.

Prerequisites

You can install this library via pip command as per below if you do not have it in your working environment yet:

pip install pandas-alive

It will also install its dependencies such as pandas, pillow and numpy etc.

For demonstration of our later examples, let’s grab some sample covid-19 data from internet, you can download it from here.

Before we start, we shall import all the necessary modules and do a preview of our sample data:

import pandas as pd
import pandas-alive

df_covid = pd.read_excel("covid-19 sample data.xlsx")

The data we will be working on would be something similar to the below:

create animated charts and gif in python with pandas-alive

Now with all above ready, let’s dive into the code examples.

Generate animated bar chart race

Bar chart is the most straightforward way to present data, it can be drawn in horizontal or vertical manner. Let’s do a minor formatting on our data so that we can use date as horizontal or vertical axis to present the data.

df_covid = df_covid.pivot(index="date", columns="location", values="total_cases").fillna(0)

To create an animated bar chart horizontally, you can simply call plot_animated as per below:

df_covid.plot_animated("covid-19-h-bar.gif", period_fmt="%Y-%m", title="Covid-19 Cases")

The plot_animated function has default parameters kind=”race” and orientation = “h”, hence the output gif would be generated as per below:

create animated charts and gif in python with pandas-alive

You can change the default values of these two parameters to generate a vertical bar chart race:

df_covid.plot_animated("covid-19-v-bar.gif", 
                     period_fmt="%Y-%m", 
                     title="Covid-19 Cases", 
                     orientation='v')

The output chart would be something similar to below:

create animated charts and gif in python with pandas-alive

 

Generate animated line chart

To create an animated line chart, you just need to change the parameter kind = “line” as per below:

df_covid.plot_animated("covid-19-line.gif",
                     title="Covid-19 Cases",
                     kind='line',
                     period_fmt="%Y-%m",
                     period_label={
                         'x':0.25,
                         'y':0.9,
                         'family': 'sans-serif',
                         'color':  'darkred'
                    })

There are some other parameters such as period_label to control the format of the label, or n_visible to constrain how many records to be shown on the chart. The output chart would be as per the below:

create animated charts and gif in python with pandas-alive

 

Generate animated pie chart

Similar to other charts, you can create a simple pie chart with below parameters:

df_covid.plot_animated(filename='covid-19-pie-chart.gif',
                     kind="pie",                     
                     rotatelabels=True,
                     tick_label_size=5,
                     dpi=300,
                     period_fmt="%Y-%m",
                     )

You can also use other Axes.Pie parameters to define the pie chart behavior. The output from above code would be:

 

 

create animated charts and gif in python with pandas-alive

 

 

Generate scatter chart

Generate scatter chart or bubble chart is slightly complicated than other charts, but for our sample data, it does not make much sense to visualize it in this type of charts. E.g.

df_covid.plot_animated(filename='covid-19-scatter-chart.gif',
                     kind="scatter",
                     period_label={'x':0.05,'y':0.9},
                     steps_per_period=5
                    )

You shall see the output is similar to the line chart:

create animated charts and gif in python with pandas-alive

 

Conclusion

Pandas-Alive provides very convenient ways to generate all sorts of animated charts from pandas data frame with the underlying support from the Matplotlib library. It accepts most of the parameters you used in matplotlib, so you don’t have to learn a lot of new things before applying it for your charts.

There are many more features beyond the basis I have covered in above, such as supplying custom figures, generating GeoSpatial charts or combining multiple animated charts in one view. You can check more examples from its project page.

 

Photo by Aron Visuals on Unsplash

How to calculate date difference between rows in pandas

Problem:

You have some data with date (or numeric) data columns, you already knew you can directly use – operator to calculate the difference between columns, but now you would like to calculate the date difference between the two consecutive rows.

For instance, You have some sample data for GPS tracking, and it has the start and end time at each location (latitude and longitude). You would like to calculate the time gap within each location or between two locations.

import pandas as pd
import numpy as np

df = pd.read_excel("GPS Data.xlsx")
df.head(10)

For a quick view, you can see the sample data output as per below:

pandas calculate date difference between two consecutive rows

Solutions:

  Option 1: Using Series or Data Frame diff

Data frame diff function is the most straightforward way to compare the values between the current row and the previous rows. By default, it compare the current and previous row, and you can also specify the period argument in order to compare the current row and current – period row. To calculate the time gap of the start time between two consecutive rows:

df["Start Time"].diff()

You shall see the below output:

pandas calculate date difference between two consecutive rows

If you check the date type of the result, you may see it is showing as dtype(‘<m8[ns]’), you can convert the result into integer minutes for easier reading.

In this case, you can use the below timedelta from numpy to convert the date difference into minutes:

df["Start Time"].diff().apply(lambda x: x/np.timedelta64(1, 'm')).fillna(0).astype('int64')

You shall see the below output:

pandas calculate date difference between two consecutive rows

You can also select multiple date columns as a data frame to apply the diff function.

  Option 2: Using Series or Data Frame shift with – operator

Shift function allows us to move the values up/down or left /right to the given periods depends on what axis you have specified. You can imagine it is the same as Excel shift cells function.

To calculate the difference between the current and next row, you will need to shift the subtrahend column up 1 cell, below is how to calculate the difference between current End Time and the Start Time from the following row:

df["End Time"] - df["Start Time"].shift(1)

Yous shall see the below result:

pandas calculate date difference between two consecutive rows

If you want to calculate the difference for multiple date columns, you can use the data frame shift operation as per below:

df[["End Time", "Start Time"]] - df[["Start Time", "End Time"]].shift(1)

pandas calculate date difference between two consecutive rows

  Option 3: Using data frame sub

The data frame sub function is self-explanatory by it’s name. You can either apply the subtraction at row level or column level by specifying the aixs argument. For our case, to calculate the date difference between two rows, you can use the original data frame to subtract another data frame which starts from the second row of the original data frame. Below is the code:

(df.loc[:,["Start Time", "End Time"]].sub(df.loc[0,["Start Time", "End Time"]], axis='columns')/np.timedelta64(1, "m")).astype("int64")

You can see the below output:

pandas calculate date difference between two consecutive rows

If you would like to calculate the gap between current End Time and next Start Time, you can use the below:

df["End Time"].sub(df["Start Time"].shift(1))

It should produce the same result as previously when we use – with shift.

Conclusion:

Among the 3 options we discussed above, using diff is the most straightforward approach, but you may notice that it can only apply the calculation on the same columns, if you would like to calculate the difference between the End Time of the current row and the Start Time of the next row, you will have to use sub or – with shift operation. One more difference between diff and sub is that sub has the fill_value argument which supports to substitute the missing values with a default value, so that you do not need another line of code to handle the NA values.

pandas split data into buckets with cut and qcut

Pandas – split data into buckets with cut and qcut

If you do a lot of data analysis on your daily job, you may have encountered problems that you would want to split data into buckets or groups based on certain criteria and then analyse your data within each group. For instance, you would like to check the popularity of your products or website within each age groups, or understand how many percent of the students fall under each score range. The most straightforward way might be to categorize your data based on the conditions and then summarize the information, but this usually requires some additional effort to massage the data. In this article, I will be sharing with you a simple way to bin your data with pandas cut and qcut function.

Prerequisite

You will need to install pandas package if you do not have it yet in your working environment. Below is the command to install pandas with pip:

pip install pandas

And let’s import the necessary packages and create some sample sales data for our later examples.

import pandas as pd
import numpy as np
df = pd.DataFrame({"Consignee" : ["Patrick", "Sara", "Randy", "John", "Patrick", "Joe"],
                   "Age" : [44, 51, 23, 30, 44, 39],
                  "Order Date" : pd.date_range(start='2020-08-01', end="2020-08-05", periods=6),
                  "Item Desc" : ["White Wine", "Whisky", "Red Wine", "Whisky", "Red Wine", "Champagne"],
                  "Price Per Unit": [10, 20, 30, 20, 30, 30], 
                  "Order Quantity" : [50, 60, 40, 20, 10, 50],
                  "Order Dimensions" : [0.52, 0.805, 0.48, 0.235,0.12, 0.58]})

With the above codes, we can do a quick view of how the data looks like:

pandas split data into segments

And let’s also calculate the total sales amount by multiplying the price per unit and the order quantity:

df["Total Amount"] = df["Price Per Unit"] * df["Order Quantity"]

Once this data is ready, let’s dive into the problems we are going to solve today.

split data into buckets by cut

If we would like to classify our customers into a few age groups and have a overall view of how much money each age group has spent on our product, how shall we do it ? As I mentioned earlier, we are not going to apply some lambda function with conditions like : if the age is less than 30 then classify the customer as young, because this can easily drive you crazy when you have hundreds or thousands of groups to be defined. Instead, we will be using a powerful data frame cut function to achieve this.

The cut function has two mandatory arguments:

  • x – an array of values to be binned
  • bins – indicate how you want to bin your values

For instance, if you supply the df[“Age”] as the first argument, and indicate bins as 2, you are telling pandas to split your age data into 2 equal groups. In our case, the minimum age value is 23, and maximum age value is 51, so the first group will be from 23 to 23 + (51-23)/2, and second group from 23 + (51-23)/2 to 51. When you run the below code:

pd.cut(df["Age"],2)

You shall see the output similar to below:

pandas split data segment category

Pandas already classified our age data into these two groups and the output shows that data type is a pandas category object. This is very useful as you can actually assign this category column back to the original data frame, and do further analysis based on the categories from there.

Since we don’t want the decimal points for age data, we can set precision = 0, and we also want to label our age data into 3 groups as Yong, Mid-Aged and Old.

Below is the code that we assign our binned age data into “Age Group” column:

df["Age Group"] = pd.cut(df["Age"],3, precision=0, labels=["Young","Mid-Aged","Old"])

If you examine the data again, you would see:

pandas split data into buckets - age group

Pandas mapped out our age data into 3 groups evenly based on the min and max of the age values. But you may have noticed that age 44 has been classified as “Old” which does not sound that true. In this case, we would want to give our own definition of young, mid-aged and old in the bins argument. Let’s delete the “Age Group” column and redo it with below:

df["Age Group"] = pd.cut(df["Age"],[20, 30, 50, 60], precision=0, labels=["Young","Mid-Aged","Old"])

With this list of integer intervals, we are telling pandas to split our data into 3 groups (20, 30], (30, 50] and (50, 60], and label them as Young, Mid-Aged and Old respectively. (here “(” means exclusive, and “]” means inclusive)

If we check the data again:

df[["Age", "Age Group"]]

You shall see the correct result as per we expected:

pandas split data into buckets- age groups with custom intervals

Now with this additional column, you can easily find out how much each age group contributed to the total sales amount. For example:

df.groupby("Age Group").agg({"Total Amount": "sum"})[["Total Amount"]].apply(lambda x: 100*x/x.sum())

This would calculate the contribution % to the total sales amount within each group (more details from here):

pandas split data into buckets - cut age groups

If you do not wish to have any intermediate data column (for our case, the “Age Group”) added to you data frame, you can directly pass the output of the cut into the groupby function:

df.groupby(pd.cut(df["Age"],[20, 30, 50, 55], precision=0, labels=["Young","Mid-Aged","Old"])).agg({"Total Amount": "sum"})[["Total Amount"]].apply(lambda x: 100*x/x.sum())

The above code will produce the same result as previously.

There are times you may want to define your bins with a start point & end point at a fixed interval, for instance, to understand for order dimensions at each 0.1, how much is the total sales amount.

For such case, we can make use of the arange function from numpy package, e.g.:

np.arange(0, 1, 0.1)

This would give us an array of values between 0 and 1 with interval of 0.1, and we can supply it as the bins to cut function:

df.groupby(pd.cut(df["Order Dimensions"],np.arange(0, 1, 0.1))).agg({"Total Amount": "sum"})

With the above code, we can see pandas split the order dimensions into small chunks of every 0.1 range, and then summarized the sales amount for each of these ranges:

pandas split data into buckets - order dimensions

Note that arange does not include the stop number 1, so if you wish to include 1, you may want to add an extra step into the stop number, e.g.: np.arange(0, 1 + 0.1, 0.1). And cut function also has two arguments – right and include_lowest to control how you want to include the left and right edge. E.g.:

df.groupby(pd.cut(df["Order Dimensions"],np.arange(0, 1 + 0.1, 0.1), right=False, include_lowest=True)).agg({"Total Amount": "sum"})

This will make the left edge inclusive and right edge exclusive, the output will be similar to below:

pandas split data into buckets - order dimensions left inclusive

cut vs qcut

Pandas also provides another function qcut, which helps to split your data based on quantiles (the cut points based on the distribution of the data). For instance, if you use qcut for the “Age” column:

pd.qcut(df["Age"],2, duplicates="drop")

You would see the age data has been split into two groups : (22.999, 41.5] and (41.5, 51.0]. 

pandas split data into buckets - age groups qcut

If you examine the data inside each group:

pd.qcut(df["Age"],2, duplicates="drop").value_counts()

You would see qcut has split the total of 6 rows of age data equally into 2 groups, and the cut point is at 41.5:

pandas split data into buckets - age groups qcut - value_counts1

So if you would like to understand what are the 4 age groups spent similar amount of money on your product, you can do as below:

df.groupby(pd.qcut(df["Age"],4, duplicates="drop")).agg({"Total Amount" : "sum"})

And you would see if we split our data into these 4 groups, the total sale amount are relatively the same:

pandas split data into buckets - age groups qcut - sales amount

Conclusion

In this article, we have reviewed through the pandas cut and qcut function where we can make use of them to split our data into buckets either by self defined intervals or based on cut points of the data distribution.

Hope this gives you some hints when you are solving the problems similar to what we have discussed here.