Introduction
If you have used pandas for your data analysis work, you may already get some idea on how powerful and flexible it is in terms of data processing. Many times there are more than one way to solve your problem, and choosing the best approach become another tough decision. For instance, in one of my previous article, I tried to summarize the 20 ways to filter records in pandas which definitely is not a complete list for all the possible solutions. In this article, I will be discussing about the different ways to merge/combine data in pandas and when you shall use them since combining data probably is one of the necessary step you shall perform before starting your data analysis.
Prerequisites
If you have not yet installed pandas, you may use the below command to install it from PyPI:
pip install pandas
And import the module at the beginning of your code:
import pandas as pd
Let’s dive into the code examples.
Combine Data with Append vs Concat
Imagine you have below two data frames from different sources, now you would like to merge them into one data frame.
df1 = pd.DataFrame({"ID" : [1, 2, 3, 4, 5], "Name" : ["Aaron", "Jimmy", "Zoe", "Jill", "Jenny"]}) df2 = pd.DataFrame({"ID": [6], "Name" : ["Kelly"]})
The most straightforward way would be using the append method from the pandas DataFrame object:
df1.append(df2, ignore_index=True)
The append method allows to add rows to the end of the current data frame, and with the ignore_index parameter as True, the resulting axis will be relabeled starting from 0.
You would see the output as per below:
Alternatively, you can use the pandas concat method which is self-explanatory based on its name. It provides a few more parameters to manipulate the resulting data frame such as specifying the axis for the concatenation to be done as well as the join logic for either union or intersection operation.
You can use the below to generate the same output as previously:
pd.concat([df1, df2], ignore_index=True)
And if you would like to retain a reference to the sources in your result, you can use the keys as per below:
pd.concat([df1, df2], keys=["src_1", "src_2"])
This would return a multi-index data frame where you can easily refer back to the data by source (e.g. df.loc[“src_1”]).
Adding new data frame as columns can be also done with axis = 1, for instance:
df3 = pd.DataFrame({"Age" : [12, 13, 13, 12, 13]}) pd.concat([df1, df3], axis=1)
The data frame has been added as one column to the caller:
As concat method accepts a list of data frames, you can combine multiple data frames at one time, which would be much faster than using append to do one by one.
Merge Data with Join vs Merge
Beside appending rows or columns based on axis, sometimes you may need more sophisticated operations similar to the left/right join in a rational database. For such scenarios, you shall make use of the pandas merge or join method.
For the previous example to append df2 to df1, you can achieve it with merge as well:
df1.merge(df2, how="outer")
Output as following:
It would be more tedious if you want to achieve the same via join since it can only join the data frame based on index, so you will have to set the index to the correct columns you would like to use as key. Below is how you can do it via join:
df1.join(df2.set_index(["ID", "Name"]), on=["ID", "Name"], how="outer").reset_index(drop=True)
Assuming you have the below student’s score for each subject, and you want to merge the student information (df1) and the below based on the “Name” column:
df4 = pd.DataFrame({"ID" : [1001, 1002, 1003, 1002, 1001], "Subject": ["Science", "Math", "English", "Math", "Science"], "Name": ["Aaron", "Jimmy", "Jimmy", "Zoe", "Jenny"], "Score" : ["A", "B", "C", "B", "B"]})
With merge function, you can specify the joining logic as left join on “Name” column as per below:
df1.merge(df4, on="Name", how="left")
Pandas will automatically add suffix whenever there are columns with duplicate names (e.g. “ID” in df1 and df4) from the two data frames, below is the output you may see:
To generate the same output via join, you can use below code which you need to pre-set the index for df4 and specify the suffix for left and right data frame:
df1.join(df4.set_index("Name"), on="Name", lsuffix="_x", rsuffix="_y")
Of course, if you would like to perform the right join for the above two data frames, you can do as per below:
df1.merge(df4, on="Name", how="right") # or df1.join(df4.set_index("Name"), on="Name", how="right", lsuffix="_x", rsuffix="_y")
Output as per below:
Merge DataFrame with Duplicate Keys
When merging multiple DataFrame objects, you may occasionally encounter the scenario that there are duplicate values for the columns you want to use as keys for joining. For instance, you may have below records if one subject has more than one lecturers:
df5 = pd.DataFrame({"Subject": ["Science", "Science", "Math", "Math", "English"], "Lecturer": ["Michael", "John", "Tim", "Robert", "Alex"]})
When you merge this information with student score based on the subject with merge or join method:
df4.merge(df5, on="Subject", how="left") #or df4.join(df5.set_index("Subject"), on="Subject", how="left")
You would see the below output with M x N records due to the duplicate key in the df5:
If your objective is to perform something similar to excel vlookup to return the first matched value, then you can use the drop_duplicates method to remove the duplicate records before joining. E.g.:
df4.merge(df5.drop_duplicates("Subject"), on="Subject", how="left")
This would allow you to combine the two data frames with the first matched record from df5:
And in case you do not want to lose the information from the lecturer data frame, you will need to perform some sort of data aggregation before joining, e.g.:
df4.merge(df5.groupby("Subject").agg({"Lecturer" : lambda x: ','.join(x)}), on="Subject", how="left")
With this aggregation on the lecturer values, you would be able to see the below output:
Based on the above examples, you may find that merge and join are interchangeable in most of the cases, and you may have to type a bit more when using join method due to the different default arguments used. Since it always works on the index, you will have to preset the index on the key columns before joining.
Conclusion
In this article, we have reviewed through a few methods pandas offered for combining data frames with some sample code. To wrap up, the append and concat are usually used for merging two or more data frames based on the row or column index, and concat has better performance over append when you have multiple data frames to be worked on. If you need some high performance in-memory join operations like SQL joining for rational database, you will need to use merge or join method which can be interchangeable in most of the scenario. In addition, if the data frame you worked on does not have a index on the joining row/column, using merge over join would probably save your some typing.