
Pandas format column headers
When using Pandas to deal with data from various sources, you may usually see the data headers in various formats, for instance, some people prefers to use upper case, some uses lowercase or camel case. And there are also different ways to join the words when using as column label, such as space, hyphen or underscore are commonly seen. This potentially causes some problem when you want to reference a particular column since pandas column label is case sensitive, and you may get confused what the correct spelling. In this case, you would need to format column headers into a standard format before processing the data. This article will be explaining the different ways to format column headers.
Prerequisite:
You will need to install pandas package in order to follow the below examples. Below is the command to install pandas with pip:
pip install pandas
With the package installed, let’s create a sample data set for our later use:
import pandas as pd df = pd.DataFrame({"Salesman" : ["Patrick", "Sara", "Randy"], "order date" : pd.date_range(start='2020-08-01', periods=3), "Item Desc " : ["White Wine", "Whisky", "Red Wine"], "Price Per-Unit": [10, 20, 30], "Order Quantity" : [50, 10, 40], 99: ["remak1", "remark2", "remark3"]})
You can preview your data set from Jupyter Notebook, it would be similar to below:
You probably wonder why someone would use number as column header, but it does happen in the real-world for various reasons.
If you use df[99] or df.loc[0,99], you are able to see the correct data, which means it does not confuse pandas whether your column label is string or numeric.
But it sometimes causes readability issue to human and introduce errors, especially if you always assume column labels are string and perform some string operation on them.
Convert column header to string
So the first thing we probably want to do it to convert column header into string. You can use the astype method to convert it:
df.columns = df.columns.astype("str")
A lot of pandas methods have “inplace” parameter to apply the changes without creating new objects, but astype does not support “inplace”, so we will need to re-assign the formatted object back to df.columns.
Format column header with cases
If you would like to reference all columns in uppercase, you can have at least three choices of doing it:
- Use the str method from pandas Index object
- Use the map method from pandas Index object
- Use Python built-in map method
Below is the sample code for above 3 options:
#Index.str method df.columns = df.columns.str.upper() #Index.map method df.columns = df.columns.map(str.upper) #Python built-in map method df.columns = map(str.upper, df.columns)
The column headers would be all converted to uppercase:
Index(['SALESMAN', 'ORDER DATE', 'ITEM DESC ', 'PRICE PER-UNIT', 'ORDER QUANTITY', '99'], dtype='object')
Option 1 seems to be most straightforward way as long as the operations are supported by str, such as ljust, rjust, split etc.
Similarly, you can convert column headers to lowercase with str.lower():
df.columns = df.columns.str.lower()
or camel case with str.title if this is the format you wish to standardize across all data sources:
df.columns = df.columns.str.title()
Replace characters in column header
If you noticed there is a space accidentally added in my column header – “Item Desc “, this will cause index error if I use df[“Item Desc”] to access the column. To fix this, we can use the str.strip to remove all the leading or trailing spaces:
df.columns = df.columns.str.strip()
But those spaces in-between cannot be removed, if want to you use df.Item Desc , it will give you error. The best way is to replace all the spaces with hyphen or underscore, so that you can use both df[“Item_Desc”] and df.Item_Desc format to reference the column. Below is how you can use a simple lambda function to replace the space and hyphen with underscore:
df.columns = df.columns.map(lambda x : x.replace("-", "_").replace(" ", "_")) # Or df.columns = map(lambda x : x.replace("-", "_").replace(" ", "_"), df.columns)
If you check again, the column header would be updated as per below:
Index(['Salesman', 'Order_Date', 'Item_Desc', 'Price_Per_Unit', 'Order_Quantity', '99'], dtype='object')
Note that, if you use df.columns.str.replace, you cannot just chain multiple replace function together, as the first replace function just return an Index object not a string.
Often you would see there are new line characters in the column header, you can remove them with the replace method as per below:
df.columns = df.columns.str.replace("\n", "")
Add prefix or suffix to column header
With the map and lambda, you can also easily add prefix or suffix to the column header, e.g.:
#adding prefix with "Label_" df.columns = df.columns.map(lambda x : "Label_" + x) #adding suffix with "_Col" df.columns = df.columns.map(lambda x : x + "_Col")
Use of rename method
If you find the entire column header is not meaningful to you, you can manually rename multiple column names at one time with the data frame rename method as per below:
df.rename(columns={"Salesman" : "Sales Person", "Item Desc " : "Order Desc"}, inplace=True)
The rename method support inplace parameter, so you can immediately apply the changes in the original data frame.
flatten multi index column
After you aggregated your data with groupby and agg function, you may sometimes get a multi index column header, for instance:
df_sum = df.groupby("Salesman").agg({"Order Quantity": ["mean", "sum"]})
When you calculate both mean and sum of the “Order Quantity” column at the same time, you will get the result similar to below:
The column header become a multi index header, so if you want to flatten this column header by join the two levels into one, you can make use of the list comprehension as per below :
df_sum.columns = [' '.join(col) for col in df_sum.columns]
With the above, you would see column header changed from hierarchical to flattened as per the below:
Conclusion
In this article, we have discussed a few options you can use to format column headers such as using str and map method of pandas Index object, and if you want something more than just some string operation, you can also pass in a lambda function. All these methods are not just limited to column header or row label (Index object), you can also use them to format your data series.
If you are interested in other topics about pandas, you may refer to here.