ken

pandas format column headers

Pandas format column headers

When using Pandas to deal with data from various sources, you may usually see the data headers in various formats, for instance, some people prefers to use upper case, some uses lowercase or camel case. And there are also different ways to join the words when using as column label, such as space, hyphen or underscore are commonly seen. This potentially causes some problem when you want to reference a particular column since pandas column label is case sensitive, and you may get confused what the correct spelling. In this case, you would need to format column headers into a standard format before processing the data. This article will be explaining the different ways to format column headers.

Prerequisite:

You will need to install pandas package in order to follow the below examples. Below is the command to install pandas with pip:

pip install pandas

With the package installed, let’s create a sample data set for our later use:

import pandas as pd
df = pd.DataFrame({"Salesman" : ["Patrick", "Sara", "Randy"],
                  "order date" : pd.date_range(start='2020-08-01', periods=3),
                  "Item Desc " : ["White Wine", "Whisky", "Red Wine"],
                  "Price Per-Unit": [10, 20, 30], 
                  "Order Quantity" : [50, 10, 40],
                  99: ["remak1", "remark2", "remark3"]})

You can preview your data set from Jupyter Notebook, it would be similar to below:

pandas format column headers

You probably wonder why someone would use number as column header, but it does happen in the real-world for various reasons.

If you use df[99] or df.loc[0,99], you are able to see the correct data, which means it does not confuse pandas whether your column label is string or numeric.

pandas format column headers

But it sometimes causes readability issue to human and introduce errors, especially if you always assume column labels are string and perform some string operation on them.

Convert column header to string

So the first thing we probably want to do it to convert column header into string. You can use the astype method to convert it:

df.columns = df.columns.astype("str")

A lot of pandas methods have “inplace” parameter to apply the changes without creating new objects, but astype does not support “inplace”, so we will need to re-assign the formatted object back to df.columns.

Format column header with cases

If you would like to reference all columns in uppercase, you can have at least three choices of doing it:

  • Use the str method from pandas Index object
  • Use the map method from pandas Index object
  • Use Python built-in map method

Below is the sample code for above 3 options:

#Index.str method
df.columns = df.columns.str.upper()

#Index.map method
df.columns = df.columns.map(str.upper)

#Python built-in map method
df.columns = map(str.upper, df.columns)

The column headers would be all converted to uppercase:

Index(['SALESMAN', 'ORDER DATE', 'ITEM DESC ', 'PRICE PER-UNIT',
       'ORDER QUANTITY', '99'],
      dtype='object')

Option 1 seems to be most straightforward way as long as the operations are supported by str, such as ljust, rjust, split etc.

Similarly, you can convert column headers to lowercase with str.lower():

df.columns = df.columns.str.lower()

or camel case with str.title if this is the format you wish to standardize across all data sources:

df.columns = df.columns.str.title()

Replace characters in column header

If you noticed there is a space accidentally added in my column header – “Item Desc “, this will cause index error if I use df[“Item Desc”] to access the column. To fix this, we can use the str.strip to remove all the leading or trailing spaces:

df.columns = df.columns.str.strip()

But those spaces in-between cannot be removed, if want to you use df.Item Desc , it will give you error. The best way is to replace all the spaces with hyphen or underscore, so that you can use both df[“Item_Desc”] and df.Item_Desc format to reference the column. Below is how you can use a simple lambda function to replace the space and hyphen with underscore:

df.columns = df.columns.map(lambda x : x.replace("-", "_").replace(" ", "_"))
# Or
df.columns = map(lambda x : x.replace("-", "_").replace(" ", "_"), df.columns)

If you check again, the column header would be updated as per below:

Index(['Salesman', 'Order_Date', 'Item_Desc', 'Price_Per_Unit',
       'Order_Quantity', '99'],
      dtype='object')

Note that, if you use df.columns.str.replace, you cannot just chain multiple replace function together, as the first replace function just return an Index object not a string.

Often you would see there are new line characters in the column header, you can remove them with the replace method as per below:

df.columns = df.columns.str.replace("\n", "")

Add prefix or suffix to column header

With the map and lambda, you can also easily add prefix or suffix to the column header, e.g.:

#adding prefix with "Label_"
df.columns = df.columns.map(lambda x : "Label_" + x)

#adding suffix with "_Col"
df.columns = df.columns.map(lambda x : x + "_Col")

Use of rename method

If you find the entire column header is not meaningful to you, you can manually rename multiple column names at one time with the data frame rename method as per below:

df.rename(columns={"Salesman" : "Sales Person", "Item Desc " : "Order Desc"}, inplace=True)

The rename method support inplace parameter, so you can immediately apply the changes in the original data frame.

pandas rename column name

flatten multi index column

After you aggregated your data with groupby and agg function, you may sometimes get a multi index column header, for instance:

df_sum = df.groupby("Salesman").agg({"Order Quantity": ["mean", "sum"]})

When you calculate both mean and sum of the “Order Quantity” column at the same time, you will get the result similar to below:

python pandas format column header multi index column

The column header become a multi index header, so if you want to flatten this column header by join the two levels into one, you can make use of the list comprehension as per below :

df_sum.columns = [' '.join(col) for col in df_sum.columns]

With the above, you would see column header changed from hierarchical to flattened as per the below:

python pandas format column header flatten multi index column

 

Conclusion

In this article, we have discussed a few options you can use to format column headers such as using str and map method of pandas Index object, and if you want something more than just some string operation, you can also pass in a lambda function. All these methods are not just limited to column header or row label (Index object), you can also use them to format your data series.

If you are interested in other topics about pandas, you may refer to here.

 

split or merge PDF files with PyPDF2

Split or merge PDF files with 5 lines of Python code

There are many cases you want to extract a particular page from a big PDF file or merge PDF files into one due to various reasons. You can make use of some PDF editor tools to do this, but you may realize the split or merge functions are usually not available in the free version, or it is too tedious when there are just so many pages or files to be processed. In this article, I will be sharing a simple solution to split or merge multiple PDF files with a few lines of Python code.

Prerequisite

We will be using a Python library called PyPDF2, so you will need to install this package in your working environment. Below is an example with pip:

pip install PyPDF2

Let’s get started

The PyPDF2 package has 4 major classes PdfFileWriter, PdfFileReader, PdfFileMerger and PageObject which looks quite self explanatory from class name itself. If you need to do something more than split or merge PDF pages, you may want to check this document to find out more about what you can do with this library.

Split PDF file

When you want to extract a particular page from the PDF file and make it a separate PDF file, you can use PdfFileReader to read the original file, and then you will be able to get a particular page by it’s page number (page number starts from 0). With the PdfFileWriter, you can use addPage function to add the PDF page into a new PDF object and save it.

Below is the sample code that extracts the first page of the file1.pdf and split it as a separate PDF file named first_page.pdf

from PyPDF2 import PdfFileWriter, PdfFileReader
input_pdf = PdfFileReader("file1.pdf")
output = PdfFileWriter()
output.addPage(input_pdf.getPage(0))
output.write("first_page.pdf")

The input_pdf.getPage(0) returns the PageObject which allows you to modify some of the attributes related to the PDF page, such as rotate and scale the page etc. So you may want to understand more from here.

Merge PDF files

To merge multiple PDF files into one file, you can use PdfFileMerger to achieve it. Although you can also do with PdfFileWriter, but PdfFileMerger probably is more straightforward when you do not need to edit the pages before merging them.

Below is the sample code which using append function from PdfFileMerger to append multiple PDF files and write into one PDF file named merged.pdf

from PyPDF2 import PdfFileReader, PdfFileMerger
pdf_file1 = PdfFileReader("file1.pdf")
pdf_file2 = PdfFileReader("file2.pdf")
output = PdfFileMerger()
output.append(pdf_file1)
output.append(pdf_file2)
output.write("merged.pdf")

If you do not want to include all pages from your original file, you can specify a tuple with starting and ending page number as pages argument for append function, so that only the pages specified would be add to the new PDF file.

The append function will always add new pages at the end, in case you want to specify the position where you wan to put in your pages, you shall use merge function. It allows you to specify the position of the page where you want to add in the new pages.

Conclusion

PyPDF2 package is a very handy toolkit for editing PDF files. In this article, we have reviewed how we can make use of this library to split or merge PDF files with some sample codes. You can modify these codes to suit your needs in order to automate the task in case you have many files or pages to be processed. There is also a pdfcat script included in this project folder which allows you to split or merge PDF files by calling this script from the command line. You may also want to take a look in case you just simply deal with one or two PDF files each time.

In case you are interested in other topics related to Python automation, you may check here. Thanks for reading.

python decorators

Why we should use Python decorator

Introduction

Decorator is one of the very important features in Python, and you may have seen it many places in Python code, for instance, the functions with annotation like @classmethod, @staticmethod, @property etc. By definition, decorator is a function that extends the functionality of another function without explicitly modifying it. It makes the code shorter and meanwhile improve the readability. In this article, I will be sharing with you how we shall use the Python decorators.

Basic Syntax

If you have checked my this article about the Python closure, you may still remember that we have discussed about Python allows to pass in a function into another function as argument. For example, if we have the below functions:

add_log – to add log to inspect all the positional and keyword arguments of a function before actually calling it

send_email – to accept some positional and keyword arguments for sending out emails

def add_log(func):
    def log(*args, **kwargs):
        for arg in args:
            print(f"{func.__name__} - args: {arg}")
        for key, val in kwargs.items():
            print(f"{func.__name__} - {key}, {val}")
        return func(*args, **kwargs)
    return log

def send_email(subject, to, **kwargs):  
    #send email logic 
    print(f"email sent to {to} with subject {subject}.")

We can pass in the send_email function to add_log as argument, and then we trigger the sending of the email.

sender = add_log(send_email)
sender("hello", "contact@codeforests.com", attachment="debug.log", urgent_flag=True)

This code will generate the output as per below:

python decorator pass function as argument

You can see that the send_email function has been invoked successfully after all the arguments were printed out. This is exactly what decorator is doing – extending the functionality of the send_email function without changing its original structure. When you directly call the send_email again, you can still see it’s original behavior without any change.

python decorator pass function as argument

Python decorator as a function

Before Python 2.4, the classmethod() and staticmethod() function were used to decorate functions by passing in the decorated function as argument. And later the @ symbol was introduced to make the code more concise and easier to read especially when the functions are very long.

So let implement our own decorator with @ syntax.

Assuming we have the below decorator function and we want to check if user is in the whitelist before allowing he/she to access certain resources. We follow the Python convention to use wrapper as the name of the inner function (although it is free of your choice to use any name).

class PermissionDenied(Exception):
    pass

def permission_required(func):
    whitelist = ["John", "Jane", "Joe"]
    def wrapper(*args, **kwargs):
        user = args[0]
        if not user in whitelist:
            raise PermissionDenied
        func(*args, **kwargs)
    return wrapper

Next, we decorate our function with permission_required as per below:

@permission_required
def read_file(user, file_path):
    with open(file_path, "r") as f:
        #print out the first line of the file
        print(f.readline())

When we call our function as per normal, we shall expect the decorator function to be executed first to check if user is in the whitelist.

read_file("John", r"C:\pwd.txt")

You can see the below output has been printed out:

python decorator read file output -1

If we pass in some user name not in the whitelist:

read_file("Johnny", r"C:\pwd.txt")

You would see the permission denied exception raised which shows everything works perfect as per we expected.

python decorator read file permission denied

But if you are careful enough, you may find something strange when you check the below.

python decorator read file output -3

So it seems there is some flaw with this implementation although the functional requirement has been met. The function signature has been overwritten by the decorator, and this may cause some confusing to other people when they want to use your function.

Use of the functools.wraps

To solve this problem, we will need to introduce one more Python module functools, where we can use the wraps method to update back the metadata info for the original function.

Let update our decorator function again by adding @wraps(func) to the wrapper function:

from functools import wraps

def permission_required(func):
    ...
    @wraps(func)
    def wrapper(*args, **kwargs):
       ...
    return wrapper

Finally, when we check the function signature and name again, it shows the correct information now.

python decorator read file output -4

So what happened was that, the @wraps(func) would invoke a update_wrapper function which updates the metadata of the original function automatically so that you will not see the wrapper’s metadata. You may want to check the update_wrapper function in the functools module to further understand how the metadata is updated.

Beside decorating normal function, the decorator function can be also used to decorate the class function, for instance, the @staticmethod and @property are commonly seen in Python code to decorate the class functions.

Python decorator as a class

Decorator function can be also implemented as a class in case you find your wrapper function has grown too big or has nested too deeply. To make this happen, you will need to implement a __call__ function so that the class instance become callable with the decorated function as argument.

Below is the code that implements our earlier example as a class:

from functools import update_wrapper
class PermissionRequired:
    def __init__(self, func):
        self._whitelist = ["John", "Jane", "Joe"]
        update_wrapper(self, func)
        self._func = func
        
    def __call__(self, *args, **kwargs):  
        user = args[0]
        if not user in self._whitelist:
            raise PermissionDenied
        return self._func(*args, **kwargs)

Take note that we will need to call the update_wrapper function to manually update the metadata for our decorated function. And same as before, we can continue using @ with class name to decorate our function.

@PermissionRequired
def read_file(user, file_path):
    with open(file_path, "r") as f:
        #print out the first line of the file
        print(f.readline())

Conclusion

In this article, we have reviewed through the reasons of Python decorators being introduced with the basic syntax of implementing our own decorators. And we also discussed about the decorator as function and class with some examples. Hopefully this article would help you to enhance your understanding about Python decorator and guide you on how to use it in your project.

 

Photo by Ali Yahya on Unsplash

Master python closure with 3 real-world examples

Introduction

Python closure is a technique for binding function with an environment where the function gets access to all the variables defined in the enclosing scope. Closure typically appears in the programming language with first class function, which means functions are allowed to be passed as arguments, return value or assigned to a variable.

This definition sounds confusing to the python beginners, and sometimes the examples found from online also not intuitive enough in the way that most of the examples are trying to illustrate with some printing statement, so the readers may not get the whole idea of why and how the closure should be used. In this article, I will be using some real-world example to explain how to use closure in your code.

Nested function in Python

To understand closure, we must first know that Python has nested function where one function can be defined inside another. For instance, the below inner_func is the nested function and the outer_func returns it’s nested function as return value.

def outer_func():    
    print("starting outer func")
    def inner_func():
        pi = 3.1415926
        print(f"pi is : {pi}")
    return inner_func

When you invoke the outer_func, it returns the reference to the inner_func, and subsequently you can call the inner_func. Below is the output when you run in Jupyter Notebook:

python closure nested function example

After you have got some feeling about the nested function, let’s continue to explore how nested function is related to closure. If we modify our previous function and move the pi variable into outer function, surprisedly it generates the same result as previously.

def outer_func():    
    print("starting outer func")
    #move pi variable definition to outer function
    pi = 3.1415926
    def inner_func():
        print(f"pi is : {pi}")
    return inner_func

You may wonder the pi variable is defined in outer function which is a local variable to outer_func, why inner_func is able access it since it’s not a global scope? This is exactly where closure happens, the inner_func has the full access to the environment (variables) in it’s enclosing scope. The inner_func refers to pi variable as nonlocal variable since there is no other local variable called pi.

If you want to modify the value of the pi inside the inner_func, you will have to explicitly specify “nonlocal pi” before you modify it since it’s immutable data type.

With the above understanding, now let’s walk through some real-world examples to see how we can use closure in our code.

Hide data with Python closure

Let’s say we want to implement a counter to record how many time the word has been repeated. The first thing you may want to do is to define a dictionary in global scope, and then create a function to add in the words as key into this dictionary and also update the number of times it repeated. Below is the sample code:

counter = {}

def count_word(word):    
    global counter
    counter[word] = counter.get(word, 0) + 1
    return counter[word]

To make sure the count_word function updates the correct “counter”, we need to put the global keyword to explicitly tell Python interpreter to use the “counter” defined in global scope, not any variable we accidentally defined with the same name in the local scope (within this function).

Sample output:

python closure word counter sample output

The above code works as expected, but there are two potential issues: Firstly, the global variable is accessible to any of the other functions and you cannot guarantee your data won’t be modified by others. Secondly, the global variable exists in the memory as long as the program is still running, so you may not want to create so many global variables if not necessary.

To address these two issues, let’s re-implement it with closure:

def word_counter():
    counter = {}
    def count(word):
        counter[word] = counter.get(word, 0) + 1
        return counter[word]
    return count

If we run it from Jupyter Notebook, you will see the below output:

python closure word counter example output

With this implementation, the counter dictionary is hidden from the public access and the functionality remains the same. (you may notice it works even after the word_counter function is deleted)

Convert small class to function with Python closure

Occasionally in your project, you may want to implement a small utility class to do some simple task. Let’s take a look at the below example:

import requests

class RequestMaker:
    def __init__(self, base_url):
        self.url = base_url
    def request(self, **kwargs):
        return requests.get(self.url.format_map(kwargs))

You can see the below output when you call the make_request from an instance of RequestMaker:

python closure small class example

Since you’ve already seen in the word counter example, the closure can also hold the data for your later use, the above class can be converted into a function with closure:

import requests

def request_maker(url):
    def make_request(**kwargs):
        return requests.get(url.format_map(kwargs))
    return make_request

The code becomes more concise and achieves the same result. Take note that in the above code, we are able to pass in the arguments into the nested function with **kwargs (or *args).

python closure convert small class to closure

Replace text with case matching

When you use regular express to find and replace some text, you may realize if you are trying to match text in case insensitive mode, you will not able to replace the text with proper case. For instance:

import re

paragraph = 'To start Python programming, you need to install python and configure PYTHON env.'
re.sub("python", "java", paragraph, flags=re.I)

Output from above:

python closure replace with case

It indeed replaced all the occurrence of the “python”, but the case does not match with the original text. To solve this problem, let’s implement the replace function with closure:

def replace_case(word):
    def replace(m):
        text = m.group()
        if text.islower():
            return word.lower()
        elif text.isupper():
            return word.upper()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

In the above code, the replace function has the access to the original text we intend to replace with, and when we detect the case of the matched text, we can convert the case of original text and return it back.

So in our original substitute function, let’s pass in a function replace_case(“java”) as the second argument. (You may refer to Python official doc in case you want to know what is the behavior when passing in function to re.sub)

re.sub("python", replace_case("java"), paragraph, flags=re.IGNORECASE)

If we run the above again, you should be able to see the case has been retained during the replacement as per below:

python closure replace with case

Conclusion

In this article, we have discussed about the general reasons why Python closure is used and also demonstrated how it can be used in your code with 3 real-world examples. In fact, Python decorator is also a use case of closure, I will be discussing this topic in the next article.

 

Pyinstaller upxdir and icon options

In previous article, we have discussed about most of the commonly used options for PyInstaller library. There are two more very useful options but you may encounter some issues when you use them for the first time. In this article, we will discuss about the common issues for using PyInstaller –icon and –upxdir options.

Customize icon for your exe file with –icon

PyInstaller has the –icon option to specify your own icon when creating the executable file. If this option is not given, the exe files will be generated with default icon as per below.

pyinstaller logo

You can use –icon followed by image file name to let PyInstaller to use your own icon. You may see errors when you try to use a normal image format as icon, in this case you can convert your image file into .ico format and run the command again.

For demo purpose, I downloaded an icon from this website into my project folder to use it for my app. And with the below command, I shall be able to get new look for my exe file.

pyinstaller --onefile hello.py --name "SuperHero" --add-data "test.config;." --icon "superhero.icon" --clean

Below is how it looks like when the new exe file generated:

Pyinstaller generate exe with icon

Sometimes, you may also find that the icon did not get changed after you rebuilt the executable file, but when checking the “General” tab in file properties, you are able to see the new icon displayed. This is due to the window icon cache, you may try to delete the cache files from the below directory and retry.

User\AppData\Local\Microsoft\Windows\Explorer\IconCacheToDelete

Or if you specify a new name for your exe file, you shall be able to see the new icon applied.

 

Reduce file size with PyInstaller –upx-dir option

When you used a lot of libraries or resource files, your executable file can grow very big and become difficult for distribution. In this case, you can use upx to compress your exe file.

You can download the upx executable file into your PC and copy the full path as the parameter value for –upx-dir option. E.g.:

pyinstaller --onefile hello.py --name "SuperHero" --add-data "test.config;." --icon "superhero.icon" --upx-dir "c:\upx-3.96-win64" --clean

Sometimes you may find even there is no error when you build the executable file, there can be a runtime error such as the below, which showing that VCRUNTIME140.dll is either not designed to run on Windows or it contains an error.

pyinstaller-VCRUNTIME140.dll-error

This issue is due to PyInstaller modified the dll files during packing and compressing. The workaround is that you use the –upx-exclude to exclude the particular dll files. (No need to specify the path for the dll)

pyinstaller --onefile hello.py --name "SuperHero" --add-data "test.config;." --icon "superhero.icon" --upx-dir "c:\upx-3.96-win64" --upx-exclude "VCRUNTIME140.dll" --clean

Conclusion

Beside the above issues we discussed, you may occasional encounter some other errors, you will need to check  both your Python and PyInstaller versions to see if is it some compatibility issues. And also not all the Python libraries are supported by PyInstaller, you will need to check this list to see if you have used any libraries not in supported by PyInstaller.