Python

pandas filtering records

Pandas – filtering records in 20 ways

Filtering records is a quite common operation when you process or analyze data with pandas,a lot of times you will have to apply filters so that you can concentrate to the data you want. Pandas is so powerful and flexible that it provides plenty of ways you can filter records, whether you want to filtering by columns to focus on a subset of the data or base on certain conditions. In this article, we will be discussing the various ways of filtering records in pandas.

Prerequisite:

You will need to install pandas package in order to follow the below examples. Below is the command to install pandas with pip:

pip install pandas

And I will be using the sample data from here, so you may also want to download a copy into your local machine to try out the later examples.

With the below codes, we can get a quick view of how the sample data looks like:

import pandas as pd
df = pd.read_excel(r"C:\Sample-Sales-Data.xlsx")
df.head(5)

Below is the output of the first 5 rows of data:

pandas filtering data

Let’s get started with our examples.

Filtering records by label or index

Filtering by column name/index is the most straightforward way to get a subset of the data frame in case you are only interested in a few columns of the data rather than the full data frame. The syntax is to use df[[column1,..columnN]] to filter only the specified columns. For instance, the below will get a subset of data with only 2 columns –  “Salesman” and “Item Desc”:

new_df = df[["Salesman","Item Desc"]]
new_df.head(5)

Output from the above would be:

pandas filtering data subset

If you are pretty sure which are the rows you are looking for, you can use the df.loc function which allows you to specify both the row and column labels to filter the records. You can pass in a list of row labels and column labels like below:

df.loc[[0,4], ["Salesman", "Item Desc"]]

And you would see the row index 0 and 4, column label “Salesman” and “Item Desc” are selected as per below output:

pandas filtering loc

Or you can specify the label range with : to filter the records by a range:

df.loc[0:4, ["Salesman", "Item Desc"]]

You would see 5 rows (row index 0 to 4) selected as per below output:

pandas filtering loc with label range

Note that currently we are using the default row index which is a integer starting from 0, so it happens to be same as the position of the rows. Let’s say you have Salesman as your index, then you will need to do filtering based on the index label (value of the Salesman), e.g.:

df.set_index("Salesman", inplace=True)
df.loc["Sara", ["Item Desc", "Order Quantity"]]

With the above code, you will be able to select all the records with Salesman as “Sara”:

pandas filtering loc with row label

Filtering records by row/column position

Similarly, you can use iloc function to achieve the same as what can be done with loc function. But the difference is that, for iloc, you shall pass in the integer position for both row and columns. E.g.:

df.iloc[[0,4,5,10],0:2]

The integers are the position of the row/column from 0 to length-1 for the axis. So the below output will be generated when you run the above code:

pandas filtering iloc function

Filtering records by single condition

If you would like to filter the records based on a certain condition, for instance, the value of a particular column, you may have a few options to do the filtering based on what type of data you are dealing with. 

The eq and == work the same when you want to compare if the value matches:

flt_wine = df["Item Desc"].eq("White Wine")
df[flt_wine]

Or:

flt_wine = (df["Item Desc"] == "White Wine")
df[flt_wine]

Both will generate the below output:

pandas filtering equals condition

If you run the flt_wine alone, you will see the output is a list of True/False with their index. This is how the filter works as pandas data frame would filter out the index with False value.

To get the data with the negation of certain condition, you can use ~ before your condition statement as per below:

df[~flt_wine]
#or
df[~(df["Item Desc"] == "White Wine")]
#or
df[(df["Item Desc"] != "White Wine")]

This will return the data with “Item Desc” other than “White Wine”.

And for string data type, you can also use the str.contains to match if the column has a particular sub string.

df[df["Item Desc"].str.contains("Wine")]

If you want to filter by matching multiple values, you can use isin with a list of values:

flt_wine = df["Item Desc"].isin(["White Wine", "Red Wine"])
df[flt_wine].head(5)

pandas filtering isin function

And you can also use data frame query function to achieve the same. But the column label with spaces in-between would cause errors when using this function, so you will need to reformat a bit of your column header, such as replacing spaces with underscore (refer to this article for more details ).

With this change in the column header, you shall be able to run the below code with the same result as above isin method.

df1 = df.query("Item_Desc in ('White Wine','Red Wine')")
df1.head(5)

There are other Series functions you can use to filter your records, such as isnull, isna, notna, notnull, find etc. You may want to check pandas Series documentation.

Filtering records by multiple conditions

When you need to filter by multiple conditions where multiple columns are involved, you can also do similar as what we have discussed in above with the & or | to join the conditions.

For filtering records when both conditions are true:

flt_whisky_bulk_order = (df["Item Desc"] == "Whisky") & (df["Order Quantity"] >= 10)
df[flt_whisky_bulk_order]

The output would be :

pandas filtering and condition

For filtering the records when either condition is true:

flt_high_value_order = (df["Item Desc"] == "Whisky") | (df["Price Per Unit"] >= 50) 
df[flt_high_value_order]

The output would be :

pandas filtering or condition

Similarly, the above can be done with data frame query function. Below is the example of AND condition:

df1 = df.query("Item_Desc == 'Whisky' and Order_Quantity >= 10") 
df1.head(5)

Below is the example of OR condition:

df1 = df.query("Item_Desc_ == 'Whisky' or Price_Per_Unit >= 10")
df1.head(5)

Filtering records by dataframe.filter

There is also another filter method which can be used to filter by the row or column label.

Below is an example that can be used to get all the columns with the name starting with “Order” keyword:

df.filter(regex="Order*", axis=1)

you shall see the below output:

pandas filtering dataframe filter

Similarly, when applying to row labels, you can axis=0

df.set_index("Order Date", inplace=True)
df.filter(like="2020-06-21", axis=0)

pandas filtering dataframe filter 2

Take note that data frame query function only works on the row or column label not any specific data series.

Conclusion

Filtering records is a so frequently used operation whenever you need to deal with the data in pandas, and in this article we have discussed a lot of methods you can use under different scenarios. It may not cover everything you need but hopefully it can solve 80% of your problems. There are other Series functions you may employ to filter your data, but probably you would see the syntax still falls under what we have summarized in this article.

If you are interested in other topics about pandas, you may refer to here.

split or merge PDF files with PyPDF2

Split or merge PDF files with 5 lines of Python code

There are many cases you want to extract a particular page from a big PDF file or merge PDF files into one due to various reasons. You can make use of some PDF editor tools to do this, but you may realize the split or merge functions are usually not available in the free version, or it is too tedious when there are just so many pages or files to be processed. In this article, I will be sharing a simple solution to split or merge multiple PDF files with a few lines of Python code.

Prerequisite

We will be using a Python library called PyPDF2, so you will need to install this package in your working environment. Below is an example with pip:

pip install PyPDF2

Let’s get started

The PyPDF2 package has 4 major classes PdfFileWriter, PdfFileReader, PdfFileMerger and PageObject which looks quite self explanatory from class name itself. If you need to do something more than split or merge PDF pages, you may want to check this document to find out more about what you can do with this library.

Split PDF file

When you want to extract a particular page from the PDF file and make it a separate PDF file, you can use PdfFileReader to read the original file, and then you will be able to get a particular page by it’s page number (page number starts from 0). With the PdfFileWriter, you can use addPage function to add the PDF page into a new PDF object and save it.

Below is the sample code that extracts the first page of the file1.pdf and split it as a separate PDF file named first_page.pdf

from PyPDF2 import PdfFileWriter, PdfFileReader
input_pdf = PdfFileReader("file1.pdf")
output = PdfFileWriter()
output.addPage(input_pdf.getPage(0))
with open("first_page.pdf", "wb") as output_stream:
    output.write(output_stream)

The input_pdf.getPage(0) returns the PageObject which allows you to modify some of the attributes related to the PDF page, such as rotate and scale the page etc. So you may want to understand more from here.

Merge PDF files

To merge multiple PDF files into one file, you can use PdfFileMerger to achieve it. Although you can also do with PdfFileWriter, but PdfFileMerger probably is more straightforward when you do not need to edit the pages before merging them.

Below is the sample code which using append function from PdfFileMerger to append multiple PDF files and write into one PDF file named merged.pdf

from PyPDF2 import PdfFileReader, PdfFileMerger
pdf_file1 = PdfFileReader("file1.pdf")
pdf_file2 = PdfFileReader("file2.pdf")
output = PdfFileMerger()
output.append(pdf_file1)
output.append(pdf_file2)

with open("merged.pdf", "wb") as output_stream:
    output.write(output_stream)

If you do not want to include all pages from your original file, you can specify a tuple with starting and ending page number as pages argument for append function, so that only the pages specified would be add to the new PDF file.

The append function will always add new pages at the end, in case you want to specify the position where you wan to put in your pages, you shall use merge function. It allows you to specify the position of the page where you want to add in the new pages.

Conclusion

PyPDF2 package is a very handy toolkit for editing PDF files. In this article, we have reviewed how we can make use of this library to split or merge PDF files with some sample codes. You can modify these codes to suit your needs in order to automate the task in case you have many files or pages to be processed. There is also a pdfcat script included in this project folder which allows you to split or merge PDF files by calling this script from the command line. You may also want to take a look in case you just simply deal with one or two PDF files each time.

In case you are interested in other topics related to Python automation, you may check here. Thanks for reading.

python decorators

Why we should use Python decorator

Introduction

Decorator is one of the very important features in Python, and you may have seen it many places in Python code, for instance, the functions with annotation like @classmethod, @staticmethod, @property etc. By definition, decorator is a function that extends the functionality of another function without explicitly modifying it. It makes the code shorter and meanwhile improve the readability. In this article, I will be sharing with you how we shall use the Python decorators.

Basic Syntax

If you have checked my this article about the Python closure, you may still remember that we have discussed about Python allows to pass in a function into another function as argument. For example, if we have the below functions:

add_log – to add log to inspect all the positional and keyword arguments of a function before actually calling it

send_email – to accept some positional and keyword arguments for sending out emails

def add_log(func):
    def log(*args, **kwargs):
        for arg in args:
            print(f"{func.__name__} - args: {arg}")
        for key, val in kwargs.items():
            print(f"{func.__name__} - {key}, {val}")
        return func(*args, **kwargs)
    return log

def send_email(subject, to, **kwargs):  
    #send email logic 
    print(f"email sent to {to} with subject {subject}.")

We can pass in the send_email function to add_log as argument, and then we trigger the sending of the email.

sender = add_log(send_email)
sender("hello", "contact@codeforests.com", attachment="debug.log", urgent_flag=True)

This code will generate the output as per below:

python decorator pass function as argument

You can see that the send_email function has been invoked successfully after all the arguments were printed out. This is exactly what decorator is doing – extending the functionality of the send_email function without changing its original structure. When you directly call the send_email again, you can still see it’s original behavior without any change.

python decorator pass function as argument

Python decorator as a function

Before Python 2.4, the classmethod() and staticmethod() function were used to decorate functions by passing in the decorated function as argument. And later the @ symbol was introduced to make the code more concise and easier to read especially when the functions are very long.

So let implement our own decorator with @ syntax.

Assuming we have the below decorator function and we want to check if user is in the whitelist before allowing he/she to access certain resources. We follow the Python convention to use wrapper as the name of the inner function (although it is free of your choice to use any name).

class PermissionDenied(Exception):
    pass

def permission_required(func):
    whitelist = ["John", "Jane", "Joe"]
    def wrapper(*args, **kwargs):
        user = args[0]
        if not user in whitelist:
            raise PermissionDenied
        func(*args, **kwargs)
    return wrapper

Next, we decorate our function with permission_required as per below:

@permission_required
def read_file(user, file_path):
    with open(file_path, "r") as f:
        #print out the first line of the file
        print(f.readline())

When we call our function as per normal, we shall expect the decorator function to be executed first to check if user is in the whitelist.

read_file("John", r"C:\pwd.txt")

You can see the below output has been printed out:

python decorator read file output -1

If we pass in some user name not in the whitelist:

read_file("Johnny", r"C:\pwd.txt")

You would see the permission denied exception raised which shows everything works perfect as per we expected.

python decorator read file permission denied

But if you are careful enough, you may find something strange when you check the below.

python decorator read file output -3

So it seems there is some flaw with this implementation although the functional requirement has been met. The function signature has been overwritten by the decorator, and this may cause some confusing to other people when they want to use your function.

Use of the functools.wraps

To solve this problem, we will need to introduce one more Python module functools, where we can use the wraps method to update back the metadata info for the original function.

Let update our decorator function again by adding @wraps(func) to the wrapper function:

from functools import wraps

def permission_required(func):
    ...
    @wraps(func)
    def wrapper(*args, **kwargs):
       ...
    return wrapper

Finally, when we check the function signature and name again, it shows the correct information now.

python decorator read file output -4

So what happened was that, the @wraps(func) would invoke a update_wrapper function which updates the metadata of the original function automatically so that you will not see the wrapper’s metadata. You may want to check the update_wrapper function in the functools module to further understand how the metadata is updated.

Beside decorating normal function, the decorator function can be also used to decorate the class function, for instance, the @staticmethod and @property are commonly seen in Python code to decorate the class functions.

Python decorator as a class

Decorator function can be also implemented as a class in case you find your wrapper function has grown too big or has nested too deeply. To make this happen, you will need to implement a __call__ function so that the class instance become callable with the decorated function as argument.

Below is the code that implements our earlier example as a class:

from functools import update_wrapper
class PermissionRequired:
    def __init__(self, func):
        self._whitelist = ["John", "Jane", "Joe"]
        update_wrapper(self, func)
        self._func = func
        
    def __call__(self, *args, **kwargs):  
        user = args[0]
        if not user in self._whitelist:
            raise PermissionDenied
        return self._func(*args, **kwargs)

Take note that we will need to call the update_wrapper function to manually update the metadata for our decorated function. And same as before, we can continue using @ with class name to decorate our function.

@PermissionRequired
def read_file(user, file_path):
    with open(file_path, "r") as f:
        #print out the first line of the file
        print(f.readline())

Conclusion

In this article, we have reviewed through the reasons of Python decorators being introduced with the basic syntax of implementing our own decorators. And we also discussed about the decorator as function and class with some examples. Hopefully this article would help you to enhance your understanding about Python decorator and guide you on how to use it in your project.

 

Photo by Ali Yahya on Unsplash

Master python closure with 3 real-world examples

Introduction

Python closure is a technique for binding function with an environment where the function gets access to all the variables defined in the enclosing scope. Closure typically appears in the programming language with first class function, which means functions are allowed to be passed as arguments, return value or assigned to a variable.

This definition sounds confusing to the python beginners, and sometimes the examples found from online also not intuitive enough in the way that most of the examples are trying to illustrate with some printing statement, so the readers may not get the whole idea of why and how the closure should be used. In this article, I will be using some real-world example to explain how to use closure in your code.

Nested function in Python

To understand closure, we must first know that Python has nested function where one function can be defined inside another. For instance, the below inner_func is the nested function and the outer_func returns it’s nested function as return value.

def outer_func():    
    print("starting outer func")
    def inner_func():
        pi = 3.1415926
        print(f"pi is : {pi}")
    return inner_func

When you invoke the outer_func, it returns the reference to the inner_func, and subsequently you can call the inner_func. Below is the output when you run in Jupyter Notebook:

python closure nested function example

After you have got some feeling about the nested function, let’s continue to explore how nested function is related to closure. If we modify our previous function and move the pi variable into outer function, surprisedly it generates the same result as previously.

def outer_func():    
    print("starting outer func")
    #move pi variable definition to outer function
    pi = 3.1415926
    def inner_func():
        print(f"pi is : {pi}")
    return inner_func

You may wonder the pi variable is defined in outer function which is a local variable to outer_func, why inner_func is able access it since it’s not a global scope? This is exactly where closure happens, the inner_func has the full access to the environment (variables) in it’s enclosing scope. The inner_func refers to pi variable as nonlocal variable since there is no other local variable called pi.

If you want to modify the value of the pi inside the inner_func, you will have to explicitly specify “nonlocal pi” before you modify it since it’s immutable data type.

With the above understanding, now let’s walk through some real-world examples to see how we can use closure in our code.

Hide data with Python closure

Let’s say we want to implement a counter to record how many time the word has been repeated. The first thing you may want to do is to define a dictionary in global scope, and then create a function to add in the words as key into this dictionary and also update the number of times it repeated. Below is the sample code:

counter = {}

def count_word(word):    
    global counter
    counter[word] = counter.get(word, 0) + 1
    return counter[word]

To make sure the count_word function updates the correct “counter”, we need to put the global keyword to explicitly tell Python interpreter to use the “counter” defined in global scope, not any variable we accidentally defined with the same name in the local scope (within this function).

Sample output:

python closure word counter sample output

The above code works as expected, but there are two potential issues: Firstly, the global variable is accessible to any of the other functions and you cannot guarantee your data won’t be modified by others. Secondly, the global variable exists in the memory as long as the program is still running, so you may not want to create so many global variables if not necessary.

To address these two issues, let’s re-implement it with closure:

def word_counter():
    counter = {}
    def count(word):
        counter[word] = counter.get(word, 0) + 1
        return counter[word]
    return count

If we run it from Jupyter Notebook, you will see the below output:

python closure word counter example output

With this implementation, the counter dictionary is hidden from the public access and the functionality remains the same. (you may notice it works even after the word_counter function is deleted)

Convert small class to function with Python closure

Occasionally in your project, you may want to implement a small utility class to do some simple task. Let’s take a look at the below example:

import requests

class RequestMaker:
    def __init__(self, base_url):
        self.url = base_url
    def request(self, **kwargs):
        return requests.get(self.url.format_map(kwargs))

You can see the below output when you call the make_request from an instance of RequestMaker:

python closure small class example

Since you’ve already seen in the word counter example, the closure can also hold the data for your later use, the above class can be converted into a function with closure:

import requests

def request_maker(url):
    def make_request(**kwargs):
        return requests.get(url.format_map(kwargs))
    return make_request

The code becomes more concise and achieves the same result. Take note that in the above code, we are able to pass in the arguments into the nested function with **kwargs (or *args).

python closure convert small class to closure

Replace text with case matching

When you use regular express to find and replace some text, you may realize if you are trying to match text in case insensitive mode, you will not able to replace the text with proper case. For instance:

import re

paragraph = 'To start Python programming, you need to install python and configure PYTHON env.'
re.sub("python", "java", paragraph, flags=re.I)

Output from above:

python closure replace with case

It indeed replaced all the occurrence of the “python”, but the case does not match with the original text. To solve this problem, let’s implement the replace function with closure:

def replace_case(word):
    def replace(m):
        text = m.group()
        if text.islower():
            return word.lower()
        elif text.isupper():
            return word.upper()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

In the above code, the replace function has the access to the original text we intend to replace with, and when we detect the case of the matched text, we can convert the case of original text and return it back.

So in our original substitute function, let’s pass in a function replace_case(“java”) as the second argument. (You may refer to Python official doc in case you want to know what is the behavior when passing in function to re.sub)

re.sub("python", replace_case("java"), paragraph, flags=re.IGNORECASE)

If we run the above again, you should be able to see the case has been retained during the replacement as per below:

python closure replace with case

Conclusion

In this article, we have discussed about the general reasons why Python closure is used and also demonstrated how it can be used in your code with 3 real-world examples. In fact, Python decorator is also a use case of closure, I will be discussing this topic in the next article.

 

Pyinstaller upxdir and icon options

In previous article, we have discussed about most of the commonly used options for PyInstaller library. There are two more very useful options but you may encounter some issues when you use them for the first time. In this article, we will discuss about the common issues for using PyInstaller –icon and –upxdir options.

Customize icon for your exe file with –icon

PyInstaller has the –icon option to specify your own icon when creating the executable file. If this option is not given, the exe files will be generated with default icon as per below.

pyinstaller logo

You can use –icon followed by image file name to let PyInstaller to use your own icon. You may see errors when you try to use a normal image format as icon, in this case you can convert your image file into .ico format and run the command again.

For demo purpose, I downloaded an icon from this website into my project folder to use it for my app. And with the below command, I shall be able to get new look for my exe file.

pyinstaller --onefile hello.py --name "SuperHero" --add-data "test.config;." --icon "superhero.icon" --clean

Below is how it looks like when the new exe file generated:

Pyinstaller generate exe with icon

Sometimes, you may also find that the icon did not get changed after you rebuilt the executable file, but when checking the “General” tab in file properties, you are able to see the new icon displayed. This is due to the window icon cache, you may try to delete the cache files from the below directory and retry.

User\AppData\Local\Microsoft\Windows\Explorer\IconCacheToDelete

Or if you specify a new name for your exe file, you shall be able to see the new icon applied.

 

Reduce file size with PyInstaller –upx-dir option

When you used a lot of libraries or resource files, your executable file can grow very big and become difficult for distribution. In this case, you can use upx to compress your exe file.

You can download the upx executable file into your PC and copy the full path as the parameter value for –upx-dir option. E.g.:

pyinstaller --onefile hello.py --name "SuperHero" --add-data "test.config;." --icon "superhero.icon" --upx-dir "c:\upx-3.96-win64" --clean

Sometimes you may find even there is no error when you build the executable file, there can be a runtime error such as the below, which showing that VCRUNTIME140.dll is either not designed to run on Windows or it contains an error.

pyinstaller-VCRUNTIME140.dll-error

This issue is due to PyInstaller modified the dll files during packing and compressing. The workaround is that you use the –upx-exclude to exclude the particular dll files. (No need to specify the path for the dll)

pyinstaller --onefile hello.py --name "SuperHero" --add-data "test.config;." --icon "superhero.icon" --upx-dir "c:\upx-3.96-win64" --upx-exclude "VCRUNTIME140.dll" --clean

Conclusion

Beside the above issues we discussed, you may occasional encounter some other errors, you will need to check  both your Python and PyInstaller versions to see if is it some compatibility issues. And also not all the Python libraries are supported by PyInstaller, you will need to check this list to see if you have used any libraries not in supported by PyInstaller.