Python

pandas split one row of data into multiple rows

Pandas tricks – split one row of data into multiple rows

As a data scientist or analyst, you will need to spend a lot of time wrangling the data from various sources so that you can have a standard data structure for your further analysis. There are cases that you get the raw data in some sort of summary view and you would need to split one row of data into multiple rows based on certain conditions in order to do grouping and matching from different perspectives. In this article, we will be discussing a solution to solve this particular issue.

Prerequisites:

You will need to get pandas installed if you have not yet. Below is the pip command to install pandas:

pip install pandas

And let’s import the necessary modules and use this sample data for our demonstration, you can download it into your local folder, or just supply this URL link to pandas read_excel method:

import pandas as pd
import numpy as np

df = pd.read_excel("eShop-Delivery-Record.xlsx", sheet_name=0)

So if we do a quick view of the first 5 rows of the data with df.head(5), you would see the below output:

pandas split one row of data into multiple rows

Assume this is the data extracted from a eCommerce system where someone is running a online shop for footwear and apparel products, and the shop provides free 7 days return for the items that it is selling. You can see that each of the rows has the order information, when and who performed the delivery service, and if customer requested return, when the item was returned and by which courier service provider. The data is more from the shop owner’s view, and you may find some difficulty when you want to analyse from courier service providers’ perspective with the current data format. So probably we shall do some transformation to make the format simpler for analysis.

Split one row of data into multiple rows

Now let’s say we would like to split this one row of data into 2 rows if there is a return happening, so that each row has the order info as well as the courier service info and we can easily do some analysis such as calculating the return rate for each product, courier service cost for each month by courier companies, etc.

The output format we would like to have is more like a transaction based, so let’s try to format our date columns and rename the delivery related columns, so that it won’t confuse us later when splitting the data.

df["Delivery Date"] = pd.to_datetime(df["Delivery Date"]).dt.date
df["Return Date"] = pd.to_datetime(df["Return Date"]).dt.date

df.rename(columns={"Delivery Date" : "Transaction Date",
"Delivery Courier" : "Courier",
"Delivery Charges" : "Charges"}, inplace=True)

And we add one more column as transaction type to indicate whether the record is for delivery or return. For now, we just assign it as “DELIVERY” for all records:

df["Transaction Type"] = "DELIVERY"

The rows we need to split are the ones with return info, so let’s create a filter by checking if return date is empty:

flt_returned = ~df["Return Date"].isna()

If you verify the filter with df[flt_returned], you shall see all rows with return info are selected as per below:

pandas split one row of data into multiple rows

To split out the delivery and return info for these rows, we will need to perform the below steps:

  • Duplicate the current 1 row into 2 rows
  • Change the transaction type to “RETURN” for the second duplicated row
  • Copy values of the Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively

To duplicate these records, we use data frame index.repeat() to repeat these index twice, and then use loc function to get the data for these repeated indexes. Below is the code to create the duplicate records for the rows with return info:

d = df[flt_returned].loc[df[flt_returned].index.repeat(2),:].reset_index(drop=True)

Next, let’s save the duplicated row indexes into a variable, so that we can refer to it multiple times even when some data in the duplicated row changed. We use the data frame duplicated function to return the index of the duplicated rows. For this function, the keep=”first” argument will mark 1st row as non-duplicate and the subsequent rows as duplicate, while keep=”last” will mark the 1st row as duplicate.

idx_duplicate = d.duplicated(keep="first")
#the default value for keep argument is "first", so you can just use d.duplicated()

With this idx_duplicate variable, we can directly update the transaction type for these rows to RETURN:

d.loc[idx_duplicate,"Transaction Type"] = "RETURN"

And next, we shall copy the return info into Transaction Date, Courier, Charges fields for these return records. You can either base on the transaction type value to select rows, or continue to use the idx_duplicate to identify the return records.

Below will copy values from Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively:

d.loc[idx_duplicate, ["Transaction Date", "Courier", "Charges"]] = d.loc[idx_duplicate, 
                                                     ["Return Date", "Return Courier","Return Charges"]].to_numpy()

If you check the data now, you shall see for the return rows, the return info is already copied over:

pandas split one row of data into multiple rows

(Note: you may want to check here to understand why to_numpy() is needed for swapping columns)

Finally, we need to combine the original rows which only has delivery info with the above processed data. Let’s also sort the values by order number and reset the index:

new_df = pd.concat([d, df[~flt_returned]]).sort_values("Order#").reset_index(drop=True)

Since the return related columns are redundant now, we shall drop these columns to avoid the confusion, so let’s use the below code to drop them by the “Return” keywords in the column labels:

new_df.drop(new_df.filter(regex="Return*", axis=1), axis=1, inplace=True)

(To understand how df.filter works, check my this article)

Once we deleted the redundant columns, you shall see the below final result in the new_df as per below:

pandas split one row of data into multiple rows

So we have successfully transformed our data from a shop owner’s view to courier companies’ view, each of the delivery and return records are now an individual row.

Conclusion

Data wrangling sometimes can be tough depends on what kind of source data you get. In this article, we have gone through a solution to split one row of data into multiple rows by using the pandas index.repeat to duplicate the rows and loc function to swapping the values. There are other possible ways to handle this, please do share your comments in case you have any better idea.

Photo by Luther Bottrill on Unsplash

Why your lambda function does not work

Introduction

Lambda function in Python is designed to be a one-liner and throwaway function even without the needs to assign a function name, so it is also known as anonymous function. Comparing to the normal Python functions, you do not need to write the def and return keywords for lambda function, and it can be defined just at the place where you need it, so it makes your code more concise and looks a bit special. In this article, we will be discussing some unexpected results you may have encountered when you are using lambda function.

Basis usage of lambda

Let’s cover some basis of lambda function before we dive into the problems we are going solve in this article.

Below is the syntax to define lambda function:

lambda [arguments] : expression

As you can see lambda function can be defined with or without arguments, and take note that it only accepts one line of expression, not any of the Python statements. Expressions can be also statements, the difference is that you are able to evaluate a expression into values (or objects), e.g.: 2**2, but you may not be able to evaluate a statement like while(True): into a value. You can think there is an implicit “return” keyword before the expression, so your expression must be eventually computed into a value.

And here are some basic usage of lambda function:

square = lambda x: x**2
print(square(4))
#Output: 16
cryptocurrencies = [('Bitcoin', 10948.52),('Ethereum', 381.41),('Tether', 1.00),
('XRP', 0.249940),
('Bitcoin Cash', 231.86),
('Polkadot', 4.91),
('Binance Coin', 27.02),
('Chainlink', 10.47),
('Litecoin', 48.20),
('EOS', 2.69),
('TRON', 0.027157),
('Neo', 24.29),
('Stellar', 0.077903),
('Huobi Token', 4.91)]

top5_by_name = sorted(cryptocurrencies, key=lambda token: token[0].lower())[0:5]
print(top5_by_name)
#Output: [('Binance Coin', 27.02), ('Bitcoin', 10948.52), ('Bitcoin Cash', 231.86), ('Chainlink', 10.47), ('EOS', 2.69)]

lowest = min(cryptocurrencies, key=lambda token: token[1])
print(lowest)
#Output: ('TRON', 0.027157)

highest = max(cryptocurrencies, key=lambda token: token[1])
print(highest)
#Output: ('Bitcoin', 10948.52)

highest_in_local_currency = lambda exchange_rate: highest[1] * exchange_rate
highest_sgd = highest_in_local_currency(1.38)
print(highest_sgd)
#Output: 15108.9576

You can see that it is quite convenient when you just need a very short function to be supplied to another function which accepts argument like key=keyfunc, such as sorted, list.sort, min, max, heapq.nlargest, heapq.nsmallest, itertool.groupby and so on. The common thing about these use cases is that you do not need very complicated logic (can be written in one line) in the keyfunc and probably you will not reuse it in anywhere else. So it is the ideal scenario to use a lambda function.

Now Let’s expand further on our previous example, assuming the bitcoin price fluctuated a lot on Mon & Tue although it still dominated the market, and you would like to convert the price in SGD in below way:

highest = ('Bitcoin', 10948.52)
mon_highest = lambda exchange_rate: highest[1] * exchange_rate

highest = ('Bitcoin', 10000)
tue_highest = lambda exchange_rate: highest[1] * exchange_rate

print("Mon:", mon_highest(1.36))
print("Tue:", tue_highest(1.36))

You want to assign different values in highest variable to calculate the price in another currency, but you would be surprised when checking the result:

python lambda variable binding

Instead of scratching your head to figure out why it does not work, and let’s try another approach. I am going to create a list of converter functions where I pass in the cryptocurrency pair to calculate the price based on the exchange rate supplied. Later I loop through these functions and print out the converted values:

converters = [lambda exchange_rate: crypto[1] * exchange_rate for crypto in cryptocurrencies]
for c in converters:
    print(c(1.36))

I am expecting to see all the prices are converted into local currency based on the exchange rate 1.36, but when running the above code, it gives below result:

python lambda variable binding

python lambda variable binding output

Same as the previous behaviour, only the last value was used in lambda function. so why it does not work as intended when I use the lambda in this way?

Runtime data binding

When people come into this issue, it is usually due to a fundamental misunderstanding of the variable binding for Python function. For Python function regardless of normal function or lambda function, the variables used in function are bound at runtime not in definition time. So for our first example, the lambda function only used the highest variable stored in locals() at the moment when it is executed.

With this concept cleared, you shall be able to understand the behavior of the output from above two examples, only the latest values at execution time were used in the lambda function.

To fix this issue, we just need a minor change to our original code to pass in the variable in the function definition as default value to the argument. For instance, below is the fix for the first example:

mon_highest = lambda exchange_rate, highest = highest: highest[1] * exchange_rate
tue_highest = lambda exchange_rate, highest = highest: highest[1] * exchange_rate

Below is the fix for the second example:

converters = [lambda exchange_rate, crypto = crypto: crypto[1] * exchange_rate for crypto in cryptocurrencies]

You may wonder why must use lambda in above two examples, indeed they do not necessarily require a lambda function. For the first example, since you need to call the function more than once, you should just use normal function instead just to be more careful when you need any variable from outside of the function.

And for the second example, it can be simply replaced with a list comprehension as per below:

list(map(lambda crypto: crypto[1] * 1.36, cryptocurrencies))

Conclusion:

Lambda function provides convenience for writing tiny functions for the one-time use, and make your code concise. But it is also highly restricted due to the one line of expression, as you cannot use multiple statements, exception handling and conditions etc. Whatever lambda does, you can definitely use a normal function to replace. The only thing matters is about the readability, so you will need to evaluate whether it is the best scenario to use lambda, and bear in mind about the variable binding.

 

python cache

Python cache – the must read tips for code performance

Introduction

Most of us may have experienced the scenarios that we need to implement some computationally expensive logic such as recursive functions or need to read from I/O or network multiple times, these functions typically requires more resources and longer CPU time, and eventually can cause performance issue if handle without care. For such case, you shall always pay special attention to it once you have completed all the functional requirements, as the additional costs on the resources and time may eventually lead to the user experience issue. In this article, I will be sharing how we can make use of the cache mechanism (aka memoization) to improve the code performance.

Prerequisites:

To follow the examples in below, you will need to have requests package installed in your working environment, you may use the below pip command to install:

pip install requests

With this ready, let’s dive into the problem we are going to solve today.

As I mentioned before, the computationally expensive logic such as recursive functions or reading from I/O or network usually have the significant impacts to the runtime, and are always the targets for optimization. Let me illustrate with a specific example, for instance, assume we need to call some external API to get the rates:

import requests
import json

def inquire_rate_online(dimension):
    result = requests.get(f"https://postman-echo.com/get?dim={dimension}")
    if result.status_code == requests.codes.OK:
        data = result.json()
        return data["args"]["dim"]
    return ''

This function needs to make a call through the network and return the result (for demo purpose, this API call just echo back the input as result). If you want to provide this as a service to everybody, there is a high chance that different people inquire the rate with same dimension value. And for this case, you may wish to have the result stored at somewhere after the first person inquired, so that later you can just return this result for the subsequent inquiry rather than making an API call again. With this sort of caching mechanism, it should speed up your code.

Implement cache with global dictionary

For the above example, the most straightforward way to implement a cache is to store the arguments and results in a dictionary, and every time we check this dictionary to see if the key exists before calling the external API. We can implement this logic in a separate function as per below:

cached_rate = {}
def cached_inquire(dim):
    if dim in cached_rate:
        print(f"cached value: {cached_rate[dim]}")
        return cached_rate[dim]
    cached_rate[dim]= inquire_rate_online(dim)
    print(f"result from online : {cached_rate[dim]}")
    return cached_rate[dim]

With this code, you can cache the previous key and result in the dictionary, so that the subsequent calls will be directly returned from the dictionary lookup rather than an external API call. This should dramatically improve your code performance since reading from dictionary is much faster than making an API through the network.

You can quickly test it from Jupyter Notebook with the %time magic:

%time cached_inquire(1)

For the first time you call it, you would see the time taken is over 1 seconds (depends on the network condition):

result from online : 1
Wall time: 1.22 s

When calling it again with the same argument, we should expect our cached result start working:

%time cached_inquire(1)

You can see the total time taken dropped to 997 microseconds for this call, which is over 1200 times faster than previously:

cached value: 1
Wall time: 997 µs

With this additional global dictionary, we can see so much improvement on the performance. But some people may have concern about the additional memory used to hold these values in a dictionary, especially if the result is a huge object such as image file or array. Python has a separate module called weakref which solves this problem.

Implement cache with weakref

Python introduced weakref to allow creating weak reference to the object and then garbage collection is free to destroy the objects whenever needed in order to reuse its memory.

For demonstration purpose, let’s modify our earlier code to return a Rate class instance as the inquiry result:

class Rate():
    def __init__(self, dim, amount):
        self.dim = dim
        self.amount = amount
    def __str__(self):
        return f"{self.dim} , {self.amount}"

def inquire_rate_online(dimension):
    result = requests.get(f"https://postman-echo.com/get?dim={dimension}")
    if result.status_code == requests.codes.OK:
        data = result.json()
        return Rate(float(data["args"]["dim"]), float(data["args"]["dim"]))
    return Rate(0.0,0.0)

And instead of a normal Python dictionary, we will be using WeakValueDictionary to hold a weak reference of the returned objects, below is the updated code:

import weakref

wkrf_cached_rate = weakref.WeakValueDictionary()
def wkrf_cached_inquire(dim):
    if dim in wkrf_cached_rate:
        print(f"cached value: {wkrf_cached_rate[dim]}")
        return wkrf_cached_rate[dim]

    result = inquire_rate_online(dim)
    print(f"result from online : {result}")
    wkrf_cached_rate[dim] = result
    return wkrf_cached_rate[dim]

With the above changes, if you run the wkrf_cached_inquire two times, you shall see the significant improvement on the performance:

python weakref cache

And the dictionary does not hold the instance of the Rate, rather a weak reference of it, so you do not have to worry about the extra memory used, because the garbage collection will reclaim it when it’s needed and meanwhile your dictionary will be automatically updated with the particular entry being removed. So subsequently the program can continue to call the external API like the first time.

If you stop your reading here, you will miss the most important part of this article, because what we have gone through above are good but just not perfect due to the below issues:

  • In the example, we only have 1 argument for the inquire_rate_online function, things are getting tedious if you have more arguments, all these arguments have to be stored as the key for the dictionary. In that case, re-implement the caching as a decorator function probably would be easier
  • Sometimes you do not really want to let garbage collection to determine which values to be cached longer than others, rather you want your cache to follow certain logic, for instance, based on the time from the most recent calls to the least recent calls, aka least recent used, to store the cache

If the least recent used cache mechanism makes sense to your use case, you shall consider to make use of the lru_cache decorator from functools module which will save you a lot of effort to reinvent the wheels.

Cache with lru_cache

The lru_cache accepts two arguments :

  • maxsize to limit the size of the cache, when it is None, the cache can grow without bound
  • typed when set it as True, the arguments of different types will be cached separately, e.g. wkrf_cached_inquire(1) and wkrf_cached_inquire(1.0) will be cached as different entries

With the understanding of the lru_cache, let’s decorate our inquire_rate_online function to have the cache capability:

from functools import lru_cache

@lru_cache(maxsize=None)
def inquire_rate_online(dimension):
    result = requests.get(f"https://postman-echo.com/get?dim={dimension}")
    if result.status_code == requests.codes.OK:
        data = result.json()
        return Rate(float(data["args"]["dim"]), float(data["args"]["dim"]))
    return Rate(0.0,0.0)

If we re-run our inquire_rate_online twice, you can see the same effect as previously in terms of the performance improvement:

Python cache with lru_cache

And with this decorator function, you can also see the how the cache is used. The hits shows no. of calls have been returned from the cached results:

inquire_rate_online.cache_info()
#CacheInfo(hits=1, misses=1, maxsize=None, currsize=1)

Or you can manually clear all the cache to reset the hits and misses to 0:

inquire_rate_online.cache_clear()

Limitations:

Let’s also talk about the limitations of the solutions we discussed above:

  • The cache mechanism works best for the deterministic function meaning by given the same set of inputs, it always returns the same set of results. And you would not benefit much if you try to cache the result of a nondeterministic function, e.g.:
def random_x(x):
    return x*random.randint(1,1000)
  • For keyword arguments, if you swap the position of the keywords, the two calls will be cached as separate entries
  • It only works for the arguments that are immutable data type.

Conclusion

In this article, we have discussed about the different ways of creating cache to improve the code performance whenever you have computational expensive operations or heavy I/O or network reads. Although lru_cache decorator provide a easy and clean solution for creating cache but it would be still better that you understand the underline implementation of cache before we just take and use.

We also discussed about the limitations for these solutions that you may need to take note before implementing. Nevertheless, it would still help you in a lot of scenarios where you can make use of these methods to improve your code performance.

pandas filtering records

Pandas – filtering records in 20 ways

Filtering records is a quite common operation when you process or analyze data with pandas,a lot of times you will have to apply filters so that you can concentrate to the data you want. Pandas is so powerful and flexible that it provides plenty of ways you can filter records, whether you want to filtering by columns to focus on a subset of the data or base on certain conditions. In this article, we will be discussing the various ways of filtering records in pandas.

Prerequisite:

You will need to install pandas package in order to follow the below examples. Below is the command to install pandas with pip:

pip install pandas

And I will be using the sample data from here, so you may also want to download a copy into your local machine to try out the later examples.

With the below codes, we can get a quick view of how the sample data looks like:

import pandas as pd
df = pd.read_excel(r"C:\Sample-Sales-Data.xlsx")
df.head(5)

Below is the output of the first 5 rows of data:

pandas filtering data

Let’s get started with our examples.

Filtering records by label or index

Filtering by column name/index is the most straightforward way to get a subset of the data frame in case you are only interested in a few columns of the data rather than the full data frame. The syntax is to use df[[column1,..columnN]] to filter only the specified columns. For instance, the below will get a subset of data with only 2 columns –  “Salesman” and “Item Desc”:

new_df = df[["Salesman","Item Desc"]]
new_df.head(5)

Output from the above would be:

pandas filtering data subset

If you are pretty sure which are the rows you are looking for, you can use the df.loc function which allows you to specify both the row and column labels to filter the records. You can pass in a list of row labels and column labels like below:

df.loc[[0,4], ["Salesman", "Item Desc"]]

And you would see the row index 0 and 4, column label “Salesman” and “Item Desc” are selected as per below output:

pandas filtering loc

Or you can specify the label range with : to filter the records by a range:

df.loc[0:4, ["Salesman", "Item Desc"]]

You would see 5 rows (row index 0 to 4) selected as per below output:

pandas filtering loc with label range

Note that currently we are using the default row index which is a integer starting from 0, so it happens to be same as the position of the rows. Let’s say you have Salesman as your index, then you will need to do filtering based on the index label (value of the Salesman), e.g.:

df.set_index("Salesman", inplace=True)
df.loc["Sara", ["Item Desc", "Order Quantity"]]

With the above code, you will be able to select all the records with Salesman as “Sara”:

pandas filtering loc with row label

Filtering records by row/column position

Similarly, you can use iloc function to achieve the same as what can be done with loc function. But the difference is that, for iloc, you shall pass in the integer position for both row and columns. E.g.:

df.iloc[[0,4,5,10],0:2]

The integers are the position of the row/column from 0 to length-1 for the axis. So the below output will be generated when you run the above code:

pandas filtering iloc function

Filtering records by single condition

If you would like to filter the records based on a certain condition, for instance, the value of a particular column, you may have a few options to do the filtering based on what type of data you are dealing with. 

The eq and == work the same when you want to compare if the value matches:

flt_wine = df["Item Desc"].eq("White Wine")
df[flt_wine]

Or:

flt_wine = (df["Item Desc"] == "White Wine")
df[flt_wine]

Both will generate the below output:

pandas filtering equals condition

If you run the flt_wine alone, you will see the output is a list of True/False with their index. This is how the filter works as pandas data frame would filter out the index with False value.

To get the data with the negation of certain condition, you can use ~ before your condition statement as per below:

df[~flt_wine]
#or
df[~(df["Item Desc"] == "White Wine")]
#or
df[(df["Item Desc"] != "White Wine")]

This will return the data with “Item Desc” other than “White Wine”.

And for string data type, you can also use the str.contains to match if the column has a particular sub string.

df[df["Item Desc"].str.contains("Wine")]

If you want to filter by matching multiple values, you can use isin with a list of values:

flt_wine = df["Item Desc"].isin(["White Wine", "Red Wine"])
df[flt_wine].head(5)

pandas filtering isin function

And you can also use data frame query function to achieve the same. But the column label with spaces in-between would cause errors when using this function, so you will need to reformat a bit of your column header, such as replacing spaces with underscore (refer to this article for more details ).

With this change in the column header, you shall be able to run the below code with the same result as above isin method.

df1 = df.query("Item_Desc in ('White Wine','Red Wine')")
df1.head(5)

There are other Series functions you can use to filter your records, such as isnull, isna, notna, notnull, find etc. You may want to check pandas Series documentation.

Filtering records by multiple conditions

When you need to filter by multiple conditions where multiple columns are involved, you can also do similar as what we have discussed in above with the & or | to join the conditions.

For filtering records when both conditions are true:

flt_whisky_bulk_order = (df["Item Desc"] == "Whisky") & (df["Order Quantity"] >= 10)
df[flt_whisky_bulk_order]

The output would be :

pandas filtering and condition

For filtering the records when either condition is true:

flt_high_value_order = (df["Item Desc"] == "Whisky") | (df["Price Per Unit"] >= 50) 
df[flt_high_value_order]

The output would be :

pandas filtering or condition

Similarly, the above can be done with data frame query function. Below is the example of AND condition:

df1 = df.query("Item_Desc == 'Whisky' and Order_Quantity >= 10") 
df1.head(5)

Below is the example of OR condition:

df1 = df.query("Item_Desc_ == 'Whisky' or Price_Per_Unit >= 10")
df1.head(5)

Filtering records by dataframe.filter

There is also another filter method which can be used to filter by the row or column label.

Below is an example that can be used to get all the columns with the name starting with “Order” keyword:

df.filter(regex="Order*", axis=1)

you shall see the below output:

pandas filtering dataframe filter

Similarly, when applying to row labels, you can axis=0

df.set_index("Order Date", inplace=True)
df.filter(like="2020-06-21", axis=0)

pandas filtering dataframe filter 2

Take note that data frame query function only works on the row or column label not any specific data series.

Conclusion

Filtering records is a so frequently used operation whenever you need to deal with the data in pandas, and in this article we have discussed a lot of methods you can use under different scenarios. It may not cover everything you need but hopefully it can solve 80% of your problems. There are other Series functions you may employ to filter your data, but probably you would see the syntax still falls under what we have summarized in this article.

If you are interested in other topics about pandas, you may refer to here.

split or merge PDF files with PyPDF2

Split or merge PDF files with 5 lines of Python code

There are many cases you want to extract a particular page from a big PDF file or merge PDF files into one due to various reasons. You can make use of some PDF editor tools to do this, but you may realize the split or merge functions are usually not available in the free version, or it is too tedious when there are just so many pages or files to be processed. In this article, I will be sharing a simple solution to split or merge multiple PDF files with a few lines of Python code.

Prerequisite

We will be using a Python library called PyPDF2, so you will need to install this package in your working environment. Below is an example with pip:

pip install PyPDF2

Let’s get started

The PyPDF2 package has 4 major classes PdfFileWriter, PdfFileReader, PdfFileMerger and PageObject which looks quite self explanatory from class name itself. If you need to do something more than split or merge PDF pages, you may want to check this document to find out more about what you can do with this library.

Split PDF file

When you want to extract a particular page from the PDF file and make it a separate PDF file, you can use PdfFileReader to read the original file, and then you will be able to get a particular page by it’s page number (page number starts from 0). With the PdfFileWriter, you can use addPage function to add the PDF page into a new PDF object and save it.

Below is the sample code that extracts the first page of the file1.pdf and split it as a separate PDF file named first_page.pdf

from PyPDF2 import PdfFileWriter, PdfFileReader
input_pdf = PdfFileReader("file1.pdf")
output = PdfFileWriter()
output.addPage(input_pdf.getPage(0))
with open("first_page.pdf", "wb") as output_stream:
    output.write(output_stream)

The input_pdf.getPage(0) returns the PageObject which allows you to modify some of the attributes related to the PDF page, such as rotate and scale the page etc. So you may want to understand more from here.

Merge PDF files

To merge multiple PDF files into one file, you can use PdfFileMerger to achieve it. Although you can also do with PdfFileWriter, but PdfFileMerger probably is more straightforward when you do not need to edit the pages before merging them.

Below is the sample code which using append function from PdfFileMerger to append multiple PDF files and write into one PDF file named merged.pdf

from PyPDF2 import PdfFileReader, PdfFileMerger
pdf_file1 = PdfFileReader("file1.pdf")
pdf_file2 = PdfFileReader("file2.pdf")
output = PdfFileMerger()
output.append(pdf_file1)
output.append(pdf_file2)

with open("merged.pdf", "wb") as output_stream:
    output.write(output_stream)

If you do not want to include all pages from your original file, you can specify a tuple with starting and ending page number as pages argument for append function, so that only the pages specified would be add to the new PDF file.

The append function will always add new pages at the end, in case you want to specify the position where you wan to put in your pages, you shall use merge function. It allows you to specify the position of the page where you want to add in the new pages.

Conclusion

PyPDF2 package is a very handy toolkit for editing PDF files. In this article, we have reviewed how we can make use of this library to split or merge PDF files with some sample codes. You can modify these codes to suit your needs in order to automate the task in case you have many files or pages to be processed. There is also a pdfcat script included in this project folder which allows you to split or merge PDF files by calling this script from the command line. You may also want to take a look in case you just simply deal with one or two PDF files each time.

In case you are interested in other topics related to Python automation, you may check here. Thanks for reading.