There are many cases you want to extract a particular page from a big PDF file or merge PDF files into one due to various reasons. You can make use of some PDF editor tools to do this, but you may realize the split or merge functions are usually not available in the free version, or it is too tedious when there are just so many pages or files to be processed. In this article, I will be sharing a simple solution to split or merge multiple PDF files with a few lines of Python code.
Prerequisite
We will be using a Python library called PyPDF2, so you will need to install this package in your working environment. Below is an example with pip:
pip install PyPDF2
Let’s get started
The PyPDF2 package has 4 major classes PdfFileWriter, PdfFileReader, PdfFileMerger and PageObject which looks quite self explanatory from class name itself. If you need to do something more than split or merge PDF pages, you may want to check this document to find out more about what you can do with this library.
Split PDF file
When you want to extract a particular page from the PDF file and make it a separate PDF file, you can use PdfFileReader to read the original file, and then you will be able to get a particular page by it’s page number (page number starts from 0). With the PdfFileWriter, you can use addPage function to add the PDF page into a new PDF object and save it.
Below is the sample code that extracts the first page of the file1.pdf and split it as a separate PDF file named first_page.pdf
from PyPDF2 import PdfFileWriter, PdfFileReader input_pdf = PdfFileReader("file1.pdf") output = PdfFileWriter() output.addPage(input_pdf.getPage(0)) with open("first_page.pdf", "wb") as output_stream: output.write(output_stream)
The input_pdf.getPage(0) returns the PageObject which allows you to modify some of the attributes related to the PDF page, such as rotate and scale the page etc. So you may want to understand more from here.
Merge PDF files
To merge multiple PDF files into one file, you can use PdfFileMerger to achieve it. Although you can also do with PdfFileWriter, but PdfFileMerger probably is more straightforward when you do not need to edit the pages before merging them.
Below is the sample code which using append function from PdfFileMerger to append multiple PDF files and write into one PDF file named merged.pdf
from PyPDF2 import PdfFileReader, PdfFileMerger pdf_file1 = PdfFileReader("file1.pdf") pdf_file2 = PdfFileReader("file2.pdf") output = PdfFileMerger() output.append(pdf_file1) output.append(pdf_file2) with open("merged.pdf", "wb") as output_stream: output.write(output_stream)
If you do not want to include all pages from your original file, you can specify a tuple with starting and ending page number as pages argument for append function, so that only the pages specified would be add to the new PDF file.
The append function will always add new pages at the end, in case you want to specify the position where you wan to put in your pages, you shall use merge function. It allows you to specify the position of the page where you want to add in the new pages.
Conclusion
PyPDF2 package is a very handy toolkit for editing PDF files. In this article, we have reviewed how we can make use of this library to split or merge PDF files with some sample codes. You can modify these codes to suit your needs in order to automate the task in case you have many files or pages to be processed. There is also a pdfcat script included in this project folder which allows you to split or merge PDF files by calling this script from the command line. You may also want to take a look in case you just simply deal with one or two PDF files each time.
In case you are interested in other topics related to Python automation, you may check here. Thanks for reading.