How to Work With PDF Documents Using Python

I really admire Portable Document Format (PDF) files. They are immensely popular with people because you get the same exact content and layout irrespective of your operating system, reading device or software being used.

Anyone who has worked with plain text files in Python before might think that working with PDF files is also going to be easy. But, it is a bit different here. PDF documents are binary files and more complex than just plain text files, especially since they contain different font types, colors, etc.

However, that doesn’t mean that it is hard to work with PDF documents using Python, it is rather simple, and using an external module solves the issue.

Initial Set Up

As I mentioned above, using an external module would be the key. The module we will be using in this tutorial is PyPDF2. As it is an external module, the first step we have to take is to install it. For that, we will be using pip, which is (based on Wikipedia):

A package management system used to install and manage software packages written in Python. Many packages can be found in the Python Package Index (PyPI).

You can follow the steps mentioned in the official guide for installing pip. There is a good chance that pip was installed automatically for you if you downloaded Python from python.org.

PyPDF2 now can be simply installed by typing the following command inside your terminal:

Great! You now have PyPDF2 installed, and you’re ready to start playing with PDF documents.

PyPDF2 Basics

Before we dig deeper, I would like to give you a brief overview of the PyPDF2 module. This is a completely free and open source library that can do a lot of things with PDF documents. You can use the library not only for reading from a PDF file but also for writing, splitting and merging.

A lot of things have changed in the library from its older versions. For this tutorial, I am going to use the version 2.11.1 of the library.

The PyPDF2 library doesn’t require any dependency for its regular features. However, you will need some dependencies to work with cryptography and images in PDF files. Automatic installation of all dependencies is possible with the command:

However, if you know that you will need to encrypt and decrypt PDF documents with AES or Advanced Encryption System you will need to install some cryptography related dependencies:

I should also point out that RC4 encryption is supported with the standalone installation of PyPDF2 without any dependencies.

Reading a PDF Document

The sample file we will be working with in this tutorial is a PDF version of Beauty and the Beast hosted on Project Gutenberg. Go ahead and download the file to follow the tutorial, or you can simply use any PDF file you like.

The following code will get you set up for extracting additional information from the file:

The first line imports the PyPDF2 module for us to use in our program. We then use the built-in open() function to open our PDF file in binary mode.

Once the file is open, we use the PdfReader base class from the module to initialize our PdfReader object by passing it our book as the parameter. We are now ready to handle a variety of reading operations on our book.

More Operations on PDF Documents

After reading the PDF document, we can now carry out different operations on the document, as we will see in this section.

Number of Pages

The number of pages in a PDF document are accessible with a read-only property of the PdfReader class called pages. This property basically gives us a list of Page objects. Those page objects represent the individual pages of the PDF file.

You can easily get the number of pages by using the built-in len() function and passing the list of Page objects as a parameter.

In this case, the returned value was 48 which is equal to the number of pages in our document.

Directly Accessing a Page Number

We have seen in the previous section that the pages property of the PdfReader class returns a list of Page objects. You can directly access any page from the list by specifying its index. Consider the following example in which I will retrieve the second item from a list of languages.

Directly accessing a page from the PDF document will work similarly. Here is an example:

Now that we have learned how to access a Page object based on the page number. Let’s see how to do the reverse and get the page number from a page object. The PyPDF2 library has a very handy function called get_page_number() that you can use to get the  page number of the current page. All you need to do is pass the Page object as a parameter to the get_page_number() function.

In the above example, we first try to get the page number for the last page in our PDF document and it comes out to 47 since the indexing starts at 0. A value of 47 actually means the page 48.

We also try the same function with a page between 15 and 35 selected at random. The output is 19 in this particular instance but it will vary with every execution.

Page Mode and Page Layout

The library also allows you to easily access the page mode and page layout information for your PDF document. You simply need to use the properties called page_mode and page_layout to do so.

All the valid page mode values are shown in the table below:

/UseNone Do not show outlines or thumbnails panels
/UseOutlines Show outlines (aka bookmarks) panel
/UseThumbs Show page thumbnails panel
/FullScreen Fullscreen view
/UseOC Show Optional Content Group (OCG) panel
/UseAttachments Show attachments panel

The table below shows all the valid page layout values:

/NoLayout Layout explicitly not specified
/SinglePage Show one page at a time
/OneColumn Show one column at a time
/TwoColumnLeft Show pages in two columns, odd-numbered pages on the left
/TwoColumnRight Show pages in two columns, odd-numbered pages on the right
/TwoPageLeft Show two pages at a time, odd-numbered pages on the left
/TwoPageRight Show two pages at a time, odd-numbered pages on the right

In order to check our page mode, we can use the following script:

In the case of our PDF document the returned value is None, which means that the page mode as well as the page layout is not specified.

Extract Metadata

The PdfReader class also has a property called metadata that returns the document information dictionary for the PDF file that you are reading. This metadata can contain information such as the author name, title of the document, creation date, and producer. The following example tries to extract all of this information from our own PDF document.

Please keep in mind that some PDF files could have all of these values set to None.

Extract Text

We have been wandering around the file so far, so let’s see what’s inside. The method extract_text() will be our friend in this task. The script to extract a text from the PDF document is as follows:

The output that I got after executing the above script is shown below:

I was able to extract all the text on the page. However, as you can see the extract_text() function doesn’t get the spacing between the words right in some places. The final result depends on a variety of factors with one of them being the generator used to create the PDF file. This basically means that you won’t face such issue in all PDF files but some of them are bound to have messed up spacing upon text extraction.

Conclusion

As we can see, Python makes it simple to work with PDF documents. This tutorial just scratched the surface on this topic, and you can find more details on different operations you can perform on PDF documents on the PyPDF2 documentation page.

I really admire Portable Document Format (PDF) files. They are immensely popular with people because you get the same exact content and layout irrespective of your operating system, reading device or software being used.

Anyone who has worked with plain text files in Python before might think that working with PDF files is also going to be easy. But, it is a bit different here. PDF documents are binary files and more complex than just plain text files, especially since they contain different font types, colors, etc.

However, that doesn’t mean that it is hard to work with PDF documents using Python, it is rather simple, and using an external module solves the issue.

Initial Set Up

As I mentioned above, using an external module would be the key. The module we will be using in this tutorial is PyPDF2. As it is an external module, the first step we have to take is to install it. For that, we will be using pip, which is (based on Wikipedia):

A package management system used to install and manage software packages written in Python. Many packages can be found in the Python Package Index (PyPI).

You can follow the steps mentioned in the official guide for installing pip. There is a good chance that pip was installed automatically for you if you downloaded Python from python.org.

PyPDF2 now can be simply installed by typing the following command inside your terminal:

Great! You now have PyPDF2 installed, and you’re ready to start playing with PDF documents.

PyPDF2 Basics

Before we dig deeper, I would like to give you a brief overview of the PyPDF2 module. This is a completely free and open source library that can do a lot of things with PDF documents. You can use the library not only for reading from a PDF file but also for writing, splitting and merging.

A lot of things have changed in the library from its older versions. For this tutorial, I am going to use the version 2.11.1 of the library.

The PyPDF2 library doesn’t require any dependency for its regular features. However, you will need some dependencies to work with cryptography and images in PDF files. Automatic installation of all dependencies is possible with the command:

However, if you know that you will need to encrypt and decrypt PDF documents with AES or Advanced Encryption System you will need to install some cryptography related dependencies:

I should also point out that RC4 encryption is supported with the standalone installation of PyPDF2 without any dependencies.

Reading a PDF Document

The sample file we will be working with in this tutorial is a PDF version of Beauty and the Beast hosted on Project Gutenberg. Go ahead and download the file to follow the tutorial, or you can simply use any PDF file you like.

The following code will get you set up for extracting additional information from the file:

The first line imports the PyPDF2 module for us to use in our program. We then use the built-in open() function to open our PDF file in binary mode.

Once the file is open, we use the PdfReader base class from the module to initialize our PdfReader object by passing it our book as the parameter. We are now ready to handle a variety of reading operations on our book.

More Operations on PDF Documents

After reading the PDF document, we can now carry out different operations on the document, as we will see in this section.

Number of Pages

The number of pages in a PDF document are accessible with a read-only property of the PdfReader class called pages. This property basically gives us a list of Page objects. Those page objects represent the individual pages of the PDF file.

You can easily get the number of pages by using the built-in len() function and passing the list of Page objects as a parameter.

In this case, the returned value was 48 which is equal to the number of pages in our document.

Directly Accessing a Page Number

We have seen in the previous section that the pages property of the PdfReader class returns a list of Page objects. You can directly access any page from the list by specifying its index. Consider the following example in which I will retrieve the second item from a list of languages.

Directly accessing a page from the PDF document will work similarly. Here is an example:

Now that we have learned how to access a Page object based on the page number. Let’s see how to do the reverse and get the page number from a page object. The PyPDF2 library has a very handy function called get_page_number() that you can use to get the  page number of the current page. All you need to do is pass the Page object as a parameter to the get_page_number() function.

In the above example, we first try to get the page number for the last page in our PDF document and it comes out to 47 since the indexing starts at 0. A value of 47 actually means the page 48.

We also try the same function with a page between 15 and 35 selected at random. The output is 19 in this particular instance but it will vary with every execution.

Page Mode and Page Layout

The library also allows you to easily access the page mode and page layout information for your PDF document. You simply need to use the properties called page_mode and page_layout to do so.

All the valid page mode values are shown in the table below:

/UseNone Do not show outlines or thumbnails panels
/UseOutlines Show outlines (aka bookmarks) panel
/UseThumbs Show page thumbnails panel
/FullScreen Fullscreen view
/UseOC Show Optional Content Group (OCG) panel
/UseAttachments Show attachments panel

The table below shows all the valid page layout values:

/NoLayout Layout explicitly not specified
/SinglePage Show one page at a time
/OneColumn Show one column at a time
/TwoColumnLeft Show pages in two columns, odd-numbered pages on the left
/TwoColumnRight Show pages in two columns, odd-numbered pages on the right
/TwoPageLeft Show two pages at a time, odd-numbered pages on the left
/TwoPageRight Show two pages at a time, odd-numbered pages on the right

In order to check our page mode, we can use the following script:

In the case of our PDF document the returned value is None, which means that the page mode as well as the page layout is not specified.

Extract Metadata

The PdfReader class also has a property called metadata that returns the document information dictionary for the PDF file that you are reading. This metadata can contain information such as the author name, title of the document, creation date, and producer. The following example tries to extract all of this information from our own PDF document.

Please keep in mind that some PDF files could have all of these values set to None.

Extract Text

We have been wandering around the file so far, so let’s see what’s inside. The method extract_text() will be our friend in this task. The script to extract a text from the PDF document is as follows:

The output that I got after executing the above script is shown below:

I was able to extract all the text on the page. However, as you can see the extract_text() function doesn’t get the spacing between the words right in some places. The final result depends on a variety of factors with one of them being the generator used to create the PDF file. This basically means that you won’t face such issue in all PDF files but some of them are bound to have messed up spacing upon text extraction.

Conclusion

As we can see, Python makes it simple to work with PDF documents. This tutorial just scratched the surface on this topic, and you can find more details on different operations you can perform on PDF documents on the PyPDF2 documentation page.


Print Share Comment Cite Upload Translate
APA
Abder-Rahman Ali | Sciencx (2024-03-28T10:03:40+00:00) » How to Work With PDF Documents Using Python. Retrieved from https://www.scien.cx/2016/01/17/how-to-work-with-pdf-documents-using-python/.
MLA
" » How to Work With PDF Documents Using Python." Abder-Rahman Ali | Sciencx - Sunday January 17, 2016, https://www.scien.cx/2016/01/17/how-to-work-with-pdf-documents-using-python/
HARVARD
Abder-Rahman Ali | Sciencx Sunday January 17, 2016 » How to Work With PDF Documents Using Python., viewed 2024-03-28T10:03:40+00:00,<https://www.scien.cx/2016/01/17/how-to-work-with-pdf-documents-using-python/>
VANCOUVER
Abder-Rahman Ali | Sciencx - » How to Work With PDF Documents Using Python. [Internet]. [Accessed 2024-03-28T10:03:40+00:00]. Available from: https://www.scien.cx/2016/01/17/how-to-work-with-pdf-documents-using-python/
CHICAGO
" » How to Work With PDF Documents Using Python." Abder-Rahman Ali | Sciencx - Accessed 2024-03-28T10:03:40+00:00. https://www.scien.cx/2016/01/17/how-to-work-with-pdf-documents-using-python/
IEEE
" » How to Work With PDF Documents Using Python." Abder-Rahman Ali | Sciencx [Online]. Available: https://www.scien.cx/2016/01/17/how-to-work-with-pdf-documents-using-python/. [Accessed: 2024-03-28T10:03:40+00:00]
rf:citation
» How to Work With PDF Documents Using Python | Abder-Rahman Ali | Sciencx | https://www.scien.cx/2016/01/17/how-to-work-with-pdf-documents-using-python/ | 2024-03-28T10:03:40+00:00
https://github.com/addpipe/simple-recorderjs-demo