- Can Python Read PDF Files?
- 1. Reading PDF File Contents With PDFMiner
- 2. Extracting Text With PyPDF2
- 3. Importing Tabular Data Into Pandas With Tabula-py
- 4. Slate
- 5. Scraping And Querying PDF Files With PDFQuery
- 6. Xpdf_python
- 7. Pdflib
- 8. PyMuPDF
- Summary
- References
- About the Author
- Csongor Jozsa
- We’re looking for talented developers
- How to create a PDF viewer using Python
- Creating a PDF Viewer using Python
- Source Code: Create a PDF viewer GUI in Python
- Saved searches
- Use saved searches to filter your results more quickly
- Zain-Bin-Arshad/pdf-viewer
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- About
Can Python Read PDF Files?
Python is a great tool for task automation, it makes working with text files and data sheets really easy. But can you use Python to read PDF files?
There are plenty of great Python libraries that can be used to parse pdf files, for example: PDFMiner, PyPDF2, tabula-py, slate, PDFQuery, xpdf_python, pdflib and PyMuPDF
In this brief tutorial I’ll show you how to install and use each of these libraries to read pdfs.
1. Reading PDF File Contents With PDFMiner
PDFMiner is a library for pdf to text and text to pdf conversion. It can be used as an importable module in your Python scripts, but it also comes with a CLI interface, so you can invoke pdfminer directly from the command line as well.
Attention: The original pdfminer package is deprecated, as the repo has been abandoned by the original author. Make sure to install its community fork, pdfminer.six instead!
python pdf2txt.py /path/to/your/file.pdf
If you want to use it in your Python script you can simply do:
import pdfminer.high_level contents = pdfminer.high_level.extract_text("/path/to/your/file.pdf")
2. Extracting Text With PyPDF2
PyPDF2 is feature-rich Python library that makes manipulating PDF files easier. It can extract metadata, text and images, and can also modify PDF files by cropping, merging and splitting PDFs.
You can install it by running:
To read text from PDF files you can use the PdfFileReader class, like so:
from PyPDF2 import PdfFileReader contents = "" with open("/path/to/your/file.pdf", 'rb') as f: pdf = PdfFileReader(f) for page_num in range(pdf.getNumPages()): page = pdf.getPage(1) contents += page.extractText()
This little snippet gets the number of pages from the metadata, then iterates through all the pages, and extracts the text content from each page one-by-one.
3. Importing Tabular Data Into Pandas With Tabula-py
Tabula-py is a bit more specific tool: it is specialized on reading tables from PDF files. It returns the data as a pandas DataFrame, but you can also export it into TSV or CSV format.
Installation is simple with pip:
Using it is pretty straightforward as well:
import tabula df = tabula.read_pdf("/path/to/your/file.pdf", pages='all')
df will be a pandas DataFrame containing all the data that tabula-py manages to find in tabular format inside the input file.
4. Slate
Slate is a wrapper around PDFMiner. It provides roughly the same feature set, but with a much cleaner, pythonic interface.
with open("/path/to/your/file.pdf") as input_file: contents = slate.PDF(input_file)
contents will be a list of strings, where each element
5. Scraping And Querying PDF Files With PDFQuery
If you need to do some more sophisticated manipulation of PDF data besides just dumping all the contents of the file as raw text, your best bet would be PDFQuery. It allows you to traverse the document tree, just like you would the with an xml or html document.
PDFQuery supports both XPath and JQuery syntax for querying.
import pdfquery pdf = pdfquery.PDFQuery("/path/to/your/file.pdf")
pdf variable will now contain a traversable and searchable representation of the PDF document. Contents of this document can be exported in arbitrary, user-defined format.
You can also search the contents of the document, for example:
element = pdf.pq(':contains("text to find")')
6. Xpdf_python
xpdf_python is a wrapper for xpdf. It can export pdf files to text format.
As always installation is easy with pip:
To get the contents of a pdf file as a string:
from xpdf_python import to_text contents = to_text("/path/to/your/file.pdf")
7. Pdflib
Pdflib provides Python binding for the Poppler pdf library. Pdflib can be installed by running:
Parsing pdf files is pretty easy using pdflib:
from pdflib import Document pdf = Document("/path/to/your/file.pdf") content = [line for page in doc for line in page.lines]
The above snippet will gather all the text in the pdf in the content variable line-by-line.
8. PyMuPDF
PyMuPDF provides Python bindings for MuPDF, a lightweight PDF/e-book viewer.
Reading a PDF file into variable:
doc = fitz.open("/path/to/your/file.pdf") content = [page.getText() for page in doc]
content will be a list of pages, containing the content of each page as a string element.
Summary
That was the 8 most popular Python libraries that can be used to read pdf data. So which one should you pick?
If you need to parse data tables, I’d definitely recommend tabula-py , as it exports directly to a pandas DataFrame .
If you want to programmatically search in a pdf file, or extract only parts if, you should choose PDFQuery .
However, if you need nothing fancy, just dump the contents of the file, any of the others will do, but I’d probably go with pdflib or PyMuPDF`. They are actively maintained, fast, robust, easy to install, and provide a clean interface to work with.
References
About the Author
Csongor Jozsa
Main author and editor at pythonin1minute.com. He’s been working as a Software Developer/Security Engineer since 2005. Currently Lead Engineer at MBA.
We’re looking for talented developers
- Work with prestigious clients
- Set your own rate and hours
- Flexible schedule and time off
- $2500 starting bonus
How to create a PDF viewer using Python
In this tutorial we will learn how to create PDF viewer using python, this is a GUI toolkit which uses python Tkinter module, Pdf2img module, and Python Imaging Library (PIL).
As the increasing popularity and compatibility of PDFs in almost each and every document format, whether it is Invoices, Reports or other official documents, we need an onboard PDF viewer or renderer to seek out the information repeatedly.
- Tkinter – It is the most preferred GUI toolkit available in python, which posses the faster and easiest way of producing GUI software.
- pdf2img – It is an easy to use command line module that provides users with a batch conversion of PDF into Images.
- PIL/Pillow – It is a free library in python which supports opening, manipulating different image file formats.
Creating a PDF Viewer using Python
Before getting into the code you need to install the above-mentioned libraries.
$ sudo apt-get install python3-tk $ pip3 install pdf2image $ pip3 install pillow
After installing the above modules and required dependencies you can get into actual code.
Source Code: Create a PDF viewer GUI in Python
# Importing required modules from tkinter import * from PIL import Image,ImageTk from pdf2image import convert_from_path # Creating Tk container root = Tk() # Creating the frame for PDF Viewer pdf_frame = Frame(root).pack(fill=BOTH,expand=1) # Adding Scrollbar to the PDF frame scrol_y = Scrollbar(pdf_frame,orient=VERTICAL) # Adding text widget for inserting images pdf = Text(pdf_frame,yscrollcommand=scrol_y.set,bg="grey") # Setting the scrollbar to the right side scrol_y.pack(side=RIGHT,fill=Y) scrol_y.config(command=pdf.yview) # Finally packing the text widget pdf.pack(fill=BOTH,expand=1) # Here the PDF is converted to list of images pages = convert_from_path('mypdf.pdf',size=(800,900)) # Empty list for storing images photos = [] # Storing the converted images into list for i in range(len(pages)): photos.append(ImageTk.PhotoImage(pages[i])) # Adding all the images to the text widget for photo in photos: pdf.image_create(END,image=photo) # For Seperating the pages pdf.insert(END,'\n\n') # Ending of mainloop mainloop()
Here you must be thinking why I have used two for loops one adding images to list and the second one for adding images to the text widget, I had actually used one loop but then it was only showing the last page in the PDF, this might be happening, because I was using a single variable to hold the image and store into text widget. The conclusion that I could arrive at is that every image needs the separate permanent storage until the PDF is rendered.
By looking at this task code it seems to be simple, but it took me around 8 hours to arrive at this solution because in this I had tried every possible solution I could have found and lastly I had found it, this was really a challenging and interesting task.
So in this way you can create a simple PDF viewer, I hope this article might be fruitful to you, thank you ‘Keep Learning Keep Coding’.
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
A Pure Python PDFViewer, which provides functionalities same as other famous PDFViewers.
Zain-Bin-Arshad/pdf-viewer
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Warning: This project is currently not being actively maintained. When I initially developed this project, I had only about a week available. PySimpleGUI was also in its early stages, and there may have been updates that could simplify certain sections of the code. I am working on finding some time to update the code to the latest Python and PySimpleGUI versions. However, please note that this project should still fulfill your requirements.
- Searching through a PDF document
- Table of contents
- Take notes and save them automatically
- Night mode for taking care of your eyes
- Zooming feature
- Developed in pure Python (python-3.7.4)
Why bother creating a new PDF Viewer?
I wanted to create an application that requires me to have a PDF Viewer with access to the source code, in Python. I searched the internet for a PDF Viewer written in Python, but all in vain. I couldn’t find any good results. Hence, I decided to create my own PDF Viewer, that uses pure Python packages.
Make sure you have installed these packages:
You can simply run this command to install all packages. First make sure you are in the same directory as requirements.txt, then type:
pip install -r requirements.txt
This will install all the packages, then you have to simply run this command:
The given command will generate this screen:
Now, you can open any book and enjoy reading.
We all want to search through a document and get things done quickly, well I got your back. Keep searching using this advanced user-friendly search feature.
If you want to read more but your eyes can’t see the white screen anymore, switch to night mode. This not only updates the application’s theme but converts the PDF to black background and white text so that your eyes can relax.
Well, you can take notes as you go through the book. On the right side, there is a space for writing anything you want. There is a «Check Box» at the right-hand corner make sure that box is checked, only then will the note be saved.
A file name like «%PDF Filename%_notes.txt» will appear once you close the document. This file will look something like this:
You can see how easy it is to navigate through this text file and see all the notes.
What’s so special about it?
Once you have taken the notes, then, whenever you open the same file again, all the notes will appear again concerning their page number. If you take the example of the above-given screenshot of the notes.txt, when I open the same file again you can see the notes:
Make sure you don’t delete the text file that is created automatically by the PDFViewer.
If you like my effort and want this project to move forward, you can support this project by using the Sponsor button above. I appreciate your generous help.
This code is not as Pythonic as I want it to be, but I had just enough time and resources to get this done, so this is what it is. I might take 1-2 days out of my schedule to refactor this code base.
I want to thank PySimpleGUI for developing such a great GUI-Framework in Python.
About
A Pure Python PDFViewer, which provides functionalities same as other famous PDFViewers.