Parsing pdf with python

Содержание

Экспортируем данные из PDF при помощи Python
Извлечение Текста с PDFMiner
Извлекаем весь текст
Извлечение текста постранично
How to Extract Data from PDF Files with Python
How to Use PDFQuery
Package installation
Import the libraries
Read and convert the PDF files
Access and extract the Data
Conclusion

Экспортируем данные из PDF при помощи Python

Существует много случаев, когда вам нужно извлечь данные из PDF и экспортировать их в другой формат при помощи Python. К сожалению, на сегодняшний день доступно не так уж много пакетов Python, которые выполняют извлечение лучшим образом. В данной статье мы рассмотрим различные пакеты, которые вы можете использовать для извлечения текста. Мы также научимся извлекать изображения из PDF. Так как в Python нет конкретного решения для этих задач, вам нужно уметь использовать эту информацию. После извлечения необходимых данных, мы рассмотрим, как мы можем взять эти данные и извлечь её в другом формате.

Есть вопросы по Python?

На нашем форуме вы можете задать любой вопрос и получить ответ от всего нашего сообщества!

Telegram Чат & Канал

Вступите в наш дружный чат по Python и начните общение с единомышленниками! Станьте частью большого сообщества!

Одно из самых больших сообществ по Python в социальной сети ВК. Видео уроки и книги для вас!

Начнем с того, как извлекать текст!

Извлечение Текста с PDFMiner

Наверное, самым известным является пакет PDFMiner. Данный пакет существует, начиная с версии Python 2.4. Его изначальная задача заключалась в извлечение текста из PDF. В целом, PDFMiner может указать вам точное расположение текста на странице, а также родительскую информацию о шрифтах. Для версий Python 2.4 – 2.7, вы можете ссылаться на следующие сайты с дополнительной информацией о PDFMiner:

PDFMiner не совместим с Python 3. К счастью, существует вилка для PDFMiner под названием PDFMiner.six, которая работает аналогичным образом. Вы можете найти её здесь: https://github.com/pdfminer/pdfminer.six

Инструкции по установке PDFMiner как минимум можно назвать устаревшими. Вы можете использовать pip для проведения установки:

Если вам нужно установить PDFMiner в Python 3 (что вы, скорее всего, и пытаетесь сделать), то вам нужно провести установку следующим образом:

Документация PDFMiner достаточно скудная. По большей части вам понадобится гугл и StackOverflow, чтобы понять, как использовать PDFMiner эффективнее в случаях, не описанных в данной статье.

Извлекаем весь текст

Возможно, вам нужно будет извлечь весь текст из PDF. Пакет PDFMiner предоставляет несколько разных методов, которые позволяют это сделать. Мы рассмотрим несколько программных методов для начала. Попробуем считать весь текст из формы W9 для внутренних доходов. Копию вы можете найти здесь: https://www.irs.gov/pub/irs-pdf/fw9.pdf

После удачного сохранения PDF файла, мы можем взглянуть на код:

PDFMiner имеет тенденцию быть через чур подробным в тех или иных случаях, если вы работаете с ним напрямую. Здесь мы импортируем фрагменты из различных частей PDFMiner. Так как для этих классов нет документации, как и docstrings, углубляться в то, чем они являются, мы в этой статьей не будем. Вы можете ознакомиться с исходным кодом лично, если вам действительно любопытно. Однако, я думаю мы можем следовать примеру кода.

Первое, что мы делаем, это создаем экземпляр ресурсного менеджера. Далее, мы создаем файловый объект через модуль io в Python. Если вы работаете в Python 2, то вам может понадобиться модуль StringIO. Наш следующий шаг – создание конвертера. В данном случае, мы выберем TextConverter, однако вы можете также использовать HTMLConverter или XMLConverter, если захотите. Наконец, мы создаем объект интерпретаторв PDF, который использует наш диспетчер ресурсов, объекты конвертера и извлечет текст.

Последний шаг, это открыть PDF и ввести цикл через каждую страницу. В конце мы захватим весь текст, закроем несколько обработчиков и выведем текст в stdout.

Извлечение текста постранично

Честно говоря, брать весь текст из многостраничного документа далеко не всегда оказывается полезным. Как правило, вам может понадобиться работать с отдельными фрагментами документа. Давайте перепишем код таким образом, чтобы он извлекал текст постранично. Это позволит нам проверить текст (страница за раз):

Источник

How to Extract Data from PDF Files with Python

Shittu Olumide

Data is present in all areas of the modern digital world, and it takes many different forms.

One of the most common formats for data is PDF. Invoices, reports, and other forms are frequently stored in Portable Document Format (PDF) files by businesses and institutions.

It can be laborious and time-consuming to extract data from PDF files. Fortunately, for easy data extraction from PDF files, Python provides a variety of libraries.

This tutorial will explain how to extract data from PDF files using Python. You’ll learn how to install the necessary libraries and I’ll provide examples of how to do so.

There are several Python libraries you can use to read and extract data from PDF files. These include PDFMiner, PyPDF2, PDFQuery and PyMuPDF. Here, we will use PDFQuery to read and extract data from multiple PDF files.

How to Use PDFQuery

PDFQuery is a Python library that provides an easy way to extract data from PDF files by using CSS-like selectors to locate elements in the document.

It reads a PDF file as an object, converts the PDF object to an XML file, and accesses the desired information by its specific location inside of the PDF document.

Let’s consider a short example to see how it works.

from pdfquery import PDFQuery pdf = PDFQuery('example.pdf') pdf.load() # Use CSS-like selectors to locate the elements text_elements = pdf.pq('LTTextLineHorizontal') # Extract the text from the elements text = [t.text for t in text_elements] print(text)

In this code, we first create a PDFQuery object by passing the filename of the PDF file we want to extract data from. We then load the document into the object by calling the load() method.

Next, we use CSS-like selectors to locate the text elements in the PDF document. The pq() method is used to locate the elements, which returns a PyQuery object that represents the selected elements.

Finally, we extract the text from the elements by accessing the text attribute of each element and we store the extracted text in a list called text .

Let’s consider another method we can use to read PDF files, extract some data elements, and create a structured dataset using PDFQuery. We will follow the following steps:

Package installation.
Import the libraries.
Read and convert the PDF files.
Access and extract the Data.

Package installation

First, we need to install PDFQuery and also install Pandas for some analysis and data presentation.

pip install pdfquery pip install pandas

Import the libraries

import pandas as pd import pdfquery

We import the two libraries to be be able to use them in our project.

Read and convert the PDF files

#read the PDF pdf = pdfquery.PDFQuery('customers.pdf') pdf.load() #convert the pdf to XML pdf.tree.write('customers.xml', pretty_print = True) pdf

We will read the pdf file into our project as an element object and load it. Convert the pdf object into an Extensible Markup Language (XML) file. This file contains the data and the metadata of a given PDF page.

The XML defines a set of rules for encoding PDF in a format that is readable by humans and machines. Looking at the XML file using a text editor, we can see where the data we want to extract is.

Access and extract the Data

We can get the information we are trying to extract inside the LTTextBoxHorizontal tag, and we can see the metadata associated with it.

The values inside the text box, [68.0, 231.57, 101.990, 234.893] in the XML fragment refers to Left, Bottom, Right, Top coordinates of the text box. You can think of this as the boundaries around the data we want to extract.

Let’s access and extract the customer name using the coordinates of the text box.

# access the data using coordinates customer_name = pdf.pq('LTTextLineHorizontal:in_bbox("68.0, 231.57, 101.990, 234.893")').text() print(customer_name) #output: Brandon James

Note: Sometimes the data we want to extract is not in the exact same location in every file which can cause issues. Fortunately, PDFQuery can also query tags that contain a given string.

Conclusion

Data extraction from PDF files is a crucial task because these files are frequently used for document storage and sharing.

Python’s PDFQuery is a potent tool for extracting data from PDF files. Anyone looking to extract data from PDF files will find PDFQuery to be a great option thanks to its simple syntax and comprehensive documentation. It is also open-source and can be modified to suit specific use cases.

Let’s connect on Twitter and on LinkedIn. You can also subscribe to my YouTube channel.

Источник