Thursday, May 2, 2024
HomePythonOCR on PDF information utilizing Python

OCR on PDF information utilizing Python


Hello there of us! You may need heard about OCR utilizing Python. Essentially the most well-known library out there may be tesseract which is sponsored by Google. It is vitally straightforward to do OCR on a picture. The difficulty arises while you wish to do OCR over a PDF doc.

I’m engaged on a mission the place I wish to enter PDF information, extract textual content from them after which add the textual content to the database. I needed to search quite a bit earlier than I stumbled over the ultimate resolution. So with out losing any time, lets start.

Putting in Tesseract

It is vitally straightforward to put in tesseract on numerous working methods. For the sake of simplicity I might be utilizing Ubuntu for example. In Ubuntu you merely should run the next command within the terminal:

sudo apt-get set up tesseract-ocr

It’s going to set up Tesseract together with the assist for 3 languages.

Putting in PyOCR

Now we have to set up the Python bindings for tesseract. Thankfully, there are some fairly good bindings on the market. We might be putting in a contemporary one:

pip set up git+https://github.com/jflesch/pyocr.git

Putting in Wand and PIL

We have to set up two different dependencies as nicely earlier than we are able to transfer on. First one is Wand. It’s the Python bindings for Imagemagick. We might be utilizing it for changing PDF information to pictures:

pip set up wand

We might be utilizing PIL as nicely as a result of PyOCR wants it. You possibly can check out the official docs on how you can set up it in your working system.

Warming up

Let’s begin writing our script. Initially, we might be importing the required libraries:

from wand.picture import Picture
from PIL import Picture as PI
import pyocr
import pyocr.builders
import io

Observe: I imported Picture from PIL as PI as a result of in any other case it might have conflicted with the Picture module from wand.picture.

Get Going

Now we have to get the deal with of the OCR library (in our case, tesseract) and the language which might be utilized by pyocr.

instrument = pyocr.get_available_tools()[0]
lang = instrument.get_available_languages()[1]

We used the second language within the instrument.get_available_languages() as a result of the final time I checked, it was English.

Now we have to setup two lists which might be used to carry our photos and final_text.

req_image = []
final_text = []

Subsequent step is to open the PDF file utilizing wand and convert it to jpeg. Let’s do it!

image_pdf = Picture(filename="./PDF_FILE_NAME", decision=300)
image_jpeg = image_pdf.convert('jpeg')

Observe: Substitute PDF_FILE_NAME with a legitimate PDF file identify within the present path.

wand has transformed all of the separate pages within the PDF into separate picture blobs. We will loop over them and append them as a blob into the _reqpicture checklist.

for img in image_jpeg.sequence:
    img_page = Picture(picture=img)
    req_image.append(img_page.make_blob('jpeg'))

Now we simply must run OCR over the picture blobs. It is vitally straightforward.

for img in req_image: 
    txt = instrument.image_to_string(
        PI.open(io.BytesIO(img)),
        lang=lang,
        builder=pyocr.builders.TextBuilder()
    )
    final_text.append(txt)

Now the entire acknowledged textual content has been appended within the _finaltextual content checklist. You should use it in any method you need. I hope this tutorial was useful for you guys!

In case you have any feedback and ideas then do let me know within the feedback part under.

Until subsequent time! 🙂

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments