• Hackers Realm

Image to Text Conversion & Extraction using Python | OCR | Machine Learning Project Tutorial

The Image to Text Conversion and Extraction is an Optical Character Recognition (OCR) project where text data is extracted from the image and stored into a file or displayed on screen. This is a very useful process to extract any type of text from scanned documents, photos, street signs, etc.


In this project tutorial, we will use pytesseract module for Optical Character Recognition (OCR) to extract the text from the images and re module to extract specific fields from the data.



You can watch the step by step explanation video tutorial down below



Project Information

The project uses pytesseract module to convert image into text and regular expression to extract specific fields from the extracted text.


Download the pytesseract OCR source files here


Import Modules


import matplotlib.pyplot as plt
import PIL
import pytesseract
import re
%matplotlib inline
  • matplotlib - used for data visualization and graphical plotting

  • PIL - Python Imaging Library for image manipulation in different image formats

  • re – used as a regular expression to find particular patterns and process it

  • pytesseract - Image extraction module for character recognition, character segmentation and preprocessing images.


These are the following prerequisites for installing the pystesseract module

# prerequisites
!pip install pytesseract
# install desktop version of pytesseract

Download link: https://sourceforge.net/projects/tesseract-ocr-alt/files/

Alternate Download link: https://digi.bib.uni-mannheim.de/tesseract/



Load the image


Now we will open and display the test image to view the text data

img = PIL.Image.open('test.JPG')
plt.imshow(img)
  • Here is a test image for example

  • We can see various types of text data like alphanumerical and special characters.



Convert Image to Text


Now we will convert the image into text

# config
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract'
TESSDATA_PREFIX = 'C:/Program Files/Tesseract-OCR'
  • This is a necessary configuration for the pytesseact module

  • You need a prefix to run the module


Now we process the image and get the output data

text_data = pytesseract.image_to_string(img.convert('RGB'), lang='eng')
  • img.convert('RGB') - Specify to convert the image into a color image

  • lang='eng' - Set the language to extract the text, by default is in English

  • If you want to extract the text in a specific language you must download the corresponding files and set the language

  • You can resize the image into a larger format for better extraction



print(text_data)

Name: Sample Unique Policy Number: 12345 Amount: 100000 Start Date: 1/10/2019 End Date: 1/11/2019 Geo-Coordinates: 13.89,83.49

  • Text extracted from the image

  • Compared to the extracted data and the displayed image the extraction was successful and no other data was left out

  • If the text data is too small you can resize the image to higher resolution for better extraction



Extract Specific Fields


Now we will extract specific fields from the text data

m = re.search("Name: (\w+)", text_data)
name = m[1]
name

'Sample'


m = re.search("Start Date: (\S+)", text_data)
start_date = m[1]
start_date

'1/10/2019'


m = re.search("Geo-Coordinates: (\S+)", text_data)
coordinates = m[1]
coordinates

'13.89,83.49'

  • (\w+) - Function to extract at least one word after the specified field

  • m[n] - Retrieve all the text starting from the nth position

  • (\S+) - Function to extract text including special characters

  • You may use the split function to receive any other specific data from the text



Final Thoughts


  • You can filter and further process the text to apply another process like sentiment analysis to obtain more results through plot graphs or frequent words.

  • You can preprocess the images that have various formats similar to the test image and use the tesseract module to extract text without loss of information.


In this project tutorial, we have explored the Image to Text Conversion and Extraction project using OCR. This method is very practical and can be implemented in other projects to analyze and process the text data from an image.


Get the project notebook from here


Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

76 views