Hackers Realm

Jul 7, 20223 min

Image to Text Conversion & Extraction using Python | OCR | Machine Learning Project Tutorial

Updated: May 31, 2023

Unleash the power of image-to-text conversion with Python! This comprehensive tutorial explores OCR (Optical Character Recognition) and machine learning techniques. Learn how to extract text from images, enhance accuracy with ML models, and unlock the potential of image processing. Elevate your data extraction capabilities and explore exciting project possibilities with this hands-on tutorial. #ImageToText #OCR #MachineLearning #Python #DataExtraction #ImageProcessing

Image to Text Conversion & Extraction using OCR

In this project tutorial, we will use pytesseract module for Optical Character Recognition (OCR) to extract the text from the images and re module to extract specific fields from the data.

You can watch the step by step explanation video tutorial down below

Project Information

The project uses pytesseract module to convert image into text and regular expression to extract specific fields from the extracted text.

Download the pytesseract OCR source files here

Import Modules

import matplotlib.pyplot as plt
 
import PIL
 
import pytesseract
 
import re
 
%matplotlib inline

  • matplotlib - used for data visualization and graphical plotting

  • PIL - Python Imaging Library for image manipulation in different image formats

  • re – used as a regular expression to find particular patterns and process it

  • pytesseract - Image extraction module for character recognition, character segmentation and preprocessing images.

These are the following prerequisites for installing the pystesseract module

# prerequisites
 
!pip install pytesseract
 
# install desktop version of pytesseract

Download link: https://sourceforge.net/projects/tesseract-ocr-alt/files/

Alternate Download link: https://digi.bib.uni-mannheim.de/tesseract/

Load the image

Now we will open and display the test image to view the text data

img = PIL.Image.open('test.JPG')
 
plt.imshow(img)

Image contains Text Data
  • Here is a test image for example

  • We can see various types of text data like alphanumerical and special characters.

Convert Image to Text

Now we will convert the image into text

# config
 
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract'
 
TESSDATA_PREFIX = 'C:/Program Files/Tesseract-OCR'

  • This is a necessary configuration for the pytesseact module

  • You need a prefix to run the module

Now we process the image and get the output data

text_data = pytesseract.image_to_string(img.convert('RGB'), lang='eng')

  • img.convert('RGB') - Specify to convert the image into a color image

  • lang='eng' - Set the language to extract the text, by default is in English

  • If you want to extract the text in a specific language you must download the corresponding files and set the language

  • You can resize the image into a larger format for better extraction

print(text_data)

Name: Sample
 

 
Unique Policy Number: 12345
 
Amount: 100000
 

 
Start Date: 1/10/2019
 

 
End Date: 1/11/2019
 

 
Geo-Coordinates: 13.89,83.49

  • Text extracted from the image

  • Compared to the extracted data and the displayed image the extraction was successful and no other data was left out

  • If the text data is too small you can resize the image to higher resolution for better extraction

Extract Specific Fields

Now we will extract specific fields from the text data

m = re.search("Name: (\w+)", text_data)
 
name = m[1]
 
name

'Sample'

m = re.search("Start Date: (\S+)", text_data)
 
start_date = m[1]
 
start_date

'1/10/2019'

m = re.search("Geo-Coordinates: (\S+)", text_data)
 
coordinates = m[1]
 
coordinates

'13.89,83.49'

  • (\w+) - Function to extract at least one word after the specified field

  • m[n] - Retrieve all the text starting from the nth position

  • (\S+) - Function to extract text including special characters

  • You may use the split function to receive any other specific data from the text

Final Thoughts

  • You can filter and further process the text to apply another process like sentiment analysis to obtain more results through plot graphs or frequent words.

  • You can preprocess the images that have various formats similar to the test image and use the tesseract module to extract text without loss of information.

In this project tutorial, we have explored the Image to Text Conversion and Extraction project using OCR. This method is very practical and can be implemented in other projects to analyze and process the text data from an image.

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm

    1693
    1