Image to Text Conversion & Extraction using Python | OCR | Machine Learning Project Tutorial

Updated: May 31, 2023

Unleash the power of image-to-text conversion with Python! This comprehensive tutorial explores OCR (Optical Character Recognition) and machine learning techniques. Learn how to extract text from images, enhance accuracy with ML models, and unlock the potential of image processing. Elevate your data extraction capabilities and explore exciting project possibilities with this hands-on tutorial. #ImageToText #OCR #MachineLearning #Python #DataExtraction #ImageProcessing

Image to Text Conversion & Extraction using OCR

In this project tutorial, we will use pytesseract module for Optical Character Recognition (OCR) to extract the text from the images and re module to extract specific fields from the data.

You can watch the step by step explanation video tutorial down below

Project Information

The project uses pytesseract module to convert image into text and regular expression to extract specific fields from the extracted text.

Download the pytesseract OCR source files here

Import Modules

import matplotlib.pyplot as plt
import PIL
import pytesseract
import re
%matplotlib inline

matplotlib - used for data visualization and graphical plotting
PIL - Python Imaging Library for image manipulation in different image formats
re – used as a regular expression to find particular patterns and process it
pytesseract - Image extraction module for character recognition, character segmentation and preprocessing images.

These are the following prerequisites for installing the pystesseract module

# prerequisites
!pip install pytesseract
# install desktop version of pytesseract

Download link: https://sourceforge.net/projects/tesseract-ocr-alt/files/

Alternate Download link: https://digi.bib.uni-mannheim.de/tesseract/

Load the image

Now we will open and display the test image to view the text data

img = PIL.Image.open('test.JPG')
plt.imshow(img)

Image contains Text Data

Here is a test image for example
We can see various types of text data like alphanumerical and special characters.

Convert Image to Text

Now we will convert the image into text

# config
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract'
TESSDATA_PREFIX = 'C:/Program Files/Tesseract-OCR'

This is a necessary configuration for the pytesseact module
You need a prefix to run the module

Now we process the image and get the output data

text_data = pytesseract.image_to_string(img.convert('RGB'), lang='eng')

img.convert('RGB') - Specify to convert the image into a color image
lang='eng' - Set the language to extract the text, by default is in English
If you want to extract the text in a specific language you must download the corresponding files and set the language
You can resize the image into a larger format for better extraction

print(text_data)

Name: Sample

Unique Policy Number: 12345
Amount: 100000

Start Date: 1/10/2019

End Date: 1/11/2019

Geo-Coordinates: 13.89,83.49

Text extracted from the image
Compared to the extracted data and the displayed image the extraction was successful and no other data was left out
If the text data is too small you can resize the image to higher resolution for better extraction

Extract Specific Fields

Now we will extract specific fields from the text data

m = re.search("Name: (\w+)", text_data)
name = m[1]
name

'Sample'

m = re.search("Start Date: (\S+)", text_data)
start_date = m[1]
start_date

'1/10/2019'

m = re.search("Geo-Coordinates: (\S+)", text_data)
coordinates = m[1]
coordinates

'13.89,83.49'

(\w+) - Function to extract at least one word after the specified field
m[n] - Retrieve all the text starting from the nth position
(\S+) - Function to extract text including special characters
You may use the split function to receive any other specific data from the text

Final Thoughts

You can filter and further process the text to apply another process like sentiment analysis to obtain more results through plot graphs or frequent words.
You can preprocess the images that have various formats similar to the test image and use the tesseract module to extract text without loss of information.

In this project tutorial, we have explored the Image to Text Conversion and Extraction project using OCR. This method is very practical and can be implemented in other projects to analyze and process the text data from an image.

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm

1693