• Hackers Realm

Text Summarization using Python (NLP) | Word Frequency | Machine Learning Project Tutorial

Text summarization is a process to analyze a large text document and annotate a summary of useful information. This project comes under Natural Language Processing and uses word frequency to calculate a score of the sentence.



In this project tutorial, we will analyze a text document and use a word frequency technique to calculate no. of words in each sentence and annotate a relevant summary of the analyzed information.



You can watch the step by step explanation video tutorial down below


Dataset Information


Text Summarization is a very useful technique to get important parts of a large text document. This project uses word frequencies of the sentence and yield a score for each sentence. It used the Natural Language Toolkit to process the text information and regular expressions to preprocess the data.


This is a basic text summarization technique to get useful information from any document.


You may use any text document from the internet, investigation article, paper, etc. For this project we will use the following text article as the input data.



## input text article article_text="Just what is agility in the context of software engineering work? Ivar Jacobson [Jac02a] provides a useful discussion: Agility has become today’s buzzword when describing a modern software process. Everyone is agile. An agile team is a nimble team able to appropriately respond to changes. Change is what software development is very much about. Changes in the software being built, changes to the team members, changes because of new technology, changes of all kinds that may have an impact on the product they build or the project that creates the product. Support for changes should be built-in everything we do in software, something we embrace because it is the heart and soul of software. An agile team recognizes that software is developed by individuals working in teams and that the skills of these people, their ability to collaborate is at the core for the success of the project.In Jacobson’s view, the pervasiveness of change is the primary driver for agility. Software engineers must be quick on their feet if they are to accommodate the rapid changes that Jacobson describes. But agility is more than an effective response to change. It also encompasses the philosophy espoused in the manifesto noted at the beginning of this chapter. It encourages team structures and attitudes that make communication (among team members, between technologists and business people, between software engineers and their managers) more facile. It emphasizes rapid delivery of operational software and deemphasizes the importance of intermediate work products (not always a good thing); it adopts the customer as a part of the development team and works to eliminate the “us and them” attitude that continues to pervade many software projects; it recognizes that planning in an uncertain world has its limits and that a project plan must be fl exible. Agility can be applied to any software process. However, to accomplish this, it is essential that the process be designed in a way that allows the project team to adapt tasks and to streamline them, conduct planning in a way that understands the fl uidity of an agile development approach, eliminate all but the most essential work products and keep them lean, and emphasize an incremental delivery strategy that gets working software to the customer as rapidly as feasible for the product type and operational environment. "



Import Modules


import re
import nltk
  • re – used as a regular expression to find particular patterns and process it

  • nltk – a natural language processing toolkit module associated in anaconda


Data Preprocessing


Let us convert all letters in lower case for better processing

article_text = article_text.lower()
article_text

'just what is agility in the context of software engineering work? ivar jacobson [jac02a] provides a useful discussion: agility has become today’s buzzword when describing a modern software process. everyone is agile. an agile team is a nimble team able to appropriately respond to changes. change is what software development is very much about. changes in the software being built, changes to the team members, changes because of new technology, changes of all kinds that may have an impact on the product they build or the project that creates the product. support for changes should be built-in everything we do in software, something we embrace because it is the heart and soul of software. an agile team recognizes that software is developed by individuals working in teams and that the skills of these people, their ability to collaborate is at the core for the success of the project.in jacobson’s view, the pervasiveness of change is the primary driver for agility. software engineers must be quick on their feet if they are to accommodate the rapid changes that jacobson describes. but agility is more than an effective response to change. it also encompasses the philosophy espoused in the manifesto noted at the beginning of this chapter. it encourages team structures and attitudes that make communication (among team members, between technologists and business people, between software engineers and their managers) more facile. it emphasizes rapid delivery of operational software and deemphasizes the importance of intermediate work products (not always a good thing); it adopts the customer as a part of the development team and works to eliminate the “us and them” attitude that continues to pervade many software projects; it recognizes that planning in an uncertain world has its limits and that a project plan must be fl exible. agility can be applied to any software process. however, to accomplish this, it is essential that the process be designed in a way that allows the project team to adapt tasks and to streamline them, conduct planning in a way that understands the fl uidity of an agile development approach, eliminate all but the most essential work products and keep them lean, and emphasize an incremental delivery strategy that gets working software to the customer as rapidly as feasible for the product type and operational environment. '



Now we remove extra spaces, punctuations and numbers

# remove spaces, punctuations and numbers
clean_text = re.sub('[^a-zA-Z]', ' ', article_text)
clean_text = re.sub('\s+', ' ', clean_text)

'just what is agility in the context of software engineering work ivar jacobson jac a provides a useful discussion agility has become today s buzzword when describing a modern software process everyone is agile an agile team is a nimble team able to appropriately respond to changes change is what software development is very much about changes in the software being built changes to the team members changes because of new technology changes of all kinds that may have an impact on the product they build or the project that creates the product support for changes should be built in everything we do in software something we embrace because it is the heart and soul of software an agile team recognizes that software is developed by individuals working in teams and that the skills of these people their ability to collaborate is at the core for the success of the project in jacobson s view the pervasiveness of change is the primary driver for agility software engineers must be quick on their feet if they are to accommodate the rapid changes that jacobson describes but agility is more than an effective response to change it also encompasses the philosophy espoused in the manifesto noted at the beginning of this chapter it encourages team structures and attitudes that make communication among team members between technologists and business people between software engineers and their managers more facile it emphasizes rapid delivery of operational software and deemphasizes the importance of intermediate work products not always a good thing it adopts the customer as a part of the development team and works to eliminate the us and them attitude that continues to pervade many software projects it recognizes that planning in an uncertain world has its limits and that a project plan must be flexible agility can be applied to any software process however to accomplish this it is essential that the process be designed in a way that allows the project team to adapt tasks and to streamline them conduct planning in a way that understands the fluidity of an agile development approach eliminate all but the most essential work products and keep them lean and emphasize an incremental delivery strategy that gets working software to the customer as rapidly as feasible for the product type and operational environment '



Now we split the text data to tokenized sentences

# split into sentence list
sentence_list = nltk.sent_tokenize(article_text)
sentence_list

['just what is agility in the context of software engineering work?', 'ivar jacobson [jac02a] provides a useful discussion: agility has become today’s buzzword when describing a modern software process.', 'everyone is agile.', 'an agile team is a nimble team able to appropriately respond to changes.', 'change is what software development is very much about.', 'changes in the software being built, changes to the team members, changes because of new technology, changes of all kinds that may have an impact on the product they build or the project that creates the product.', 'support for changes should be built-in everything we do in software, something we embrace because it is the heart and soul of software.', 'an agile team recognizes that software is developed by individuals working in teams and that the skills of these people, their ability to collaborate is at the core for the success of the project.in jacobson’s view, the pervasiveness of change is the primary driver for agility.', 'software engineers must be quick on their feet if they are to accommodate the rapid changes that jacobson describes.', 'but agility is more than an effective response to change.', 'it also encompasses the philosophy espoused in the manifesto noted at the beginning of this chapter.', 'it encourages team structures and attitudes that make communication (among team members, between technologists and business people, between software engineers and their managers) more facile.', 'it emphasizes rapid delivery of operational software and deemphasizes the importance of intermediate work products (not always a good thing); it adopts the customer as a part of the development team and works to eliminate the “us and them” attitude that continues to pervade many software projects; it recognizes that planning in an uncertain world has its limits and that a project plan must be flexible.', 'agility can be applied to any software process.', 'however, to accomplish this, it is essential that the process be designed in a way that allows the project team to adapt tasks and to streamline them, conduct planning in a way that understands the fluidity of an agile development approach, eliminate all but the most essential work products and keep them lean, and emphasize an incremental delivery strategy that gets working software to the customer as rapidly as feasible for the product type and operational environment.']


  • The period is needed to determine a sentence to properly tokenize the sentences.



## run this cell once to download stopwords
import nltk
nltk.download('stopwords')
  • Necessary to download the stopwords from the nltk toolkit for further processing


Word Frequencies


stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(clean_text):
    if word not in stopwords:
        if word not in word_frequencies:
            word_frequencies[word] = 1
        else:
             word_frequencies[word] += 1
  • English stopwords loaded for preprocessing

  • Word frequency dictionary created for calculation

  • Stop words are not meaningful words, deleting or avoiding those words may improve the results

  • Every word in the clean text are tokenized for comparison with the stop words

  • If the word is not a stop word and not in the word frequency dictionary, it will load the word with a count no. of 1, else adds one to its count no.



Now we calculate the frequency of each word

maximum_frequency = max(word_frequencies.values())

for word in word_frequencies:
    word_frequencies[word] = word_frequencies[word] / maximum_frequency

Calculate Sentence Scores


sentence_scores = {}

for sentence in sentence_list:
    for word in nltk.word_tokenize(sentence):
        if word in word_frequencies and len(sentence.split(' ')) < 30:
            if sentence not in sentence_scores:
                sentence_scores[sentence] = word_frequencies[word]
            else:
                sentence_scores[sentence] += word_frequencies[word]
  • len(sentence.split(' ')) < 30 is a threshold considered to split and evaluate a sentence less than 30 words. The threshold value may be changed by your preference.

  • Sentence score dictionary created for sentence ranking with the word frequency



Now let us see the Word Frequency dictionary with the scores


word_frequencies

{'agility': 0.38461538461538464, 'context': 0.07692307692307693, 'software': 1.0, 'engineering': 0.07692307692307693, 'work': 0.23076923076923078, 'ivar': 0.07692307692307693, 'jacobson': 0.23076923076923078, 'jac': 0.07692307692307693, 'provides': 0.07692307692307693, 'useful': 0.07692307692307693, 'discussion': 0.07692307692307693, 'become': 0.07692307692307693, 'today': 0.07692307692307693, 'buzzword': 0.07692307692307693, 'describing': 0.07692307692307693, 'modern': 0.07692307692307693, 'process': 0.23076923076923078, 'everyone': 0.07692307692307693, 'agile': 0.3076923076923077, 'team': 0.6153846153846154, 'nimble': 0.07692307692307693, 'able': 0.07692307692307693, 'appropriately': 0.07692307692307693, 'respond': 0.07692307692307693, 'changes': 0.5384615384615384, 'change': 0.23076923076923078, 'development': 0.23076923076923078, 'much': 0.07692307692307693,


'built': 0.15384615384615385, 'members': 0.15384615384615385, 'new': 0.07692307692307693, 'technology': 0.07692307692307693, 'kinds': 0.07692307692307693, 'may': 0.07692307692307693, 'impact': 0.07692307692307693, 'product': 0.23076923076923078, 'build': 0.07692307692307693, 'project': 0.3076923076923077, 'creates': 0.07692307692307693, 'support': 0.07692307692307693, 'everything': 0.07692307692307693, 'something': 0.07692307692307693, 'embrace': 0.07692307692307693, 'heart': 0.07692307692307693, 'soul': 0.07692307692307693, 'recognizes': 0.15384615384615385, 'developed': 0.07692307692307693, 'individuals': 0.07692307692307693, 'working': 0.15384615384615385, 'teams': 0.07692307692307693, 'skills': 0.07692307692307693, 'people': 0.15384615384615385, 'ability': 0.07692307692307693, 'collaborate': 0.07692307692307693, 'core': 0.07692307692307693, 'success': 0.07692307692307693, 'view': 0.07692307692307693, 'pervasiveness': 0.07692307692307693, 'primary': 0.07692307692307693, 'driver': 0.07692307692307693, 'engineers': 0.15384615384615385, 'must': 0.15384615384615385, 'quick': 0.07692307692307693, 'feet': 0.07692307692307693,


'accommodate': 0.07692307692307693, 'rapid': 0.15384615384615385, 'describes': 0.07692307692307693, 'effective': 0.07692307692307693, 'response': 0.07692307692307693, 'also': 0.07692307692307693, 'encompasses': 0.07692307692307693, 'philosophy': 0.07692307692307693, 'espoused': 0.07692307692307693, 'manifesto': 0.07692307692307693, 'noted': 0.07692307692307693, 'beginning': 0.07692307692307693, 'chapter': 0.07692307692307693, 'encourages': 0.07692307692307693, 'structures': 0.07692307692307693, 'attitudes': 0.07692307692307693, 'make': 0.07692307692307693, 'communication': 0.07692307692307693, 'among': 0.07692307692307693, 'technologists': 0.07692307692307693, 'business': 0.07692307692307693, 'managers': 0.07692307692307693, 'facile': 0.07692307692307693, 'emphasizes': 0.07692307692307693, 'delivery': 0.15384615384615385, 'operational': 0.15384615384615385, 'deemphasizes': 0.07692307692307693, 'importance': 0.07692307692307693, 'intermediate': 0.07692307692307693, 'products': 0.15384615384615385, 'always': 0.07692307692307693, 'good': 0.07692307692307693, 'thing': 0.07692307692307693, 'adopts': 0.07692307692307693, 'customer': 0.15384615384615385, 'part': 0.07692307692307693, 'works': 0.07692307692307693, 'eliminate': 0.15384615384615385, 'us': 0.07692307692307693, 'attitude': 0.07692307692307693, 'continues': 0.07692307692307693, 'pervade': 0.07692307692307693, 'many': 0.07692307692307693, 'projects': 0.07692307692307693, 'planning': 0.15384615384615385, 'uncertain': 0.07692307692307693, 'world': 0.07692307692307693,


'limits': 0.07692307692307693, 'plan': 0.07692307692307693, 'exible': 0.07692307692307693, 'applied': 0.07692307692307693, 'however': 0.07692307692307693, 'accomplish': 0.07692307692307693, 'essential': 0.15384615384615385, 'designed': 0.07692307692307693, 'way': 0.15384615384615385, 'allows': 0.07692307692307693, 'adapt': 0.07692307692307693, 'tasks': 0.07692307692307693, 'streamline': 0.07692307692307693, 'conduct': 0.07692307692307693, 'understands': 0.07692307692307693, 'uidity': 0.07692307692307693, 'approach': 0.07692307692307693, 'keep': 0.07692307692307693, 'lean': 0.07692307692307693, 'emphasize': 0.07692307692307693, 'incremental': 0.07692307692307693, 'strategy': 0.07692307692307693, 'gets': 0.07692307692307693, 'rapidly': 0.07692307692307693, 'feasible': 0.07692307692307693, 'type': 0.07692307692307693, 'environment': 0.07692307692307693}



sentence_scores

{'just what is agility in the context of software engineering work?': 1.7692307692307694, 'ivar jacobson [jac02a] provides a useful discussion: agility has become today’s buzzword when describing a modern software process.': 2.5384615384615383, 'everyone is agile.': 0.38461538461538464, 'an agile team is a nimble team able to appropriately respond to changes.': 2.3846153846153846, 'change is what software development is very much about.': 1.5384615384615385, 'support for changes should be built-in everything we do in software, something we embrace because it is the heart and soul of software.': 3.0, 'software engineers must be quick on their feet if they are to accommodate the rapid changes that jacobson describes.': 2.5384615384615383, 'but agility is more than an effective response to change.': 0.7692307692307694, 'it also encompasses the philosophy espoused in the manifesto noted at the beginning of this chapter.': 0.6153846153846154, 'it encourages team structures and attitudes that make communication (among team members, between technologists and business people, between software engineers and their managers) more facile.': 3.4615384615384612,


  • The score for each sentence is calculated by adding the frequency score of the words



# get top 5 sentences
import heapq
summary = heapq.nlargest(5, sentence_scores, key=sentence_scores.get)

print(" ".join(summary))

it encourages team structures and attitudes that make communication (among team members, between technologists and business people, between software engineers and their managers) more facile. support for changes should be built-in everything we do in software, something we embrace because it is the heart and soul of software. ivar jacobson [jac02a] provides a useful discussion: agility has become today’s buzzword when describing a modern software process. software engineers must be quick on their feet if they are to accommodate the rapid changes that jacobson describes. an agile team is a nimble team able to appropriately respond to changes.


  • heapq - module provides an implementation of the heap queue algorithm, also known as the priority queue algorithm.

  • heapq.nlargest() - returns the largest value from the queue; In this snippet, the largest sentence score has been returned with the sentences.



Final Thoughts


  • Text summarization technique is very useful to obtain useful data from large documents.

  • Simplifying and filtering text can achieve cleaner data to process, giving better results.


In this project tutorial, we have explored the Text summarization process to summarize a large document using the word frequency technique to evaluate the relevance of each sentence from the entire text.

Get the project notebook from here


Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

68 views