Before we move on to the detailed concepts, let us quickly understand Text Summarization Python. Here is the definition for the same.
“Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning”-Text Summarization Techniques: A Brief Survey, 2017
Contributed by: Nitin Kumar
- Need For Text Summarization Python
- Approaches used for Text Summarization
- Steps for Implementation
- Complete Code
Need for Text Summarization Python
Various organisations today, be it online shopping, private sector organisations, government, tourism and catering industry, or any other institute that offers customer services, they are all concerned to learn their customer’s feedback each time their services are utilised. Now, consider that these companies are receiving an enormous amount of feedback and data every single day. It becomes quite a tedious task for the management to analyse each of these datapoints and come up with insights.
However, we have reached a point in technological advancements where technology can help with the tasks and we ourselves do not need to perform them. One such field that makes this happen is Machine Learning. Machines have become capable of understanding human language with the help of NLP or Natural Language Processing. Today, research is being done with the help of text analytics.
One application of text analytics and NLP is Text Summarization. Text Summarization Python helps in summarizing and shortening the text in the user feedback. It can be done with the help of an algorithm that can help in reducing the text bodies while keeping their original meaning intact or by giving insights into their original text.
Two different approaches are used for Text Summarization
- Extractive Summarization
- Abstractive Summarization
In Extractive Summarization, we are identifying important phrases or sentences from the original text and extract only these phrases from the text. These extracted sentences would be the summary.
In the Abstractive Summarization approach, we work on generating new sentences from the original text. The abstractive method is in contrast to the approach that was described above. The sentences generated through this approach might not even be present in the original text.
We are going to focus on using extractive methods. This method functions by identifying important sentences or excerpts from the text and reproducing them as part of the summary. In this approach, no new text is generated, only the existing text is used in the summarization process.
Steps for Implementation
Step 1: The first step is to import the required libraries. There are two NLTK libraries that are necessary for building an efficient text summarizer.
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize
A collection of text is known as Corpus. This could be either data sets such as bodies of work by an author, poems by a a particular poet, etc. To explain this concept in the blog, we will be using a data set of predetermined stop words.
This divides a text into a series of tokens. In Tokenizers, there are three main tokens – sentence, word, and regex tokenizer. We will be using only the word and the sentence tokenizer.
Step 2: Remove the Stop Words and store them in a separate array of words.
Words such as is, an, a, the, for that do not add value to the meaning of a sentence. For example, let us take a look at the following sentence:
GreatLearning is one of the most useful websites for ArtificialIntelligence aspirants.
After removing the stop words in the above sentence, we can narrow the number of words and preserve the meaning as follows:
[‘GreatLearning’, ‘one’, ‘useful’, ‘website’, ‘ArtificialIntelligence‘, ‘aspirants’, ‘.’]
Step 3: We can then create a frequency table of the words.
A Python Dictionary can keep a record of how many times each word will appear in the text after removing the stop words. We can use this dictionary over each sentence to know which sentences have the most relevant content in the overall text.
stopwords = set (stopwords.words("english")) words = word_tokenize(text) freqTable = dict()
Step 4: Depending on the words it contains and the frequency table, we will assign a score to each sentence.
Here, we will use the sent_tokenize() method that can be used to create the array of sentences. We will also need a dictionary to keep track of the score of each sentence, and we can later go through the dictionary to create a summary.
sentences = sent_tokenize(text) sentenceValue = dict()
Step 5: To compare the sentences within the text, assign a score.
One simple approach that can be used to compare the scores is to find an average score of a particular sentence. This average score can be a good threshold.
sumValues = 0 for sentence in sentenceValue: sumValues += sentenceValue[sentence] average = int(sumValues / len(sentenceValue))
Apply the threshold value and the store sentences in an order into the summary.
Find Complete Code Here
This brings us to the end of the blog on Text Summarization Python. We hope that you were able to learn more about the concept. If you wish to learn more such concepts, do take up the Python for Machine Learning free online course offered by Great Learning Academy.