NLTK with Python
  1. What is NLTK?
  2. Accessing a dataset in NLTK
  3. Data pre-processing
    Tokenization
    Punctuation Removal
    Stop Words Removal
    Stemming
    Lemmatization
    POS Tagging
    Chunking
  4. Synonyms using wordnet
  5. Word Embeddings
  6. Project in NLP

What is NLTK?

NLTK is a standard python library with prebuilt functions and utilities for the ease of use and implementation. It is one of the most used libraries for natural language processing and computational linguistics.

NLTK Installation Process

With a system running windows OS and having python preinstalled

Open a command prompt and type:

pip install nltk

Note: 

!pip install nltk
will download nltk in a specific file/editor for the current session
nltk dataset download

There are several datasets which can be used with nltk. To use them, we need to download them.
We can download them by executing this:
#code
import nltk
nltk.download()

Click download in the pop up

Once it downloads, we are set to go.

Accessing a dataset in NLTK

A dataset is referred to as corpus in nltk.

A corpus is essentially a collection of sentences which serves as an input. For further processing a corpus is broken down into smaller pieces and processed which we would see in later sections.

There are several of them which we downloaded in the earlier step, but we have used the movie_reviews corpus for the demonstration.


import nltk
from nltk.corpus import movie_reviews
movie_reviews.words()

So now we are all setup for some real time text processing using nltk.

Data pre-processing 

Data pre-processing is the process of making the machine understand things better or making the input more machine understandable. Some standard practices for doing that are:

1.Tokenization

Tokenization is the process of breaking text up into smaller chunks as per our requirements.

nltk has a cool submodule “tokenize” which we will be using. 

  • Word Tokenization

Word tokenization is the process of breaking a sentence into words. word_tokenize function has been used, which returns a list of words as output.[]


from nltk.tokenize import word_tokenize
data = "I pledge to be a data scientist one day"
tokenized_text=word_tokenize(data)
print(tokenized_text)
print(type(tokenized_text))
  • Sentence Tokenization

Sentence tokenization is the process of breaking a corpus into sentence level tokens. It’s essentially used when the corps consists of multiple paragraphs. Each paragraph is broken down into sentences.


from nltk.tokenize import sent_tokenize
para="""Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards, and pies.The most commonly used cake ingredients include flour, sugar, eggs, butter or oil or margarine, a liquid, and leavening agents, such as baking soda or baking powder. Common additional ingredients and flavourings include dried, candied, or fresh fruit, nuts, cocoa, and extracts such as vanilla, with numerous substitutions for the primary ingredients.Cakes can also be filled with fruit preserves, nuts or dessert sauces (like pastry cream), iced with buttercream or other icings, and decorated with marzipan, piped borders, or candied fruit."""
tokenized_para=sent_tokenize(para)
print(tokenized_para)
print(type(tokenized_para))

2. Punctuation Removal

Punctuations are of little use in NLP so they are removed.


from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
result = tokenizer.tokenize("Wow! I am excited to learn data science")
print(result)

3. Stop Words Removal

Stop words are words which occur frequently in a corpus. e.g a, an, the, in. Frequently occurring words are removed from the corpus for the sake of text-normalization.


from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
to_be_removed = set(stopwords.words('english'))
para="""Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.
In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations 
that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards, 
and pies."""
tokenized_para=word_tokenize(para)
print(tokenized_para)
modified_token_list=[word for word in tokenized_para if not word in to_be_removed]
print(modified_token_list)

4. Stemming

It is reduction of inflection from words. Words with same origin will get reduced to a form which may or may not be a word.

NLTK has different stemmers which implement different methodologies.

Some of which are:

  • Porter Stemmer

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
content = """Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations 
that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards, and pies."""
tk_content=word_tokenize(content)
stemmed_words = [stemmer.stem(i) for i in tk_content] 
print(stemmed_words)
  • Lancaster Stemmer

from nltk.stem import LancasterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
content = """Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.
In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations 
that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards, 
and pies."""

tk_content=word_tokenize(content)
stemmed_words = [stemmer.stem(i) for i in tk_content]
print(stemmed_words)

5. Lemmatization

It is another process of reducing inflection from words. The way its different from stemming is that it reduces words to their origins which have actual meaning. Stemming sometimes generates words which are not even words.

WordNet Lemmatizer


import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
content = """Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.
In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations 
that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards, 
and pies."""

tk_content=word_tokenize(content)
lemmatized_words = [lemmatizer.lemmatize(i) for i in tk_content] 
print(lemmatized_words)

6. POS Tagging

POS tagging is the process of identifying parts of speech of a sentence. It is able to identify nouns, pronouns, adjectives etc. in a sentence and assigns a POS token to each word. There are different methods to tag, but we will be using the universal style of tagging.


import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
content = """Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.
In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations 
that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards, 
and pies."""
words= [word_tokenize(i) for i in sent_tokenize(content)]
pos_tag= [nltk.pos_tag(i,tagset="universal") for i in words]
print(pos_tag)

7. Chunking

Chunking also known as shallow parsing, is practically a method in NLP applied to POS tagged data to gain further insights from it. It is done by grouping certain words on the basis of a pre-defined rule. The text is then parsed according to the rule to group data for phrase creation.


import nltk
from nltk.tokenize import word_tokenize
content = "Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked."
tokenized_text = nltk.word_tokenize(content)
tagged_token = nltk.pos_tag(tokenized_text)
grammer = "NP: {<DT>?<JJ>*<NN>}"
phrases = nltk.RegexpParser(grammer)
result = phrases.parse(tagged_token)
print(result)
result.draw()

Bag Of Words

Bag of words is a simplistic model which gives information about the contents of a corpus in terms of number of occurrences of words. It ignores the grammar and context of the documents and is a mapping of words to their counts in the corpus.


from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

content = """Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.
In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards, and pies."""

count_vectorizer = CountVectorizer()

bag_of_words = count_vectorizer.fit_transform(content.splitlines())

pd.DataFrame(bag_of_words.toarray(), columns = count_vectorizer.get_feature_names())

Synonyms using wordnet

Wordnet is a cool corpus in NLTK which can be used to generate synonyms antonyms of words.

Here is a cool program to generate synonyms:


from nltk.corpus import wordnet
syns = wordnet.synsets("dog") 
  
print(syns[0].name()) 
  
print(syns[0].lemmas()[0].name()) 
  
print(syns[0].definition()) 
  
print(syns[0].examples())

Frequency distribution of words

We can generate frequency distribution of words in a corpus by using the FreqDist() function in nlp. The results generated when plotted give a nice plot as illustrated by the code output below.


import nltk
import matplotlib.pyplot as plt
content = """Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.
In their oldest forms, cakes were modifications of bread"""
words = nltk.tokenize.word_tokenize(content)
fd = nltk.FreqDist(words)
fd.plot()

Word Embeddings

Word Embeddings is a NLP technique in which we try to capture the context, semantic meaning and inter relation of words with each other. It is done by creation of a word vector. Word vectors when projected upon a vector space can also show similarity between words.The technique or word embeddings which we discuss here today is Word-to-vec. We would be doing so with the help of Gensim which is another cool library like NLTK.


from gensim.models import Word2Vec
import nltk
# define training data
content="""Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.
In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards, and pies."""
sentences=nltk.sent_tokenize(content)
words=[]

for i in sentences:
    words.append(nltk.word_tokenize(i))

# train model
model = Word2Vec(words, min_count=1)

# summarize the loaded model
print(model)

# summarize vocabulary
word_vec_words = list(model.wv.vocab)
print(word_vec_words)

# access vector for one word
print(model['sugar'])

# save model
model.save('model.bin')

# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

Project in NLP

As a final project in NLP, we will be building a text classification model using NLP.

The dataset we will be using is the IMDB dataset which is prebuilt in keras for faster execution.

The dataset contains movie data along with genres.

The task we would be doing is to classify the movie in their respective genres.

For the sake of simplicity, we use the first 10,000 records. You are free to explore with more data. The execution time increases with more data.


import numpy as np
from keras.utils import to_categorical
from keras import models
from keras import layers
from keras.datasets import imdb
 
(train_data, train_target), (test_data, test_target) = imdb.load_data(num_words=10000)
dt = np.concatenate((train_data, test_data), axis=0)
tar = np.concatenate((train_target, test_target), axis=0)
 
def convert(sequences, dimension = 10000):
 results = np.zeros((len(sequences), dimension))
 for i, sequence in enumerate(sequences):
  results[i, sequence] = 1
 return results
 
dt = convert(dt)
tar = np.array(tar).astype("float32")
test_x = dt[:9000]
test_y = tar[:9000]
train_x = dt[9000:]
train_y = tar[9000:]
model = models.Sequential()
# Input - Layer
model.add(layers.Dense(50, activation = "relu", input_shape=(10000, )))
# Hidden - Layers
model.add(layers.Dropout(0.4, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
model.add(layers.Dropout(0.3, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
# Output- Layer
model.add(layers.Dense(1, activation = "sigmoid"))
model.summary()
# compiling the model
 
model.compile(
 optimizer = "adam",
 loss = "binary_crossentropy",
 metrics = ["accuracy"]
)
results = model.fit(
 train_x, train_y,
 epochs= 2,
 batch_size = 500,
 validation_data = (test_x, test_y)
)
print("Test-Accuracy:", np.mean(results.history["val_acc"]))

In the above code, we first import the prebuilt dataset along with the other dependencies.

We have a function convert, to convert the words into vectors for processing.

We then divide our dataset into train and test sets.

Finally we compile our model using compile() with the optimizer set as adam which is one of the most robust optimizers keras has to offer. The thing to take note here is that we have used binary cross_entropy as the loss function. The output we are getting is a sparse matrix with the probability of genres most suited are returned as 1.

The dropout layers in the model help us regularize the model.

We have used only two epochs for the demonstration. You can obviously increase them to get more accuracy.
Learn all about Python and other AIML technologies – check out Great Learning’s PGP- Machine Learning course.

3

LEAVE A REPLY

Please enter your comment!
Please enter your name here

five × 1 =