Natural Language Processing (NLP) Introduction:
NLP stands for Natural Language Processing which helps the machines understand and analyse natural languages. It is an automated process to extract required information from data by applying machine learning algorithms.
While applying for job roles that deal with Natural Language Processing, it is often not clear to the applicants the kind of questions that the interviewer might ask. Apart from learning the basics of NLP, it is important to prepare specifically for the interviews. Checkout the list of frequently asked NLP interview questions and answers with explanation that you might face.
Top 10 NLP Interview Questions for Beginners
- What are the possible features of a text corpus?
- Which of the below are NLP use cases?
- TF-IDF helps you to establish?
- Transformer architecture was first introduced with?
- List 10 use cases to be solved using NLP techniques?
- Which NLP model gives the best accuracy amongst the following?
- Permutation Language models is a feature of
- What is Naive Bayes algorithm, When we can use this algorithm in NLP?
- Explain Dependency Parsing in NLP?
- What is text Summarization?
NLP Interview Questions and Answers with Explanations
1. Which of the following techniques can be used for keyword normalization, the process of converting a keyword into its base form?
c. Cosine Similarity
Lemmatization helps to get to the base form of a word, e.g. are playing -> play, eating -> eat, etc..
Other options are meant for different purposes.
2. Which of the following techniques can be used to compute the distance between two word vectors?
b. Euclidean distance
c. Cosine Similarity
Answer: b) and c)
Distance between two word vectors can be computed using Cosine similarity and Euclidean Distance. Cosine Similarity establishes a cosine angle between the vector of two words. A cosine angle close to each other between two word vectors indicates the words are similar and vice a versa.
E.g. cosine angle between two words “Football” and “Cricket” will be closer to 1 as compared to angle between the words “Football” and “New Delhi”
Python code to implement CosineSimlarity function would look like this
return np.dot(x,y)/( np.sqrt(np.dot(x,x)) * np.sqrt(np.dot(y,y)) )
q1 = wikipedia.page(‘Strawberry’)
q2 = wikipedia.page(‘Pineapple’)
q3 = wikipedia.page(‘Google’)
q4 = wikipedia.page(‘Microsoft’)
cv = CountVectorizer()
X = np.array(cv.fit_transform([q1.content, q2.content, q3.content, q4.content]).todense())
print (“Strawberry Pineapple Cosine Distance”, cosine_similarity(X,X))
print (“Strawberry Google Cosine Distance”, cosine_similarity(X,X))
print (“Pineapple Google Cosine Distance”, cosine_similarity(X,X))
print (“Google Microsoft Cosine Distance”, cosine_similarity(X,X))
print (“Pineapple Microsoft Cosine Distance”, cosine_similarity(X,X))
Strawberry Pineapple Cosine Distance 0.8899200413701714
Strawberry Google Cosine Distance 0.7730935582847817
Pineapple Google Cosine Distance 0.789610214147025
Google Microsoft Cosine Distance 0.8110888282851575
Usually Document similarity is measured by how close semantically the content (or words) in the document are to each other. When they are close, the similarity index is close to 1, otherwise near 0.
The Euclidean distance between two points is the length of the shortest path connecting them. Usually computed using Pythagoras theorem for a triangle.
3. What are the possible features of a text corpus?
a. Count of the word in a document
b. Vector notation of the word
c. Part of Speech Tag
d. Basic Dependency Grammar
e. All of the above
All of the above can be used as features of the text corpus.
4. You created a document term matrix on the input data of 20K documents for a Machine learning model. Which of the following can be used to reduce the dimensions of data?
- Keyword Normalization
- Latent Semantic Indexing
- Latent Dirichlet Allocation
a. only 1
b. 2, 3
c. 1, 3
d. 1, 2, 3
5. Which of the text parsing techniques can be used for noun phrase detection, verb phrase detection, subject detection, and object detection.
a. Part of speech tagging
b. Skip Gram and N-Gram extraction
c. Continuous Bag of Words
d. Dependency Parsing and Constituency Parsing
6. Dissimilarity between words expressed using cosine similarity will have values significantly higher than 0.5
7. Which one of the following are keyword Normalization techniques
b. Part of Speech
c. Named entity recognition
Answer: a) and d)
Part of Speech (POS) and Named Entity Recognition(NER) are not keyword Normalization techniques. Named Entity help you extract Organization, Time, Date, City, etc..type of entities from the given sentence, whereas Part of Speech helps you extract Noun, Verb, Pronoun, adjective, etc..from the given sentence tokens.
8. Which of the below are NLP use cases?
a. Detecting objects from an image
b. Facial Recognition
c. Speech Biometric
d. Text Summarization
a) And b) are Computer Vision use cases, and c) is Speech use case.
Only d) Text Summarization is an NLP use case.
9. In a corpus of N documents, one randomly chosen document contains a total of T terms and the term “hello” appears K times.
What is the correct value for the product of TF (term frequency) and IDF (inverse-document-frequency), if the term “hello” appears in approximately one-third of the total documents?
a. KT * Log(3)
b. T * Log(3) / K
c. K * Log(3) / T
d. Log(3) / KT
formula for TF is K/T
formula for IDF is log(total docs / no of docs containing “data”)
= log(1 / (⅓))
= log (3)
Hence correct choice is Klog(3)/T
10. The algorithm decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents
a. Term Frequency (TF)
b. Inverse Document Frequency (IDF)
d. Latent Dirichlet Allocation (LDA)
11. The process of removing words like “and”, “is”, “a”, “an”, “the” from a sentence is called as
c. Stop word
d. All of the above
In Lemmatization, all the stop words such as a, an, the, etc.. are removed. One can also define custom stop words for removal.
12. The process of converting a sentence or paragraph into tokens is referred to as Stemming
The statement describes the process of tokenization and not stemming, hence it is False.
13. Tokens are converted into numbers before giving to any Neural Network
In NLP, all words are converted into a number before feeding to a Neural Network.
14. identify the odd one out
b. scikit learn
All the ones mentioned are NLP libraries except BERT, which is a word embedding
15. TF-IDF helps you to establish?
a. most frequently occurring word in the document
b. most important word in the document
TF-IDF helps to establish how important a particular word is in the context of the document corpus. TF-IDF takes into account the number of times the word appears in the document and offset by the number of documents that appear in the corpus.
- TF is the frequency of term divided by a total number of terms in the document.
- IDF is obtained by dividing the total number of documents by the number of documents containing the term and then taking the logarithm of that quotient.
- Tf.idf is then the multiplication of two values TF and IDF.
Suppose that we have term count tables of a corpus consisting of only two documents, as listed here
|Term||Document 1 Frequency||Document 2 Frequency|
The calculation of tf–idf for the term “this” is performed as follows:
tf(“this”, d1) = 1/5 = 0.2
tf(“this”, d2) = 1/7 = 0.14
idf(“this”, D) = log (2/2) =0
tfidf(“this”, d1, D) = 0.2* 0 = 0
tfidf(“this”, d2, D) = 0.14* 0 = 0
tf(“example”, d1) = 0/5 = 0
tf(“example”, d2) = 3/7 = 0.43
idf(“example”, D) = log(2/1) = 0.301
tfidf(“example”, d1, D) = tf(“example”, d1) * idf(“example”, D) = 0 * 0.301 = 0
tfidf(“example”, d2, D) = tf(“example”, d2) * idf(“example”, D) = 0.43 * 0.301 = 0.129
In its raw frequency form, TF is just the frequency of the “this” for each document. In each document, the word “this” appears once; but as document 2 has more words, its relative frequency is smaller.
An IDF is constant per corpus, and accounts for the ratio of documents that include the word “this”. In this case, we have a corpus of two documents and all of them include the word “this”. So TF–IDF is zero for the word “this”, which implies that the word is not very informative as it appears in all documents.
The word “example” is more interesting – it occurs three times, but only in the second document.
16. The process of identifying people, an organization from a given sentence, paragraph is called
c. Stop word removal
d. Named entity recognition
17. Which one of the following is not a pre-processing technique
a. Stemming and Lemmatization
b. converting to lowercase
c. removing punctuations
d. removal of stop words
e. Sentiment analysis
Sentiment Analysis is not a pre-processing technique. It is done after pre-processing and is an NLP use case. All other listed ones are used as part of statement pre-processing.
18. In text mining, converting text into tokens and then converting them into an integer or floating-point vectors can be done using
c. Bag of Words
CountVectorizer helps do the above, while others are not applicable.
text =[“Rahul is an avid writer, he enjoys studying understanding and presenting. He loves to play”]
vectorizer = CountVectorizer()
vector = vectorizer.transform(text)
[[1 1 1 1 2 1 1 1 1 1 1 1 1 1]]
The second section of the interview questions covers advanced NLP techniques such as Word2Vec, GloVe word embeddings, and advanced models such as GPT, ELMo, BERT, XLNET based questions, and explanations.
19. Words represented as vectors are called as Neural Word Embeddings
Word2Vec, GloVe based models build word embedding vectors that are multidimensional.
20. Context modeling is supported with which one of the following word embeddings
- a. Word2Vec
- b) GloVe
- c) BERT
- d) All of the above
Only BERT (Bidirectional Encoder Representations from Transformer) supports context modelling where the previous and next sentence context is taken into consideration. In Word2Vec, GloVe only word embeddings are considered and previous and next sentence context is not considered.
21. Bidirectional context is supported by which of the following embedding
d. All the above
Only BERT provides a bidirectional context. The BERT model uses the previous and the next sentence to arrive at the context.Word2Vec and GloVe are word embeddings, they do not provide any context.
22. Which one of the following Word embeddings can be custom trained for a specific subject
d. All the above
BERT allows Transform Learning on the existing pre-trained models and hence can be custom trained for the given specific subject, unlike Word2Vec and GloVe where existing word embeddings can be used, no transfer learning on text is possible.
23. Word embeddings capture multiple dimensions of data and are represented as vectors
24. Word embedding vectors help establish distance between two tokens
One can use Cosine similarity to establish distance between two vectors represented through Word Embeddings
25. Language Biases are introduced due to historical data used during training of word embeddings, which one amongst the below is not an example of bias
a. New Delhi is to India, Beijing is to China
b. Man is to Computer, Woman is to Homemaker
Statement b) is a bias as it buckets Woman into Homemaker, whereas statement a) is not a biased statement.
26. which of the following will be a better choice to address NLP use cases such as semantic similarity, reading comprehension, and common sense reasoning
b. Open AI’s GPT
Open AI’s GPT is able to learn complex pattern in data by using the Transformer models Attention mechanism and hence is more suited for complex use cases such as semantic similarity, reading comprehensions, and common sense reasoning.
27. Transformer architecture was first introduced with?
c. Open AI’s GPT
ULMFit has an LSTM based Language modeling architecture. This got replaced into Transformer architecture with Open AI’s GPT
28. Which of the following architecture can be trained faster and needs less amount of training data
a. LSTM based Language Modelling
b. Transformer architecture
Transformer architectures were supported from GPT onwards and were faster to train and needed less amount of data for training too.
29. Same word can have multiple word embeddings possible with ____________?
EMLo word embeddings supports same word with multiple embeddings, this helps in using the same word in a different context and thus captures the context than just meaning of the word unlike in GloVe and Word2Vec. Nltk is not a word embedding.
30. For a given token, its input representation is the sum of embedding from the token, segment and position embedding
BERT uses token, segment and position embedding.
31. Trains two independent LSTM language model left to right and right to left and shallowly concatenates them
ELMo tries to train two independent LSTM language models (left to right and right to left) and concatenates the results to produce word embedding.
32. Uses unidirectional language model for producing word embedding
GPT is a idirectional model and word embedding are produced by training on information flow from left to right. ELMo is bidirectional but shallow. Word2Vec provides simple word embedding.
33. In this architecture, the relationship between all words in a sentence is modelled irrespective of their position. Which architecture is this?
a. OpenAI GPT
BERT Transformer architecture models the relationship between each word and all other words in the sentence to generate attention scores. These attention scores are later used as weights for a weighted average of all words’ representations which is fed into a fully-connected network to generate a new representation.
34. List 10 use cases to be solved using NLP techniques?
- Sentiment Analysis
- Language Translation (English to German, Chinese to English, etc..)
- Document Summarization
- Question Answering
- Sentence Completion
- Attribute extraction (Key information extraction from the documents)
- Chatbot interactions
- Topic classification
- Intent extraction
- Grammar or Sentence correction
- Image captioning
- Document Ranking
- Natural Language inference
35. Transformer model pays attention to the most important word in Sentence
Ans: a) Attention mechanisms in the Transformer model are used to model the relationship between all words and also provide weights to the most important word.
36. Which NLP model gives the best accuracy amongst the following?
Ans: b) XLNET
XLNET has given best accuracy amongst all the models. It has outperformed BERT on 20 tasks and achieves state of art results on 18 tasks including sentiment analysis, question answering, natural language inference, etc.
37. Permutation Language models is a feature of
XLNET provides permutation-based language modelling and is a key difference from BERT. In permutation language modeling, tokens are predicted in a random manner and not sequential. The order of prediction is not necessarily left to right and can be right to left. The original order of words is not changed but a prediction can be random.
The conceptual difference between BERT and XLNET can be seen from the following diagram.
38. Transformer XL uses relative positional embedding
Instead of embedding having to represent the absolute position of a word, Transformer XL uses an embedding to encode the relative distance between the words. This embedding is used to compute the attention score between any 2 words that could be separated by n words before or after.
39. What is Naive Bayes algorithm, When we can use this algorithm in NLP?
Naive Bayes algorithm is a collection of classifiers which works on the principles of the Bayes’ theorem. This series of NLP model forms a family of algorithms that can be used for a wide range of classification tasks including sentiment prediction, filtering of spam, classifying documents and more.
Naive Bayes algorithm converges faster and requires less training data. Compared to other discriminative models like logistic regression, Naive Bayes model it takes lesser time to train. This algorithm is perfect for use while working with multiple classes and text classification where the data is dynamic and changes frequently.
40. Explain Dependency Parsing in NLP?
Dependency Parsing, also known as Syntactic parsing in NLP is a process of assigning syntactic structure to a sentence and identifying its dependency parses. This process is crucial to understand the correlations between the “head” words in the syntactic structure.
The process of dependency parsing can be a little complex considering how any sentence can have more than one dependency parses. Multiple parse trees are known as ambiguities. Dependency parsing needs to resolve these ambiguities in order to effectively assign a syntactic structure to a sentence.
Dependency parsing can be used in the semantic analysis of a sentence apart from the syntactic structuring.
41. What is text Summarization?
Text summarization is the process of shortening a long piece of text with its meaning and effect intact. Text summarization intends to create a summary of any given piece of text and outlines the main points of the document. This technique has improved in recent times and is capable of summarizing volumes of text successfully.
Text summarization has proved to a blessing since machines can summarise large volumes of text in no time which would otherwise be really time-consuming. There are two types of text summarization:
- Extraction-based summarization
- Abstraction-based summarization
42. What is NLTK? How is it different from Spacy?
NLTK or Natural Language Toolkit is a series of libraries and programs that are used for symbolic and statistical natural language processing. This toolkit contains some of the most powerful libraries that can work on different ML techniques to break down and understand human language. NLTK is used for Lemmatization, Punctuation, Character count, Tokenization, and Stemming. The difference between NLTK and Spacey are as follows:
- While NLTK has a collection of programs to choose from, Spacey contains only the best-suited algorithm for a problem in its toolkit
- NLTK supports a wider range of languages compared to Spacey (Spacey supports only 7 languages)
- While Spacey has an object-oriented library, NLTK has a string processing library
- Spacey can support word vectors while NLTK cannot
43. What is information extraction?
Information extraction in the context of Natural Language Processing refers to the technique of extracting structured information automatically from unstructured sources to ascribe meaning to it. This can include extracting information regarding attributes of entities, relationship between different entities and more. The various models of information extraction includes:
- Tagger Module
- Relation Extraction Module
- Fact Extraction Module
- Entity Extraction Module
- Sentiment Analysis Module
- Network Graph Module
- Document Classification & Language Modeling Module
44. What is Bag of Words?
Bag of Words is a commonly used model that depends on word frequencies or occurrences to train a classifier. This model creates an occurrence matrix for documents or sentences irrespective of its grammatical structure or word order.
45. What is Pragmatic Ambiguity in NLP?
Pragmatic ambiguity refers to those words which have more than one meaning and their use in any sentence can depend entirely on the context. Pragmatic ambiguity can result in multiple interpretations of the same sentence. More often than not, we come across sentences which have words with multiple meanings, making the sentence open to interpretation. This multiple interpretation causes ambiguity and is known as Pragmatic ambiguity in NLP.
46. What is Masked Language Model?
Masked language models help learners to understand deep representations in downstream tasks by taking an output from the corrupt input. This model is often used to predict the words to be used in a sentence.
47. What is the difference between NLP and CI(Conversational Interface)?
The difference between NLP and CI is as follows:
|Natural Language Processing||Conversational Interface|
|NLP attempts to help machines understand and learn how language concepts work.||CI focuses only on providing users with an interface to interact with.|
|NLP uses AI technology to identify, understand, and interpret the requests of users through language.||CI uses voice, chat, videos, images and more such conversational aid to create the user interface.|
48. What are the best NLP Tools?
Some of the best NLP tools from open sources are:
- Natural language Toolkit
- Stanford NLP
49. What is POS tagging?
Parts of speech tagging better known as POS tagging refers to the process of identifying specific words in a document and group them as part of speech, based on its context. POS tagging is also known as grammatical tagging since it involves understanding grammatical structures and identifying the respective component.
POS tagging is a complicated process since the same word can be different parts of speech depending on the context. The same generic process used for word mapping is quite ineffective for POS tagging because of the same reason.
50. What is NES?
Name entity recognition is more commonly known as NER is the process of identifying specific entities in a text document which are more informative and have a unique context. These often denote places, people, organisations, and more. Even though it seems like these entities are proper nouns, the NER process is far from identifying just the nouns. In fact, NER involves entity chunking or extraction wherein entities are segmented to categorise them under different predefined classes. This step further helps in extracting information.
There, you have it – all the probable questions for your NLP interview. Now go, give it your best shot. Check out Great Learning’s Deep Learning course to further your knowledge of the domain.8