{"id":13981,"date":"2020-09-11T10:58:40","date_gmt":"2020-09-11T05:28:40","guid":{"rendered":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/"},"modified":"2024-09-02T17:17:11","modified_gmt":"2024-09-02T11:47:11","slug":"bag-of-words","status":"publish","type":"post","link":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/","title":{"rendered":"An Introduction to Bag of Words (BoW) | What is Bag of Words?"},"content":{"rendered":"\n<p>Bag of words is a Natural Language Processing technique of text modelling. Through this blog, we will learn more about why Bag of Words is used, we will understand the concept with the help of an example, learn more about it's implementation in Python, and more. <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"#sh1\">What is Bag of Words in NLP?<\/a><\/li>\n\n\n\n<li><a href=\"#sh2\">Why is the Bag of Words algorithm used?<\/a><\/li>\n\n\n\n<li><a href=\"#ed3\">Understanding Bag of Words with an example<\/a><\/li>\n\n\n\n<li><a href=\"#sh3\">Implementing Bag of Words with Python<\/a><\/li>\n\n\n\n<li><a href=\"#ed4\">Create a Bag of Words Model with Sklearn<\/a><\/li>\n\n\n\n<li><a href=\"#sh4\">What are N-Grams?<\/a><\/li>\n\n\n\n<li><a href=\"#ed6\">What is Tf-Idf ( term frequency-inverse document frequency)?<\/a><\/li>\n\n\n\n<li><a href=\"#sh5\">Feature Extraction with Tf-Idf vectorizer<\/a><\/li>\n\n\n\n<li><a href=\"#sh6\">Limitations of Bag of Word<\/a><\/li>\n<\/ol>\n\n\n\n<p>Using Natural Language Processing, we make use of the text data available across the internet to generate insights for the business. In order to understand this huge amount of data and make insights from them, we need to make them usable. Natural language processing helps us to do so. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-a-bag-of-words-in-nlp\"><strong>What is a Bag of Words in NLP?<\/strong><\/h2>\n\n\n\n<p>Bag of words is a <a rel=\"noreferrer noopener\" aria-label=\"Natural Language Processing (opens in a new tab)\" href=\"https:\/\/www.mygreatlearning.com\/blog\/natural-language-processing-tutorial\/\" target=\"_blank\">Natural Language Processing<\/a> technique of text modelling. In technical terms, we can say that it is a method of feature extraction with text data. This approach is a simple and flexible way of extracting features from documents.<\/p>\n\n\n\n<p>A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order. It is called a \u201cbag\u201d of words because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.<br><\/p>\n\n\n\n    <div class=\"courses-cta-container\">\n        <div class=\"courses-cta-card\">\n            <div class=\"courses-cta-header\">\n                <div class=\"courses-learn-icon\"><\/div>\n                <span class=\"courses-learn-text\">Texas McCombs, UT Austin<\/span>\n            <\/div>\n            <p class=\"courses-cta-title\">\n                <a href=\"https:\/\/onlineexeced.mccombs.utexas.edu\/online-ai-machine-learning-course\" class=\"courses-cta-title-link\">Post Graduate Program in AI &amp; Machine Learning: Business Applications<\/a>\n            <\/p>\n            <p class=\"courses-cta-description\">Master in-demand AI and machine learning skills with this executive-level AI course\u2014designed to transform professionals into strategic tech leaders.<\/p>\n            <div class=\"courses-cta-stats\">\n                <div class=\"courses-stat-item\">\n                    <div class=\"courses-stat-icon courses-user-icon\"><\/div>\n                    <span>Duration: 7 months<\/span>\n                <\/div>\n                <div class=\"courses-stat-item\">\n                    <div class=\"courses-stat-icon courses-star-icon\"><\/div>\n                    <span>4.72\/5 Rating<\/span>\n                <\/div>\n            <\/div>\n            <a href=\"https:\/\/onlineexeced.mccombs.utexas.edu\/online-ai-machine-learning-course\" class=\"courses-cta-button\">\n                Take your First Step\n                <div class=\"courses-arrow-icon\"><\/div>\n            <\/a>\n        <\/div>\n    <\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"why-is-the-bag-of-words-algorithm-used\"><strong>Why is the  Bag-of-Words algorithm used?<\/strong><\/h2>\n\n\n\n<p>So, why bag-of-words, what is wrong with the simple and easy text?&nbsp;&nbsp;<\/p>\n\n\n\n<p>One of the biggest problems with text is that it is messy and unstructured, and <a href=\"https:\/\/www.mygreatlearning.com\/blog\/what-is-machine-learning\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"machine learning (opens in a new tab)\">machine learning<\/a> algorithms prefer structured, well defined fixed-length inputs and by using the Bag-of-Words technique we can convert variable-length texts into a fixed-length <strong>vector.<\/strong><\/p>\n\n\n\n<p>Also, at a much granular level, the machine learning models work with numerical data rather than textual data. So to be more specific, by using the bag-of-words (BoW) technique, we convert a text into its equivalent vector of numbers.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"understanding-bag-of-words-with-an-example\"><strong>Understanding Bag of Words with an example<\/strong><\/h2>\n\n\n\n<p>Let us see an example of how the bag of words technique converts text into vectors<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"example1-without-preprocessing\">Example(1)<strong> without preprocessing:<\/strong>&nbsp;<\/h3>\n\n\n\n<p>Sentence 1:&nbsp; \u201dWelcome to Great Learning, Now start learning\u201d<\/p>\n\n\n\n<p>Sentence 2: \u201cLearning is a good practice\u201d<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><strong>Sentence 1<\/strong><\/td><td><strong>Sentence 2<\/strong><\/td><\/tr><tr><td>Welcome<\/td><td>Learning<\/td><\/tr><tr><td>to<\/td><td>is<\/td><\/tr><tr><td>Great<\/td><td>a<\/td><\/tr><tr><td> Learning<\/td><td>good<\/td><\/tr><tr><td>,<\/td><td> practice<\/td><\/tr><tr><td>Now<\/td><td>&nbsp;<\/td><\/tr><tr><td>start<\/td><td><br><\/td><\/tr><tr><td>learning<\/td><td><br><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Step 1: Go through all the words in the above text and make a list of all of the words in our model vocabulary.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Welcome<\/li>\n\n\n\n<li>To<\/li>\n\n\n\n<li>Great<\/li>\n\n\n\n<li>Learning<\/li>\n\n\n\n<li>,<\/li>\n\n\n\n<li>Now<\/li>\n\n\n\n<li>start<\/li>\n\n\n\n<li>learning<\/li>\n\n\n\n<li>is<\/li>\n\n\n\n<li>a<\/li>\n\n\n\n<li>good<\/li>\n\n\n\n<li>practice<\/li>\n<\/ul>\n\n\n\n<p>Note that the words \u2018Learning\u2019 and \u2018 learning\u2019 are not the same here because of the difference in their cases and hence are repeated. Also, note that a comma \u2018 , \u2019 is also taken in the list.<\/p>\n\n\n\n<p>Because we know the vocabulary has 12 words, we can use a fixed-length document-representation of 12, with one position in the vector to score each word.<\/p>\n\n\n\n<p>The scoring method we use here is to count the presence of each word and mark 0 for absence. This scoring method is used more generally.<\/p>\n\n\n\n<p>The scoring of sentence 1 would look as follows:<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><strong>Word<\/strong><\/td><td><strong>Frequency<\/strong><\/td><\/tr><tr><td>Welcome<\/td><td>1<\/td><\/tr><tr><td>to<\/td><td>1<\/td><\/tr><tr><td>Great<\/td><td>1<\/td><\/tr><tr><td> Learning<\/td><td>1<\/td><\/tr><tr><td>,<\/td><td>1<\/td><\/tr><tr><td>Now<\/td><td>1<\/td><\/tr><tr><td>start<\/td><td>1<\/td><\/tr><tr><td>learning<\/td><td>1<\/td><\/tr><tr><td>is<\/td><td>0<\/td><\/tr><tr><td>a<\/td><td>0<\/td><\/tr><tr><td>good<\/td><td>0<\/td><\/tr><tr><td> practice<\/td><td>0<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Writing the above frequencies in the vector&nbsp;<\/p>\n\n\n\n<p>Sentence 1 \u279d<strong> [ 1,1,1,1,1,1,1,1,0,0,0 ] <\/strong><\/p>\n\n\n\n<p>Now for sentence 2, the scoring would like&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><strong>Word<\/strong><\/td><td><strong>Frequency<\/strong><\/td><\/tr><tr><td>Welcome<\/td><td>0<\/td><\/tr><tr><td>to<\/td><td>0<\/td><\/tr><tr><td>Great<\/td><td>0<\/td><\/tr><tr><td> Learning<\/td><td>1<\/td><\/tr><tr><td>,<\/td><td>0<\/td><\/tr><tr><td>Now<\/td><td>0<\/td><\/tr><tr><td>start<\/td><td>0<\/td><\/tr><tr><td>learning<\/td><td>0<\/td><\/tr><tr><td>is<\/td><td>1<\/td><\/tr><tr><td>a<\/td><td>1<\/td><\/tr><tr><td>good<\/td><td>1<\/td><\/tr><tr><td> practice<\/td><td>1<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Similarly, writing the above frequencies in the vector form<\/p>\n\n\n\n<p>Sentence 2 \u279d<strong> [ 0,0,0,0,0,0,0,1,1,1,1,1 ] <\/strong><br><\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><strong>Sentence<\/strong><\/td><td>Welcome<\/td><td>to<\/td><td>Great<\/td><td>Learning<\/td><td>,<\/td><td>Now<\/td><td>start&nbsp;<\/td><td>learning<\/td><td>is<\/td><td>a<\/td><td>good<\/td><td>practice<\/td><\/tr><tr><td><strong>Sentence1<\/strong><\/td><td>1<\/td><td>1<\/td><td>1<\/td><td>1<\/td><td>1<\/td><td>1<\/td><td>1<\/td><td>1<\/td><td>0<\/td><td>0<\/td><td>0<\/td><td>0<\/td><\/tr><tr><td><strong>Sentence2<\/strong><\/td><td>0<\/td><td>0<\/td><td>0<\/td><td>0<\/td><td>0<\/td><td>0<\/td><td>0<\/td><td>1<\/td><td>1<\/td><td>1<\/td><td>1<\/td><td>1<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>But is this the best way to perform a bag of words. The above example was not the best example of how to use a bag of words. The words Learning and learning, although having the same meaning are taken twice. Also, a comma \u2019,\u2019 which does not convey any information is also included in the vocabulary.<\/p>\n\n\n\n<p>Let us make some changes and see how we can use \u2018bag of words in a more effective way.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"example2-with-preprocessing\"><strong>Example<\/strong>(2) <strong>with preprocessing<\/strong>:&nbsp;<\/h3>\n\n\n\n<p>Sentence 1: \u201dWelcome to Great Learning, Now start learning\u201d<\/p>\n\n\n\n<p>Sentence 2: \u201cLearning is a good practice\u201d<br><\/p>\n\n\n\n<p><strong>Step 1<\/strong>: Convert the above sentences in lower case as the case of the word does not hold any information.<\/p>\n\n\n\n<p><strong>Step 2<\/strong>: Remove special characters and stopwords from the text. Stopwords are the words that do not contain much information about text like \u2018is\u2019, \u2018a\u2019,\u2019the and many more\u2019.<\/p>\n\n\n\n<p>After applying the above steps, the sentences are changed to<\/p>\n\n\n\n<p>Sentence 1:&nbsp; \u201dwelcome great learning now start learning\u201d<\/p>\n\n\n\n<p>Sentence 2: \u201clearning good practice\u201d<\/p>\n\n\n\n<p>Although the above sentences do not make much sense the maximum information is contained in these words only.<\/p>\n\n\n\n<p><strong>Step 3<\/strong>: Go through all the words in the above text and make a list of all of the words in our model vocabulary.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>welcome<\/li>\n\n\n\n<li>great<\/li>\n\n\n\n<li>learning<\/li>\n\n\n\n<li>now<\/li>\n\n\n\n<li>start<\/li>\n\n\n\n<li>good<\/li>\n\n\n\n<li>practice<\/li>\n<\/ul>\n\n\n\n<p>Now as the vocabulary has only 7 words, we can use a fixed-length document-representation of 7, with one position in the vector to score each word.<\/p>\n\n\n\n<p>The scoring method we use here is the same as used in the previous example. For sentence 1, the count of words is as follow:<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><strong>Word<\/strong><\/td><td><strong>Frequency<\/strong><\/td><\/tr><tr><td>welcome<\/td><td>1<\/td><\/tr><tr><td>great<\/td><td>1<\/td><\/tr><tr><td>learning<\/td><td>2<\/td><\/tr><tr><td>now<\/td><td>1<\/td><\/tr><tr><td>start<\/td><td>1<\/td><\/tr><tr><td>good<\/td><td>0<\/td><\/tr><tr><td>practice<\/td><td>0<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Writing the above frequencies in the vector&nbsp;<br><\/p>\n\n\n\n<p>Sentence 1 \u279d <strong> [ 1,1,2,1,1,0,0 ] <\/strong><br><\/p>\n\n\n\n<p>Now for sentence 2, the scoring would be like&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><strong>Word<\/strong><\/td><td><strong>Frequency<\/strong><\/td><\/tr><tr><td>welcome<\/td><td>0<\/td><\/tr><tr><td>great<\/td><td>0<\/td><\/tr><tr><td>learning<\/td><td>1<\/td><\/tr><tr><td>now<\/td><td>0<\/td><\/tr><tr><td>start<\/td><td>0<\/td><\/tr><tr><td>good<\/td><td>1<\/td><\/tr><tr><td>practice<\/td><td>1<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Similarly, writing the above frequencies in the vector form<\/p>\n\n\n\n<p>Sentence 2 \u279d<strong>  [ 0,0,1,0,0,1,1 ] <\/strong><br><\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><strong>Sentence<\/strong><\/td><td>welcome<\/td><td>great<\/td><td>learning<\/td><td>now<\/td><td>start&nbsp;<\/td><td>good<\/td><td>practice<\/td><\/tr><tr><td><strong>Sentence1<\/strong><\/td><td>1<\/td><td>1<\/td><td>2<\/td><td>1<\/td><td>1<\/td><td>0<\/td><td>0<\/td><\/tr><tr><td><strong>Sentence2<\/strong><\/td><td>0<\/td><td>0<\/td><td>1<\/td><td>0<\/td><td>0<\/td><td>1<\/td><td>1<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The approach used in example two is the one that is generally used in the Bag-of-Words technique, the reason being that the datasets used in Machine learning are tremendously large and can contain vocabulary of a few thousand or even millions of words. Hence, preprocessing the text before using bag-of-words is a better way to go.<\/p>\n\n\n\n<p>There are various preprocessing steps that can increase the performance of Bag-of-Words. Some of them are explained in great detail in this <a href=\"https:\/\/www.mygreatlearning.com\/blog\/natural-language-processing-tutorial\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"blog (opens in a new tab)\">blog<\/a>.<\/p>\n\n\n\n<p>In the examples above we use all the words from vocabulary to form a vector, which is neither a practical way nor the best way to implement the BoW model. In practice, only a few words from the vocabulary, more preferably most common words are used to form the vector.&nbsp;<\/p>\n\n\n\n    <div class=\"courses-cta-container\">\n        <div class=\"courses-cta-card\">\n            <div class=\"courses-cta-header\">\n                <div class=\"courses-learn-icon\"><\/div>\n                <span class=\"courses-learn-text\">Texas McCombs, UT Austin<\/span>\n            <\/div>\n            <p class=\"courses-cta-title\">\n                <a href=\"https:\/\/onlineexeced.mccombs.utexas.edu\/online-data-science-business-analytics-course\" class=\"courses-cta-title-link\">Post Graduate Program in Data Science with Generative AI: Applications to Business<\/a>\n            <\/p>\n            <p class=\"courses-cta-description\">Learn how to turn data into strategy in this UT Data Science and Business Analytics Course \u2014 now with a focus on Generative AI. Gain practical experience through 7 hands-on projects over a 7-month duration.<\/p>\n            <div class=\"courses-cta-stats\">\n                <div class=\"courses-stat-item\">\n                    <div class=\"courses-stat-icon courses-user-icon\"><\/div>\n                    <span>7 Hands-on Projects<\/span>\n                <\/div>\n                <div class=\"courses-stat-item\">\n                    <div class=\"courses-stat-icon courses-star-icon\"><\/div>\n                    <span>Duration: 7 months<\/span>\n                <\/div>\n            <\/div>\n            <a href=\"https:\/\/onlineexeced.mccombs.utexas.edu\/online-data-science-business-analytics-course\" class=\"courses-cta-button\">\n                Discover the Program\n                <div class=\"courses-arrow-icon\"><\/div>\n            <\/a>\n        <\/div>\n    <\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"implementing-bag-of-words-algorithm-with-python\"><strong>Implementing Bag of Words Algorithm with Python<\/strong><\/h2>\n\n\n\n<p>In this section, we are going to implement a bag of words algorithm with Python. Also, this is a very basic implementation to understand how bag of words algorithm work, so I would not recommend using this in your project, instead use the method described in the next section.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ndef vectorize(tokens):\n    &#039;&#039;&#039; This function takes list of words in a sentence as input \n    and returns a vector of size of filtered_vocab.It puts 0 if the \n    word is not present in tokens and count of token if present.&#039;&#039;&#039; \n    vector=&#x5B;]\n    for w in filtered_vocab:\n        vector.append(tokens.count(w))\n    return vector\ndef unique(sequence):\n    &#039;&#039;&#039;This functions returns a list in which the order remains \n    same and no item repeats.Using the set() function does not \n    preserve the original ordering,so i didnt use that instead&#039;&#039;&#039;\n    seen = set()\n    return &#x5B;x for x in sequence if not (x in seen or seen.add(x))]\n#create a list of stopwords.You can import stopwords from nltk too\nstopwords=&#x5B;&quot;to&quot;,&quot;is&quot;,&quot;a&quot;]\n#list of special characters.You can use regular expressions too\nspecial_char=&#x5B;&quot;,&quot;,&quot;:&quot;,&quot; &quot;,&quot;;&quot;,&quot;.&quot;,&quot;?&quot;]\n#Write the sentences in the corpus,in our case, just two \nstring1=&quot;Welcome to Great Learning , Now start learning&quot;\nstring2=&quot;Learning is a good practice&quot;\n#convert them to lower case\nstring1=string1.lower()\nstring2=string2.lower()\n#split the sentences into tokens\ntokens1=string1.split()\ntokens2=string2.split()\nprint(tokens1)\nprint(tokens2)\n#create a vocabulary list\nvocab=unique(tokens1+tokens2)\nprint(vocab)\n#filter the vocabulary list\nfiltered_vocab=&#x5B;]\nfor w in vocab: \n    if w not in stopwords and w not in special_char: \n        filtered_vocab.append(w)\nprint(filtered_vocab)\n#convert sentences into vectords\nvector1=vectorize(tokens1)\nprint(vector1)\nvector2=vectorize(tokens2)\nprint(vector2)\n<\/pre><\/div>\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n<figure class=\"wp-block-image size-large is-resized zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/06\/bow.png\"><img decoding=\"async\" width=\"880\" height=\"163\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/06\/bow.png\" alt=\"\" class=\"wp-image-15810\" style=\"width:715px;height:132px\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/06\/bow.png 880w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/06\/bow-300x56.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/06\/bow-768x142.png 768w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/06\/bow-696x129.png 696w\" sizes=\"(max-width: 880px) 100vw, 880px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"create-a-bag-of-words-model-with-sklearn\"><strong>Create a<\/strong> <strong>Bag of Words Model with Sklearn<\/strong><\/h2>\n\n\n\n<p>We can use the <strong>CountVectorizer()<\/strong> function from the <a rel=\"noreferrer noopener\" aria-label=\"Sk-learn library (opens in a new tab)\" href=\"https:\/\/www.mygreatlearning.com\/blog\/open-source-python-libraries\/\" target=\"_blank\">Sk-learn library<\/a> to easily implement the above BoW model using <a rel=\"noreferrer noopener\" aria-label=\"Python (opens in a new tab)\" href=\"https:\/\/www.mygreatlearning.com\/blog\/python-tutorial-for-beginners-a-complete-guide\/\" target=\"_blank\">Python<\/a>.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pandas as pd\nfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n\nsentence_1=&quot;This is a good job.I will not miss it for anything&quot;\nsentence_2=&quot;This is not good at all&quot;\n\n\n\nCountVec = CountVectorizer(ngram_range=(1,1), # to use bigrams ngram_range=(2,2)\n                           stop_words=&#039;english&#039;)\n#transform\nCount_data = CountVec.fit_transform(&#x5B;sentence_1,sentence_2])\n\n#create dataframe\ncv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names())\nprint(cv_dataframe)\n\n\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-are-n-grams\"><strong>What are<\/strong> <strong>N-Grams?<\/strong><\/h2>\n\n\n\n<p>Again same questions, what are n-grams and why do we use them? Let us understand this with an example below-<\/p>\n\n\n\n<p>Sentence 1: \u201cThis is a good job. I will not miss it for anything\u201d<\/p>\n\n\n\n<p>Sentence 2: \u201dThis is not good at all\u201d<\/p>\n\n\n\n<p>For this example, let us take the vocabulary of 5 words only. The five words being-<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>good<\/li>\n\n\n\n<li>job<\/li>\n\n\n\n<li>miss<\/li>\n\n\n\n<li>not<\/li>\n\n\n\n<li>all<\/li>\n<\/ul>\n\n\n\n<p>So, the respective vectors for these sentences are:<\/p>\n\n\n\n<p>\u201cThis is a good job. I will not miss it for anything\u201d=<strong>[1,1,1,1,0]<\/strong><\/p>\n\n\n\n<p>\u201dThis is not good at all\u201d=<strong>[1,0,0,1,1]<\/strong><\/p>\n\n\n\n<p>Can you guess what is the problem here? Sentence 2 is a negative sentence and sentence 1 is a positive sentence. Does this reflect in any way in the vectors above? Not at all. So how can we solve this problem? Here come the N-grams to our rescue.<\/p>\n\n\n\n<p>An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram) is a two-word sequence of words like \u201creally good\u201d, \u201cnot good\u201d, or \u201cyour homework\u201d, and a 3-gram (more commonly called a trigram) is a three-word sequence of words like \u201cnot at all\u201d, or \u201cturn off light\u201d.<\/p>\n\n\n\n<p>For example, the bigrams in the first line of text in the previous section: \u201cThis is not good at all\u201d are as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u201cThis is\u201d<\/li>\n\n\n\n<li>\u201cis not\u201d<\/li>\n\n\n\n<li>\u201cnot good\u201d<\/li>\n\n\n\n<li>\u201cgood at\u201d<\/li>\n\n\n\n<li>\u201cat all\u201d<\/li>\n<\/ul>\n\n\n\n<p>Now if instead of using just words in the above example, we use bigrams (Bag-of-bigrams) as shown above. The model can differentiate between sentence 1 and sentence 2. So, using bi-grams makes tokens more understandable (for example, \u201cHSR Layout\u201d, in Bengaluru, is more informative than \u201cHSR\u201d and \u201clayout\u201d)<\/p>\n\n\n\n<p>So we can conclude that a bag-of-bigrams representation is much more powerful than bag-of-words, and in many cases proves very hard to beat.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-tf-idf-term-frequency-inverse-document-frequency\"><strong>What is Tf-Idf ( term frequency-inverse document frequency)?<\/strong><\/h2>\n\n\n\n<p>The scoring method being used above takes the count of each word and represents the word in the vector by the number of counts of that particular word. What does a word having high word count signify?<\/p>\n\n\n\n<p>Does this mean that the word is important in retrieving information about documents? The answer is NO. Let me explain, if a word occurs many times in a document but also along with many other documents in our dataset, maybe it is because this word is just a frequent word; not because it is relevant or meaningful.<\/p>\n\n\n\n<p>One approach is to rescale the frequency of words by how often they appear in all documents so that the scores for frequent words like \u201cthe\u201d that are also frequent across all documents are penalized. This approach is called term frequency-inverse document frequency or shortly known as Tf-Idf approach of scoring.TF-IDF is intended to reflect how relevant a term is in a given document. So how is Tf-Idf of a document in a dataset calculated?<\/p>\n\n\n\n<p>TF-IDF for a word in a document is calculated by multiplying two different metrics:<\/p>\n\n\n\n<p>The<strong> term frequency (TF)<\/strong> of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are other ways to adjust the frequency. For example, by dividing the raw count of instances of a word by either length of the document, or by the raw frequency of the most frequent word in the document. The formula to calculate Term-Frequency is<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">TF(i,j)=n(i,j)\/\u03a3 n(i,j)<\/pre>\n\n\n\n<p>Where,<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">n(i,j )= number of times nth word&nbsp; occurred in a document\n\u03a3n(i,j) = total number of words in a document. <\/pre>\n\n\n\n<p>The <strong>inverse document frequency(IDF) <\/strong>of the word across a set of documents. This suggests how common or rare a word is in the entire document set. The closer it is to 0, the more common is the word. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe title=\"Bag of Words\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/IRKDrrzh4dE?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.<\/p>\n\n\n\n<p>Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document.<\/p>\n\n\n\n<p>To put it in mathematical terms, the TF-IDF score is calculated as follows:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">IDF=1+log(N\/dN)<\/pre>\n\n\n\n<p>Where <\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">N=Total number of documents in the dataset\ndN=total number of documents in which nth word occur <\/pre>\n\n\n\n<p>Also, note that the 1 added in the above formula is so that terms with zero IDF don\u2019t get suppressed entirely. This process is known as IDF smoothing.<\/p>\n\n\n\n<p>The TF-IDF is obtained by&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">TF-IDF=TF*IDF<\/pre>\n\n\n\n<p>Does this seem too complicated? Don\u2019t worry, this can be attained with just a few lines of code and you don\u2019t even have to remember these scary formulas.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"feature-extraction-with-tf-idf-vectorizer\"><strong>Feature Extraction with Tf-Idf vectorizer<\/strong><\/h2>\n\n\n\n<p>We can use the <strong>TfidfVectorizer()<\/strong> function from the <a href=\"https:\/\/www.mygreatlearning.com\/blog\/open-source-python-libraries\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Sk-learn library (opens in a new tab)\">Sk-learn library<\/a> to easily implement the above BoW(Tf-IDF), model.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pandas as pd\nfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n\nsentence_1=&quot;This is a good job.I will not miss it for anything&quot;\nsentence_2=&quot;This is not good at all&quot;\n\n\n\n#without smooth IDF\nprint(&quot;Without Smoothing:&quot;)\n#define tf-idf\ntf_idf_vec = TfidfVectorizer(use_idf=True, \n                        smooth_idf=False,  \n                        ngram_range=(1,1),stop_words=&#039;english&#039;) # to use only  bigrams ngram_range=(2,2)\n#transform\ntf_idf_data = tf_idf_vec.fit_transform(&#x5B;sentence_1,sentence_2])\n\n#create dataframe\ntf_idf_dataframe=pd.DataFrame(tf_idf_data.toarray(),columns=tf_idf_vec.get_feature_names())\nprint(tf_idf_dataframe)\nprint(&quot;n&quot;)\n\n#with smooth\ntf_idf_vec_smooth = TfidfVectorizer(use_idf=True,  \n                        smooth_idf=True,  \n                        ngram_range=(1,1),stop_words=&#039;english&#039;)\n\n\ntf_idf_data_smooth = tf_idf_vec_smooth.fit_transform(&#x5B;sentence_1,sentence_2])\n\nprint(&quot;With Smoothing:&quot;)\ntf_idf_dataframe_smooth=pd.DataFrame(tf_idf_data_smooth.toarray(),columns=tf_idf_vec_smooth.get_feature_names())\nprint(tf_idf_dataframe_smooth)\n\n<\/pre><\/div>\n\n\n<p><br>Output:<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"limitations-of-bag-of-words\"><strong>Limitations of Bag-of-Words<\/strong><\/h2>\n\n\n\n<p>Although Bag-of-Words is quite efficient and easy to implement, still there are some disadvantages to this technique which are given below:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>The model ignores the location information of the word. The location information is a piece of very important information in the text. For example\u00a0 \u201ctoday is off\u201d and \u201cIs today off\u201d, have the exact same vector representation in the BoW model.<\/li>\n\n\n\n<li>Bag of word models doesn\u2019t respect the semantics of the word. For example, words \u2018soccer\u2019 and \u2018football\u2019 are often used in the same context. However, the vectors corresponding to these words are quite different in the bag of words model. The problem becomes more serious while modeling sentences. Ex: \u201cBuy used cars\u201d and \u201cPurchase old automobiles\u201d are represented by totally different vectors in the Bag-of-words model.<\/li>\n\n\n\n<li>The range of vocabulary is a big issue faced by the Bag-of-Words model. For example, if the model comes across a new word it has not seen yet, rather we say a rare, but informative word like Biblioklept(means one who steals books). The BoW model will probably end up ignoring this word as this word has not been seen by the model yet.<\/li>\n<\/ol>\n\n\n\n<p><em>This brings us to the end of this article where we have learned about Bag of words and its implementation with Sk-learn. If you wish to learn more about NLP, take up the Introduction to <a href=\"https:\/\/www.mygreatlearning.com\/academy\/learn-for-free\/courses\/introduction-to-natural-language-processing\" target=\"_blank\" rel=\"noreferrer noopener\">Natural Language Processing Free Online Course <\/a>offered by Great Learning Academy and upskill today. <\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"further-reading\">Further Reading<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.mygreatlearning.com\/blog\/natural-language-processing-tutorial\/\" target=\"_blank\" rel=\"noreferrer noopener\">Natural Language Processing (NLP) Tutorial: A Step by Step Guide<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.mygreatlearning.com\/blog\/named-entity-recognition\/\" target=\"_blank\" rel=\"noreferrer noopener\">What is Named Entity Recognition (NER) Applications and Uses?<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.mygreatlearning.com\/blog\/nltk-tutorial-with-python\/\" target=\"_blank\" rel=\"noreferrer noopener\">Natural Language Toolkit (NLTK) Tutorial with Python<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.mygreatlearning.com\/blog\/pos-tagging\/\" target=\"_blank\" rel=\"noreferrer noopener\">Part of Speech (POS) tagging with Hidden Markov Model<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/\" target=\"_blank\" rel=\"noreferrer noopener\">Tokenising into Words and Sentences | What is Tokenization and it\u2019s Definition?<\/a><\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Bag of words is a Natural Language Processing technique of text modelling. Through this blog, we will learn more about why Bag of Words is used, we will understand the concept with the help of an example, learn more about it's implementation in Python, and more. Using Natural Language Processing, we make use of the [&hellip;]<\/p>\n","protected":false},"author":41,"featured_media":14065,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_uag_custom_page_level_css":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"set","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[2],"tags":[],"content_type":[],"class_list":["post-13981","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.3 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>An Introduction to Bag of Words in NLP using Python | What is BoW?<\/title>\n<meta name=\"description\" content=\"What is Bag of Words (BoW): Bag of Words is a Natural Language Processing technique of text modeling which is used to extract features from text to train a machine learning model.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"An Introduction to Bag of Words (BoW) | What is Bag of Words?\" \/>\n<meta property=\"og:description\" content=\"What is Bag of Words (BoW): Bag of Words is a Natural Language Processing technique of text modeling which is used to extract features from text to train a machine learning model.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/\" \/>\n<meta property=\"og:site_name\" content=\"Great Learning Blog: Free Resources what Matters to shape your Career!\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/GreatLearningOfficial\/\" \/>\n<meta property=\"article:published_time\" content=\"2020-09-11T05:28:40+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-09-02T11:47:11+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1254\" \/>\n\t<meta property=\"og:image:height\" content=\"836\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Great Learning Editorial Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@https:\/\/twitter.com\/Great_Learning\" \/>\n<meta name=\"twitter:site\" content=\"@Great_Learning\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Great Learning Editorial Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/bag-of-words\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/bag-of-words\\\/\"},\"author\":{\"name\":\"Great Learning Editorial Team\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\"},\"headline\":\"An Introduction to Bag of Words (BoW) | What is Bag of Words?\",\"datePublished\":\"2020-09-11T05:28:40+00:00\",\"dateModified\":\"2024-09-02T11:47:11+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/bag-of-words\\\/\"},\"wordCount\":2251,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/bag-of-words\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/04\\\/iStock-1174690086.jpg\",\"articleSection\":[\"AI and Machine Learning\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/bag-of-words\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/bag-of-words\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/bag-of-words\\\/\",\"name\":\"An Introduction to Bag of Words in NLP using Python | What is BoW?\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/bag-of-words\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/bag-of-words\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/04\\\/iStock-1174690086.jpg\",\"datePublished\":\"2020-09-11T05:28:40+00:00\",\"dateModified\":\"2024-09-02T11:47:11+00:00\",\"description\":\"What is Bag of Words (BoW): Bag of Words is a Natural Language Processing technique of text modeling which is used to extract features from text to train a machine learning model.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/bag-of-words\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/bag-of-words\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/bag-of-words\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/04\\\/iStock-1174690086.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/04\\\/iStock-1174690086.jpg\",\"width\":1254,\"height\":836,\"caption\":\"bag of words\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/bag-of-words\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Blog\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"AI and Machine Learning\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/artificial-intelligence\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"An Introduction to Bag of Words (BoW) | What is Bag of Words?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"name\":\"Great Learning Blog\",\"description\":\"Learn, Upskill &amp; Career Development Guide and Resources\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"alternateName\":\"Great Learning\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\",\"name\":\"Great Learning\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"width\":900,\"height\":900,\"caption\":\"Great Learning\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/GreatLearningOfficial\\\/\",\"https:\\\/\\\/x.com\\\/Great_Learning\",\"https:\\\/\\\/www.instagram.com\\\/greatlearningofficial\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/in.pinterest.com\\\/greatlearning12\\\/\",\"https:\\\/\\\/www.youtube.com\\\/user\\\/beaconelearning\\\/\"],\"description\":\"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.\",\"email\":\"info@mygreatlearning.com\",\"legalName\":\"Great Learning Education Services Pvt. Ltd\",\"foundingDate\":\"2013-11-29\",\"numberOfEmployees\":{\"@type\":\"QuantitativeValue\",\"minValue\":\"1001\",\"maxValue\":\"5000\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\",\"name\":\"Great Learning Editorial Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"caption\":\"Great Learning Editorial Team\"},\"description\":\"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.\",\"sameAs\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/\",\"https:\\\/\\\/in.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/x.com\\\/https:\\\/\\\/twitter.com\\\/Great_Learning\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UCObs0kLIrDjX2LLSybqNaEA\"],\"award\":[\"Best EdTech Company of the Year 2024\",\"Education Economictimes Outstanding Education\\\/Edtech Solution Provider of the Year 2024\",\"Leading E-learning Platform 2024\"],\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/author\\\/greatlearning\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"An Introduction to Bag of Words in NLP using Python | What is BoW?","description":"What is Bag of Words (BoW): Bag of Words is a Natural Language Processing technique of text modeling which is used to extract features from text to train a machine learning model.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/","og_locale":"en_US","og_type":"article","og_title":"An Introduction to Bag of Words (BoW) | What is Bag of Words?","og_description":"What is Bag of Words (BoW): Bag of Words is a Natural Language Processing technique of text modeling which is used to extract features from text to train a machine learning model.","og_url":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/","og_site_name":"Great Learning Blog: Free Resources what Matters to shape your Career!","article_publisher":"https:\/\/www.facebook.com\/GreatLearningOfficial\/","article_published_time":"2020-09-11T05:28:40+00:00","article_modified_time":"2024-09-02T11:47:11+00:00","og_image":[{"width":1254,"height":836,"url":"http:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086.jpg","type":"image\/jpeg"}],"author":"Great Learning Editorial Team","twitter_card":"summary_large_image","twitter_creator":"@https:\/\/twitter.com\/Great_Learning","twitter_site":"@Great_Learning","twitter_misc":{"Written by":"Great Learning Editorial Team","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/#article","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/"},"author":{"name":"Great Learning Editorial Team","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad"},"headline":"An Introduction to Bag of Words (BoW) | What is Bag of Words?","datePublished":"2020-09-11T05:28:40+00:00","dateModified":"2024-09-02T11:47:11+00:00","mainEntityOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/"},"wordCount":2251,"commentCount":0,"publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086.jpg","articleSection":["AI and Machine Learning"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/","url":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/","name":"An Introduction to Bag of Words in NLP using Python | What is BoW?","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/#primaryimage"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086.jpg","datePublished":"2020-09-11T05:28:40+00:00","dateModified":"2024-09-02T11:47:11+00:00","description":"What is Bag of Words (BoW): Bag of Words is a Natural Language Processing technique of text modeling which is used to extract features from text to train a machine learning model.","breadcrumb":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/#primaryimage","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086.jpg","width":1254,"height":836,"caption":"bag of words"},{"@type":"BreadcrumbList","@id":"https:\/\/www.mygreatlearning.com\/blog\/bag-of-words\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog","item":"https:\/\/www.mygreatlearning.com\/blog\/"},{"@type":"ListItem","position":2,"name":"AI and Machine Learning","item":"https:\/\/www.mygreatlearning.com\/blog\/artificial-intelligence\/"},{"@type":"ListItem","position":3,"name":"An Introduction to Bag of Words (BoW) | What is Bag of Words?"}]},{"@type":"WebSite","@id":"https:\/\/www.mygreatlearning.com\/blog\/#website","url":"https:\/\/www.mygreatlearning.com\/blog\/","name":"Great Learning Blog","description":"Learn, Upskill &amp; Career Development Guide and Resources","publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"alternateName":"Great Learning","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.mygreatlearning.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization","name":"Great Learning","url":"https:\/\/www.mygreatlearning.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","width":900,"height":900,"caption":"Great Learning"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/GreatLearningOfficial\/","https:\/\/x.com\/Great_Learning","https:\/\/www.instagram.com\/greatlearningofficial\/","https:\/\/www.linkedin.com\/school\/great-learning\/","https:\/\/in.pinterest.com\/greatlearning12\/","https:\/\/www.youtube.com\/user\/beaconelearning\/"],"description":"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.","email":"info@mygreatlearning.com","legalName":"Great Learning Education Services Pvt. Ltd","foundingDate":"2013-11-29","numberOfEmployees":{"@type":"QuantitativeValue","minValue":"1001","maxValue":"5000"}},{"@type":"Person","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad","name":"Great Learning Editorial Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","caption":"Great Learning Editorial Team"},"description":"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.","sameAs":["https:\/\/www.mygreatlearning.com\/","https:\/\/in.linkedin.com\/school\/great-learning\/","https:\/\/x.com\/https:\/\/twitter.com\/Great_Learning","https:\/\/www.youtube.com\/channel\/UCObs0kLIrDjX2LLSybqNaEA"],"award":["Best EdTech Company of the Year 2024","Education Economictimes Outstanding Education\/Edtech Solution Provider of the Year 2024","Leading E-learning Platform 2024"],"url":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"}]}},"uagb_featured_image_src":{"full":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086.jpg",1254,836,false],"thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086-150x150.jpg",150,150,true],"medium":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086-300x200.jpg",300,200,true],"medium_large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086-768x512.jpg",768,512,true],"large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086-1024x683.jpg",1024,683,true],"1536x1536":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086.jpg",1254,836,false],"2048x2048":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086.jpg",1254,836,false],"web-stories-poster-portrait":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086.jpg",640,427,false],"web-stories-publisher-logo":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086.jpg",96,64,false],"web-stories-thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/04\/iStock-1174690086.jpg",150,100,false]},"uagb_author_info":{"display_name":"Great Learning Editorial Team","author_link":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"},"uagb_comment_info":1,"uagb_excerpt":"Bag of words is a Natural Language Processing technique of text modelling. Through this blog, we will learn more about why Bag of Words is used, we will understand the concept with the help of an example, learn more about it's implementation in Python, and more. Using Natural Language Processing, we make use of the&hellip;","_links":{"self":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/13981","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/comments?post=13981"}],"version-history":[{"count":86,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/13981\/revisions"}],"predecessor-version":[{"id":111031,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/13981\/revisions\/111031"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media\/14065"}],"wp:attachment":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media?parent=13981"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/categories?post=13981"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/tags?post=13981"},{"taxonomy":"content_type","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/content_type?post=13981"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}