{"id":15202,"date":"2020-05-29T23:19:47","date_gmt":"2020-05-29T17:49:47","guid":{"rendered":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/"},"modified":"2024-09-02T17:17:18","modified_gmt":"2024-09-02T11:47:18","slug":"tokenization","status":"publish","type":"post","link":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/","title":{"rendered":"Tokenising into Words and Sentences | What is Tokenization and it's Definition?"},"content":{"rendered":"\n<ol class=\"wp-block-list\"><li><a href=\"#sh1\">What is tokenisation<\/a><\/li><li><a href=\"#sh2\">Tokenisation techniques (optional)<\/a><\/li><li><a href=\"#sh3\">Tokenising with NLTK<\/a><\/li><li><a href=\"#sh4\">Tokenising with TextBlob<\/a><\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-tokenization\"><strong>What is Tokenization?<\/strong><\/h2>\n\n\n\n<p>Tokenisation is the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks may be discarded. The tokens usually become the input for the processes like parsing and text mining.<br><\/p>\n\n\n\n<p>Almost every <a href=\"https:\/\/www.mygreatlearning.com\/blog\/natural-language-processing-tutorial\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Natural language processing (opens in a new tab)\">Natural language processing<\/a> task uses some sort of tokenisation technique. It is vital to understand the pattern in the text to achieve tasks like sentiment analysis,<a href=\"https:\/\/www.mygreatlearning.com\/blog\/named-entity-recognition\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\" named entity recognition (opens in a new tab)\"> named entity recognition<\/a> also known as NER, <a href=\"https:\/\/www.mygreatlearning.com\/blog\/pos-tagging\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"POS tagging (opens in a new tab)\">POS tagging<\/a>, Text classification, intelligent chatbot, language translation, text summarisation and many more.&nbsp;<br><\/p>\n\n\n\n<p>These tokens are very useful for finding such patterns as well as is considered as a base step for stemming and lemmatization. Stemming and Lemmatization both generate the root form of the inflected words obtained from tokenisation.<br><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"tokenisation-techniques-optional\"><strong>Tokenisation Techniques (Optional)<\/strong><br><\/h2>\n\n\n\n<p>In this section, we are going to explore some of the tokenisation techniques. If you are only concerned about implementing tokenisation, you can skip ahead to the next sections&nbsp;<br><\/p>\n\n\n\n<p><strong>White space tokenisation:<\/strong><\/p>\n\n\n\n<p>Perhaps this is one of the simplest technique to tokenize a sentence or paragraph into words. In this technique, the input sentence is broken apart every time a white-space is encountered. Although this is a fast and easy way to implement tokenisation, this technique only works in languages where meaningful units are separated by spaces e.g&nbsp; English. But for words such as living room, full moon, real estate, coffee table, this method might work accurately.<br><\/p>\n\n\n\n<p><strong>Dictionary based tokenisation:<\/strong><\/p>\n\n\n\n<p>This is a much more advanced method than white space tokeniser. We find tokens from the sentences that are already in the dictionary. This approach needs specific guidance if the tokens in the sentence aren\u2019t in the dictionary For languages without spaces between words, there is an additional step of word segmentation where we find sequences of characters that have a certain meaning.<br><\/p>\n\n\n\n<p><strong>Subword Tokenisation:<\/strong><br><\/p>\n\n\n\n<p>This is a collection of approaches usually using unsupervised <a href=\"https:\/\/www.mygreatlearning.com\/blog\/what-is-machine-learning\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"machine learning (opens in a new tab)\">machine learning<\/a> techniques. This method finds short sequences of characters that are often used together and assigns each of them to be a separate token. As this is an unsupervised method, sometimes we may encounter tokens that have actually no real meaning.<br><\/p>\n\n\n\n<p>These were some of the techniques that may have given you a brief overview of technicality behind tokenisation. Now in the next section, we will see how we can use some libraries and frameworks to do tokenisation for us.<br><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"tokenisation-with-nltk\"><strong>Tokenisation with NLTK<\/strong><br><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p><a rel=\"noreferrer noopener\" aria-label=\"NLTK (opens in a new tab)\" href=\"https:\/\/www.mygreatlearning.com\/blog\/nltk-tutorial-with-python\/\" target=\"_blank\">NLTK<\/a> is a standard python library with prebuilt functions and utilities for the ease of use and implementation. It is one of the most used libraries for natural language processing and computational linguistics.<br><\/p>\n\n\n\n<p>The tasks such as tokenisation, stemming, lemmatisation, chunking and many more can be implemented in just one line using NLTK. Now let us see some of the popular tokenisers used for tokenising text into sentences or words available on NLTK.<br><\/p>\n\n\n\n<p>First install NLTK in your PC, if not already installed. To install it go to the command prompt and type.<br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install nltk\n<\/code><\/pre>\n\n\n\n<p>Next, go to the editor and run these lines of code<br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import nltk\nnltk.download(\u2018all\u2019)\n<\/code><\/pre>\n\n\n\n<p>Tokenising into sentences: Some of the tokenisers that can split a paragraph into sentences are given below. The results obtained from each may be a little different, thus you must choose an appropriate tokeniser that'll work best for you. Now let us take a look at some examples taken from NLTK documentation.<br><\/p>\n\n\n\n<p><strong>sent_tokenize<\/strong><br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import nltk\nfrom nltk import sent_tokenize \n  \ntext = '''Hello everyone. welcome to the Great Learning.\n Mr. Smith(\"He is instructor\") and Johann S. Baech \n\n (He is also an instructor.) are waiting for you'''\nsent_tokenize(text) <\/code><\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/1.png\"><img decoding=\"async\" width=\"740\" height=\"85\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/1.png\" alt=\"\" class=\"wp-image-15203\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/1.png 740w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/1-300x34.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/1-696x80.png 696w\" sizes=\"(max-width: 740px) 100vw, 740px\" \/><\/figure>\n\n\n\n<p>Actually, sent_tokenize is a wrapper function that calls tokenize by the Punkt Sentence Tokenizer. This tokeniser divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used., here is the code in NLTK:<br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>text = '''Hello everyone. welcome to the Great Learning.Mr. Smith(\"He is instructor\") and Johann S. Baech (He is also an instructor.) are waiting for you'''\nsent_detector = nltk.data.load('tokenizers\/punkt\/english.pickle')\nprint(\"'\\n-----\\n'\".join(sent_detector.tokenize(text.strip())))<\/code><\/pre>\n\n\n\n<p> Output: <\/p>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/2.png\"><img decoding=\"async\" width=\"741\" height=\"139\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/2.png\" alt=\"\" class=\"wp-image-15204\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/2.png 741w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/2-300x56.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/2-696x131.png 696w\" sizes=\"(max-width: 741px) 100vw, 741px\" \/><\/figure>\n\n\n\n<p>As we can see the output is the same. Also, the parameter realign_boundaries can change the output in the following way if set false&nbsp;<br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>text = '''Hello everyone. welcome to the Great Learning.Mr. Smith(\"He is instructor\") and Johann S. Baech (He is also an instructor.) are waiting for you'''\nsent_detector = nltk.data.load('tokenizers\/punkt\/english.pickle')\nprint(\"'\\n-----\\n'\".join(sent_detector.tokenize(text.strip(),realign_boundaries=False)))<\/code><\/pre>\n\n\n\n<p> Output: <\/p>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/3.png\"><img decoding=\"async\" width=\"721\" height=\"135\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/3.png\" alt=\"\" class=\"wp-image-15205\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/3.png 721w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/3-300x56.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/3-696x130.png 696w\" sizes=\"(max-width: 721px) 100vw, 721px\" \/><\/figure>\n\n\n\n<p><strong>BlanklineTokenizer<\/strong>: This tokeniser separates the sentences when there is a blank line between them. We can use this tokeniser to extract paragraphs from a large corpus of text<br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from nltk.tokenize import BlanklineTokenizer\ntext = '''Hello everyone. welcome to the Great Learning.\nMr. Smith(\"He is instructor\") and Johann S. Baech \n \n(He is also an instructor.) are waiting for you'''\nBlanklineTokenizer().tokenize(text)<\/code><\/pre>\n\n\n\n<p> Output: <\/p>\n\n\n<figure class=\"wp-block-image size-large is-resized zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/4.png\"><img decoding=\"async\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/4.png\" alt=\"\" class=\"wp-image-15206\" width=\"725\" height=\"39\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/4.png 879w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/4-300x16.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/4-768x42.png 768w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/4-696x38.png 696w\" sizes=\"(max-width: 725px) 100vw, 725px\" \/><\/figure>\n\n\n\n<p> Now here are few methods for tokenising a text into words <\/p>\n\n\n\n<p><strong>word_tokenize<\/strong><br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from nltk import word_tokenize\ntext = '''Hello everyone . welcome to the Great Learning .\nMr. Smith(\"He isn't an instructor\") and Johann S. Baech \n(He is  an instructor.) are waiting for you . They'll join you soon.'''\nfor t in sent_tokenize(text):\n  x=word_tokenize(t)\n  print(x)<\/code><\/pre>\n\n\n\n<p> Output: <\/p>\n\n\n<figure class=\"wp-block-image size-large td-caption-align-https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/5.png zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/5.png\"><img decoding=\"async\" width=\"1024\" height=\"77\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/5-1024x77.png\" alt=\"\" class=\"wp-image-15207\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/5-1024x77.png 1024w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/5-300x23.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/5-768x58.png 768w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/5-696x53.png 696w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/5-1068x81.png 1068w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/5.png 1259w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>word_tokenize is a wrapper function that calls tokenize by the Treebank tokenizer. The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. Here is the code for Treebank tokenizer<br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from nltk.tokenize import TreebankWordTokenizer\nfor t in sent_tokenize(text):\n  x=TreebankWordTokenizer().tokenize(t)\n  print(x)<\/code><\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n<figure class=\"wp-block-image size-large td-caption-align-https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/6.png zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/6.png\"><img decoding=\"async\" width=\"1024\" height=\"97\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/6-1024x97.png\" alt=\"\" class=\"wp-image-15208\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/6-1024x97.png 1024w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/6-300x29.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/6-768x73.png 768w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/6-696x66.png 696w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/6-1068x102.png 1068w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/6.png 1252w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>WhitespaceTokenizer<\/strong>: As the name suggests, this tokeniser splits the text whenever it encounters a space.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from nltk.tokenize import WhitespaceTokenizer\nfor t in sent_tokenize(text):\n  x=WhitespaceTokenizer().tokenize(t)\n  print(x)<\/code><\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/7.png\"><img decoding=\"async\" width=\"953\" height=\"120\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/7.png\" alt=\"\" class=\"wp-image-15209\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/7.png 953w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/7-300x38.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/7-768x97.png 768w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/7-696x88.png 696w\" sizes=\"(max-width: 953px) 100vw, 953px\" \/><\/figure>\n\n\n\n<p><strong>wordpunct_tokenize<\/strong>:wordpunct_tokenize is based on a simple regexp tokenization.Basically it uses the regular expression <em>\" \\w+|[^\\w\\s]+\"<\/em> to split the input.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from nltk.tokenize import wordpunct_tokenize\nfor t in sent_tokenize(text):\n  x=wordpunct_tokenize(t)\n  print(x)<\/code><\/pre>\n\n\n\n<p> Output: <\/p>\n\n\n<figure class=\"wp-block-image size-large td-caption-align-https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/8.png zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/8.png\"><img decoding=\"async\" width=\"1024\" height=\"94\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/8-1024x94.png\" alt=\"\" class=\"wp-image-15216\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/8-1024x94.png 1024w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/8-300x28.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/8-768x71.png 768w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/8-696x64.png 696w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/8-1068x98.png 1068w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/8.png 1238w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Multi-Word Expression Tokenizer(MWETokenizer)<\/strong>: A MWETokenizer takes a string and merges multi-word expressions into single tokens, using a lexicon of MWEs.As you may have noticed in the above examples, Great learning being a single entity is separated into two tokens. We can avoid this and also merge some other expressions such as Johann S. Baech and a lot into single tokens <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from nltk.tokenize import MWETokenizer\ntext = '''Hello everyone . welcome to the Great Learning .\nMr. Smith(\"He isn't an instructor\") and Johann S. Baech \n(He is  an instructor.) are waiting for you . They'll join you soon.\nHope you enjoy a lot'''\ntokenizer = MWETokenizer(&#91;('Great', 'Learning'), ('Johann', 'S.', 'Baech'), ('a', 'lot')],separator=' ')\nfor t in sent_tokenize(text):\n  x=tokenizer.tokenize(t.split())\n  print(x)<\/code><\/pre>\n\n\n\n<p> Output: <\/p>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/9-1.png\"><img decoding=\"async\" width=\"908\" height=\"117\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/9-1.png\" alt=\"\" class=\"wp-image-15215\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/9-1.png 908w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/9-1-300x39.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/9-1-768x99.png 768w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/9-1-696x90.png 696w\" sizes=\"(max-width: 908px) 100vw, 908px\" \/><\/figure>\n\n\n\n<p><strong>Tweet Tokenizer<\/strong>: Tweet tokeniser is a special tokeniser which works best for tweets or in general social media comments and posts. It can preserve the emojis and also come with many handy options. Few of the examples are<br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from nltk.tokenize import TweetTokenizer\ntknzr = TweetTokenizer(strip_handles=True)\ntweet= \" @GL : Great Learning is way tooo coool #AI: :-) :-P &lt;3 . Here are some arrows &lt; &gt; -&gt; &lt;--\"\nfor t in sent_tokenize(tweet):\n  x=tknzr.tokenize(t)\n  print(x)<\/code><\/pre>\n\n\n\n<p> Output: <\/p>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/10-1.png\"><img decoding=\"async\" width=\"767\" height=\"62\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/10-1.png\" alt=\"\" class=\"wp-image-15213\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/10-1.png 767w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/10-1-300x24.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/10-1-696x56.png 696w\" sizes=\"(max-width: 767px) 100vw, 767px\" \/><\/figure>\n\n\n\n<p>Here we are able to remove handles from tokens. Also if you may have noticed #AI is not divided into separate tokens which is exactly what we want when tokenising tweets.<\/p>\n\n\n\n<p><strong>RegexpTokenizer:<\/strong> This tokeniser splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences:<br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from nltk.tokenize import RegexpTokenizer\ntokenizer = RegexpTokenizer('\\w+|\\$&#91;\\d\\.]+|\\S+')\nfor t in sent_tokenize(text):\n  x=tokenizer.tokenize(t)\n  print(x)<\/code><\/pre>\n\n\n\n<p> Output: <\/p>\n\n\n<figure class=\"wp-block-image size-large td-caption-align-https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/11.png zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/11.png\"><img decoding=\"async\" width=\"1024\" height=\"109\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/11-1024x109.png\" alt=\"\" class=\"wp-image-15214\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/11-1024x109.png 1024w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/11-300x32.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/11-768x81.png 768w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/11-696x74.png 696w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/11-1068x113.png 1068w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/11.png 1141w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>There are many more tokenisers available in NLTK library that you can find in their official documentation.<br><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"tokenising-with-textblob\"><strong>Tokenising with TextBlob<\/strong><br><\/h2>\n\n\n\n<p>TextBlob is a Python library for processing textual data. Using its simple API we can easily perform many common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. So now let us see how TextBlob performs when it comes to tokenisation.<br><\/p>\n\n\n\n<p>To install it in your PC, go the terminal and run this command<br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install textblob\n<\/code><\/pre>\n\n\n\n<p> Here is a code to tokenize a text into sentences and words <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from textblob import TextBlob \n  \ntext = '''Hello everyone . welcome to the Great Learning .\nMr. Smith(\"He isn't an instructor\") and Johann S. Baech \n(He is  an instructor.) are waiting for you . They'll join you soon.\nHope you enjoy a lot. @GL : Great Learning is way tooo coool #AI: :-) :-P &lt;3 . Here are some arrows &lt; &gt; -&gt; &lt;--'''\n    \n# create a TextBlob object \nblob_object = TextBlob(text) \n  \n# tokenize paragraph into words. \nprint(\" Word Tokenize :\\n\", blob_object.words) \n  \n# tokenize paragraph into sentences. \nprint(\"\\n Sentence Tokenize :\\n\", blob_object.sentences) <\/code><\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Word Tokenize :\n &#91;'Hello', 'everyone', 'welcome', 'to', 'the', 'Great', 'Learning', 'Mr', 'Smith', 'He', 'is', \"n't\", 'an', 'instructor', 'and', 'Johann', 'S', 'Baech', 'He', 'is', 'an', 'instructor', 'are', 'waiting', 'for', 'you', 'They', \"'ll\", 'join', 'you', 'soon', 'Hope', 'you', 'enjoy', 'a', 'lot', 'GL', 'Great', 'Learning', 'is', 'way', 'tooo', 'coool', 'AI', 'P', '3', 'Here', 'are', 'some', 'arrows']\n\n Sentence Tokenize :\n &#91;Sentence(\"Hello everyone .\"), Sentence(\"welcome to the Great Learning .\"), Sentence(\"Mr. Smith(\"He isn't an instructor\") and Johann S. Baech \n(He is  an instructor.)\"), Sentence(\"are waiting for you .\"), Sentence(\"They'll join you soon.\"), Sentence(\"Hope you enjoy a lot.\"), Sentence(\"@GL : Great Learning is way tooo coool #AI: :-) :-P &lt;3 .\"), Sentence(\"Here are some arrows &lt; &gt; -&gt; &lt;--\")]<\/code><\/pre>\n\n\n\n<p>As you might have noticed, TextBlob removes punctuation including emojis automatically from tokens. But we do not get as much customisation options as we get in NLTK<\/p>\n\n\n\n<p>This brings us to the end of this article where we have learned about tokenisation and various ways to implement it.<\/p>\n\n\n\n<p>If you wish to learn more about<a href=\"https:\/\/www.mygreatlearning.com\/blog\/python-tutorial-for-beginners-a-complete-guide\/\"> Python<\/a> and the concepts of <a href=\"https:\/\/www.mygreatlearning.com\/blog\/what-is-machine-learning\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Machine Learning (opens in a new tab)\">Machine Learning<\/a>, upskill with<a href=\"https:\/\/www.mygreatlearning.com\/pg-program-artificial-intelligence-course\"> Great Learning\u2019s PG Program Artificial Intelligence and Machine Learning.<\/a><br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>What is tokenisation Tokenisation techniques (optional) Tokenising with NLTK Tokenising with TextBlob What is Tokenization? Tokenisation is the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks may be discarded. The tokens usually [&hellip;]<\/p>\n","protected":false},"author":41,"featured_media":15219,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_uag_custom_page_level_css":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[2],"tags":[],"content_type":[],"class_list":["post-15202","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.3 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Tokenization of Textual Data into Words and Sentences and Definition?<\/title>\n<meta name=\"description\" content=\"What is Tokenization? is the process of breaking up a given text into tokens such as words or sentences.Here we use NLTK and TextBlob for tokenisation\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Tokenising into Words and Sentences | What is Tokenization and it&#039;s Definition?\" \/>\n<meta property=\"og:description\" content=\"What is Tokenization? is the process of breaking up a given text into tokens such as words or sentences.Here we use NLTK and TextBlob for tokenisation\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/\" \/>\n<meta property=\"og:site_name\" content=\"Great Learning Blog: Free Resources what Matters to shape your Career!\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/GreatLearningOfficial\/\" \/>\n<meta property=\"article:published_time\" content=\"2020-05-29T17:49:47+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-09-02T11:47:18+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"657\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Great Learning Editorial Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@https:\/\/twitter.com\/Great_Learning\" \/>\n<meta name=\"twitter:site\" content=\"@Great_Learning\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Great Learning Editorial Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/tokenization\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/tokenization\\\/\"},\"author\":{\"name\":\"Great Learning Editorial Team\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\"},\"headline\":\"Tokenising into Words and Sentences | What is Tokenization and it's Definition?\",\"datePublished\":\"2020-05-29T17:49:47+00:00\",\"dateModified\":\"2024-09-02T11:47:18+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/tokenization\\\/\"},\"wordCount\":1147,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/tokenization\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/05\\\/shutterstock_209983117-1.jpg\",\"articleSection\":[\"AI and Machine Learning\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/tokenization\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/tokenization\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/tokenization\\\/\",\"name\":\"Tokenization of Textual Data into Words and Sentences and Definition?\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/tokenization\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/tokenization\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/05\\\/shutterstock_209983117-1.jpg\",\"datePublished\":\"2020-05-29T17:49:47+00:00\",\"dateModified\":\"2024-09-02T11:47:18+00:00\",\"description\":\"What is Tokenization? is the process of breaking up a given text into tokens such as words or sentences.Here we use NLTK and TextBlob for tokenisation\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/tokenization\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/tokenization\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/tokenization\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/05\\\/shutterstock_209983117-1.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/05\\\/shutterstock_209983117-1.jpg\",\"width\":1000,\"height\":657,\"caption\":\"tokenization\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/tokenization\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Blog\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"AI and Machine Learning\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/artificial-intelligence\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Tokenising into Words and Sentences | What is Tokenization and it&#8217;s Definition?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"name\":\"Great Learning Blog\",\"description\":\"Learn, Upskill &amp; Career Development Guide and Resources\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"alternateName\":\"Great Learning\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\",\"name\":\"Great Learning\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"width\":900,\"height\":900,\"caption\":\"Great Learning\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/GreatLearningOfficial\\\/\",\"https:\\\/\\\/x.com\\\/Great_Learning\",\"https:\\\/\\\/www.instagram.com\\\/greatlearningofficial\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/in.pinterest.com\\\/greatlearning12\\\/\",\"https:\\\/\\\/www.youtube.com\\\/user\\\/beaconelearning\\\/\"],\"description\":\"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.\",\"email\":\"info@mygreatlearning.com\",\"legalName\":\"Great Learning Education Services Pvt. Ltd\",\"foundingDate\":\"2013-11-29\",\"numberOfEmployees\":{\"@type\":\"QuantitativeValue\",\"minValue\":\"1001\",\"maxValue\":\"5000\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\",\"name\":\"Great Learning Editorial Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"caption\":\"Great Learning Editorial Team\"},\"description\":\"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.\",\"sameAs\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/\",\"https:\\\/\\\/in.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/x.com\\\/https:\\\/\\\/twitter.com\\\/Great_Learning\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UCObs0kLIrDjX2LLSybqNaEA\"],\"award\":[\"Best EdTech Company of the Year 2024\",\"Education Economictimes Outstanding Education\\\/Edtech Solution Provider of the Year 2024\",\"Leading E-learning Platform 2024\"],\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/author\\\/greatlearning\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Tokenization of Textual Data into Words and Sentences and Definition?","description":"What is Tokenization? is the process of breaking up a given text into tokens such as words or sentences.Here we use NLTK and TextBlob for tokenisation","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/","og_locale":"en_US","og_type":"article","og_title":"Tokenising into Words and Sentences | What is Tokenization and it's Definition?","og_description":"What is Tokenization? is the process of breaking up a given text into tokens such as words or sentences.Here we use NLTK and TextBlob for tokenisation","og_url":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/","og_site_name":"Great Learning Blog: Free Resources what Matters to shape your Career!","article_publisher":"https:\/\/www.facebook.com\/GreatLearningOfficial\/","article_published_time":"2020-05-29T17:49:47+00:00","article_modified_time":"2024-09-02T11:47:18+00:00","og_image":[{"width":1000,"height":657,"url":"http:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1.jpg","type":"image\/jpeg"}],"author":"Great Learning Editorial Team","twitter_card":"summary_large_image","twitter_creator":"@https:\/\/twitter.com\/Great_Learning","twitter_site":"@Great_Learning","twitter_misc":{"Written by":"Great Learning Editorial Team","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/#article","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/"},"author":{"name":"Great Learning Editorial Team","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad"},"headline":"Tokenising into Words and Sentences | What is Tokenization and it's Definition?","datePublished":"2020-05-29T17:49:47+00:00","dateModified":"2024-09-02T11:47:18+00:00","mainEntityOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/"},"wordCount":1147,"commentCount":0,"publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1.jpg","articleSection":["AI and Machine Learning"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.mygreatlearning.com\/blog\/tokenization\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/","url":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/","name":"Tokenization of Textual Data into Words and Sentences and Definition?","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/#primaryimage"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1.jpg","datePublished":"2020-05-29T17:49:47+00:00","dateModified":"2024-09-02T11:47:18+00:00","description":"What is Tokenization? is the process of breaking up a given text into tokens such as words or sentences.Here we use NLTK and TextBlob for tokenisation","breadcrumb":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.mygreatlearning.com\/blog\/tokenization\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/#primaryimage","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1.jpg","width":1000,"height":657,"caption":"tokenization"},{"@type":"BreadcrumbList","@id":"https:\/\/www.mygreatlearning.com\/blog\/tokenization\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog","item":"https:\/\/www.mygreatlearning.com\/blog\/"},{"@type":"ListItem","position":2,"name":"AI and Machine Learning","item":"https:\/\/www.mygreatlearning.com\/blog\/artificial-intelligence\/"},{"@type":"ListItem","position":3,"name":"Tokenising into Words and Sentences | What is Tokenization and it&#8217;s Definition?"}]},{"@type":"WebSite","@id":"https:\/\/www.mygreatlearning.com\/blog\/#website","url":"https:\/\/www.mygreatlearning.com\/blog\/","name":"Great Learning Blog","description":"Learn, Upskill &amp; Career Development Guide and Resources","publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"alternateName":"Great Learning","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.mygreatlearning.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization","name":"Great Learning","url":"https:\/\/www.mygreatlearning.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","width":900,"height":900,"caption":"Great Learning"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/GreatLearningOfficial\/","https:\/\/x.com\/Great_Learning","https:\/\/www.instagram.com\/greatlearningofficial\/","https:\/\/www.linkedin.com\/school\/great-learning\/","https:\/\/in.pinterest.com\/greatlearning12\/","https:\/\/www.youtube.com\/user\/beaconelearning\/"],"description":"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.","email":"info@mygreatlearning.com","legalName":"Great Learning Education Services Pvt. Ltd","foundingDate":"2013-11-29","numberOfEmployees":{"@type":"QuantitativeValue","minValue":"1001","maxValue":"5000"}},{"@type":"Person","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad","name":"Great Learning Editorial Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","caption":"Great Learning Editorial Team"},"description":"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.","sameAs":["https:\/\/www.mygreatlearning.com\/","https:\/\/in.linkedin.com\/school\/great-learning\/","https:\/\/x.com\/https:\/\/twitter.com\/Great_Learning","https:\/\/www.youtube.com\/channel\/UCObs0kLIrDjX2LLSybqNaEA"],"award":["Best EdTech Company of the Year 2024","Education Economictimes Outstanding Education\/Edtech Solution Provider of the Year 2024","Leading E-learning Platform 2024"],"url":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"}]}},"uagb_featured_image_src":{"full":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1.jpg",1000,657,false],"thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1-150x150.jpg",150,150,true],"medium":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1-300x197.jpg",300,197,true],"medium_large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1-768x505.jpg",768,505,true],"large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1.jpg",1000,657,false],"1536x1536":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1.jpg",1000,657,false],"2048x2048":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1.jpg",1000,657,false],"web-stories-poster-portrait":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1.jpg",640,420,false],"web-stories-publisher-logo":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1.jpg",96,63,false],"web-stories-thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/05\/shutterstock_209983117-1.jpg",150,99,false]},"uagb_author_info":{"display_name":"Great Learning Editorial Team","author_link":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"},"uagb_comment_info":0,"uagb_excerpt":"What is tokenisation Tokenisation techniques (optional) Tokenising with NLTK Tokenising with TextBlob What is Tokenization? Tokenisation is the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks may be discarded. The tokens usually&hellip;","_links":{"self":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/15202","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/comments?post=15202"}],"version-history":[{"count":9,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/15202\/revisions"}],"predecessor-version":[{"id":62447,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/15202\/revisions\/62447"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media\/15219"}],"wp:attachment":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media?parent=15202"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/categories?post=15202"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/tags?post=15202"},{"taxonomy":"content_type","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/content_type?post=15202"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}